Table of Contents
cs.CL [Back]
[1] Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
Omnilingual ASR team,Gil Keren,Artyom Kozhevnikov,Yen Meng,Christophe Ropers,Matthew Setzler,Skyler Wang,Ife Adebara,Michael Auli,Can Balioglu,Kevin Chan,Chierh Cheng,Joe Chuang,Caley Droof,Mark Duppenthaler,Paul-Ambroise Duquenne,Alexander Erben,Cynthia Gao,Gabriel Mejia Gonzalez,Kehan Lyu,Sagar Miglani,Vineel Pratap,Kaushik Ram Sadagopan,Safiyyah Saleem,Arina Turkatenko,Albert Ventayol-Boada,Zheng-Xin Yong,Yu-An Chung,Jean Maillard,Rashel Moritz,Alexandre Mourachko,Mary Williamson,Shireen Yates
Main category: cs.CL
TL;DR: Omnilingual ASR 是首个大规模、可扩展的语音识别系统,通过自监督预训练和LLM启发的解码器,实现对1600多种语言(包括500多种新语言)的覆盖,支持低资源语言的零样本泛化,并开源模型以促进社区参与。
Details
Motivation: 大多数语言因高成本、架构限制和缺乏社区协作而无法获得ASR支持,亟需一种可扩展且符合伦理的解决方案来弥补技术鸿沟。 Method: 采用7B参数的自监督预训练,设计基于LLM启发的编码器-解码器架构,结合公共数据与社区合作采集的多样化语料库,实现跨语言的鲁棒表征学习和零样本泛化能力。 Result: 在超过1600种语言上实现了ASR覆盖,其中500多种为首次支持;在低资源条件下显著优于先前系统,具备强泛化能力,并发布从300M到7B的开源模型系列。 Conclusion: Omnilingual ASR 通过技术创新与社区协作,大幅扩展了语音识别的语言覆盖范围,其开源策略有助于降低研究门槛,推动包容性技术发展并产生积极社会影响。 Abstract: Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.[2] Order Matters: Rethinking Prompt Construction in In-Context Learning
Warren Li,Yiqian Wang,Zihan Wang,Jingbo Shang
Main category: cs.CL
TL;DR: 本文重新审视了上下文学习中示例选择与顺序对模型性能的影响,发现示例顺序的影响与选择相当,并可通过开发集找到高效顺序。
Details
Motivation: 以往研究认为示例选择比顺序更重要,本文质疑这一假设,系统比较两者的影响。 Method: 在分类和生成任务上,使用多个开源模型(0.5B至27B参数)和GPT-5进行控制实验,并利用开发集寻找最优顺序。 Result: 示例顺序带来的性能波动与更换整个示例集相当;基于开发集的排序接近测试标签指导的最优排序性能。 Conclusion: 示例选择与顺序在提示设计中同等重要且相互关联,需重新评估ICL中的既有假设。 Abstract: In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.[3] Contextual morphologically-guided tokenization for Latin encoder models
Marisa Hudspeth,Patrick J. Burns,Brendan O'Connor
Main category: cs.CL
TL;DR: 本文研究了拉丁语的形态感知分词方法,发现基于形态学指导的分词在四个下游任务中提升了整体性能,尤其在跨领域文本上表现更优,表明利用语言学资源可有效提升形态复杂语言的语言建模效果。
Details
Motivation: 标准分词方法通常优先考虑信息论目标,而忽视了形态丰富的语言对形态对齐的需求,导致在这些语言上的表现不佳。 Method: 提出并采用形态学指导的分词方法,结合拉丁语丰富的词汇资源进行实验评估。 Result: 形态引导的分词在四个下游任务中均提升了性能,尤其在跨领域文本上改善显著,显示出更好的泛化能力。 Conclusion: 语言学资源的引入能有效提升形态复杂语言的语言模型性能,对于缺乏大规模预训练数据的低资源语言是一种可行的改进路径。 Abstract: Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models' improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.[4] Assessing the Applicability of Natural Language Processing to Traditional Social Science Methodology: A Case Study in Identifying Strategic Signaling Patterns in Presidential Directives
C. LeMay,A. Lane,J. Seales,M. Winstead,S. Baty
Main category: cs.CL
TL;DR: 本研究探讨了自然语言处理(NLP)在从总统指令文本库中提取主要主题的应用,比较了NLP与人工标注的结果,发现两者存在差异,表明需进一步验证NLP在社会科学中的有效性,同时也反映了早期AI工具在新兴社会科学研究中的潜力与局限。
Details
Motivation: 随着文本数据量的增加,传统人工分析方法效率低下,研究旨在探索NLP是否能有效辅助社会科学研究,特别是在识别政治文本中的信号主题方面。 Method: 研究使用NLP技术对里根至克林顿政府时期的总统指令进行主题提取,并将结果与人工标注的主题进行对比分析。 Result: NLP能够识别出相关文档和主题,显示出其在处理大规模文本时的潜力,但与人工标注结果存在明显差异,表明当前NLP工具仍存在局限性。 Conclusion: 尽管NLP在处理大规模文本分析任务中展现出潜力,但在应用于社会科学领域时仍需谨慎,需进一步研究以提高其准确性和可靠性,尤其是在语义复杂的政治文本分析中。 Abstract: Our research investigates how Natural Language Processing (NLP) can be used to extract main topics from a larger corpus of written data, as applied to the case of identifying signaling themes in Presidential Directives (PDs) from the Reagan through Clinton administrations. Analysts and NLP both identified relevant documents, demonstrating the potential utility of NLPs in research involving large written corpuses. However, we also identified discrepancies between NLP and human-labeled results that indicate a need for more research to assess the validity of NLP in this use case. The research was conducted in 2023, and the rapidly evolving landscape of AIML means existing tools have improved and new tools have been developed; this research displays the inherent capabilities of a potentially dated AI tool in emerging social science applications.[5] How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation
Muskaan Chopra,Lorenz Sparrenberg,Sarthak Khanna,Rafet Sifa
Main category: cs.CL
TL;DR: 小型化语言模型(约10亿参数)经轻量级校准和小样本监督后,可在设备端高效、可靠地检测机器翻译中的关键错误,兼顾质量与计算效率。
Details
Motivation: 大型语言模型虽擅长评估机器翻译,但因规模和成本限制难以部署于边缘设备或隐私敏感场景。研究旨在探索在保持检测语义错误能力的前提下,模型可缩小到何种程度。 Method: 聚焦英译德关键错误检测任务,评测多个小于20亿参数的模型(如Gemma-3-1B、Qwen-3等),采用标准化提示、轻量级logit偏置校准和多数投票策略,并在WMT21、WMT22和SynCED-EnDe-2025数据集上评估语义质量与计算开销。 Result: 约10亿参数模型达到最佳权衡:Gemma-3-1B在SynCED-EnDe-2025上经微调后MCC=0.77,F1-ERR=0.98,MacBook Pro上单样本延迟400ms;Qwen-3-1.7B性能更高但计算成本大;0.6B模型仍可用但漏检较多实体和数量错误。 Conclusion: 紧凑型指令微调模型结合轻量校准可在边缘设备实现可信的机器翻译错误检测,支持私有、低成本的实际应用。 Abstract: Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.[6] Predicate-Argument Structure Divergences in Chinese and English Parallel Sentences and their Impact on Language Transfer
Rocco Tripodi,Xiaoyu Liu
Main category: cs.CL
TL;DR: 本文分析了中英文平行句中的谓词-论元结构,探讨了跨语言迁移中的对齐与错位问题,发现语言迁移具有不对称性,提示在选择源语言时需谨慎。
Details
Motivation: 由于语言差异,尤其是类型学上距离较远的语言之间,跨语言知识迁移面临挑战,需要系统分析结构差异以优化迁移效果。 Method: 通过注解投影实验,分别以中文和英文作为源语言向目标语言投影谓词标注,进行定性和定量分析,并提出结构差异的分类体系。 Result: 实验结果表明跨语言迁移存在显著不对称性,源语言的选择对迁移效果有重要影响。 Conclusion: 跨语言NLP研究需重视迁移的不对称性,在实际应用和理论主张中应事先评估源语言的影响。 Abstract: Cross-lingual Natural Language Processing (NLP) has gained significant traction in recent years, offering practical solutions in low-resource settings by transferring linguistic knowledge from resource-rich to low-resource languages. This field leverages techniques like annotation projection and model transfer for language adaptation, supported by multilingual pre-trained language models. However, linguistic divergences hinder language transfer, especially among typologically distant languages. In this paper, we present an analysis of predicate-argument structures in parallel Chinese and English sentences. We explore the alignment and misalignment of predicate annotations, inspecting similarities and differences and proposing a categorization of structural divergences. The analysis and the categorization are supported by a qualitative and quantitative analysis of the results of an annotation projection experiment, in which, in turn, one of the two languages has been used as source language to project annotations into the corresponding parallel sentences. The results of this analysis show clearly that language transfer is asymmetric. An aspect that requires attention when it comes to selecting the source language in transfer learning applications and that needs to be investigated before any scientific claim about cross-lingual NLP is proposed.[7] TARG: Training-Free Adaptive Retrieval Gating for Efficient RAG
Yufeng Wang,Lu wei,Haibin Ling
Main category: cs.CL
TL;DR: 本文提出了无需训练的自适应检索门控方法TARG,通过基础模型的短草案中的前缀logits计算轻量级不确定性分数,决定是否进行检索,在保持甚至提升准确率的同时大幅减少检索次数和端到端延迟。
Details
Motivation: 传统的检索增强生成(RAG)在每个查询时都进行检索,虽然提高了事实性,但增加了开销并可能降低质量,因此需要一种更高效的方法来决定何时进行检索。 Method: TARG利用基础模型生成的无上下文短草案的前缀logits,计算均值token熵、top-1/top-2 logit差距的margin信号或小N个随机前缀的方差等轻量级不确定性分数,并设定阈值触发检索,整个过程无需额外训练或辅助头。 Result: 在NQ-Open、TriviaQA和PopQA数据集上,TARG相比Always-RAG减少了70-90%的检索次数,降低了端到端延迟,同时保持或提升了EM/F1指标,且接近Never-RAG的开销水平;实验发现margin信号是现代指令调优LLM下的稳健默认选择,小N方差则提供了一种保守的预算优先替代方案。 Conclusion: TARG是一种模型无关、无需训练的高效检索门控策略,能够在显著降低检索开销的同时维持甚至提升性能,为检索增强生成提供了新的精度-效率权衡方案。 Abstract: Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (TARG), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft's prefix logits, TARG computes lightweight uncertainty scores: mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-N variance across a handful of stochastic prefixes, and triggers retrieval only when the score exceeds a threshold. The gate is model agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently shifts the accuracy-efficiency frontier: compared with Always-RAG, TARG matches or improves EM/F1 while reducing retrieval by 70-90% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-N variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a delta-latency view to make budget trade-offs explicit.[8] Khmer Spellchecking: A Holistic Approach
Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing
Main category: cs.CL
TL;DR: 提出了一种综合性的高棉语拼写检查方法,结合了子词分割、命名实体识别、音素转换和语言模型,达到了94.4%的准确率。
Details
Motivation: 高棉语拼写检查存在词汇与分词模型不匹配、一词多形、复合词灵活构成以及专有名词误判等问题,现有方法未能有效解决。 Method: 整合高棉语子词分割、命名实体识别、图素到音素转换和语言模型,生成并排序纠错候选。 Result: 在实验中实现了最高达94.4%的拼写检查准确率,优于现有方法。 Conclusion: 该方法有效应对了高棉语拼写检查的主要挑战,显著提升了纠错性能,并公开了相关基准数据集。 Abstract: Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the most suitable candidate. Experimental results show that the proposed approach achieves a state-of-the-art Khmer spellchecking accuracy of up to 94.4%, compared to existing solutions. The benchmark datasets for Khmer spellchecking and NER tasks in this study will be made publicly available.[9] Improving Graduate Outcomes by Identifying Skills Gaps and Recommending Courses Based on Career Interests
Rahul Soni,Basem Suleiman,Sonit Singh
Main category: cs.CL
TL;DR: 本文提出了一种结合数据挖掘、机器学习与用户偏好的课程推荐系统,旨在弥合高等教育与产业需求之间的差距。
Details
Motivation: 解决学生在选课时缺乏与行业趋势匹配的个性化指导问题,提升课程选择的相关性和毕业生就业竞争力。 Method: 采用数据挖掘、协同过滤和机器学习算法,结合用户偏好、学术标准及职业目标,设计并实现一个算法框架,并开发了注重可用性的前端界面。 Result: 系统能够根据历史课程数据和用户职业目标提供个性化推荐,通过用户反馈迭代优化,提升了可用性与用户体验。 Conclusion: 该课程推荐系统有助于学生、教师和职业顾问做出数据驱动且符合产业需求的选课决策,促进终身学习与职业发展,改善高校毕业生的学业与就业成果。 Abstract: This paper aims to address the challenge of selecting relevant courses for students by proposing the design and development of a course recommendation system. The course recommendation system utilises a combination of data analytics techniques and machine learning algorithms to recommend courses that align with current industry trends and requirements. In order to provide customised suggestions, the study entails the design and implementation of an extensive algorithmic framework that combines machine learning methods, user preferences, and academic criteria. The system employs data mining and collaborative filtering techniques to examine past courses and individual career goals in order to provide course recommendations. Moreover, to improve the accessibility and usefulness of the recommendation system, special attention is given to the development of an easy-to-use front-end interface. The front-end design prioritises visual clarity, interaction, and simplicity through iterative prototyping and user input revisions, guaranteeing a smooth and captivating user experience. We refined and optimised the proposed system by incorporating user feedback, ensuring that it effectively meets the needs and preferences of its target users. The proposed course recommendation system could be a useful tool for students, instructors, and career advisers to use in promoting lifelong learning and professional progression as it fills the gap between university learning and industry expectations. We hope that the proposed course recommendation system will help university students in making data-drive and industry-informed course decisions, in turn, improving graduate outcomes for the university sector.[10] Answering Students' Questions on Course Forums Using Multiple Chain-of-Thought Reasoning and Finetuning RAG-Enabled LLM
Neo Wang,Sonit Singh
Main category: cs.CL
TL;DR: 提出了一种基于开源大语言模型和检索增强生成(RAG)方法的课程论坛问答系统,结合本地知识库和多链思维链推理以提升回答准确性和减少幻觉。
Details
Motivation: 应对在线课程中学生提问数量增加导致的回复延迟和重复问题,减轻教师负担。 Method: 采用开源大语言模型并进行微调,结合本地课程知识库使用RAG方法检索相关信息,并引入多链思维链推理以减少模型幻觉。 Result: 在HotpotQA数据集上的实验表明,该方法在问答任务中表现出较强性能。 Conclusion: 所提出的基于微调LLM与RAG相结合的方法能有效提升课程论坛中自动问答的效果,具备实际应用潜力。 Abstract: The course forums are increasingly significant and play vital role in facilitating student discussions and answering their questions related to the course. It provides a platform for students to post their questions related to the content and admin issues related to the course. However, there are several challenges due to the increase in the number of students enrolled in the course. The primary challenge is that students' queries cannot be responded immediately and the instructors have to face lots of repetitive questions. To mitigate these issues, we propose a question answering system based on large language model with retrieval augmented generation (RAG) method. This work focuses on designing a question answering system with open source Large Language Model (LLM) and fine-tuning it on the relevant course dataset. To further improve the performance, we use a local knowledge base and applied RAG method to retrieve relevant documents relevant to students' queries, where the local knowledge base contains all the course content. To mitigate the hallucination of LLMs, We also integrate it with multi chain-of-thought reasoning to overcome the challenge of hallucination in LLMs. In this work, we experiment fine-tuned LLM with RAG method on the HotpotQA dataset. The experimental results demonstrate that the fine-tuned LLM with RAG method has a strong performance on question answering task.[11] TermGPT: Multi-Level Contrastive Fine-Tuning for Terminology Adaptation in Legal and Financial Domain
Yidan Sun,Mengying Zhu,Feiyue Chen,Yangyang Wu,Xiaolei Dan,Mengyuan Yang,Xiaolin Zheng,Shenglin Ben
Main category: cs.CL
TL;DR: 本文提出了一种名为TermGPT的多级对比微调框架,用于改善大语言模型在金融和法律领域术语表示中的各向同性问题,通过构建句子图并设计句子级与词元级的对比学习策略,显著提升了术语区分能力。
Details
Motivation: 大语言模型在文本生成任务中表现优异,但其嵌入空间存在各向同性问题,导致对领域术语(尤其是法律和金融)的区分能力差,影响下游任务性能。 Method: 提出TermGPT框架:首先构建句子图以捕捉语义与结构关系,并基于上下文和拓扑信息生成正负样本;然后设计句子级和词元级的多级对比学习机制,增强全局语境理解与细粒度术语区分能力;同时构建首个源自官方监管文件的金融术语数据集。 Result: 实验表明,TermGPT在金融和法律领域的术语区分任务上优于现有基线方法。 Conclusion: TermGPT有效缓解了大语言模型在专业领域术语表示上的局限性,通过多级对比学习提升了语义判别力,为法律判决预测和金融风险分析等下游任务提供了更可靠的嵌入表示。 Abstract: Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in terminology-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained terminology discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.[12] In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback
Mingye Zhu,Yi Liu,Zheren Fu,Quan Wang,Yongdong Zhang
Main category: cs.CL
TL;DR: 提出了一种名为InTRO的新框架,通过token级探索和自反馈机制提升大语言模型在思维链推理中的准确性和简洁性。
Details
Motivation: 传统监督微调因仅依赖单一‘黄金’推理路径而限制泛化能力,强化学习则面临信用分配困难和计算成本高昂的问题。 Method: 引入InTRO(In-Token Rationality Optimization),利用生成策略与其答案条件版本之间的信息差异估计token-wise重要性权重(校正因子),实现单次前向传播中的token级探索与自反馈。 Result: 在六个数学推理基准上,InTRO相比基线模型最高提升20%的解题准确率,且生成的思维链更简洁;此外还展现出跨领域迁移能力。 Conclusion: InTRO有效解决了现有方法在推理路径多样性、信用分配和计算效率上的局限,显著提升了推理的准确性、简洁性和泛化能力。 Abstract: Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single "golden" rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.[13] HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning
Nikunj Gupta,Bill Guo,Rajgopal Kannan,Viktor K. Prasanna
Main category: cs.CL
TL;DR: 提出了一种名为HierRouter的分层路由方法,通过强化学习动态选择轻量级语言模型进行多跳推理,在降低计算成本的同时显著提升性能。
Details
Motivation: 大型语言模型虽然性能优越,但计算和内存开销高,难以在资源受限或实时场景中部署,因此需要更高效的推理方法。 Method: 将分层路由建模为有限视界马尔可夫决策过程(MDP),使用近端策略优化(PPO)强化学习代理,根据上下文和累积成本动态选择各阶段调用的轻量级模型。 Result: 在三个开源LLM和六个基准任务(包括问答、代码生成和数学推理)上的实验表明,与单独使用个体模型相比,HierRouter将响应质量最高提升了2.4倍,且平均仅增加极小的推理开销。 Conclusion: HierRouter展示了通过分层路由实现高效、高性能LLM推理的可行性,为资源受限环境下的模型部署提供了有效解决方案。 Abstract: Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.[14] EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models
Jialin Wu,Kecen Li,Zhicong Huang,Xinfeng Li,Xiaofeng Wang,Cheng Hong
Main category: cs.CL
TL;DR: EnchTable是一种新型框架,通过NTK-based安全向量蒸馏和干扰感知融合技术,在不需大规模重训练的情况下,有效迁移并保持下游大语言模型的安全对齐,兼顾安全性与实用性。
Details
Motivation: 微调大语言模型在特定领域提升性能的同时,常导致安全对齐能力下降,增加产生有害输出的风险,亟需一种无需大量重训练即可维持安全性的方法。 Method: 提出EnchTable框架,采用基于神经正切核(NTK)的安全向量蒸馏方法解耦安全约束与任务推理,并设计干扰感知的融合策略以平衡安全与性能。 Result: 在三个任务领域、三种LLM架构及十一个数据集上验证了EnchTable的有效性,展现出对静态和动态越狱攻击的强抗性,优于厂商发布的安全模型;相比六种参数修改方法和两种推理时对齐基线,具有更低的不安全率、更高的效用得分和更广的适用性。 Conclusion: EnchTable能有效保持下游LLM的安全对齐,具备跨架构、跨领域的通用性,且可无缝集成到部署流程中,无显著开销,为安全与性能兼顾的模型微调提供了可行方案。 Abstract: Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.[15] HI-TransPA: Hearing Impairments Translation Personal Assistant
Zhiming Ma,Shiyu Gan,Junhao Zhao,Xianming Li,Qingyun Pan,Peidong Wang,Mingjun Pan,Yuhao Mo,Jiajie Cheng,Chengxin Chen,Zhonglun Cao,Chonghan Liu,Shi Cheng
Main category: cs.CL
TL;DR: 本文提出了一种面向听障人士的指令驱动音视频个人助手HI-TransPA,基于Omni-Model范式,融合模糊语音与高帧率唇动信息,实现翻译与对话一体化。通过构建数据预处理与质量评估流程,并结合课程学习策略和高效编码结构,在自建HI-Dialogue数据集上实现了语义和字面精度的最先进性能。
Details
Motivation: 现有Omni-Model对听障人群的模糊语音适应性差,且缺乏统一、灵活的日常沟通辅助方案,因此需要一种能有效融合多模态信息并适应噪声数据的模型。 Method: 提出HI-TransPA模型,融合模糊语音与高帧率唇动;构建包含面部关键点检测、唇部区域稳定化和质量评估的数据预处理流程;采用课程学习策略,由高质量样本逐步扩展到困难样本;使用SigLIP编码器与Unified 3D-Resampler高效编码唇部运动。 Result: 在自建HI-Dialogue数据集上,HI-TransPA在字面准确性和语义保真度方面均达到SOTA性能,验证了方法在噪声和异构数据下的鲁棒性。 Conclusion: 该工作为将Omni-Model应用于听障辅助通信技术奠定了基础,提供了端到端建模范式及关键数据处理工具,推动了未来相关研究的发展。 Abstract: To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.[16] MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection
Pritish Sahu,Anirudh Som,Dimitra Vergyri,Ajay Divakaran
Main category: cs.CL
TL;DR: 本文提出了Norm-RAG,一种基于检索增强的代理框架,用于多轮对话中的社会规范推理,并引入了双语数据集MINDS,以提升跨文化、多语言对话系统中对规范遵守与违背的识别能力。
Details
Motivation: 现有研究多关注孤立语句或合成对话,难以捕捉真实多轮对话中动态的社会规范;同时社会规范具有主观性、情境依赖性和文化差异性,给计算模型带来挑战。 Method: 提出Norm-RAG框架,通过语义分块方法检索结构化规范文档,建模话语层级的交际意图、说话人角色、人际框架和语言线索,实现可解释、情境感知的规范推理;并构建包含31个多轮双语对话的MINDS数据集,每轮标注规范类别与遵守状态。 Result: 实验表明,Norm-RAG在规范检测与泛化能力上表现更优,尤其在跨文化、多语言对话场景中提升了社会智能对话系统的性能。 Conclusion: Norm-RAG结合检索增强与细粒度语境建模,有效支持多轮、多语言对话中的社会规范理解,为构建文化适应性对话系统提供了可行路径。 Abstract: Social norms are implicit, culturally grounded expectations that guide interpersonal communication. Unlike factual commonsense, norm reasoning is subjective, context-dependent, and varies across cultures, posing challenges for computational models. Prior works provide valuable normative annotations but mostly target isolated utterances or synthetic dialogues, limiting their ability to capture the fluid, multi-turn nature of real-world conversations. In this work, we present Norm-RAG, a retrieval-augmented, agentic framework for nuanced social norm inference in multi-turn dialogues. Norm-RAG models utterance-level attributes including communicative intent, speaker roles, interpersonal framing, and linguistic cues and grounds them in structured normative documentation retrieved via a novel Semantic Chunking approach. This enables interpretable and context-aware reasoning about norm adherence and violation across multilingual dialogues. We further introduce MINDS (Multilingual Interactions with Norm-Driven Speech), a bilingual dataset comprising 31 multi-turn Mandarin-English and Spanish-English conversations. Each turn is annotated for norm category and adherence status using multi-annotator consensus, reflecting cross-cultural and realistic norm expression. Our experiments show that Norm-RAG improves norm detection and generalization, demonstrates improved performance for culturally adaptive and socially intelligent dialogue systems.[17] Leveraging Large Language Models for Identifying Knowledge Components
Canwen Wang,Jionghao Lin,Kenneth R. Koedinger
Main category: cs.CL
TL;DR: 本研究探讨了使用大语言模型(LLM)自动生成知识组件(KC)的可行性,并提出通过语义相似性合并冗余KC标签的方法,以提升自适应学习系统中KC识别的效率与准确性。
Details
Motivation: 手动识别知识组件(KC)耗时且依赖专家,限制了自适应学习系统的发展。尽管大语言模型(LLM)提供了自动化可能,但先前方法在小数据集上表现不佳,常产生冗余和多余的KC标签。因此,需要一种可扩展且能减少冗余的自动化KC生成方法。 Method: 研究采用GPT-4o-mini对包含646道选择题的大规模数据集进行‘模拟教科书’式提示,生成初始KC标签;随后提出基于余弦相似度的语义合并策略,通过设定不同阈值将语义相近的KC标签合并,评估其对模型性能的影响。 Result: 初始LLM生成方法表现不如专家设计模型(RMSE 0.4285 vs. 0.4206),并产生大量冗余KC(569 vs. 101)。采用余弦相似度0.8阈值合并后,KC数量降至428,RMSE改善至0.4259,性能显著提升。 Conclusion: 单纯扩大LLM生成规模不足以超越专家设计的KC模型,但结合语义相似性合并策略可有效减少冗余、提升性能,为自动化KC识别提供了可行路径。 Abstract: Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a "simulated textbook" LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model's performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.[18] REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering
Yijie Zhu,Haojie Zhou,Wanting Hong,Tailin Liu,Ning Wang
Main category: cs.CL
TL;DR: 提出了一种名为REAP的递归评估与自适应规划方法,用于改进检索增强生成中的多跳推理,通过子任务规划和事实提取模块实现全局规划与细粒度分析,显著优于现有方法。
Details
Motivation: 现有检索增强生成方法在多跳推理中缺乏全局规划,易陷入局部推理困境,且未能充分利用检索内容和潜在线索,影响推理准确性。 Method: 设计了子任务规划器(SP)和事实提取器(FE)模块:SP维持全局视角并动态优化推理路径,FE对检索内容进行细粒度分析以提取可靠信息;同时提出统一任务范式支持多任务微调。 Result: 在多个公开多跳数据集上实验表明,REAP在领域内和跨领域设置下均显著优于现有RAG方法。 Conclusion: REAP通过结构化子任务与事实维护、全局规划与自适应优化,有效提升了复杂多跳推理任务的准确性和可追溯性。 Abstract: Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose Recursive Evaluation and Adaptive Planning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP's performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.[19] NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction
Peter Røysland Aarnes,Vinay Setty
Main category: cs.CL
TL;DR: 本文系统评估了最先进的大语言模型在数值声明真实性预测中的表现,发现即使是最先进的模型在面对特定扰动时准确率也会大幅下降,且没有模型能在所有条件下保持鲁棒性。增加上下文长度通常会降低准确性,但若用扰动示例增强上下文,多数模型可显著恢复性能。
Details
Motivation: 大语言模型在知识密集型任务中表现良好,但在数值推理方面仍存在困难。为了揭示现有模型在处理数值声明时的脆弱性,研究旨在通过受控扰动(如标签翻转探测)来系统评估其鲁棒性。 Method: 采用受控扰动方法(包括标签翻转探测),对当前最先进的模型在数值声明与证据对的真实性判断任务中进行系统性评估,并测试不同上下文长度及包含扰动示例的提示对模型性能的影响。 Result: 实验结果显示,即使是领先的专有模型在某些扰动下准确率最多下降62%;没有一个模型在所有条件下都表现出鲁棒性;增加上下文长度通常导致准确率下降,但当上下文中加入扰动示例时,大多数模型性能显著恢复。 Conclusion: 当前大语言模型在数值事实核查中存在严重局限性,鲁棒性仍是亟待解决的开放问题;未来模型设计需更关注对数值信息的稳定推理能力。 Abstract: Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62\% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.[20] Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG
Bo Li,Tian Tian,Zhenghua Xu,Hao Cheng,Shikun Zhang,Wei Ye
Main category: cs.CL
TL;DR: 提出了一种无需训练的检索时机控制方法ETC,通过建模token级不确定性的动态趋势,在动态检索增强生成中实现更早、更精准的检索,显著提升性能并减少检索次数。
Details
Motivation: 现有基于低置信度触发检索的方法可能在错误传播后才进行干预,导致检索时机过晚,难以有效提升生成质量。 Method: 提出Entropy-Trend Constraint(ETC),利用熵序列的一阶和二阶梯度检测不确定性趋势,从而判断最佳检索时机,无需额外训练,可即插即用。 Result: 在六个问答基准和三种大模型上实验表明,ETC持续优于强基线方法,降低检索频率,尤其在领域特定场景下表现出强泛化能力。消融和定性分析验证了趋势感知的不确定性建模更有效。 Conclusion: ETC通过捕捉不确定性演化趋势实现了更优的检索时机决策,是一种通用、高效且易于集成的动态RAG解决方案。 Abstract: Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.[21] Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation
Bo Li,Zhenghua Xu,Rui Xie
Main category: cs.CL
TL;DR: 本文研究了多语言检索增强生成(RAG)中的输出语言漂移问题,发现该问题源于解码器层面的崩溃而非理解失败,英语作为语义吸引子主导生成。为此提出一种轻量、无需训练的软约束解码(SCD)方法,有效缓解语言漂移,提升多语言RAG中的语言对齐与任务性能。
Details
Motivation: 在多语言RAG中,当检索到的文档、用户查询和上下文示例语言不一致时,模型常出现语言漂移,尤其在思维链等推理密集型生成中更为严重。为系统分析此问题并提出通用解决方案。 Method: 通过在多个数据集、语言和大模型上进行受控实验,分析语言漂移成因;提出软约束解码(SCD),通过在解码阶段轻微惩罚非目标语言词汇来引导生成方向,无需训练且适用于任何生成算法。 Result: 实验证明语言漂移主要由解码器层面的token分布偏差导致,英语成为跨语言条件下的主导干扰源和回退语言;SCD方法在三个多语言数据集和多种语言上显著提升了语言对齐度和任务表现。 Conclusion: 语言漂移是多语言RAG中解码过程的系统性偏差问题,SCD作为一种通用、轻量、无需训练的解码策略,能有效抑制该问题,为多语言生成提供了实用且可推广的解决方案。 Abstract: Multilingual Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to perform knowledge-intensive tasks in multilingual settings by leveraging retrieved documents as external evidence. However, when the retrieved evidence differs in language from the user query and in-context exemplars, the model often exhibits language drift by generating responses in an unintended language. This phenomenon is especially pronounced during reasoning-intensive decoding, such as Chain-of-Thought (CoT) generation, where intermediate steps introduce further language instability. In this paper, we systematically study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones. Our controlled experiments reveal that the drift results not from comprehension failure but from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language. We further observe that English serves as a semantic attractor under cross-lingual conditions, emerging as both the strongest interference source and the most frequent fallback language. To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language by penalizing non-target-language tokens. SCD is model-agnostic and can be applied to any generation algorithm without modifying the architecture or requiring additional data. Experiments across three multilingual datasets and multiple typologically diverse languages show that SCD consistently improves language alignment and task performance, providing an effective and generalizable solution in multilingual RAG.[22] FinNuE: Exposing the Risks of Using BERTScore for Numerical Semantic Evaluation in Finance
Yu-Shiang Huang,Yun-Yu Lee,Tzu-Hsin Chou,Che Lin,Chuan-Ju Wang
Main category: cs.CL
TL;DR: BERTScore在金融文本中对数值变化不敏感,本文提出FinNuE数据集揭示其局限性,并呼吁构建数值感知的金融NLP评估框架。
Details
Motivation: BERTScore等基于嵌入的相似度指标在金融领域存在对数值变化敏感性不足的问题,而数值精度在金融语境中至关重要,因此需要识别并解决这一缺陷。 Method: 构建了一个包含受控数值扰动的金融文本数据集FinNuE,涵盖财报电话会议、监管文件、社交媒体和新闻,用于系统评估BERTScore在数值差异上的表现。 Result: 实验表明,BERTScore无法有效区分具有重要语义差异的数值变化,常对财务含义截然不同的文本对给出高相似度评分。 Conclusion: 当前嵌入式评估指标在金融场景下存在根本性局限,需发展能够感知数值信息的新型评估方法。 Abstract: BERTScore has become a widely adopted metric for evaluating semantic similarity between natural language sentences. However, we identify a critical limitation: BERTScore exhibits low sensitivity to numerical variation, a significant weakness in finance where numerical precision directly affects meaning (e.g., distinguishing a 2% gain from a 20% loss). We introduce FinNuE, a diagnostic dataset constructed with controlled numerical perturbations across earnings calls, regulatory filings, social media, and news articles. Using FinNuE, demonstrate that BERTScore fails to distinguish semantically critical numerical differences, often assigning high similarity scores to financially divergent text pairs. Our findings reveal fundamental limitations of embedding-based metrics for finance and motivate numerically-aware evaluation frameworks for financial NLP.[23] PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models
Shivam Sharma,Riya Naik,Tejas Gawas,Heramb Patil,Kunal Korgaonkar
Main category: cs.CL
TL;DR: 本文提出了一个名为PustakAI的框架,用于设计和评估与印度NCERT课程对齐的“NCERT-QA”问答数据集,涵盖6至8年级的英语和科学科目,并通过多种提示技术评估开源和高端大语言模型在教育应用中的有效性。
Details
Motivation: 为了应对将大语言模型有效适配到特定课程(如印度NCERT大纲)所面临的准确性、一致性及教学相关性挑战,尤其是在教育资源匮乏地区的个性化学习需求。 Method: 构建了NCERT-QA数据集,包含事实型、推断型和其他类型(评估与推理)的问答对;采用元提示、少样本提示和思维链提示等技术,结合多种评估指标,评估不同大语言模型的表现。 Result: 评估了Gemma3:1b、Llama3.2:3b、Nemotron-mini:4b、Llama-4-Scout-17B和Deepseek-r1-70B等模型在NCERT-QA数据集上的表现,分析了不同提示方法的有效性及其与课程结构的契合度。 Conclusion: 该研究展示了针对特定课程设计的高质量问答数据集在推动大语言模型于正规教育系统中应用的潜力,同时揭示了当前模型在准确性和教学适用性方面的优势与局限。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework "PustakAI"\footnote{Pustak means `book' in many Indian languages.} for the design and evaluation of a novel question-answering dataset "NCERT-QA" aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.[24] ScaleFormer: Span Representation Cumulation for Long-Context Transformer
Jiangshu Du,Wenpeng Yin,Philip Yu
Main category: cs.CL
TL;DR: ScaleFormer提出了一种无需修改架构或重新预训练的即插即用方法,通过重叠分块和上下文累积表示来高效处理长文本,在长文档摘要任务中表现出色。
Details
Motivation: 标准自注意力机制的二次复杂度限制了Transformer在长文本任务中的应用,而现有高效变体通常需要架构修改和从头预训练,难以直接应用已有预训练模型。 Method: 将长输入分割为重叠块,利用无参数的融合机制,在每个块的边界表示中累积前后上下文信息,生成具有结构感知的压缩表示,从而实现线性复杂度并保留预训练模型的能力。 Result: 在长文档摘要任务上,ScaleFormer与最先进的方法相当甚至更优,且无需架构修改或外部检索机制。 Conclusion: ScaleFormer是一种简单有效的即插即用框架,能够使现成的编码-解码模型高效处理长序列,显著提升其在长文本场景下的推理能力。 Abstract: The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk's representation with structural awareness of its position within the document. It achieves this by enriching each chunk's boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document's narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.[25] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism
Jinhong Jeong,Sunghyun Lee,Jaeyoung Lee,Seonah Han,Youngjae Yu
Main category: cs.CL
TL;DR: 本研究利用声音象征(sound symbolism)探究多模态大语言模型(MLLMs)对语音信息的理解,提出包含8052个真实词和2930个伪词的LEX-ICON数据集,涵盖四种语言的文本与音频模态。通过分析模型在多达25个语义维度上的表现及音素级注意力分布,发现MLLMs的语音直觉与语言学研究一致,并展现出对拟声音素的关注模式,首次实现了对MLLM中语音象似性的大规模定量可解释性分析。
Details
Motivation: 探索多模态大语言模型如何理解人类语言中的声音与意义之间的非任意关联(即声音象征),为MLLM的听觉信息处理机制提供新的分析视角。 Method: 构建LEX-ICON数据集,包含四种语言的真实词与伪词,标注多个语义维度;输入采用正字法、IPA和音频形式,分析MLLM在不同层的音素级注意力分数,进行跨模态、多语义维度的语音象似性评估。 Result: 1. MLLMs在多个语义维度上表现出与人类语言直觉一致的语音象征效应;2. 注意力分析显示模型显著关注具有象似性的关键音素;3. 实现了跨文本与音频模态的一致性建模。 Conclusion: MLLMs具备类似人类的语音象征直觉,其注意力机制能够捕捉音义之间的非任意关联,这为AI与认知语言学的交叉研究提供了新的实证基础。 Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.[26] GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt
Zhenhe Li,Can Lin,Ling Zheng,Wen-Da Wei,Junli Liang,Qi Song
Main category: cs.CL
TL;DR: 本文提出了GraphIF,一种通过将多轮对话建模为有向关系图并利用图提示来增强大语言模型指令遵循能力的即插即用框架。
Details
Motivation: 现有方法主要依赖大规模多轮对话数据集微调大语言模型,但未显式地将多轮指令遵循纳入优化目标,导致模型难以处理复杂的长距离约束。 Method: GraphIF包含三个模块:基于代理的关系抽取模块、关系图提示生成模块和响应重写模块。它通过动作触发机制捕捉语义关系构建结构化图,并将其转化为自然语言提示以优化模型输出。 Result: 在两个长多轮对话数据集上的实验表明,GraphIF能无缝集成到指令调优的大语言模型中,并在四个多轮指令遵循评估指标上均带来显著提升。 Conclusion: GraphIF有效提升了大语言模型在多轮对话中的指令遵循能力,特别是在处理跨轮次复杂约束方面表现出色。 Abstract: Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.[27] ADI-20: Arabic Dialect Identification dataset and models
Haroun Elleuch,Salima Mdhaffar,Yannick Estève,Fethi Bougares
Main category: cs.CL
TL;DR: 本文介绍了ADI-20,这是对先前发布的ADI-17阿拉伯语方言识别数据集的扩展,覆盖了所有阿拉伯语国家的方言,包含来自19种阿拉伯语方言和现代标准阿拉伯语(MSA)的3,556小时数据。作者使用该数据集训练和评估了多种最先进的ADI系统,探索了基于预训练ECAPA-TDNN模型的微调,以及结合注意力池化层和分类密集层的Whisper编码器块。研究了训练数据规模和模型参数数量对识别性能的影响,结果表明仅使用原始训练数据的30%时,F1分数略有下降。作者开源了收集的数据和训练的模型,以支持可重复性和进一步研究。
Details
Motivation: 为了改进阿拉伯语方言识别(ADI)任务,需要更大、更全面的数据集和更先进的模型来覆盖所有阿拉伯语方言并提升识别性能。 Method: 使用ADI-20数据集训练和评估基于ECAPA-TDNN和Whisper编码器块的模型,结合注意力池化和分类层,并研究不同训练数据量和模型参数规模对性能的影响。 Result: 实验显示,即使只使用30%的训练数据,F1分数也仅有轻微下降,表明模型具有较强的数据效率;开源了数据和模型。 Conclusion: ADI-20是一个大规模、多样的阿拉伯语方言识别数据集,能够有效支持先进模型的训练与评估,推动ADI领域的研究发展。 Abstract: We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.[28] Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts
Xanh Ho,Yun-Ang Wu,Sunisth Kumar,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa
Main category: cs.CL
TL;DR: 本文评估了多模态大语言模型在不同证据格式(表格和图表)下验证科学主张的能力,发现现有模型在处理图表时表现较差,且小模型跨模态泛化能力弱,建议未来研究应加强图表理解能力。
Details
Motivation: 随着科学论文数量增加,亟需系统辅助审稿人评估科研主张;而当前多模态大模型在不同证据格式下的科学主张验证能力尚不清楚,尤其缺乏对表格与图表的对比研究。 Method: 设计实验并构建适用于多模态主张验证的数据集,基于两个现有科学论文数据集进行标注与结构化改造,并在此基础上评估12个多模态大语言模型在表格和图表作为证据时的表现。 Result: 实验表明当前模型在表格证据下表现较好,但在图表证据下表现较差;人类在两种格式上均表现良好;小规模模型(小于8B)在表与图任务间的性能相关性弱,显示其跨模态泛化能力有限。 Conclusion: 当前多模态大模型在科学主张验证方面存在关键缺陷,特别是在图表理解上,未来应重点提升模型对图表信息的解析与推理能力以增强多模态科学推理支持。 Abstract: With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models' multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.[29] ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks
Haroun Elleuch,Youssef Saidi,Salima Mdhaffar,Yannick Estève,Fethi Bougares
Main category: cs.CL
TL;DR: Elyadata & LIA在NADI 2025多方言阿拉伯语语音处理任务中提交的系统在口语阿拉伯语方言识别(ADI)和多方言阿拉伯语ASR子任务中分别排名第一和第二。
Details
Motivation: 提升多方言阿拉伯语语音识别和方言识别的性能,利用预训练大模型进行针对性微调。 Method: 对于ADI任务,采用数据增强的Whisper-large-v3编码器进行微调;对于ASR任务,分别对SeamlessM4T-v2 Large(埃及变体)针对八种方言进行微调。 Result: ADI任务达到79.83%的准确率,ASR任务平均WER为38.54%,CER为14.53%。 Conclusion: 大规模预训练语音模型结合针对性微调在阿拉伯语语音处理中表现出色。 Abstract: This paper describes Elyadata \& LIA's joint submission to the NADI multi-dialectal Arabic Speech Processing 2025. We participated in the Spoken Arabic Dialect Identification (ADI) and multi-dialectal Arabic ASR subtasks. Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants. Our ADI system is a fine-tuned Whisper-large-v3 encoder with data augmentation. This system obtained the highest ADI accuracy score of \textbf{79.83\%} on the official test set. For multi-dialectal Arabic ASR, we fine-tuned SeamlessM4T-v2 Large (Egyptian variant) separately for each of the eight considered dialects. Overall, we obtained an average WER and CER of \textbf{38.54\%} and \textbf{14.53\%}, respectively, on the test set. Our results demonstrate the effectiveness of large pre-trained speech models with targeted fine-tuning for Arabic speech processing.[30] On the Military Applications of Large Language Models
Satu Johansson,Taneli Riihonen
Main category: cs.CL
TL;DR: 本文探讨了自然语言处理和大语言模型在军事领域的应用,通过询问基于GPT的模型(如Microsoft Copilot)并评估其提供的信息,同时研究了利用商业云服务(如Microsoft Azure)构建此类应用的可行性。
Details
Motivation: 探索大语言模型在军事场景中的潜在应用,并评估现有技术实现这些应用的可行性。 Method: 通过质询GPT-based语言模型获取其对军事应用的看法,并分析商业云服务平台(如Microsoft Azure)在开发这些应用中的作用与可行性。 Result: 发现语言模型的摘要和生成能力可直接支持多种军事应用,其他特性也有特定用途,且利用现有云服务构建这些应用具有可行性。 Conclusion: 大语言模型特别是其生成与摘要功能,在军事领域有广泛应用潜力,结合商业云平台可加速实际部署。 Abstract: In this paper, military use cases or applications and implementation thereof are considered for natural language processing and large language models, which have broken into fame with the invention of the generative pre-trained transformer (GPT) and the extensive foundation model pretraining done by OpenAI for ChatGPT and others. First, we interrogate a GPT-based language model (viz. Microsoft Copilot) to make it reveal its own knowledge about their potential military applications and then critically assess the information. Second, we study how commercial cloud services (viz. Microsoft Azure) could be used readily to build such applications and assess which of them are feasible. We conclude that the summarization and generative properties of language models directly facilitate many applications at large and other features may find particular uses.[31] Generalizing to Unseen Disaster Events: A Causal View
Philipp Seeberger,Steffen Freisinger,Tobias Bocklet,Korbinian Riedhammer
Main category: cs.CL
TL;DR: 提出一种基于因果学习的去偏方法,用于提升灾难事件中社交媒体数据分类任务的泛化能力,在三个任务上优于基线模型。
Details
Motivation: 现有系统在处理社交媒体灾难信息时易受事件相关偏差影响,导致对新发事件泛化能力差。 Method: 采用因果学习视角,提出减轻事件和领域相关偏差的方法,以增强模型对未来事件的适应性。 Result: 该方法在三个灾难分类任务上比多个基线模型最高提升+1.9% F1分数,并显著改善了基于预训练语言模型的分类器性能。 Conclusion: 所提因果去偏方法有效缓解了灾难信息处理中的偏差问题,提升了模型在未知事件上的泛化能力。 Abstract: Due to the rapid growth of social media platforms, these tools have become essential for monitoring information during ongoing disaster events. However, extracting valuable insights requires real-time processing of vast amounts of data. A major challenge in existing systems is their exposure to event-related biases, which negatively affects their ability to generalize to emerging events. While recent advancements in debiasing and causal learning offer promising solutions, they remain underexplored in the disaster event domain. In this work, we approach bias mitigation through a causal lens and propose a method to reduce event- and domain-related biases, enhancing generalization to future events. Our approach outperforms multiple baselines by up to +1.9% F1 and significantly improves a PLM-based classifier across three disaster classification tasks.[32] Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA
Yiran Zhang,Mingyang Lin,Mark Dras,Usman Naseem
Main category: cs.CL
TL;DR: 本文提出了VISTA,一个用于多轮推理任务的可视化交互式文本分析系统,支持上下文影响可视化、对话历史修改和推理依赖树生成,以降低分析大语言模型推理过程的复杂性。
Details
Motivation: 现有研究缺乏专门工具来有效分析大语言模型在多轮交互中的复杂推理过程,导致研究人员认知负担高。 Method: 设计并实现了一个基于Web的可视化交互系统VISTA,可可视化上下文对模型决策的影响,支持交互式修改对话历史进行“假设”分析,并自动构建推理依赖树。 Result: VISTA能够清晰展示模型的逐步推理路径,支持跨模型比较、自定义基准测试和本地模型集成,显著降低了推理链分析的复杂度。 Conclusion: VISTA为分析多轮场景下大语言模型的推理行为提供了一个透明、统一且可扩展的交互框架,有助于深入理解当前模型的能力与局限。 Abstract: Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning processes within these interactions presents a significant challenge due to complex contextual dependencies and a lack of specialized visualization tools, leading to a high cognitive load for researchers. To address this gap, we present VISTA, an web-based Visual Interactive System for Textual Analytics in multi-turn reasoning tasks. VISTA allows users to visualize the influence of context on model decisions and interactively modify conversation histories to conduct "what-if" analyses across different models. Furthermore, the platform can automatically parse a session and generate a reasoning dependency tree, offering a transparent view of the model's step-by-step logical path. By providing a unified and interactive framework, VISTA significantly reduces the complexity of analyzing reasoning chains, thereby facilitating a deeper understanding of the capabilities and limitations of current LLMs. The platform is open-source and supports easy integration of custom benchmarks and local models.[33] Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL
Qifeng Cai,Hao Liang,Chang Xu,Tao Xie,Wentao Zhang,Bin Cui
Main category: cs.CL
TL;DR: 提出了一种SQL感知的数据增强框架Text2SQL-Flow,用于生成大规模、语义有效且结构多样的Text-to-SQL数据对,并构建高质量数据集SQLFlow,显著提升开源和闭源大模型在Text-to-SQL任务上的性能。
Details
Motivation: 现有Text-to-SQL数据集稀缺、简单且多样性不足,限制了模型性能,亟需高质量、高多样性的数据支持。 Method: 设计六维数据增强的Text2SQL-Flow框架,集成SQL执行验证、自然语言生成、思维链推理、数据分类和数据库管理模块;构建SQLFlow数据集,并提出掩码对齐检索方法用于闭源模型的示例匹配。 Result: 在相同数据预算下,基于SQLFlow微调的开源大模型在多个基准上表现更优;提出的检索方法在闭源模型中优于现有技术,验证了数据质量和对齐策略的有效性。 Conclusion: 高质量结构化数据对Text-to-SQL系统至关重要,Text2SQL-Flow和SQLFlow为数据驱动的AI提供了可扩展的基础。 Abstract: The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.[34] EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
Junquan Huang,Haotian Wu,Yubo Gao,Yibo Yan,Junyan Zhang,Yonghua Hei,Song Dai,Jie Zhang,Puay Siew Tan,Xuming Hu
Main category: cs.CL
TL;DR: 本文提出了EffiReason-Bench,一个用于高效推理方法的统一评估基准,并引入E3-Score指标,在多种任务和模型上进行实验,发现没有单一方法在所有场景下都表现最优。
Details
Motivation: 现有的大语言模型在使用思维链提示时往往生成冗长的推理过程,影响效率和准确性,且缺乏统一的评估标准来比较不同高效推理方法。 Method: 构建了一个包含推理蓝图、动态执行和事后精炼三类方法的统一评估基准EffiReason-Bench,设计了标准化的思维链标注流程,并提出E3-Score作为综合评估指标。 Result: 在6个开源大模型和4个数据集上评估了7种方法,结果表明不同方法的表现受模型规模、任务复杂度和架构影响,无单一最优策略。 Conclusion: 高效推理方法的选择应根据具体模型规模、任务类型和架构进行权衡,统一的评估框架和指标有助于推动该领域的发展。 Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.[35] Persona-Aware Alignment Framework for Personalized Dialogue Generation
Guanrong Li,Xinyu Liu,Zhen Wu,Xinyu Dai
Main category: cs.CL
TL;DR: 提出了一种新的个性化对话生成框架PAL,通过两阶段训练方法直接优化人物一致性,提升生成回复的相关性和个性化程度。
Details
Motivation: 现有模型在生成个性化对话时往往忽略给定的人物设定,导致回复泛化、缺乏一致性,因此需要更显式地建模人物对齐。 Method: 提出Persona-Aware Alignment Framework (PAL),采用两阶段训练:人物感知学习和人物对齐,并结合“选择再生成”的推理策略,在语义层面增强人物敏感性。 Result: 实验表明,PAL在多个指标上优于现有的最先进个性化对话模型和大语言模型,显著提升回复的人物相关性和一致性。 Conclusion: PAL通过将人物对齐作为明确训练目标,有效提升了个性化对话生成的质量,为人物一致性建模提供了新思路。 Abstract: Personalized dialogue generation aims to leverage persona profiles and dialogue history to generate persona-relevant and consistent responses. Mainstream models typically rely on token-level language model training with persona dialogue data, such as Next Token Prediction, to implicitly achieve personalization, making these methods tend to neglect the given personas and generate generic responses. To address this issue, we propose a novel Persona-Aware Alignment Framework (PAL), which directly treats persona alignment as the training objective of dialogue generation. Specifically, PAL employs a two-stage training method including Persona-aware Learning and Persona Alignment, equipped with an easy-to-use inference strategy Select then Generate, to improve persona sensitivity and generate more persona-relevant responses at the semantics level. Through extensive experiments, we demonstrate that our framework outperforms many state-of-the-art personalized dialogue methods and large language models.[36] LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning
Yangfan Ye,Xiaocheng Feng,Xiachong Feng,Lei Huang,Weitao Ma,Qichen Hong,Yunfei Lu,Duyu Tang,Dandan Tu,Bing Qin
Main category: cs.CL
TL;DR: 本文提出了LangGPS,一种基于语言可分离性的轻量级两阶段预筛选框架,用于提升多语言大模型的训练效果,尤其在低资源语言和理解任务上表现更优。
Details
Motivation: 现有数据选择方法忽视了多语言数据内在的语言结构,导致多语言能力对训练数据构成敏感。因此需要一种更关注语言间表示差异的方法来提升数据选择的有效性。 Method: 提出LangGPS框架:第一阶段根据语言可分离性得分过滤训练数据,第二阶段结合现有选择方法对子集进行优化;同时探索语言可分离性在课程学习中的应用。 Result: 在六个基准和22种语言上的实验表明,LangGPS能显著提升现有选择方法的效果与泛化能力,尤其改善低资源语言的理解任务性能;高可分离性样本有助于形成清晰的语言边界,低可分离性样本促进跨语言对齐。 Conclusion: 语言可分离性是衡量多语言数据效用的有效指标,LangGPS为多语言指令微调提供了更细粒度的数据选择策略,并为构建语言感知的大模型提供了新视角。 Abstract: Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs), but the resulting multilingual capability remains highly sensitive to the composition and selection of the training data. Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data. In this paper, we propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability which quantifies how well samples in different languages can be distinguished in the model's representation space. LangGPS first filters training data based on separability scores and then refines the subset using existing selection methods. Extensive experiments across six benchmarks and 22 languages demonstrate that applying LangGPS on top of existing selection methods improves their effectiveness and generalizability in multilingual training, especially for understanding tasks and low-resource languages. Further analysis reveals that highly separable samples facilitate the formation of clearer language boundaries and support faster adaptation, while low-separability samples tend to function as bridges for cross-lingual alignment. Besides, we also find that language separability can serve as an effective signal for multilingual curriculum learning, where interleaving samples with diverse separability levels yields stable and generalizable gains. Together, we hope our work offers a new perspective on data utility in multilingual contexts and support the development of more linguistically informed LLMs.[37] VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction
Yuhao Wang,Ziyang Cheng,Heyang Liu,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
Main category: cs.CL
TL;DR: 本文提出了一种低延迟的端到端 spoken language model VocalNet-M2,通过多码本分词器和多令牌预测策略显著降低响应延迟,同时保持竞争力的性能。
Details
Motivation: 现有端到端口语模型存在较高响应延迟,主要源于语音标记的自回归生成和复杂的流匹配合成模型,限制了实时交互应用。 Method: 引入VocalNet-M2,采用多码本分词器直接生成语音标记,并结合多令牌预测(MTP)策略提升生成效率,避免使用高延迟的流匹配模型。 Result: 实验显示,VocalNet-M2将首块延迟从约725ms降至350ms,在主流SLM中保持竞争力性能,并提供了单码本与多码本策略的全面对比。 Conclusion: VocalNet-M2有效平衡了低延迟与高性能,为实时交互式语音应用提供了可行方案,推动高效SLM的发展。 Abstract: Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model. Furthermore, our MTP strategy enhances generation efficiency and improves overall performance. Extensive experiments demonstrate that VocalNet-M2 achieves a substantial reduction in first chunk latency (from approximately 725ms to 350ms) while maintaining competitive performance across mainstream SLMs. This work also provides a comprehensive comparison of single-codebook and multi-codebook strategies, offering valuable insights for developing efficient and high-performance SLMs for real-time interactive applications.[38] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
He Zhang,Wenqian Cui,Haoning Xu,Xiaohui Li,Lei Zhu,Shaohua Ma,Irwin King
Main category: cs.CL
TL;DR: 提出MTR-DuplexBench,首个支持多轮全双工对话逐轮评估的基准,涵盖对话质量、交互动态、指令遵循与安全性。
Details
Motivation: 现有全双工语言模型评测多局限于单轮对话,缺乏对多轮交互中指令遵循、安全性等关键能力的系统评估,且面临回合边界模糊和上下文不一致的挑战。 Method: 通过将连续的全双工对话分割为离散回合,构建MTR-DuplexBench基准,实现逐轮、多维度(对话质量、动态性、指令遵循、安全)的模型评估。 Result: 实验表明当前FD-SLM在多轮场景下表现不稳定,各维度性能不一致,验证了所提基准的有效性和必要性。 Conclusion: MTR-DuplexBench填补了全双工语言模型多轮评估的空白,为未来研究提供了重要工具。 Abstract: Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.[39] Local Hybrid Retrieval-Augmented Document QA
Paolo Astrino
Main category: cs.CL
TL;DR: 提出一种完全在本地运行的问答系统,结合语义理解与关键词精确性,在保护数据隐私的同时实现高准确率,适用于法律、科学和对话类文档。
Details
Motivation: 解决组织在使用云端AI系统时面临的数据隐私泄露风险与本地处理精度不足之间的权衡问题。 Method: 结合语义理解和关键词检索两种互补的检索策略,并利用消费级硬件加速,全程在本地基础设施上运行。 Result: 系统在多种复杂文档上实现了具有竞争力的准确率,且无需将数据传出本地,错误率极低。 Conclusion: 企业可以在不牺牲隐私的前提下获得高性能的AI问答能力,证明隐私与性能在企业AI部署中并非不可兼得。 Abstract: Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.[40] Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning
Changyuan Tian,Zhicong Lu,Shuang Qian,Nayu Liu,Peiguang Li,Li Jin,Leiyi Hu,Zhizhao Zeng,Sirui Wang,Ke Zeng,Zhi Guo
Main category: cs.CL
TL;DR: 本文提出了一种基于困惑度感知的强化学习算法,以纠正大语言模型在多步数学推理中评估偏好不平衡的问题,从而提升其批判能力。
Details
Motivation: 现有方法主要依赖高质量的监督微调示例来提升批判能力,但忽视了大语言模型批判性能差的根本原因。本文旨在探究并解决这一问题。 Method: 构建了一个一对一多问题-解决方案(OPS)基准,进行统计偏好分析,并提出一种困惑度感知的群体相对策略优化算法,以纠正评估偏好。 Result: 在自建的OPS和现有批评基准上的实验结果表明,所提方法有效提升了大语言模型的批判能力。 Conclusion: 通过识别和纠正大语言模型中的不平衡评估偏好,可以显著提高其在多步数学推理任务中的批判能力。 Abstract: To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.[41] BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages
Guduru Manoj,Neel Prabhanjan Rachamalla,Ashish Kulkarni,Gautam Rajeev,Jay Piplodiya,Arul Menezes,Shaharukh Khan,Souvik Rana,Manya Sah,Chandra Khatri,Shubham Agarwal
Main category: cs.CL
TL;DR: 本研究系统探讨了印度语言的合成多语言预训练数据生成与评估,构建了包含5400亿token的大规模合成数据集BhashaKritika,并提出了一套可扩展、语言敏感的质量评估框架,揭示了不同生成策略的关键权衡与最佳实践。
Details
Motivation: 在大语言模型预训练中,低资源语言的数据稀缺导致模型发展不均衡,因此需要通过合成数据来提升印度等低资源语言的模型性能。 Method: 采用5种技术为10种印度语言生成合成数据,构建BhashaKritika数据集(540B tokens),并通过文档、人物设定和主题进行生成锚定;引入模块化质量评估流程,包括脚本与语言检测、元数据一致性检查、n-gram重复分析及基于KenLM的困惑度过滤。 Result: 实验表明不同生成策略存在关键权衡,原生生成优于英文翻译内容,语言选择(提示语和文档锚定)显著影响数据质量,所提评估框架能有效支持跨语言跨脚本的质量控制。 Conclusion: 合成数据是提升低资源印度语言大模型性能的有效途径,结合合理的生成策略与严格的质量评估可构建高质量多语言预训练语料。 Abstract: In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.[42] Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates
Andrea Schimmenti,Valentina Pasqual,Fabio Vitali,Marieke van Erp
Main category: cs.CL
TL;DR: 本文提出了ATR4CH,一种用于从文化遗产文本中进行大语言模型驱动的知识抽取的五步系统方法,并在真实性评估争议案例中验证了其有效性。
Details
Motivation: 文化遗产文本包含丰富但非结构化的知识,难以系统化查询,亟需有效方法将其转化为结构化知识图谱。 Method: 提出ATR4CH方法,结合标注模型、本体框架和基于大语言模型的知识抽取,通过基础分析、标注模式设计、流水线架构、集成优化和综合评估五个步骤实现文本到RDF的转换。 Result: 在维基百科争议文物文章上验证,元数据抽取F1达0.96-0.99,实体识别0.7-0.8,假设抽取0.65-0.75,证据抽取0.95-0.97,论述表示G-EVAL得分为0.62;小型模型表现具竞争力,支持低成本部署。 Conclusion: ATR4CH是首个将大语言模型与文化遗产本体系统结合的知识抽取框架,具有可复制性和跨领域适应性,有助于文化遗产机构实现知识的结构化与自动化发现。 Abstract: Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts...), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.[43] TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs
Shuyi Liu,Yuming Shang,Xi Zhang
Main category: cs.CL
TL;DR: 提出TruthfulRAG框架,利用知识图谱在事实层面解决检索增强生成(RAG)系统中外源信息与大模型内部知识间的冲突。
Details
Motivation: 现有RAG系统在处理外部检索信息与LLM内部知识冲突时多局限于词元或语义层面,难以全面识别和解决事实性矛盾,影响生成内容的准确性和可靠性。 Method: 通过从检索内容中提取三元组构建知识图谱,采用基于查询的图检索获取相关知识,并设计基于熵的过滤机制精确定位冲突元素,从而实现对事实级知识冲突的识别与缓解。 Result: 实验表明,TruthfulRAG在多个知识密集型任务中优于现有方法,能有效减少知识冲突,提升RAG系统的准确性与鲁棒性。 Conclusion: TruthfulRAG是首个在事实层面解决RAG中知识冲突的框架,通过引入知识图谱显著增强了生成内容的真实性与可信度。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs' internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs' knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, TruthfulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems.[44] Position: On the Methodological Pitfalls of Evaluating Base LLMs for Reasoning
Jason Chan,Zhixue Zhao,Robert Gaizauskas
Main category: cs.CL
TL;DR: 本文讨论了在评估基础大语言模型(LLM)推理能力时存在的方法论问题,指出其预训练目标与推理评估标准之间存在根本性不匹配。
Details
Motivation: 现有研究常忽略基础LLM在推理评估中的方法论缺陷,本文旨在揭示这些被忽视的问题。 Method: 通过分析基础LLM的预训练目标与其输出之间的关系,说明其生成结论是统计语言模式的副产品,而非真正的推理结果。 Result: 发现基础LLM的输出不能被视为对其真实推理能力的有效反映,且其结论难以推广到经过后训练优化的LLM。 Conclusion: 呼吁对依赖此类假设的研究进行批判性重审,并建议未来研究应避免这些方法论陷阱。 Abstract: Existing work investigates the reasoning capabilities of large language models (LLMs) to uncover their limitations, human-like biases and underlying processes. Such studies include evaluations of base LLMs (pre-trained on unlabeled corpora only) for this purpose. Our position paper argues that evaluating base LLMs' reasoning capabilities raises inherent methodological concerns that are overlooked in such existing studies. We highlight the fundamental mismatch between base LLMs' pretraining objective and normative qualities, such as correctness, by which reasoning is assessed. In particular, we show how base LLMs generate logically valid or invalid conclusions as coincidental byproducts of conforming to purely linguistic patterns of statistical plausibility. This fundamental mismatch challenges the assumptions that (a) base LLMs' outputs can be assessed as their bona fide attempts at correct answers or conclusions; and (b) conclusions about base LLMs' reasoning can generalize to post-trained LLMs optimized for successful instruction-following. We call for a critical re-examination of existing work that relies implicitly on these assumptions, and for future work to account for these methodological pitfalls.[45] DELICATE: Diachronic Entity LInking using Classes And Temporal Evidence
Cristian Santini,Sebastian Barzaghi,Paolo Sernani,Emanuele Frontoni,Mehwish Alam
Main category: cs.CL
TL;DR: 本文提出了DELICATE,一种结合BERT编码器与Wikidata上下文信息的神经符号方法,用于历史意大利语文本的实体链接,并发布了多领域语料库ENEIDE,实验表明该方法在小众历史文本中表现优于大规模模型且更具可解释性。
Details
Motivation: 人文领域中的实体链接因文档类型复杂、缺乏领域特定数据集和模型以及知识库中长尾实体的存在而面临挑战。 Method: 提出DELICATE方法,结合基于BERT的编码器与Wikidata中的时间合理性和实体类型一致性等上下文信息进行实体链接;构建ENEIDE语料库,涵盖19至20世纪的文学与政治文本。 Result: DELICATE在历史意大利语实体链接任务上优于其他模型,包括参数量更大的模型,且其置信度得分和特征敏感性分析显示出更高的可解释性和可解释性。 Conclusion: DELICATE结合神经与符号方法有效提升了历史文本中实体链接的准确性与可解释性,ENEIDE语料库为相关研究提供了宝贵资源。 Abstract: In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.[46] Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction
Chunyang Jiang,Paola Merlo
Main category: cs.CL
TL;DR: 提出一种受认知启发的类比范式组织方法,使轻量级模型仅用少量数据即可在语言规则学习任务上超越大规模模型。
Details
Motivation: 探索是否可以通过类比范式组织让轻量级模型在极小数据下达到大模型的性能,减少对大规模训练数据的依赖。 Method: 结合类比结构、对比学习和最小上下文线索三种认知启发原则,构建计算方法,在结构化补全任务上训练轻量级模型(BERT+CNN,50万参数)。 Result: 仅用100个英语因果交替结构样本训练,F1达0.95,超过零样本GPT-3(F1=0.87);消融实验显示类比与对比结构有效;跨现象验证表明方法具有鲁棒性。 Conclusion: 类比范式组织可实现高效语言规则学习,以数量级更低的数据需求达到甚至超越大模型性能。 Abstract: Large language models achieve strong performance through training on vast datasets. Can analogical paradigm organization enable lightweight models to match this performance with minimal data? We develop a computational approach implementing three cognitive-inspired principles: analogical structure, contrastive learning, and minimal contextual cues. We test this approach with structured completion tasks where models identify correct sentence completions from analogical patterns with contrastive alternatives. Training lightweight models (BERT+CNN, $0.5M$ parameters) on only one hundred structured examples of English causative/inchoative alternations achieves $F1=0.95$, outperforming zero-shot \texttt{GPT-o3} ($F1=0.87$). Ablation studies confirm that analogical organization and contrastive structure improve performance, consistently surpassing randomly shuffled baselines across architectures. Cross-phenomenon validation using unspecified object alternations replicates these efficiency gains, confirming approach robustness. Our results show that analogical paradigm organization enables competitive linguistic rule learning with orders of magnitude less data than conventional approaches require.[47] Reasoning About Intent for Ambiguous Requests
Irina Saparina,Mirella Lapata
Main category: cs.CL
TL;DR: 提出一种通过生成多个解释-答案对来应对模糊请求的方法,使用强化学习和定制奖励函数进行训练,在覆盖有效答案方面优于基线方法。
Details
Motivation: 大语言模型在面对模糊请求时通常会隐式地选择一种解释,容易导致意图误解,引发用户不满和安全风险。 Method: 采用强化学习结合自定义奖励函数,利用多个有效答案作为监督信号,训练模型生成包含多种解释及其对应答案的结构化响应。 Result: 在对话式问答和语义解析任务上实验表明,该方法比基线方法能覆盖更多有效答案,人工评估显示预测解释与答案高度一致。 Conclusion: 该方法通过显式表达多种解释提升了透明性,仅需一次生成步骤保证效率,并以结构化输出支持下游应用。 Abstract: Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.[48] Exploring State Tracking Capabilities of Large Language Models
Kiamehr Rezaee,Jose Camacho-Collados,Mohammad Taher Pilehvar
Main category: cs.CL
TL;DR: 本文研究了大语言模型在状态跟踪任务中的表现,提出了一个基于三个明确定义任务的基准测试。结果表明,最新一代模型(如GPT-4和Llama3)在引入思维链等机制时能有效跟踪状态,而前代模型在多步后常失败。
Details
Motivation: 为了评估大语言模型在需要推理和状态维护的任务中的能力,特别是隔离状态跟踪这一关键组件的表现。 Method: 设计了一个包含三个明确定义状态跟踪任务的基准测试,系统分析不同大语言模型在多种场景下的表现。 Result: 最新一代模型(如GPT-4和Llama3)表现出较强的状态跟踪能力,尤其是结合思维链机制时;前代模型虽能理解任务并在初期成功,但在较多步骤后往往失败。 Conclusion: 状态跟踪对大语言模型仍具挑战性,模型代际间存在明显性能差距,提示未来需进一步优化长期推理和状态保持能力。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.[49] LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning
Zihan Gao,Yifei Xu,Jacob Thebault-Spieker
Main category: cs.CL
TL;DR: 本文提出了LocalBench,首个系统评估大语言模型在美国县级本地知识能力的基准,揭示了现有模型在处理超本地化知识方面的严重不足。
Details
Motivation: 随着现实应用对AI理解社区特定动态的需求增加,现有基准无法充分捕捉细粒度本地知识的复杂性,因此需要一个更精确的评估工具。 Method: 基于Localness概念框架,构建包含14,782个验证问答对的LocalBench数据集,覆盖49个州526个县,结合人口普查数据、地方子版块讨论和区域新闻,从物理、认知和关系三个维度评估13种最先进LLM在闭卷和网络增强设置下的表现。 Result: 实验显示,即使表现最好的模型在叙事类问题上准确率仅为56.8%,数值推理低于15.5%;更大的模型或使用网络检索并不总能提升性能,例如搜索使Gemini提升+13.6%,但GPT系列下降-11.4%。 Conclusion: 当前大语言模型在处理超本地知识方面存在显著局限,亟需发展能够公平、精准理解地方情境的地理感知AI系统。 Abstract: Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.[50] Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks
Yunzhe Xu,Zhuosheng Zhang,Zhe Liu
Main category: cs.CL
TL;DR: 提出知识提供型提示优化(KPPO),通过系统性整合知识而非激发模型潜能,显著提升语言模型在知识密集型任务中的表现。
Details
Motivation: 现有提示优化方法主要依赖激发模型已有能力,难以满足知识密集型任务对事实性知识、术语准确性和推理模式的需求。 Method: KPPO框架包含三个创新:知识缺口填补机制、批处理候选评估方法(兼顾性能提升与分布稳定性)、自适应知识剪枝策略(平衡性能与token效率)。 Result: 在15个跨领域知识密集型基准上,KPPO平均比最强基线提升约6%,同时实现相当或更低的token消耗,最多减少29%的token使用。 Conclusion: KPPO将提示优化从能力激发转向知识注入,有效提升了语言模型在专业领域的表现,兼具高效性与实用性。 Abstract: While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models' capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO's superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.[51] Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
Yun He,Wenzhe Li,Hejia Zhang,Songlin Li,Karishma Mandyam,Sopan Khosla,Yuanhao Xiong,Nanshu Wang,Selina Peng,Beibin Li,Shengjie Bi,Shishir G. Patil,Qi Qi,Shengyu Feng,Julian Katz-Samuels,Richard Yuanzhe Pang,Sujan Gonugondla,Hunter Lang,Yue Yu,Yundi Qian,Maryam Fazel-Zarandi,Licheng Yu,Amine Benhalloum,Hany Awadalla,Manaal Faruqui
Main category: cs.CL
TL;DR: 本文提出了AdvancedIF基准和RIFL训练方法,通过基于评分标准的强化学习显著提升大语言模型在复杂指令跟随任务上的表现。
Details
Motivation: 现有的大语言模型在复杂、多轮和系统级指令跟随方面仍存在挑战,缺乏高质量的人工标注基准和可靠的奖励信号来有效评估和训练这些能力。 Method: 提出AdvancedIF基准,包含1600多个提示和专家设计的评分标准;并提出RIFL训练框架,结合评分标准生成、微调的评分验证器和奖励塑形,实现基于评分标准的强化学习。 Result: 实验表明RIFL在AdvancedIF上取得了6.7%的绝对性能提升,并在公共基准上表现出色,消融研究验证了各组件的有效性。 Conclusion: 评分标准可作为训练和评估大语言模型高级指令跟随能力的有力工具,为构建更强大、可靠的AI系统提供了新路径。 Abstract: Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.[52] LOCA-R: Near-Perfect Performance on the Chinese Physics Olympiad 2025
Dong-Shan Jian,Xiang Li,Chen-Xu Yan,Hui-Wen Zheng,Zhi-Zhang Bian,You-Le Fang,Sheng-Qi Zhang,Bing-Rui Gong,Ren-Xi He,Jing-Tian Zhang,Ce Meng,Yan-Qing Ma
Main category: cs.CL
TL;DR: 本文提出了LOCA-R(用于推理的逻辑链增强),这是一种改进的复杂推理框架,并将其应用于2025年中国物理奥林匹克理论考试,取得了接近满分的313分(满分320分),表现优于最强人类选手和所有基线方法。
Details
Motivation: 物理奥赛问题对计算精度、抽象推理和物理原理理解要求极高,是检验人工智能高级能力的理想测试平台。现有方法在应对如此复杂的推理任务时仍存在不足,因此需要更强大的推理框架。 Method: 提出并实现了LOCA-R框架,该方法基于LOCA进行了改进,增强了逻辑链条的构建与推理能力,以更好地处理复杂物理问题中的多步推理和知识整合。 Result: LOCA-R在CPhO 2025理论考试中获得313/320分,超过最高分人类选手,并显著优于所有基线模型。 Conclusion: LOCA-R在高难度物理推理任务上表现出卓越性能,证明了其在复杂科学问题求解中的潜力,为AI解决奥赛级别科学问题提供了有效路径。 Abstract: Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles. The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods.[53] Say It Differently: Linguistic Styles as Jailbreak Vectors
Srikant Panda,Avinash Rai
Main category: cs.CL
TL;DR: 该研究系统地探讨了语言风格(如恐惧、好奇)如何重构有害意图并突破对齐模型的安全限制,构建了一个包含11种语言风格的增强型越狱攻击基准,并发现风格重构可使越狱成功率提升高达57个百分点;最有效的风格包括恐惧、好奇和同情,且基于上下文的重写优于模板化变体;为缓解此问题,作者提出一种使用辅助大模型进行输入预处理的风格中和方法,有效降低越狱成功率,揭示了当前安全机制中一个普遍存在且难以随模型规模扩展而消除的漏洞。
Details
Motivation: 现有研究多关注语义等价或改写形式下的鲁棒性,但忽略了语言风格变化作为潜在攻击面的影响,本文旨在探究不同语言风格如何被用来绕过大型语言模型的安全对齐机制。 Method: 通过手工模板和大模型重写,将三个标准数据集中的提示词转换为11种不同的语言风格,构建风格增强型越狱基准;在16个开源与闭源指令微调模型上评估风格重构对越狱成功率的影响;提出一种基于辅助大模型的风格中和预处理方法,以去除输入中的操纵性风格特征。 Result: 风格重构使越狱成功率最高提升57个百分点,其中恐惧、好奇和同情风格最有效,基于上下文的重写优于模板化方法;所提出的风格中和预处理显著降低了越狱成功率。 Conclusion: 语言风格是一种被忽视但严重的安全漏洞,当前的安全对齐方法对此类风格操纵缺乏鲁棒性,需引入风格中和等新机制来增强模型防御能力。 Abstract: Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.[54] Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG
Egor Pakhomov,Erik Nijkamp,Caiming Xiong
Main category: cs.CL
TL;DR: 提出一个包含75,336个问答对的对话记忆评估基准,涵盖多种类别,解决现有评估框架在统计力、数据生成一致性和评估灵活性方面的局限。研究表明,简单全上下文方法在少于150轮对话中表现优于复杂RAG记忆系统,揭示了对话记忆在小规模语料下的优势,建议应专门研究而非直接套用通用RAG方案。
Details
Motivation: 现有对话记忆评估基准在统计能力、数据一致性和评估灵活性方面存在不足,且对对话记忆与检索增强生成(RAG)的关系理解不清,需建立更全面的基准并深入分析其独特特性与适用场景。 Method: 构建涵盖多类别的大规模对话记忆评估基准,对比分析全上下文模型与RAG-based记忆系统(如Mem0)在不同对话长度下的表现,结合长期上下文有效性研究,识别性能拐点并探讨架构差异。 Result: 发现简单全上下文方法在最多150轮对话内达到70-82%准确率,优于仅得30-45%的Mem0等RAG系统;确定30轮为性能分界点,150轮内长上下文仍可行,超过则需混合或RAG方案。 Conclusion: 对话记忆因其从零开始、逐步增长的特性,在小规模对话中具备可利用的‘小语料优势’,支持高效穷举搜索与重排序,应针对性研究而非照搬传统RAG方法。 Abstract: We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory--where exhaustive search and complete reranking are feasible--deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.[55] Computing the Formal and Institutional Boundaries of Contemporary Genre and Literary Fiction
Natasha Johnson
Main category: cs.CL
TL;DR: 本研究利用计算方法分析文学体裁的形式与制度特征,发现文学类别存在显著的形式标记,并揭示女性作者身份在获得文学地位过程中所面临的模糊化挑战。
Details
Motivation: 探讨体裁作为形式分类与制度分类的有效性,特别是在当代小说中文学小说与类型小说(如言情、悬疑、科幻)之间的区别是否具有形式上的基础。 Method: 基于Andrew Piper的CONLIT数据集构建文学小说与类型小说语料库,使用Welch's ANOVA比较不同体裁中作者性别对叙事特征分布的影响,通过逻辑回归建模各特征对文学分类的影响,并分析风格与语义向量表示。 Result: 发现各类文学体裁存在统计上显著的形式标记,且作者性别对文学分类具有调节作用,女性作者的作品在形式上更难被归类为‘文学小说’。 Conclusion: 体裁不仅受制度因素影响,也具有可识别的形式特征;然而,女性作者在追求文学地位时面临更大的不确定性,其作品的形式特征可能被系统性地边缘化。 Abstract: Though the concept of genre has been a subject of discussion for millennia, the relatively recent emergence of genre fiction has added a new layer to this ongoing conversation. While more traditional perspectives on genre have emphasized form, contemporary scholarship has invoked both formal and institutional characteristics in its taxonomy of genre, genre fiction, and literary fiction. This project uses computational methods to explore the soundness of genre as a formal designation as opposed to an institutional one. Pulling from Andrew Piper's CONLIT dataset of Contemporary Literature, we assemble a corpus of literary and genre fiction, with the latter category containing romance, mystery, and science fiction novels. We use Welch's ANOVA to compare the distribution of narrative features according to author gender within each genre and within genre versus literary fiction. Then, we use logistic regression to model the effect that each feature has on literary classification and to measure how author gender moderates these effects. Finally, we analyze stylistic and semantic vector representations of our genre categories to understand the importance of form and content in literary classification. This project finds statistically significant formal markers of each literary category and illustrates how female authorship narrows and blurs the target for achieving literary status.[56] URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding
Yongxin Shi,Jiapeng Wang,Zeyu Shan,Dezhi Peng,Zening Lin,Lianwen Jin
Main category: cs.CL
TL;DR: 提出URaG框架,统一检索与生成,在多模态大模型中通过早期层进行跨模态检索,选择关键证据页,提升长文档理解的效率和准确性。
Details
Motivation: 现有方法在处理长文档时面临信息干扰和计算成本高的问题,且压缩或引入外部检索器会损失细节或增加复杂性。作者希望通过利用MLLM内部的证据定位能力,在推理过程中实现高效检索。 Method: 基于观察到的MLLM从粗到细的推理模式,设计了一个轻量级跨模态检索模块,将Transformer早期层转化为证据选择器,筛选出最相关页面,供深层集中处理。 Result: 实验表明,URaG在多个任务上达到SOTA性能,同时减少44%-56%的计算开销。 Conclusion: URaG成功将检索与生成统一于单一MLLM中,有效提升长文档理解的效率与精度,具备良好的实用性与优化潜力。 Abstract: Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.[57] DESS: DeBERTa Enhanced Syntactic-Semantic Aspect Sentiment Triplet Extraction
Vishal Thenuwara,Nisansa de Silva
Main category: cs.CL
TL;DR: 本文提出了一种新的细粒度情感分析方法DESS,结合DeBERTa和LSTM双通道结构,有效提升了方面-观点对识别与情感极性判断的性能。
Details
Motivation: 现有ASTE方法在捕捉方面、观点和情感极性之间的复杂关系上仍存在不足,尤其是对长距离依赖和复杂句式处理能力有限,亟需更强大的语言模型和融合机制来提升性能。 Method: 提出DESS模型,采用DeBERTa增强注意力机制理解语义上下文,同时引入LSTM通道捕捉语法结构,构建双通道框架,并优化两者的信息交互方式。 Result: 在标准数据集上,DESS在方面-观点对识别和情感分类任务中F1分数分别提升了4.85、8.36和2.42,且消融实验表明DeBERTa的注意力机制显著增强了对复杂句子结构的理解能力。 Conclusion: 通过合理集成更先进的语言模型(如DeBERTa),并设计有效的多通道融合策略,可显著提升细粒度情感分析的性能,为后续研究提供了有效路径。 Abstract: Fine-grained sentiment analysis faces ongoing challenges in Aspect Sentiment Triple Extraction (ASTE), particularly in accurately capturing the relationships between aspects, opinions, and sentiment polarities. While researchers have made progress using BERT and Graph Neural Networks, the full potential of advanced language models in understanding complex language patterns remains unexplored. We introduce DESS, a new approach that builds upon previous work by integrating DeBERTa's enhanced attention mechanism to better understand context and relationships in text. Our framework maintains a dual-channel structure, where DeBERTa works alongside an LSTM channel to process both meaning and grammatical patterns in text. We have carefully refined how these components work together, paying special attention to how different types of language information interact. When we tested DESS on standard datasets, it showed meaningful improvements over current methods, with F1-score increases of 4.85, 8.36, and 2.42 in identifying aspect opinion pairs and determining sentiment accurately. Looking deeper into the results, we found that DeBERTa's sophisticated attention system helps DESS handle complicated sentence structures better, especially when important words are far apart. Our findings suggest that upgrading to more advanced language models when thoughtfully integrated, can lead to real improvements in how well we can analyze sentiments in text. The implementation of our approach is publicly available at: https://github.com/VishalRepos/DESS.[58] Evaluating Prompting Strategies with MedGemma for Medical Order Extraction
Abhinand Balachandran,Bavana Durgapraveen,Gowsikkan Sikkan Sudhagar,Vidhya Varshany J S,Sriram Rajkumar
Main category: cs.CL
TL;DR: 本文研究了使用MedGemma模型在医生-患者对话中提取医疗指令的任务,比较了三种提示范式,发现简单的单样本提示在人工标注数据上表现最佳。
Details
Motivation: 准确从医患对话中提取医疗指令对减轻临床文档负担和保障患者安全至关重要,现有方法在真实场景下的性能仍有提升空间。 Method: 采用MedGemma模型,系统评估了一次性单样本提示、ReAct推理框架和多步代理工作流三种提示范式在MEDIQA-OE-2025共享任务上的表现。 Result: 简单的一次性单样本提示在官方验证集上表现最优,复杂框架如ReAct和代理流程可能因‘过度思考’引入噪声而表现较差。 Conclusion: 在处理人工标注的临床文本时,直接的提示方法更稳健高效,选择合适的提示策略应考虑数据质量和实际应用场景。 Abstract: The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to "overthinking" and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.[59] Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering
Bavana Durgapraveen,Sornaraj Sivasankaran,Abhinand Balachandran,Sriram Rajkumar
Main category: cs.CL
TL;DR: 本文提出了两种用于生成伤口护理文本回复的方法:基于检索的提示策略和元数据引导生成,二者结合可提升AI在远程医疗中回答临床问题的相关性和精确性。
Details
Motivation: 随着异步远程医疗的快速发展,医生工作负担加重,亟需AI系统辅助处理患者咨询,尤其是在结合图像的伤口护理领域。 Method: 第一种方法采用检索式提示策略,通过检索训练集中最相似的k个样例作为少样本示范;第二种方法基于元数据消融研究,训练分类器预测四个关键元数据属性,并将其融入生成过程,根据预测置信度动态调整输出。 Result: 实验结果表明,检索式提示提高了回复的相关性,而元数据引导进一步提升了临床准确性。 Conclusion: 这两种方法为开发高效、可靠的AI驱动伤口护理支持工具提供了可行方向。 Abstract: The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.[60] Know Your Limits: Entropy Estimation Modeling for Compression and Generalization
Benjamin L. Badger,Matthew Neligeorge
Main category: cs.CL
TL;DR: 本文提出了一种编码器增强的因果解码器模型架构,能够在有限硬件上实现比传统因果变换器更高的压缩效率,并基于每token的熵估计提升语言模型的泛化能力。
Details
Motivation: 由于语言内在的信息熵限制,语言预测存在准确率上限和压缩下限。当前最高效的压缩算法是因果大语言模型,但其用于精确估计语言熵在计算上不可行。因此需要更高效的模型和方法来逼近语言熵。 Method: 引入编码器增强的因果解码器架构,提升训练效率并实现更高压缩;通过每token的熵估计来指导模型训练,控制损失最小化过程不超过熵下限。 Result: 该模型在 modest 硬件上训练时即表现出优于因果变换器的压缩性能;实验表明,接近但不超过每token熵估计的模型具有更强的泛化能力。 Conclusion: 通过结合编码器信息并以熵为训练指导,模型可在逼近语言熵的同时获得更好的泛化性能,为语言建模和压缩提供了新的有效路径。 Abstract: Language prediction is constrained by informational entropy intrinsic to language, such that there exists a limit to how accurate any language model can become and equivalently a lower bound to language compression. The most efficient language compression algorithms today are causal (next token prediction) large language models, but the use of these models to form accurate estimates of language entropy is currently computationally infeasible. We introduce encoder-augmented causal decoder model architectures that exhibit superior training efficiency characteristics and achieve higher compression than causal transformers even when trained on modest hardware. We demonstrate how entropy estimates can be obtained on a per-token basis, and show that the generalization of models trained to approach the entropy of their training data necessarily exceeds the generalization of models trained to minimize loss beyond this value. We show empirically that causal models trained to approach but not exceed estimated per-token entropies exhibit greater generalization than models trained without taking entropy into account.[61] SSR: Socratic Self-Refine for Large Language Model Reasoning
Haizhou Shi,Ye Liu,Bo Pang,Zeyu Leo Liu,Hao Wang,Silvio Savarese,Caiming Xiong,Yingbo Zhou,Semih Yavuz
Main category: cs.CL
TL;DR: 本文提出了Socratic Self-Refine(SSR)框架,通过将大语言模型的推理过程分解为可验证的子问题-子答案对,实现细粒度评估和精确优化,提升了复杂任务下的推理准确性和可解释性。
Details
Motivation: 现有测试时框架依赖粗糙的自我验证和修正机制,在处理复杂推理任务时效果有限,因此需要一种更精细、可控的推理优化方法。 Method: SSR将模型输出分解为(子问题,子答案)对,通过受控重求解和自洽性检查进行步骤级置信度估计,并定位不可靠步骤进行迭代优化。 Result: 在五个推理基准和三个大语言模型上的实验表明,SSR持续优于当前最先进的自 refinement 基线方法。 Conclusion: SSR提供了一种有效的黑箱方法,不仅提升LLM推理性能,还增强了对其内部推理过程的理解与评估能力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.[62] Instella: Fully Open Language Models with Stellar Performance
Jiang Liu,Jialian Wu,Xiaodong Yu,Yusheng Su,Prakamya Mishra,Gowtham Ramesh,Sudhanshu Ranjan,Chaitanya Manem,Ximeng Sun,Ze Wang,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum
Main category: cs.CL
TL;DR: Instella 是一个完全开源的三 billion 参数语言模型系列,基于公开数据和代码训练,在性能上达到同类开源模型的领先水平,并发布了支持长上下文和数学推理的专用版本。
Details
Motivation: 大多数高性能大语言模型仍为闭源或部分开源,限制了研究的透明度与可复现性,因此需要一个完全开源且高性能的语言模型来推动开放研究。 Method: 通过大规模预训练、通用指令微调以及基于人类偏好对齐的方法,使用 AMD Instinct MI300X GPU 训练 Instella 模型,并基于监督微调和强化学习开发专用变体 Instella-Long 和 Instella-Math。 Result: Instella 在较少预训练 token 的情况下,在完全开源模型中达到最先进水平,性能媲美同规模顶尖开源权重模型;Instella-Long 支持最长 128K token 的上下文,Instella-Math 在数学推理任务中表现优异。 Conclusion: Instella 提供了一个透明、高效且多功能的开源语言模型方案,推动了可复现和开放的语言模型研究发展。 Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.[63] Black-Box On-Policy Distillation of Large Language Models
Tianzhu Ye,Li Dong,Zewen Chi,Xun Wu,Shaohan Huang,Furu Wei
Main category: cs.CL
TL;DR: 本文提出了生成对抗蒸馏(GAD),一种用于黑盒大语言模型蒸馏的新方法,通过将学生模型视为生成器、训练判别器区分其输出与教师模型输出,形成极小极大博弈,实验证明GAD优于传统的序列级知识蒸馏。
Details
Motivation: 在无法访问专有教师模型内部参数或logits的情况下,现有的黑盒蒸馏方法效果有限,因此需要一种更有效的黑盒蒸馏范式来提升学生模型性能。 Method: 提出生成对抗蒸馏(GAD),将学生LLM作为生成器,训练一个判别器来区分学生和教师模型的响应,通过对抗训练构建一个随学生模型共同进化的在线策略奖励模型,实现稳定且自适应的反馈学习。 Result: 实验表明,使用GAD训练的Qwen2.5-14B-Instruct(学生模型)在LMSYS-Chat自动评估中表现接近其教师模型GPT-5-Chat,并且整体性能持续优于传统的序列级知识蒸馏方法。 Conclusion: GAD是一种有前景且高效的黑盒大语言模型蒸馏范式,能够通过对抗性奖励机制显著提升学生模型的表现。 Abstract: Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.[64] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
Yesheng Liang,Haisheng Chen,Song Han,Zhijian Liu
Main category: cs.CL
TL;DR: 提出了一种名为ParoQuant的权重仅后训练量化方法,通过结合硬件高效的独立Givens旋转和通道级缩放,有效抑制权重和激活中的异常值,提升大语言模型在推理任务中的量化精度与效率。
Details
Motivation: 现有权重仅后训练量化方法在处理大语言模型中的权重和激活异常值时,要么抑制不足导致精度下降,要么引入过多推理开销,尤其在长链推理任务中误差累积严重。 Method: 提出Pairwise Rotation Quantization (ParoQuant),采用可优化的独立Givens旋转与通道级缩放来均衡通道间幅度、缩小量化组内动态范围,并协同设计推理内核以充分利用GPU并行性,保持运行时轻量。 Result: 在推理任务上,ParoQuant相比AWQ平均提升2.4%的准确率,且运行时开销低于10%。 Conclusion: ParoQuant为推理型大语言模型的高效、高精度部署提供了一种可行方案,显著改善了量化过程中的精度与效率权衡。 Abstract: Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.cs.CV [Back]
[65] FedeCouple: Fine-Grained Balancing of Global-Generalization and Local-Adaptability in Federated Learning
Ming Yang,Dongrun Li,Xin Wang,Feng Li,Lisheng Fan,Chunxiao Wang,Xiaoming Wu,Peng Cheng
Main category: cs.CV
TL;DR: 本文提出了一种名为FedeCouple的联邦学习方法,通过细粒度平衡全局泛化与局部适应性,提升异构数据下隐私保护移动网络中的个性化学习性能。
Details
Motivation: 现有个性化联邦学习方法在本地训练中忽视了特征提取器的局部适应性和分类器的全局泛化能力,导致组件间协调不足、耦合弱,影响整体模型性能。 Method: FedeCouple联合学习全局和局部特征表示,采用动态知识蒸馏增强分类器泛化能力,并引入非传输的局部锚点优化特征空间,兼顾隐私保护与通信效率。 Result: 在五个图像分类数据集上的实验表明,FedeCouple在有效性、稳定性、可扩展性和安全性方面 consistently 优于九个基线方法,有效性上最高提升4.3%。 Conclusion: FedeCouple有效增强了特征提取器与分类器之间的耦合,实现了更好的性能与隐私权衡,理论证明其对非凸目标收敛。 Abstract: In privacy-preserving mobile network transmission scenarios with heterogeneous client data, personalized federated learning methods that decouple feature extractors and classifiers have demonstrated notable advantages in enhancing learning capability. However, many existing approaches primarily focus on feature space consistency and classification personalization during local training, often neglecting the local adaptability of the extractor and the global generalization of the classifier. This oversight results in insufficient coordination and weak coupling between the components, ultimately degrading the overall model performance. To address this challenge, we propose FedeCouple, a federated learning method that balances global generalization and local adaptability at a fine-grained level. Our approach jointly learns global and local feature representations while employing dynamic knowledge distillation to enhance the generalization of personalized classifiers. We further introduce anchors to refine the feature space; their strict locality and non-transmission inherently preserve privacy and reduce communication overhead. Furthermore, we provide a theoretical analysis proving that FedeCouple converges for nonconvex objectives, with iterates approaching a stationary point as the number of communication rounds increases. Extensive experiments conducted on five image-classification datasets demonstrate that FedeCouple consistently outperforms nine baseline methods in effectiveness, stability, scalability, and security. Notably, in experiments evaluating effectiveness, FedeCouple surpasses the best baseline by a significant margin of 4.3%.[66] MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
Ye Tian,Ling Yang,Jiongfan Yang,Anran Wang,Yu Tian,Jiani Zheng,Haochen Wang,Zhiyang Teng,Zhuochen Wang,Yinjie Wang,Yunhai Tong,Mengdi Wang,Xiangtai Li
Main category: cs.CV
TL;DR: 本文提出了ParaBench基准,用于评估多模态推理中的文本与图像输出一致性,并提出MMaDA-Parallel框架,通过并行扩散模型和新型强化学习策略ParaRL提升跨模态对齐与语义一致性,在ParaBench上显著优于现有方法。
Details
Motivation: 现有自回归的思维感知生成方法在复杂任务中因错误传播导致性能下降,且推理过程与图像生成缺乏良好对齐。 Method: 提出ParaBench基准以系统评估文本与图像输出;设计MMaDA-Parallel并行多模态扩散框架,支持去噪全过程中的文本与图像双向交互;采用监督微调结合新型轨迹语义奖励的并行强化学习(ParaRL)进行优化。 Result: 实验表明,MMaDA-Parallel在ParaBench上的输出对齐度比当前最优模型Bagel提升6.9%,显著增强跨模态对齐与语义一致性。 Conclusion: MMaDA-Parallel通过并行架构与轨迹级强化学习,有效缓解错误传播问题,建立了更鲁棒的思维感知图像生成范式。 Abstract: While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel[67] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild
Felix B. Mueller,Jan F. Meier,Timo Lueddecke,Richard Vogg,Roger L. Freixanet,Valentin Hassler,Tiffany Bosshard,Elif Karakoc,William J. O'Hearn,Sofia M. Pereira,Sandro Sehner,Kaja Wierucka,Judith Burkart,Claudia Fichtel,Julia Fischer,Alexander Gail,Catherine Hobaiter,Julia Ostner,Liran Samuni,Oliver Schülke,Neda Shahidi,Erin G. Wessling,Alexander S. Ecker
Main category: cs.CV
TL;DR: 本文提出了一种以数据为中心的方法,构建了大规模灵长类中心视频预训练数据集PriVi,并通过在该数据集上预训练V-JEPA模型,显著提升了灵长类行为识别任务中的数据效率和泛化能力。
Details
Motivation: 现有计算机视觉方法多依赖于以人为中心的预训练模型,且通常局限于单一数据集,导致在灵长类行为分析中泛化能力不足。因此,需要一种更专注于灵长类的数据驱动方法来提升性能。 Method: 作者构建了一个包含424小时视频的大规模数据集PriVi,其中174小时来自11个行为研究场景,250小时来自网络采集,并通过可扩展的数据整理流程进行整合;在此基础上对V-JEPA模型进行预训练,并使用轻量级冻结分类器在多个基准数据集上进行评估。 Result: 在ChimpACT、BaboonLand、PanAf500和ChimpBehave四个基准数据集上的实验表明,该方法优于先前工作,包括全微调的基线模型,并在标签较少的情况下表现出更好的可扩展性。 Conclusion: 灵长类中心的预训练能显著提升模型的数据效率和跨数据集泛化能力,为低标签条件下的灵长类行为研究提供了有效解决方案。 Abstract: Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We pretrain V-JEPA on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets, ChimpACT, BaboonLand, PanAf500, and ChimpBehave, our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.[68] Classifying Phonotrauma Severity from Vocal Fold Images with Soft Ordinal Regression
Katie Matton,Purvaja Balaji,Hamzeh Ghasemzadeh,Jameson C. Cooper,Daryush D. Mehta,Jarrad H. Van Stan,Robert E. Hillman,Rosalind Picard,John Guttag,S. Mazdak Abulnaga
Main category: cs.CV
TL;DR: 提出了一种基于软标签序数回归的自动分类方法,用于声带损伤严重程度的评估,性能接近临床专家,并提供可靠的不确定性估计。
Details
Motivation: 声带损伤严重程度评估依赖临床医生主观判断,成本高且可靠性差异大,亟需自动化、客观的评估方法。 Method: 采用序数回归框架,并提出一种新的损失函数修改方法,使其能够处理反映标注者评分分布的软标签,从而应对标签不确定性问题。 Result: 所提出的软序数回归方法在预测性能上接近临床专家水平,并能生成校准良好的不确定性估计。 Conclusion: 该方法为声带损伤严重程度提供了自动化的评估工具,有助于开展大规模研究,提升临床理解和患者护理水平。 Abstract: Phonotrauma refers to vocal fold tissue damage resulting from exposure to forces during voicing. It occurs on a continuum from mild to severe, and treatment options can vary based on severity. Assessment of severity involves a clinician's expert judgment, which is costly and can vary widely in reliability. In this work, we present the first method for automatically classifying phonotrauma severity from vocal fold images. To account for the ordinal nature of the labels, we adopt a widely used ordinal regression framework. To account for label uncertainty, we propose a novel modification to ordinal regression loss functions that enables them to operate on soft labels reflecting annotator rating distributions. Our proposed soft ordinal regression method achieves predictive performance approaching that of clinical experts, while producing well-calibrated uncertainty estimates. By providing an automated tool for phonotrauma severity assessment, our work can enable large-scale studies of phonotrauma, ultimately leading to improved clinical understanding and patient care.[69] SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control
Arman Zarei,Samyadeep Basu,Mobina Pournemat,Sayan Nag,Ryan Rossi,Soheil Feizi
Main category: cs.CV
TL;DR: 本文提出了SliderEdit,一种用于连续图像编辑的框架,能够对多指令提示中的每个指令进行细粒度、可解释的强度控制,通过全局训练的滑块实现编辑强度的平滑调节。
Details
Motivation: 现有基于指令的图像编辑模型对每个指令使用固定强度,缺乏对单个编辑强度的精确和连续控制,限制了用户的操控能力。 Method: SliderEdit将多部分编辑指令解耦,并为每个指令训练一个全局共享的低秩自适应矩阵,形成可调节的滑块,实现跨不同编辑、属性和组合指令的泛化控制。 Result: 在FLUX-Kontext和Qwen-Image-Edit等先进模型上应用SliderEdit后,显著提升了编辑的可控性、视觉一致性和用户导向性,支持沿单一编辑维度连续插值,同时保持空间局部性和全局语义一致性。 Conclusion: SliderEdit是首个实现基于指令图像编辑中连续、细粒度控制的框架,为交互式、指令驱动的图像操作提供了新的可能性。 Abstract: Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user's ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.[70] Density Estimation and Crowd Counting
Balachandra Devarangadi Sunil,Rakshith Venkatesh,Shantanu Todmal
Main category: cs.CV
TL;DR: 本研究提出了一种面向视频的 crowd density 估计方法,结合去噪概率模型、扩散过程和事件驱动采样,提高了估计精度与计算效率。
Details
Motivation: 传统基于图像的 crowd density 估计方法难以有效处理视频数据中的时序动态性,且计算开销大,无法满足实时监控需求。 Method: 采用基于扩散过程的去噪概率模型生成高质量密度图,使用窄高斯核并生成多个密度图;引入回归分支提取特征,并通过相似性评分融合得到最终结果;结合 Farneback 光流算法实现事件驱动采样,仅保留显著运动帧。 Result: 在密集与稀疏场景下均表现出良好的密度估计效果,MAE 指标优异,且事件驱动采样显著减少帧数与计算负担,同时保留关键人群动态。 Conclusion: 该方法有效解决了视频中 crowd density 估计的时序建模与效率问题,提供了一个可扩展、高效的实时人群监控框架,适用于公共安全、灾害响应等实际应用场景。 Abstract: This study enhances a crowd density estimation algorithm originally designed for image-based analysis by adapting it for video-based scenarios. The proposed method integrates a denoising probabilistic model that utilizes diffusion processes to generate high-quality crowd density maps. To improve accuracy, narrow Gaussian kernels are employed, and multiple density map outputs are generated. A regression branch is incorporated into the model for precise feature extraction, while a consolidation mechanism combines these maps based on similarity scores to produce a robust final result. An event-driven sampling technique, utilizing the Farneback optical flow algorithm, is introduced to selectively capture frames showing significant crowd movements, reducing computational load and storage by focusing on critical crowd dynamics. Through qualitative and quantitative evaluations, including overlay plots and Mean Absolute Error (MAE), the model demonstrates its ability to effectively capture crowd dynamics in both dense and sparse settings. The efficiency of the sampling method is further assessed, showcasing its capability to decrease frame counts while maintaining essential crowd events. By addressing the temporal challenges unique to video analysis, this work offers a scalable and efficient framework for real-time crowd monitoring in applications such as public safety, disaster response, and event management.[71] PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model
Yunqian Cheng,Benjamin Princen,Roberto Manduchi
Main category: cs.CV
TL;DR: PALMS+ 是一种无需基础设施的室内定位系统,利用单目深度估计模型从RGB图像重建3D点云,并结合平面图进行几何布局匹配,实现高精度定位。
Details
Motivation: 现有视觉定位方法受限于智能手机LiDAR范围短和室内布局模糊性,难以在大型建筑中实现高精度、无需训练的基础设施自由定位。 Method: 提出PALMS+,使用Depth Pro模型从带位姿的RGB图像生成尺度对齐的3D点云,通过与平面图卷积进行几何布局匹配,输出位置和朝向的后验概率,支持静态和序列定位。 Result: 在Structured3D和自建校园数据集上,PALMS+在静态定位中优于PALMS和F3Loc;在33条真实轨迹的序列定位中,结合粒子滤波器表现出更低的定位误差。 Conclusion: PALMS+在不需训练的情况下实现了高精度、鲁棒的基础设施自由室内定位,适用于紧急响应和辅助导航等应用。 Abstract: Indoor localization in GPS-denied environments is crucial for applications like emergency response and assistive navigation. Vision-based methods such as PALMS enable infrastructure-free localization using only a floor plan and a stationary scan, but are limited by the short range of smartphone LiDAR and ambiguity in indoor layouts. We propose PALMS$+$, a modular, image-based system that addresses these challenges by reconstructing scale-aligned 3D point clouds from posed RGB images using a foundation monocular depth estimation model (Depth Pro), followed by geometric layout matching via convolution with the floor plan. PALMS$+$ outputs a posterior over the location and orientation, usable for direct or sequential localization. Evaluated on the Structured3D and a custom campus dataset consisting of 80 observations across four large campus buildings, PALMS$+$ outperforms PALMS and F3Loc in stationary localization accuracy -- without requiring any training. Furthermore, when integrated with a particle filter for sequential localization on 33 real-world trajectories, PALMS$+$ achieved lower localization errors compared to other methods, demonstrating robustness for camera-free tracking and its potential for infrastructure-free applications. Code and data are available at https://github.com/Head-inthe-Cloud/PALMS-Plane-based-Accessible-Indoor-Localization-Using-Mobile-Smartphones[72] Social LSTM with Dynamic Occupancy Modeling for Realistic Pedestrian Trajectory Prediction
Ahmed Alia,Mohcine Chraibi,Armin Seyfried
Main category: cs.CV
TL;DR: 本文提出了一种改进的Social LSTM模型,通过引入动态占用空间损失函数,在保持低位移误差的同时有效减少行人轨迹预测中的碰撞率,适用于不同密度的拥挤环境。
Details
Motivation: 现有方法通常将行人视为点实体,忽略了行人的物理空间占用,导致在密集人群中的轨迹预测容易出现不真实的碰撞。因此,需要一种能够考虑个体空间占用并适应不同密度场景的预测模型。 Method: 提出一种新的动态占用空间损失函数,结合平均位移误差和对场景密度与个体空间占用敏感的碰撞惩罚项,增强Social LSTM模型的训练过程,使其学习避免真实碰撞。 Result: 在基于2022年里昂灯光节真实数据生成的五个数据集上进行实验,结果显示该模型在所有数据集中均降低了碰撞率(最高减少31%)和平均位移误差(降低5%)、最终位移误差(降低6%),且优于多个最先进的深度学习模型。 Conclusion: 所提出的动态占用空间损失函数能有效提升Social LSTM在复杂人群场景下的轨迹预测真实性和准确性,兼顾低碰撞率和高定位精度,具有在异构密度环境中广泛应用的潜力。 Abstract: In dynamic and crowded environments, realistic pedestrian trajectory prediction remains a challenging task due to the complex nature of human motion and the mutual influences among individuals. Deep learning models have recently achieved promising results by implicitly learning such patterns from 2D trajectory data. However, most approaches treat pedestrians as point entities, ignoring the physical space that each person occupies. To address these limitations, this paper proposes a novel deep learning model that enhances the Social LSTM with a new Dynamic Occupied Space loss function. This loss function guides Social LSTM in learning to avoid realistic collisions without increasing displacement error across different crowd densities, ranging from low to high, in both homogeneous and heterogeneous density settings. Such a function achieves this by combining the average displacement error with a new collision penalty that is sensitive to scene density and individual spatial occupancy. For efficient training and evaluation, five datasets were generated from real pedestrian trajectories recorded during the Festival of Lights in Lyon 2022. Four datasets represent homogeneous crowd conditions -- low, medium, high, and very high density -- while the fifth corresponds to a heterogeneous density distribution. The experimental findings indicate that the proposed model not only lowers collision rates but also enhances displacement prediction accuracy in each dataset. Specifically, the model achieves up to a 31% reduction in the collision rate and reduces the average displacement error and the final displacement error by 5% and 6%, respectively, on average across all datasets compared to the baseline. Moreover, the proposed model consistently outperforms several state-of-the-art deep learning models across most test sets.[73] Soiling detection for Advanced Driver Assistance Systems
Filip Beránek,Václav Diviš,Ivan Gruber
Main category: cs.CV
TL;DR: 本文将汽车摄像头的污垢检测视为语义分割问题,比较了多种主流分割方法,并指出其优于基于瓦片分类的方法;同时分析了Woodscape数据集存在的数据泄露和标注不准确问题,构建了一个更小但高效的新子集,实现了更快且相当的性能。
Details
Motivation: 提高自动驾驶辅助系统在恶劣环境(如天气、灰尘)下的鲁棒性,解决现有数据集存在的数据泄露和标注不准问题。 Method: 采用语义分割方法进行污垢检测,对多种主流分割模型进行对比,并重新划分和修正Woodscape数据集。 Result: 分割方法性能优于传统的瓦片分类方法;新构建的子集虽更小,但训练更快且达到相当的检测效果。 Conclusion: 语义分割是更优的污垢检测方案,高质量的数据子集可有效提升训练效率与模型可靠性。 Abstract: Soiling detection for automotive cameras is a crucial part of advanced driver assistance systems to make them more robust to external conditions like weather, dust, etc. In this paper, we regard the soiling detection as a semantic segmentation problem. We provide a comprehensive comparison of popular segmentation methods and show their superiority in performance while comparing them to tile-level classification approaches. Moreover, we present an extensive analysis of the Woodscape dataset showing that the original dataset contains a data-leakage and imprecise annotations. To address these problems, we create a new data subset, which, despite being much smaller, provides enough information for the segmentation method to reach comparable results in a much shorter time. All our codes and dataset splits are available at https://github.com/filipberanek/woodscape_revision.[74] Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and Segmentation
Frank Li,Theo Dapamede,Mohammadreza Chavoshi,Young Seok Jeon,Bardia Khosravi,Abdulhameed Dere,Beatrice Brown-Mulry,Rohan Satya Isaac,Aawez Mansuri,Chiratidzo Sanyika,Janice Newsome,Saptarshi Purkayastha,Imon Banerjee,Hari Trivedi,Judy Gichoya
Main category: cs.CV
TL;DR: 该研究评估了八种医学和通用领域的视觉编码器在胸部X光分析中的表现,发现医学领域预训练的模型在线性探测中表现更优,但特征效用高度依赖任务类型,尤其在复杂病灶分割任务中所有基础模型均表现不佳,需大量微调;同时发现昂贵的图文对齐并非必要,纯图像或标签监督模型表现优异,且端到端监督模型在分割任务上仍具竞争力。
Details
Motivation: 目前尚不清楚预训练领域(医学vs.通用)、范式(如文本引导)和架构如何影响嵌入质量,导致难以选择适合特定放射学任务的基础模型。 Method: 评估来自八个医学和通用领域基础模型的视觉编码器,在胸部X光的分类(气胸、心脏肥大)和分割(气胸、心脏边界)任务中使用线性探测和微调进行基准测试,并进行子组分析以识别模型是否依赖混淆特征。 Result: 医学领域预训练模型在线性探测中显著优于通用模型,表明其初始特征质量更高;预训练嵌入对全局分类和显著解剖结构分割有效,但在细微病理分割(如气胸)上普遍表现差,需大量微调;部分模型利用混淆线索(如胸管)进行分类,但无法精确定位;无需昂贵图文对齐的模型(如RAD-DINO、Ark+)表现最佳之一;端到端监督基线模型在分割任务上媲美甚至超越最优基础模型。 Conclusion: 医学预训练有助于提升特征质量,但效果因任务而异,尤其在复杂局部化任务中仍有重大缺陷;架构选择(如多尺度)至关重要,预训练特征并非万能,传统监督模型仍是强有力的竞争者。 Abstract: Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.[75] Gradient-Guided Exploration of Generative Model's Latent Space for Controlled Iris Image Augmentations
Mahsa Mitcheff,Siamul Karim Khan,Adam Czajka
Main category: cs.CV
TL;DR: 提出一种基于生成模型潜在空间遍历的虹膜图像增强策略,可操控特定虹膜属性(如清晰度、瞳孔大小等)并保持身份一致性。
Details
Motivation: 现有虹膜数据集缺乏多样性,且难以在保持身份一致的同时合成具有特定属性变化的虹膜图像。 Method: 通过梯度引导在生成模型的潜在空间中进行遍历,操控几何、纹理或质量相关特征;结合GAN反演技术将真实或生成的虹膜图像映射到潜在空间以实现属性编辑。 Result: 实现了对虹膜图像多种属性的可控编辑,同时保持身份不变,适用于数据增强和活体检测等任务。 Conclusion: 该方法为虹膜识别和攻击检测提供了有效的数据增强手段,具有良好的扩展性,可用于任意可微损失定义的属性操控。 Abstract: Developing reliable iris recognition and presentation attack detection methods requires diverse datasets that capture realistic variations in iris features and a wide spectrum of anomalies. Because of the rich texture of iris images, which spans a wide range of spatial frequencies, synthesizing same-identity iris images while controlling specific attributes remains challenging. In this work, we introduce a new iris image augmentation strategy by traversing a generative model's latent space toward latent codes that represent same-identity samples but with some desired iris image properties manipulated. The latent space traversal is guided by a gradient of specific geometrical, textural, or quality-related iris image features (e.g., sharpness, pupil size, iris size, or pupil-to-iris ratio) and preserves the identity represented by the image being manipulated. The proposed approach can be easily extended to manipulate any attribute for which a differentiable loss term can be formulated. Additionally, our approach can use either randomly generated images using either a pre-train GAN model or real-world iris images. We can utilize GAN inversion to project any given iris image into the latent space and obtain its corresponding latent code.[76] STORM: Segment, Track, and Object Re-Localization from a Single 3D Model
Yu Deng,Teng Cao,Hikaru Shindo,Jiahong Xue,Quentin Delfosse,Kristian Kersting
Main category: cs.CV
TL;DR: STORM 是一个无需手动标注的实时6D姿态估计系统,通过结合视觉-语言理解和自监督特征匹配,在工业场景中实现了最先进的精度和鲁棒性。
Details
Motivation: 现有6D姿态估计方法依赖首帧的手动分割标注,费时且在遮挡或快速运动下性能下降,因此需要一种更自动化、鲁棒的方法。 Method: 提出STORM,采用三阶段 pipeline:利用视觉-语言理解进行目标定位,通过自交叉注意力机制识别候选区域,并使用分割模型生成精确掩码;引入自动重注册机制,通过特征相似性监测检测跟踪失败并恢复。 Result: 在具有多物体遮挡、高速运动和光照变化的挑战性工业数据集上达到最先进精度,同时实现实时运行速度,无需额外训练。 Conclusion: STORM 实现了无需人工标注的高效、鲁棒6D姿态估计,显著降低了部署成本,适用于柔性制造和智能质检等实际应用场景。 Abstract: Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically rely on a manually annotated segmentation mask of the target in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limi- tations, we propose STORM (Segment, Track, and Object Re-localization from a single 3D Model), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with self-supervised feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and a segmentation model produces precise masks for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.[77] PANDA - Patch And Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning
Siddeshwar Raghavan,Jiangpeng He,Fengqing Zhu
Main category: cs.CV
TL;DR: 提出PANDA框架,通过补丁级和分布感知的增强策略,有效缓解基于预训练模型的免范例持续学习中的类间与任务间数据不平衡问题。
Details
Motivation: 现有免范例持续学习方法忽视真实数据流中存在的双重不平衡(任务内和任务间类别偏斜),导致模型性能下降。 Method: 利用CLIP编码器识别代表性图像区域并将其移植到高频类样本中以增强低频类;结合自适应平衡策略,利用历史任务分布缓解任务间不平衡。 Result: 在多个基准上验证了PANDA的有效性,显著提升分类准确率并减少灾难性遗忘,且可无缝集成到现有方法中。 Conclusion: PANDA通过细粒度增强与分布感知机制,在冻结预训练模型的前提下实现了更公平、稳定的持续学习。 Abstract: Exemplar-Free Continual Learning (EFCL) restricts the storage of previous task data and is highly susceptible to catastrophic forgetting. While pre-trained models (PTMs) are increasingly leveraged for EFCL, existing methods often overlook the inherent imbalance of real-world data distributions. We discovered that real-world data streams commonly exhibit dual-level imbalances, dataset-level distributions combined with extreme or reversed skews within individual tasks, creating both intra-task and inter-task disparities that hinder effective learning and generalization. To address these challenges, we propose PANDA, a Patch-and-Distribution-Aware Augmentation framework that integrates seamlessly with existing PTM-based EFCL methods. PANDA amplifies low-frequency classes by using a CLIP encoder to identify representative regions and transplanting those into frequent-class samples within each task. Furthermore, PANDA incorporates an adaptive balancing strategy that leverages prior task distributions to smooth inter-task imbalances, reducing the overall gap between average samples across tasks and enabling fairer learning with frozen PTMs. Extensive experiments and ablation studies demonstrate PANDA's capability to work with existing PTM-based CL methods, improving accuracy and reducing catastrophic forgetting.[78] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models
Konstantinos M. Dafnis,Dimitris N. Metaxas
Main category: cs.CV
TL;DR: 提出了一种名为Spectrum-Aware Test-Time Steering (STS)的轻量级测试时自适应框架,通过在潜在空间中调整少量样本参数来提升视觉-语言模型在域偏移下的表现,无需更新模型权重,效率显著优于现有方法。
Details
Motivation: 现有的测试时自适应方法通常需要反向传播或修改大模型结构,计算开销大且复杂,限制了其在实际中的应用。 Method: STS从文本嵌入中提取谱子空间以定义主要语义方向,并通过最小化增强视图间的熵,在推理时仅调整少量每样本偏移参数,实现谱感知的潜在表示引导。 Result: 实验表明,STS在标准评测协议下性能优于或媲美当前最先进的测试时自适应方法,额外参数极少,推理速度快达8倍,内存占用仅为传统提示调优的1/12。 Conclusion: STS是一种高效、轻量且无需修改冻结编码器的测试时自适应方法,为VLM在域偏移下的部署提供了实用解决方案。 Abstract: Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.[79] Lumos3D: A Single-Forward Framework for Low-Light 3D Scene Restoration
Hanzhou Liu,Peng Jiang,Jia Huang,Mi Lu
Main category: cs.CV
TL;DR: 提出Lumos3D,一种可泛化的无需姿态的3D低光场景恢复框架,通过一次训练即可直接从无姿态的多视角低光图像中前馈恢复光照和结构,具有高保真度和强泛化能力。
Details
Motivation: 现有方法依赖预设相机姿态和场景特定优化,难以扩展到动态真实环境,限制了在实际应用中的可扩展性。 Method: 构建基于几何的骨干网络,采用交叉光照蒸馏策略将教师网络的几何信息(如深度)传递给学生模型,并引入专用的Lumos损失以增强重建3D空间内的光度一致性,输出正常光照下的3D高斯表示。 Result: 在真实数据集上实验表明,Lumos3D能实现高保真的低光3D场景恢复,具备准确的几何结构和对未见场景的强泛化能力,并可自然扩展至过曝校正任务。 Conclusion: Lumos3D是一种高效、通用且无需每场景优化的3D低光恢复框架,在多种照明恢复任务中展现出优越性能和广泛适用性。 Abstract: Restoring 3D scenes captured under low-light con- ditions remains a fundamental yet challenging problem. Most existing approaches depend on precomputed camera poses and scene-specific optimization, which greatly restricts their scala- bility to dynamic real-world environments. To overcome these limitations, we introduce Lumos3D, a generalizable pose-free framework for 3D low-light scene restoration. Trained once on a single dataset, Lumos3D performs inference in a purely feed- forward manner, directly restoring illumination and structure from unposed, low-light multi-view images without any per- scene training or optimization. Built upon a geometry-grounded backbone, Lumos3D reconstructs a normal-light 3D Gaussian representation that restores illumination while faithfully pre- serving structural details. During training, a cross-illumination distillation scheme is employed, where the teacher network is distilled on normal-light ground truth to transfer accurate geometric information, such as depth, to the student model. A dedicated Lumos loss is further introduced to promote photomet- ric consistency within the reconstructed 3D space. Experiments on real-world datasets demonstrate that Lumos3D achieves high- fidelity low-light 3D scene restoration with accurate geometry and strong generalization to unseen cases. Furthermore, the framework naturally extends to handle over-exposure correction, highlighting its versatility for diverse lighting restoration tasks.[80] From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance
Jeongho Min,Dongyoung Kim,Jaehyup Lee
Main category: cs.CV
TL;DR: 提出一种无需训练、基于预训练视觉编码器和大语言模型的街景到卫星图像检索框架,在零样本设置下超越现有方法,并能自动构建语义对齐的跨视角数据集。
Details
Motivation: 现有跨视角图像检索方法依赖监督训练和特定图像类型,限制了实际应用,因此需要一种无需额外训练且适用于单目街景图像的通用解决方案。 Method: 利用预训练视觉编码器(如DINOv2)和大语言模型,通过网络图像搜索和LLM进行地理线索提取与位置推断,结合地理编码API生成卫星查询,并使用PCA白化特征优化进行匹配检索。 Result: 在零样本设置下优于先前的有监督学习方法,能够在无真实标签监督或微调的情况下实现高效跨视角图像检索,并可自动生成对齐的街景-卫星图像数据集。 Conclusion: 该方法提供了一种可扩展、低成本的跨视角检索方案,无需人工标注或模型微调,具有良好的实际部署潜力。 Abstract: Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://jeonghomin.github.io/street2orbit.github.io/.[81] AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting
Aymen Mir,Jian Wang,Riza Alp Guler,Chuan Guo,Gerard Pons-Moll,Bing Zhou
Main category: cs.CV
TL;DR: 提出了一种基于3D高斯点阵(3DGS)的新框架,用于在3D场景中实现几何一致的自由视角人类动画与交互。
Details
Motivation: 现有方法多使用网格或点云表示,而3DGS在新视角合成中表现优异但尚未充分探索其在人-景交互动画中的应用。 Method: 将人类和场景均表示为高斯分布,通过解耦渲染与动作合成,利用不透明度线索和投影结构进行姿态对齐,并设计高斯优化模块提升交互真实性。 Result: 在Scannet++和SuperSplat场景及多视角重建人物上验证了方法有效性,支持自由视角渲染和真实交互。 Conclusion: 该框架展示了3DGS在单目视频驱动的人类动画中的独特优势,实现了高质量、几何一致的人-景动画。 Abstract: We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation to the problem of animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows for geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that the rendering can be decoupled from the motion synthesis and each sub-problem can be addressed independently, without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework allows for novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with new animated humans, showcasing the unique advantage of 3DGS for monocular video-based human animation.[82] CertMask: Certifiable Defense Against Adversarial Patches via Theoretically Optimal Mask Coverage
Xuntao Lyu,Ching-Chi Lin,Abdullah Al Arafat,Georg von der Brüggen,Jian-Jia Chen,Zhishan Guo
Main category: cs.CV
TL;DR: 本文提出了CertMask,一种可验证鲁棒的防御方法,通过构建二值掩码集来抵御对抗性图像补丁攻击,在保证理论安全性的同时显著提升效率和认证准确率。
Details
Motivation: 对抗性补丁攻击可通过物理部署威胁现实世界中的视觉系统,现有防御方法在效率和认证性能上存在不足,亟需更高效且具有理论保障的解决方案。 Method: 提出CertMask,采用单轮O(n)复杂度的掩码机制,通过数学严谨的覆盖策略构建二值掩码集,确保每个可能的补丁位置被至少覆盖k次,并提供理论证明其覆盖条件足以实现认证鲁棒性。 Result: 在ImageNet、ImageNette和CIFAR-10上的实验表明,CertMask相比PatchCleanser将认证鲁棒准确率最高提升13.4%,同时保持与原始模型相当的干净样本准确率。 Conclusion: CertMask在理论保障、计算效率和防御性能之间实现了更好平衡,是抵御对抗性补丁攻击的一种高效且可验证的防御方案。 Abstract: Adversarial patch attacks inject localized perturbations into images to mislead deep vision models. These attacks can be physically deployed, posing serious risks to real-world applications. In this paper, we propose CertMask, a certifiably robust defense that constructs a provably sufficient set of binary masks to neutralize patch effects with strong theoretical guarantees. While the state-of-the-art approach (PatchCleanser) requires two rounds of masking and incurs $O(n^2)$ inference cost, CertMask performs only a single round of masking with $O(n)$ time complexity, where $n$ is the cardinality of the mask set to cover an input image. Our proposed mask set is computed using a mathematically rigorous coverage strategy that ensures each possible patch location is covered at least $k$ times, providing both efficiency and robustness. We offer a theoretical analysis of the coverage condition and prove its sufficiency for certification. Experiments on ImageNet, ImageNette, and CIFAR-10 show that CertMask improves certified robust accuracy by up to +13.4\% over PatchCleanser, while maintaining clean accuracy nearly identical to the vanilla model.[83] CORONA-Fields: Leveraging Foundation Models for Classification of Solar Wind Phenomena
Daniela Martin,Jinsu Hong,Connor O'Brien,Valmir P Moraes Filho,Jasmine R. Kobayashi,Evangelia Samara,Joseph Gallego
Main category: cs.CV
TL;DR: 本研究首次尝试将用于太阳物理的基础模型嵌入应用于原位太阳风结构分类,结合傅里叶特征编码 spacecraft 位置和磁连接性,构建神经场模型,并在帕克太阳探测器数据上进行微调,验证了该方法的可行性。
Details
Motivation: 空间天气对卫星和地面基础设施构成日益增长的风险,而太阳风和日冕物质抛射的复杂性使得其自动分类具有挑战性,亟需融合遥感与原位观测的新方法。 Method: 采用基于Solar Dynamics Observatory图像预训练的基础模型生成嵌入表示,将其与利用傅里叶特征编码的航天器位置和太阳磁连接性信息拼接,构建神经场深度学习模型,并使用帕克太阳探测器的测量数据进行微调以实现太阳风结构分类。 Result: 尽管分类性能有限(可能由于标签粗糙、类别不平衡及模型迁移能力不足),但整体验证了利用基础模型嵌入进行原位太阳风任务的可行性。 Conclusion: 该研究作为概念验证,为未来改进可靠的空间天气预测方法奠定了基础,并公开代码以支持可重复性。 Abstract: Space weather at Earth, driven by the solar activity, poses growing risks to satellites around our planet as well as to critical ground-based technological infrastructure. Major space weather contributors are the solar wind and coronal mass ejections whose variable density, speed, temperature, and magnetic field make the automated classification of those structures challenging. In this work, we adapt a foundation model for solar physics, originally trained on Solar Dynamics Observatory imagery, to create embeddings suitable for solar wind structure analysis. These embeddings are concatenated with the spacecraft position and solar magnetic connectivity encoded using Fourier features which generates a neural field-based model. The full deep learning architecture is fine-tuned bridging the gap between remote sensing and in situ observations. Labels are derived from Parker Solar Probe measurements, forming a downstream classification task that maps plasma properties to solar wind structures. Although overall classification performance is modest, likely due to coarse labeling, class imbalance, and limited transferability of the pretrained model, this study demonstrates the feasibility of leveraging foundation model embeddings for in situ solar wind tasks. As a first proof-of-concept, it lays the groundwork for future improvements toward more reliable space weather predictions. The code and configuration files used in this study are publicly available to support reproducibility.[84] IPCD: Intrinsic Point-Cloud Decomposition
Shogo Sato,Takuhiro Kaneko,Shoichiro Takeda,Tomoyasu Shimada,Kazuhiko Murasaki,Taiga Yoshida,Ryuichi Tanida,Akisato Kimura
Main category: cs.CV
TL;DR: 本文提出了一种针对彩色点云的内在分解方法(IPCD),通过IPCD-Net和基于投影的亮度分布(PLD)有效分离反照率与阴影,解决了非网格结构处理和全局光照方向建模的难题,并在真实感纹理编辑、重光照等任务中验证了其有效性。
Details
Motivation: 点云的非网格结构和现有模型忽略全局光照方向导致难以准确分离反照率与阴影,限制了在AR、机器人等领域的应用。 Method: 提出IPCD-Net,结合点级特征聚合处理非网格数据;设计PLD模块,通过多视角投影捕捉全局光照信息,并采用分层特征优化策略。 Result: 在合成户外场景数据集上实验表明,IPCD-Net能有效减少反照率中的投影阴影,提升阴影的颜色准确性,并在纹理编辑、重光照和跨光照点云配准中展现良好应用效果。 Conclusion: IPCD框架首次实现了面向点云的端到端内在分解,兼顾结构特性与光照建模,显著提升了分解质量与实际应用潜力。 Abstract: Point clouds are widely used in various fields, including augmented reality (AR) and robotics, where relighting and texture editing are crucial for realistic visualization. Achieving these tasks requires accurately separating albedo from shade. However, performing this separation on point clouds presents two key challenges: (1) the non-grid structure of point clouds makes conventional image-based decomposition models ineffective, and (2) point-cloud models designed for other tasks do not explicitly consider global-light direction, resulting in inaccurate shade. In this paper, we introduce \textbf{Intrinsic Point-Cloud Decomposition (IPCD)}, which extends image decomposition to the direct decomposition of colored point clouds into albedo and shade. To overcome challenge (1), we propose \textbf{IPCD-Net} that extends image-based model with point-wise feature aggregation for non-grid data processing. For challenge (2), we introduce \textbf{Projection-based Luminance Distribution (PLD)} with a hierarchical feature refinement, capturing global-light ques via multi-view projection. For comprehensive evaluation, we create a synthetic outdoor-scene dataset. Experimental results demonstrate that IPCD-Net reduces cast shadows in albedo and enhances color accuracy in shade. Furthermore, we showcase its applications in texture editing, relighting, and point-cloud registration under varying illumination. Finally, we verify the real-world applicability of IPCD-Net.[85] Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies
Peng Gao,Yujian Lee,Xiaofeng Zhang,Zailong Chen,Hui Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的三步衰减恢复策略(T-DRS),以解决大型视觉语言模型(LVLMs)在使用旋转位置编码(ROPE)时存在的长距离注意力衰减问题,显著提升了模型对全局上下文的记忆能力与VQA性能。
Details
Motivation: LVLMs在使用ROPE时会因token距离增加而导致注意力逐渐衰减,影响模型对远距离语义依赖和全局上下文的建模能力,限制了其在复杂多模态任务中的表现。 Method: 提出了推理阶段即用的T-DRS,包含三个模块:1)语义驱动的SD-DRS,通过内容感知残差增强重要但遥远的语义信号;2)距离感知控制DC-DRS,基于位置距离平滑调节注意力权重以抑制噪声并保留局部性;3)再强化远距离reRD-DRS,整合剩余的远程依赖以维持全局一致性。 Result: 在多个视觉问答(VQA)基准上验证了T-DRS的有效性,能够在不进行训练的情况下持续提升LVLMs的性能,尤其在需要长程依赖的任务中表现突出。 Conclusion: T-DRS有效缓解了ROPE带来的长距离注意力衰减问题,在保持局部归纳偏置的同时恢复了关键的远距离依赖,为改进现有LVLMs提供了一种高效、即插即用的推理解决方案。 Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model's ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise while preserving locality, and (3) re-Reinforce Distant DRS (reRD-DRS), consolidating the remaining informative remote dependencies to maintain global coherence. Together, the T-DRS recover suppressed long-range token pairs without harming local inductive biases. Extensive experiments on Vision Question Answering (VQA) benchmarks demonstrate that T-DRS can consistently improve performance in a training-free manner. The code can be accessed in https://github.com/labixiaoq-qq/Remember-me[86] SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection
Jia Lin,Xiaofei Zhou,Jiyuan Liu,Runmin Cong,Guodao Zhang,Zhi Liu,Jiyong Zhang
Main category: cs.CV
TL;DR: 提出了一种基于深度引导自适应查询的SAM模型(SAM-DAQ),用于解决RGB-D视频显著物体检测中的手动提示依赖、内存消耗高和计算负担重等问题。
Details
Motivation: 现有方法在将SAM应用于RGB-D视频显著性检测时面临三大挑战:依赖人工提示、顺序适配器内存消耗大、记忆注意力计算开销高。因此需要一种更高效、自动化的融合深度与时间信息的方法。 Method: 提出SAM-DAQ,包含两个核心模块:1)深度引导并行适配器(DPA),以跳跃连接方式融合多模态特征,并在无提示条件下微调SAM编码器;2)查询驱动的时序记忆模块(QTM),统一记忆库与提示嵌入,利用帧级和视频级查询提取时序一致性特征并更新查询表示。 Result: 在三个RGB-D VSOD数据集上进行实验,SAM-DAQ在所有评估指标上均优于现有最先进方法。 Conclusion: SAM-DAQ通过引入深度引导和查询驱动的时序建模,有效提升了SAM在RGB-D视频显著性检测任务中的性能,同时降低了对人工提示的依赖和计算资源消耗。 Abstract: Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.[87] RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion
Wenzhe He,Xiaojun Chen,Wentang Chen,Hongyu Wang,Ying Liu,Ruihui Li
Main category: cs.CV
TL;DR: 提出了一种基于RWKV机制的轻量级点云语义场景补全网络RWKV-PCSSC,通过RWKV-SG和RWKV-PD模块实现高效特征聚合与逐步恢复,在显著降低参数量和提升内存效率的同时,在多个室内外数据集上达到最先进性能。
Details
Motivation: 现有语义场景补全方法多采用密集网络结构,导致模型复杂度高、资源消耗大,亟需更轻量高效的解决方案。 Method: 提出RWKV-PCSSC,包含RWKV Seed Generator(RWKV-SG)用于从部分点云生成带粗略特征的粗略点云,以及多阶段的RWKV Point Deconvolution(RWKV-PD)模块逐步恢复点级特征,整体设计紧凑高效。 Result: 相比PointSSC,参数量减少4.18倍,内存效率提升1.37倍,并在SSC-PC、NYUCAD-PC、PointSSC、NYUCAD-PC-V2和3D-FRONT-PC等多个数据集上实现了最先进的性能。 Conclusion: RWKV-PCSSC通过引入基于RWKV的轻量设计,在保持高性能的同时大幅降低模型复杂度,为语义场景补全提供了一种高效且可扩展的解决方案。 Abstract: Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18$\times$ and improves memory efficiency by 1.37$\times$ compared to state-of-the-art methods PointSSC. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).[88] HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models
Liheng Zhang,Jin Wang,Hui Li,Bingfeng Zhang,Weifeng Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为HCC-3D的分层补偿压缩方法,用于降低3D视觉语言模型中3D点云数据处理的计算开销,同时保持关键信息完整性。
Details
Motivation: 现有3D-VLM因将所有3D令牌输入大语言模型而导致计算成本过高,限制了实际应用,亟需一种高效压缩方法以缓解这一瓶颈。 Method: 提出HCC-3D框架,包含全局结构压缩(GSC)和自适应细节挖掘(ADM)两个模块:GSC利用全局查询将大量3D令牌压缩为少量关键令牌以保留整体结构;ADM则通过互补评分机制选择性地恢复被忽略的重要细节特征。 Result: 实验表明,HCC-3D相比先前方法实现了约98%的极端压缩比,同时在多个基准上达到最先进的性能,显著提升了效率与准确性。 Conclusion: HCC-3D有效解决了3D-VLM中计算开销过大的问题,在高度压缩的同时仍能保持甚至提升模型性能,为高效3D多模态理解提供了新思路。 Abstract: 3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.[89] Scale-Aware Relay and Scale-Adaptive Loss for Tiny Object Detection in Aerial Images
Jinfu Li,Yuqi Huang,Hong Song,Ting Wang,Jianghan Xia,Yucong Lin,Jingfan Fan,Jian Yang
Main category: cs.CV
TL;DR: 本文提出了一种用于航拍图像中小目标检测的尺度感知中继层(SARL)和尺度自适应损失(SAL),有效提升了现有检测器在多个基准上的性能,尤其在小目标检测和噪声数据上表现突出。
Details
Motivation: 现代检测器在航拍图像中难以检测小目标,主要由于小目标特征少且在网络传播中易丢失,同时训练时大目标带来的回归损失占主导。 Method: 提出SARL通过跨尺度空间-通道注意力机制增强各层特征并促进跨层特征共享;设计SAL动态降低大目标的损失权重,使训练更关注小目标。 Result: 在AI-TOD、DOTA-v2.0和VisDrone2019三个基准上,嵌入YOLOv5和YOLOx后平均精度(AP)提升5.5%,在真实噪声数据集AI-TOD-v2.0上达到29.0% AP。 Conclusion: SARL和SAL能有效提升小目标检测性能,具备良好的通用性和鲁棒性,适用于主流检测框架。 Abstract: Recently, despite the remarkable advancements in object detection, modern detectors still struggle to detect tiny objects in aerial images. One key reason is that tiny objects carry limited features that are inevitably degraded or lost during long-distance network propagation. Another is that smaller objects receive disproportionately greater regression penalties than larger ones during training. To tackle these issues, we propose a Scale-Aware Relay Layer (SARL) and a Scale-Adaptive Loss (SAL) for tiny object detection, both of which are seamlessly compatible with the top-performing frameworks. Specifically, SARL employs a cross-scale spatial-channel attention to progressively enrich the meaningful features of each layer and strengthen the cross-layer feature sharing. SAL reshapes the vanilla IoU-based losses so as to dynamically assign lower weights to larger objects. This loss is able to focus training on tiny objects while reducing the influence on large objects. Extensive experiments are conducted on three benchmarks (\textit{i.e.,} AI-TOD, DOTA-v2.0 and VisDrone2019), and the results demonstrate that the proposed method boosts the generalization ability by 5.5\% Average Precision (AP) when embedded in YOLOv5 (anchor-based) and YOLOx (anchor-free) baselines. Moreover, it also promotes the robust performance with 29.0\% AP on the real-world noisy dataset (\textit{i.e.,} AI-TOD-v2.0).[90] Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning
Zubia Naz,Farhan Asghar,Muhammad Ishfaq Hussain,Yahya Hadadi,Muhammad Aasim Rafique,Wookjin Choi,Moongu Jeon
Main category: cs.CV
TL;DR: 提出了一种基于Swin-BART的编码器-解码器系统,结合轻量级区域注意力模块,在医学图像自动描述任务中实现了最先进的语义保真度,具有良好的可解释性和临床适用性。
Details
Motivation: 为了提升医学图像自动生成诊断描述的准确性与可解释性,支持放射科报告工作流,需要一种能够突出关键诊断区域并生成高质量文本描述的模型。 Method: 采用Swin-BART作为编码器-解码器框架,并引入轻量级区域注意力模块,在交叉注意力之前增强诊断显著区域的关注;在ROCO数据集上训练和评估,使用beam search进行解码,并进行了消融实验、模态分析和显著性测试。 Result: 在ROUGE和BERTScore指标上显著优于ResNet-CNN和BLIP2-OPT等基线模型(ROUGE: 0.603 vs. 0.356/0.255;BERTScore: 0.807 vs. 0.645/0.623),BLEU、CIDEr和METEOR表现相当;通过热图可视化展示了驱动描述的关键区域,验证了模型的可解释性。 Conclusion: 该方法能生成准确且符合临床表达习惯的图像描述,并提供透明的区域归因,适用于带有人工监督的安全研究场景。 Abstract: Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.[91] Simulating Distribution Dynamics: Liquid Temporal Feature Evolution for Single-Domain Generalized Object Detection
Zihao Zhang,Yang Li,Aming Wu,Yahong Han
Main category: cs.CV
TL;DR: 本文提出了一种名为Liquid Temporal Feature Evolution(LTFE)的新方法,用于单域广义目标检测(Single-DGOD),通过引入时间建模和液态神经网络驱动的参数调整机制,模拟特征从源域到潜在分布的渐进演化,有效应对连续域偏移问题。
Details
Motivation: 现有Single-DGOD方法依赖离散数据增强或静态扰动,难以捕捉真实场景中连续渐变的域偏移(如天气、光照变化),限制了模型对细粒度跨域差异的感知能力。 Method: 提出Liquid Temporal Feature Evolution(LTFE):首先通过可控高斯噪声注入和多尺度高斯模糊模拟初始特征扰动;然后结合时间建模与液态神经网络的动态参数调节机制,生成自适应调制参数,实现跨域的平滑连续适应。 Result: 在Diverse Weather数据集和Real-to-Art基准上取得显著性能提升,验证了方法在未知域上的泛化性和鲁棒性。 Conclusion: LTFE通过建模特征的连续演化过程并动态调节适应路径,有效缩小源域与未知域之间的分布差距,为Single-DGOD提供了更优的解决方案。 Abstract: In this paper, we focus on Single-Domain Generalized Object Detection (Single-DGOD), aiming to transfer a detector trained on one source domain to multiple unknown domains. Existing methods for Single-DGOD typically rely on discrete data augmentation or static perturbation methods to expand data diversity, thereby mitigating the lack of access to target domain data. However, in real-world scenarios such as changes in weather or lighting conditions, domain shifts often occur continuously and gradually. Discrete augmentations and static perturbations fail to effectively capture the dynamic variation of feature distributions, thereby limiting the model's ability to perceive fine-grained cross-domain differences. To this end, we propose a new method, Liquid Temporal Feature Evolution, which simulates the progressive evolution of features from the source domain to simulated latent distributions by incorporating temporal modeling and liquid neural network-driven parameter adjustment. Specifically, we introduce controllable Gaussian noise injection and multi-scale Gaussian blurring to simulate initial feature perturbations, followed by temporal modeling and a liquid parameter adjustment mechanism to generate adaptive modulation parameters, enabling a smooth and continuous adaptation across domains. By capturing progressive cross-domain feature evolution and dynamically regulating adaptation paths, our method bridges the source-unknown domain distribution gap, significantly boosting generalization and robustness to unseen shifts. Significant performance improvements on the Diverse Weather dataset and Real-to-Art benchmark demonstrate the superiority of our method. Our code is available at https://github.com/2490o/LTFE.[92] MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
Ketong Chen,Yuhao Chen,Yang Xue
Main category: cs.CV
TL;DR: 本文提出了DocWeaver,一个利用大语言模型自动生成大规模、双语(中英文)视觉文档理解基准MosaicDoc的多智能体管道。MosaicDoc包含来自报纸和杂志的72K图像和60万多个问答对,具有复杂布局和多样风格,并支持OCR、VQA、阅读顺序和定位等多任务评估,揭示了现有视觉语言模型在处理真实文档复杂性方面的不足。
Details
Motivation: 现有的视觉语言模型基准主要以英语为主,布局简单,任务有限,难以有效评估模型在复杂版面和密集文本下的视觉丰富文档理解(VRDU)能力。因此,需要一个更具挑战性和代表性的基准来推动该领域发展。 Method: 提出DocWeaver多智能体管道,利用大语言模型自动构建MosaicDoc基准。该基准包含从报纸和杂志中收集的双语文档图像,涵盖复杂多样的版式设计,并提供OCR、视觉问答(VQA)、阅读顺序和目标定位等多任务标注。 Result: 构建了包含72K图像和超过60万问答对的MosaicDoc基准,覆盖196家出版商的丰富样式和复杂布局。在多个前沿视觉语言模型上的实验表明,现有模型在处理真实世界文档复杂性方面表现不佳,存在明显局限。 Conclusion: MosaicDoc为视觉丰富文档理解提供了更具挑战性和现实意义的评估基准,揭示了当前视觉语言模型在复杂布局和多语言环境下的不足,为未来研究指明了方向。 Abstract: Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.[93] Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers
Xuan Rao,Simian Xu,Zheng Li,Bo Zhao,Derong Liu,Mingming Ha,Cesare Alippi
Main category: cs.CV
TL;DR: 提出了一种用于类增量学习的序列学习方法SLDC,通过引入潜空间转移算子来补偿分布漂移,结合知识蒸馏显著提升了SeqFT的性能。
Details
Motivation: 现有的序列微调方法在类增量学习中容易受到分布漂移的影响,导致先前类别的特征分布与更新后模型不匹配,从而降低分类器性能。 Method: 提出了线性与弱非线性的SLDC变体,通过求解正则化最小二乘问题学习线性算子,并设计可学习的弱非线性映射来平衡灵活性与泛化能力;同时结合知识蒸馏减少表示漂移。 Result: 在多个标准CIL基准上实验表明,SLDC显著提升了SeqFT的性能,结合KD后性能接近联合训练。 Conclusion: SLDC能有效缓解分布漂移问题,提升类增量学习中序列微调的表现,为实际应用提供了更稳定的解决方案。 Abstract: Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: https://github.com/raoxuan98-hash/sldc.git.[94] Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification
Yuhang Zhou,Yanxiang Zhao,Zhongyun Hua,Zhipu Liu,Zhaoquan Gu,Qing Liao,Leo Yu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种去偏双不变防御框架,以应对行人重识别(ReID)中的对抗攻击问题,通过数据平衡和双对抗自元防御机制,在未见身份和未见攻击类型上实现了优越的鲁棒性。
Details
Motivation: 现有的对抗防御方法在分类任务中取得进展,但在度量学习任务如行人ReID中的应用不足,且未能解决模型偏差和复合泛化需求等独特挑战。 Method: 提出去偏双不变防御框架:第一阶段利用基于扩散模型的数据重采样策略缓解模型偏差;第二阶段引入包含最远负样本扩展软化的度量对抗训练,并结合对抗增强的自元机制实现对未见身份和攻击类型的双重泛化。 Result: 实验表明,所提方法在多种攻击下显著优于现有最先进的防御方法,提升了ReID模型的对抗鲁棒性。 Conclusion: 该框架有效解决了ReID中模型偏差和泛化能力不足的问题,为构建安全可靠的行人重识别系统提供了新思路。 Abstract: Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking. However, advanced deep learning-based ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions, posing significant security threats. Although numerous adversarial defense strategies have been proposed for classification tasks, their extension to metric learning tasks such as person ReID remains relatively unexplored. Moreover, the several existing defenses for person ReID fail to address the inherent unique challenges of adversarially robust ReID. In this paper, we systematically identify the challenges of adversarial defense in person ReID into two key issues: model bias and composite generalization requirements. To address them, we propose a debiased dual-invariant defense framework composed of two main phases. In the data balancing phase, we mitigate model bias using a diffusion-model-based data resampling strategy that promotes fairness and diversity in training data. In the bi-adversarial self-meta defense phase, we introduce a novel metric adversarial training approach incorporating farthest negative extension softening to overcome the robustness degradation caused by the absence of classifier. Additionally, we introduce an adversarially-enhanced self-meta mechanism to achieve dual-generalization for both unseen identities and unseen attack types. Experiments demonstrate that our method significantly outperforms existing state-of-the-art defenses.[95] AdaptViG: Adaptive Vision GNN with Exponential Decay Gating
Mustafa Munir,Md Mostafijur Rahman,Radu Marculescu
Main category: cs.CV
TL;DR: 提出AdaptViG,一种高效的混合视觉图神经网络,通过自适应图卷积和动态门控机制在准确性和效率上达到新SOTA。
Details
Motivation: ViGs在视觉架构中具有潜力,但其图构建阶段计算开销大,影响效率,需更高效的方法。 Method: 引入自适应图卷积,基于静态轴向骨架和内容感知的指数衰减门控机制,并在早期使用高效门控、最后阶段使用全局注意力的混合策略。 Result: AdaptViG-M在ImageNet上达82.6% top-1精度,参数和计算量分别减少80%和84%,下游任务性能也全面超越更大模型。 Conclusion: AdaptViG显著提升了Vision GNN的效率与性能平衡,为视觉图网络提供了可行的高效架构方案。 Abstract: Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.[96] TSPE-GS: Probabilistic Depth Extraction for Semi-Transparent Surface Reconstruction via 3D Gaussian Splatting
Zhiyuan Xu,Nan Min,Yuhang Guo,Tong Wei
Main category: cs.CV
TL;DR: 提出TSPE-GS方法,通过建模像素级多模态不透明度和深度分布,解决3D高斯点阵中半透明表面重建的深度歧义问题。
Details
Motivation: 传统方法假设每个像素只有一个深度,无法处理多个可见表面的情况,导致半透明表面重建困难。 Method: 采用均匀采样透射率,建模像素级多峰不透明度与深度分布,并通过融合截断符号距离函数分别重建内外表面。 Result: 在公开和自采集数据集上实验表明,TSPE-GS显著提升了半透明几何重建质量,同时保持对不透明场景的良好性能。 Conclusion: TSPE-GS有效解决了半透明表面的深度模糊问题,可推广至其他基于高斯的重建框架,且无需额外训练开销。 Abstract: 3D Gaussian Splatting offers a strong speed-quality trade-off but struggles to reconstruct semi-transparent surfaces because most methods assume a single depth per pixel, which fails when multiple surfaces are visible. We propose TSPE-GS (Transparent Surface Probabilistic Extraction for Gaussian Splatting), which uniformly samples transmittance to model a pixel-wise multi-modal distribution of opacity and depth, replacing the prior single-peak assumption and resolving cross-surface depth ambiguity. By progressively fusing truncated signed distance functions, TSPE-GS reconstructs external and internal surfaces separately within a unified framework. The method generalizes to other Gaussian-based reconstruction pipelines without extra training overhead. Extensive experiments on public and self-collected semi-transparent and opaque datasets show TSPE-GS significantly improves semi-transparent geometry reconstruction while maintaining performance on opaque scenes.[97] Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment
Zhicheng Liao,Dongxu Wu,Zhenshan Shi,Sijie Mai,Hanwei Zhu,Lingyu Zhu,Yuncheng Jiang,Baoliang Chen
Main category: cs.CV
TL;DR: 提出一种新的自适应融合框架,结合CLIP图像特征的模长信息与余弦相似度,用于无参考图像质量评估,无需任务特定训练且在多个基准上表现优越。
Details
Motivation: 现有基于CLIP的图像质量评估方法仅使用语义相似性(如余弦相似度),忽略了图像特征模长与感知质量之间的强相关性,因此需要引入模长感知的质量线索以提升性能。 Method: 提取绝对CLIP图像特征,采用Box-Cox变换进行统计归一化以降低语义敏感性,生成语义归一化的辅助线索;设计置信度引导的融合机制,自适应地结合模长线索与余弦相似度。 Result: 在多个标准IQA数据集上实验表明,该方法优于标准CLIP-based IQA及当前最先进的无训练方法。 Conclusion: 通过引入模长感知特征并结合自适应融合策略,显著提升了无需训练的CLIP-based图像质量评估性能,揭示了特征模长作为重要质量线索的潜力。 Abstract: Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as "a good photo" or "a bad photo." However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.[98] Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching
Uday Bhaskar,Rishabh Bhattacharya,Avinash Patel,Sarthak Khoche,Praveen Anil Kulkarni,Naresh Manwani
Main category: cs.CV
TL;DR: 提出一种利用视觉语言模型(VLM)生成伪标签并结合每对象协同教学策略训练高效实时目标检测器的新方法,显著减少对人工标注的依赖,在KITTI等数据集上显著提升性能。
Details
Motivation: VLM在零样本目标检测中表现良好,但存在检测延迟和预测幻觉问题,难以直接用于自动驾驶等实时应用,且人工标注成本高。 Method: 设计一个新流程:利用VLM生成伪标签,并采用基于每对象协同教学的训练策略,通过两个YOLO模型协作,根据彼此的损失值过滤每个小批量中的不可靠边界框,从而降低伪标签噪声。 Result: 在KITTI数据集上,该方法将mAP@0.5从31.12%提升至46.61%,加入10%真实标签后进一步提升至57.97%,并在ACDC和BDD100k上表现出类似增益,同时保持实时检测速度。 Conclusion: 该方法为自动驾驶场景下的目标检测提供了一种高效、鲁棒且可扩展的训练方案,大幅降低对人工标注的依赖,具有实际应用潜力。 Abstract: Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12\%$ to $46.61\%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10\%$) leads to further performance gains, reaching $57.97\%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.[99] Equivariant Sampling for Improving Diffusion Model-based Image Restoration
Chenxu Wu,Qingpeng Kong,Peiang Zhao,Wendi Yang,Wenxin Ma,Fenghe Tang,Zihang Jiang,S. Kevin Zhou
Main category: cs.CV
TL;DR: 本文提出了一种新的扩散模型图像恢复方法EquS,通过双采样轨迹引入等变信息,并结合时间步感知调度(TAS)提升性能,显著提高了现有问题无关DMIR方法的效果且不增加计算成本。
Details
Motivation: 现有的问题无关扩散模型图像恢复方法在利用扩散先验方面存在不足,导致性能受限,本文旨在分析其采样过程中的问题并提出改进方案。 Method: 提出EquS方法,通过双采样轨迹引入等变信息;设计时间步感知调度(TAS)以增强确定性和采样效率,进一步提升EquS性能。 Result: 在多个基准数据集上的实验证明,所提方法能有效兼容先前的问题无关DMIR方法,并显著提升其性能,同时不增加计算开销。 Conclusion: EquS与EquS+通过引入等变性约束和优化采样策略,有效提升了扩散模型在图像恢复任务中的表现,为问题无关DMIR方法提供了新的改进方向。 Abstract: Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by analyzing their sampling process and providing effective solutions. We introduce EquS, a DMIR method that imposes equivariant information through dual sampling trajectories. To further boost EquS, we propose the Timestep-Aware Schedule (TAS) and introduce EquS$^+$. TAS prioritizes deterministic steps to enhance certainty and sampling efficiency. Extensive experiments on benchmarks demonstrate that our method is compatible with previous problem-agnostic DMIR methods and significantly boosts their performance without increasing computational costs. Our code is available at https://github.com/FouierL/EquS.[100] Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models
Satoshi Suzuki,Shin'ya Yamaguchi,Shoichiro Takeda,Taiga Yamane,Naoki Makishima,Naotaka Kawata,Mana Ihori,Tomohiro Tanaka,Shota Orihashi,Ryo Masumura
Main category: cs.CV
TL;DR: 本文提出了一种名为Difference Vector Equalization (DiVE)的新方法,用于在不损害分布外(OOD)和零样本性能的前提下,对预训练的视觉-语言模型进行鲁棒微调。
Details
Motivation: 现有微调方法在微调过程中破坏了嵌入空间的几何结构,导致模型在OOD和零样本场景下的泛化能力受限。因此,需要一种能够保持几何结构不变的微调方法。 Method: 提出DiVE方法,通过约束预训练与微调模型之间嵌入差向量的一致性来保留几何结构。引入两种损失:平均向量损失(AVL)用于全局结构保持,成对向量损失(PVL)用于局部多模态对齐保持。 Result: 实验表明,DiVE能有效保持嵌入空间的几何结构,在ID、OOD和零样本分类任务上均取得优异性能。 Conclusion: DiVE通过保留嵌入空间的几何结构,实现了鲁棒微调,在保持原有泛化能力的同时提升了下游任务性能。 Abstract: Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.[101] STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data
Yongdeuk Seo,Hyun-seok Min,Sungchul Choi
Main category: cs.CV
TL;DR: 本文提出了STELLAR,一种面向低资源语言和真实场景数据的场景文本编辑方法,通过语言自适应字形编码器和多阶段训练策略,在合成数据上预训练并在真实图像上微调,提升了多语言文本编辑的可靠性。同时构建了新数据集STIPLAR,并提出Text Appearance Similarity(TAS)指标来评估字体、颜色和背景的风格保持能力。实验表明STELLAR在视觉一致性和识别准确率上优于现有方法,平均TAS提升2.2%。
Details
Motivation: 现有基于扩散模型的场景文本编辑方法在低资源语言支持、合成与真实数据间的领域差异以及缺乏有效的文本风格保持评估指标方面存在不足。 Method: 提出STELLAR框架,包含语言自适应字形编码器和多阶段训练策略(先在合成数据上预训练,再在真实图像上微调),并构建新数据集STIPLAR。此外,设计了Text Appearance Similarity(TAS)指标,独立衡量字体、颜色和背景的相似性以评估风格保持效果。 Result: STELLAR在视觉一致性和文本识别准确率方面优于现有最先进模型,在多种语言上平均TAS指标提升2.2%,验证了其在真实场景中进行多语言文本编辑的有效性。 Conclusion: STELLAR有效解决了低资源语言支持、域偏移和风格评估缺失的问题,为实际应用中的多语言场景文本编辑提供了可靠解决方案。 Abstract: Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.[102] MOBA: A Material-Oriented Backdoor Attack against LiDAR-based 3D Object Detection Systems
Saket S. Chaturvedi,Gaurav Bagwe,Lan Zhang,Pan He,Xiaoyong Yuan
Main category: cs.CV
TL;DR: 本文提出了MOBA,一种面向材料的激光雷达3D目标检测后门攻击框架,通过建模真实触发器的材料特性,弥合了数字与物理域之间的差距。
Details
Motivation: 现有的激光雷达后门攻击缺乏物理可实现性,因为数字触发器忽略了材料相关的反射特性,而物理触发器往往未经过优化,导致效果差或易被检测。 Method: 提出MOBA框架:1)系统选择高漫反射和环境鲁棒性的材料(如TiO_2);2)开发新的仿真流程,包括基于Oren-Nayar BRDF模型的角度无关近似以生成真实LiDAR强度,以及距离感知缩放机制以保持深度变化下的空间一致性。 Result: 在先进LiDAR和多模态融合模型上实验显示,MOBA攻击成功率高达93.50%,超过先前方法41%以上。 Conclusion: MOBA揭示了一类新型的、具有物理可实现性的安全威胁,强调防御机制需考虑现实环境中材料级别的特性。 Abstract: LiDAR-based 3D object detection is widely used in safety-critical systems. However, these systems remain vulnerable to backdoor attacks that embed hidden malicious behaviors during training. A key limitation of existing backdoor attacks is their lack of physical realizability, primarily due to the digital-to-physical domain gap. Digital triggers often fail in real-world settings because they overlook material-dependent LiDAR reflection properties. On the other hand, physically constructed triggers are often unoptimized, leading to low effectiveness or easy detectability.This paper introduces Material-Oriented Backdoor Attack (MOBA), a novel framework that bridges the digital-physical gap by explicitly modeling the material properties of real-world triggers. MOBA tackles two key challenges in physical backdoor design: 1) robustness of the trigger material under diverse environmental conditions, 2) alignment between the physical trigger's behavior and its digital simulation. First, we propose a systematic approach to selecting robust trigger materials, identifying titanium dioxide (TiO_2) for its high diffuse reflectivity and environmental resilience. Second, to ensure the digital trigger accurately mimics the physical behavior of the material-based trigger, we develop a novel simulation pipeline that features: (1) an angle-independent approximation of the Oren-Nayar BRDF model to generate realistic LiDAR intensities, and (2) a distance-aware scaling mechanism to maintain spatial consistency across varying depths. We conduct extensive experiments on state-of-the-art LiDAR-based and Camera-LiDAR fusion models, showing that MOBA achieves a 93.50% attack success rate, outperforming prior methods by over 41%. Our work reveals a new class of physically realizable threats and underscores the urgent need for defenses that account for material-level properties in real-world environments.[103] DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation
Xuexun Liu,Xiaoxu Xu,Qiudan Zhang,Lin Ma,Xu Wang
Main category: cs.CV
TL;DR: 提出了一种名为DBGroup的两阶段弱监督3D实例分割框架,利用场景级标注提升效率和可扩展性,在减少标注成本的同时实现了优于现有方法的性能。
Details
Motivation: 现有弱监督3D实例分割方法依赖点击或边界框标注,仍需大量人工且依赖专家标注。为降低标注成本并提升可扩展性,需要更高效的弱监督方法。 Method: DBGroup包含两个阶段:第一阶段通过双分支点云分组模块,结合多视图图像中的语义和掩码线索生成伪标签,并采用粒度感知实例合并与语义选择传播策略优化标签质量;第二阶段使用优化后的伪标签进行多轮自训练,并引入实例掩码过滤策略缓解伪标签不一致性。 Result: 实验表明,DBGroup在稀疏点级监督方法中表现具有竞争力,且优于当前最先进的场景级监督3D语义分割方法。 Conclusion: DBGroup验证了场景级标注用于3D实例分割的可行性,为大规模数据下的3D场景理解提供了一种高效、低标注成本的解决方案。 Abstract: Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.[104] LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers
Minjun Kim,Jaeri Lee,Jongjin Kim,Jeongin Yun,Yongmo Kwon,U Kang
Main category: cs.CV
TL;DR: 本文提出了一种针对视觉Transformer(ViT)的逐层混合精度量化方法LampQ,通过细粒度的层级别量化、类型感知的Fisher信息度量和整数线性规划优化比特分配,在图像分类、目标检测等任务中实现了最先进的量化性能。
Details
Motivation: 现有ViT量化方法多采用统一精度,忽略了不同组件对量化敏感性的差异;而现有的混合精度量化方法存在粒度粗、指标尺度不一致和量化感知不足的比特分配问题。 Method: 提出LampQ方法,采用层级别混合精度量化,引入类型感知的Fisher信息作为敏感性度量,并通过整数线性规划进行最优比特分配,结合迭代更新策略提升量化精度。 Result: 在多个预训练ViT模型和任务(如图像分类、目标检测、零样本量化)上的实验表明,LampQ在相同比特宽度下显著优于现有量化方法,达到最先进水平。 Conclusion: LampQ有效解决了现有混合精度量化方法在ViT上的三个主要局限,实现了高精度、细粒度且可加速的模型压缩,推动了低比特ViT的实际部署应用。 Abstract: How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.[105] MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging
Shufeng Kong,Zijie Wang,Nuan Cui,Hao Tang,Yihan Meng,Yuanyuan Wei,Feifan Chen,Yingheng Wang,Zhuo Cai,Yaonan Wang,Yulong Zhang,Yuzheng Li,Zibin Zheng,Caihua Liu
Main category: cs.CV
TL;DR: 提出MIRNet框架,结合自监督预训练与基于图的约束推理,用于医学图像分析,尤其在舌象诊断中表现优异。
Details
Motivation: 医学图像自动解释面临标注稀缺、标签不平衡和临床合理性约束等挑战,尤其是在舌象诊断这种需要细粒度视觉-语义理解的领域。 Method: 采用自监督掩码自动编码器(MAE)从无标签数据学习视觉表示;使用图注意力网络(GAT)建模专家定义的标签关系图;通过KL散度和正则化损失施加临床先验约束;利用非对称损失(ASL)和提升集成缓解类别不平衡;并构建了包含4000张图像的大规模数据集TongueAtlas-4K。 Result: 在舌象诊断任务上达到当前最优性能,且框架可推广至其他医学影像诊断任务。 Conclusion: MIRNet有效整合自监督学习与结构化语义推理,在应对标注稀缺、标签不平衡和临床一致性方面表现出色,为医学图像理解提供了可扩展的解决方案。 Abstract: Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels--representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.[106] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models
Xinyi Wang,Xun Yang,Yanlong Xu,Yuchen Wu,Zhen Li,Na Zhao
Main category: cs.CV
TL;DR: 本文提出了细粒度3D具身推理新任务,要求代理根据指令预测3D场景中可操作元素的位置、运动类型和运动轴;为此设计了AffordBot框架,结合多模态大语言模型与链式思维推理,通过环绕视图渲染和视角选择实现基于指令的精细交互推理,在SceneFun3D数据集上达到最优性能。
Details
Motivation: 现有方法通常在物体级别操作或割裂地处理细粒度功能推理,缺乏连贯的、指令驱动的定位与推理能力,难以满足物理环境中人机协作对精确交互理解的需求。 Method: 提出AffordBot框架,利用多模态大语言模型(MLLM)与定制的链式思维(CoT)推理流程:首先将3D场景渲染为环绕视图图像,并将3D候选元素投影到这些视图中;通过主动感知阶段让MLLM选择最信息丰富的视角,再逐步推理定位可操作元素并推断交互动作。 Result: 在SceneFun3D数据集上的实验表明,AffordBot在仅使用3D点云输入和MLLM的情况下达到了最先进的性能,展现出强泛化能力和物理合理的推理效果。 Conclusion: AffordBot通过融合MLLM与结构化推理流程,实现了指令驱动的细粒度3D具身推理,为物理环境中的人机协作提供了更精确、可解释的交互理解方案。 Abstract: Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.[107] Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation
Yuxin Jiang,Wei Luo,Hui Zhang,Qiyu Chen,Haiming Yao,Weiming Shen,Yunkang Cao
Main category: cs.CV
TL;DR: 提出Anomagic,一种基于跨模态提示和对比优化的零样本异常生成方法,结合AnomVerse数据集提升异常检测性能。
Details
Motivation: 现有异常生成方法依赖异常样本或缺乏语义一致性,难以生成多样化且真实的异常;需要一种无需异常示例即可生成高质量异常的方法。 Method: 通过统一视觉和文本线索的跨模态提示编码方案,利用修复模型生成异常,并采用对比细化策略确保生成异常与掩码精确对齐。使用AnomVerse(包含12,987个异常-掩码-描述三元组)进行训练。 Result: 实验表明,Anomagic生成的异常更真实、多样,在下游异常检测任务中表现优于先前方法,并支持基于用户定义提示为任意正常图像生成异常。 Conclusion: Anomagic是一种有效的零样本异常生成框架,具备良好的泛化性和实用性,为异常生成提供了通用基础模型。 Abstract: We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.[108] DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection
Feiyang Jia,Caiyan Jia,Ailin Liu,Shaoqing Xu,Qiming Xia,Lin Liu,Lei Yang,Yan Gong,Ziying Song
Main category: cs.CV
TL;DR: 本文提出了一种基于双引导范式的多模态3D目标检测方法DGFusion,通过难度感知的实例匹配机制提升对远距离、小尺寸和遮挡等困难目标的检测性能。
Details
Motivation: 现有单引导范式的多模态3D检测方法未能充分考虑不同模态在困难实例上的信息密度差异,导致对远距离、小或遮挡物体检测效果不佳。 Method: 提出DGFusion,采用双引导范式(Point-guide-Image与Image-guide-Point),并设计难度感知实例对匹配器(DIPM)进行基于难度的实例级特征匹配,生成易/难实例对,利用双引导模块实现有效的多模态特征融合。 Result: 在nuScenes数据集上,相比基线方法mAP提升+1.0%,NDS提升+0.8%,平均召回率提升+1.3%,并在多种困难场景下表现出更强的鲁棒性。 Conclusion: DGFusion通过双引导范式和难度感知匹配机制,有效提升了多模态3D检测中对困难实例的感知能力,增强了自动驾驶系统的安全性。 Abstract: As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0\% mAP, +0.8\% NDS, and +1.3\% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.[109] LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning
Xinran Yang,Shuichang Lai,Jiangjing Lyu,Hongjie Li,Bowen Pan,Yuanqi Li,Jie Guo,Zhou Zhengkang,Yanwen Guo
Main category: cs.CV
TL;DR: 提出一种基于无符号距离场(UDF)的3D变分自编码器(VAE)框架,通过局部到全局(LoG)架构实现高保真3D内容生成,支持复杂拓扑结构并达到2048^3超高分辨率。
Details
Motivation: 现有方法在处理非流形几何、开放表面和内部复杂结构时存在局限,如SDF需要封闭预处理且计算昂贵,点云表示易产生采样伪影。 Method: 设计基于UDF的3D VAE框架,采用UBlock分块策略,结合3D卷积捕捉局部细节与稀疏Transformer保证全局一致性,并引入Pad-Average策略优化边界过渡。 Result: 在重建精度和生成质量上达到SOTA,显著提升表面平滑度和几何灵活性,支持高达2048^3分辨率的生成与重建。 Conclusion: 该方法克服了传统表示的局限,实现了对复杂和不完整形状的高效高保真建模,推动了3D内容生成向更高分辨率和更复杂拓扑的发展。 Abstract: Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies-such as open surfaces and intricate internal structures-while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)-a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutions for capturing local detail with sparse transformers for enforcing global coherence. A Pad-Average strategy further ensures smooth transitions at subvolume boundaries during reconstruction. This modular design enables seamless scaling to ultra-high resolutions up to 2048^3-a regime previously unattainable for 3D VAEs. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and generative quality, yielding superior surface smoothness and geometric flexibility.[110] FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection
Wencong Wu,Xiuwei Zhang,Hanlin Yin,Shun Dai,Hongxi Zhang,Yanning Zhang
Main category: cs.CV
TL;DR: 本文提出了一种用于可见光-红外目标检测的频域融合Transformer模型FreDFT,通过频域注意力和多尺度频域前馈层挖掘跨模态互补信息,并设计了跨模态全局建模模块和局部特征增强模块以缓解模态信息不平衡并提升特征融合性能,在多个公开数据集上取得了优异结果。
Details
Motivation: 现有方法在复杂场景下面临可见光与红外模态间的信息不平衡问题,且大多局限于空间域的跨模态融合,忽视了频域中挖掘互补信息的潜力,导致融合不充分、检测性能下降。 Method: 提出FreDFT模型,包含多模态频域注意力(MFDA)和频域前馈层(FDFFL),结合混合尺度频域特征融合策略;引入跨模态全局建模模块(CGMM)实现像素级空间-通道交互以缓解信息不平衡;设计局部特征增强模块(LFEM),利用多种卷积和通道混洗增强局部特征表示。 Result: 在多个公开可见光-红外检测数据集上,FreDFT显著优于现有方法,验证了其在跨模态融合与检测性能上的有效性。 Conclusion: FreDFT通过频域建模有效挖掘跨模态互补信息,解决了模态不平衡问题,提升了复杂环境下的检测性能,为多模态检测提供了新的频域融合视角。 Abstract: Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.[111] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples
Xurui Li,Feng Xue,Yu Zhou
Main category: cs.CV
TL;DR: 本文提出了一种用于零样本异常分类和分割的互评框架MuSc-V2,通过利用正常图像块在2D外观和3D形状上的相似性与异常的孤立性这一关键特性,结合2D/3D多模态信息,在MVTec 3D-AD和Eyecandies数据集上显著超越现有方法。
Details
Motivation: 现有零样本异常检测方法忽略了正常样本在2D和3D空间中具有高相似性和聚集性的特点,而异常样本则呈现多样且孤立的特性,本文旨在显式利用这一判别性特征提升检测性能。 Method: 提出MuSc-V2框架,包括迭代点群分组(IPG)优化3D表示,多度相似邻域聚合(SNAMD)融合2D/3D多尺度特征,互评机制(MSM)进行模态内打分,跨模态异常增强(CAE)融合2D/3D得分,并通过约束邻域重评分(RsCon)抑制误检。 Result: 在MVTec 3D-AD数据集上实现+23.7% AP提升,在Eyecandies数据集上提升+19.3%,性能超越先前零样本方法并优于多数少样本方法。 Conclusion: MuSc-V2通过显式建模正常与异常样本的分布差异,有效提升了零样本异常分类与分割的准确性和鲁棒性,具备良好的跨产品线适应能力。 Abstract: Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7\%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3\%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.[112] Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance
Zhiyuan Hu,Zheng Sun,Yi Wei,Long Yu
Main category: cs.CV
TL;DR: 本文提出了一种针对图像美学推理能力不足的完整解决方案,包括构建大规模图像筛选数据集和引入HCM-GRPO方法以提升多模态大模型的表现。
Details
Motivation: 由于缺乏数据以及多模态大语言模型(MLLMs)在图像美学推理能力上的不足,当前图像筛选性能较差,因此需要有效的数据和方法来提升该任务的表现。 Method: 构建了一个包含超过128k个样本、约640k张图像的大规模数据集,涵盖外观变形、物理阴影、布局放置和扩展合理性四个方面的美学评估;采用多种标注方式获取高质量思维链(CoT)数据;提出结合Hard Cases Mining与Dynamic Proportional Accuracy奖励的HCM-GRPO方法。 Result: 实验表明,即使是GPT4o和Qwen-VL-Max等最先进的闭源MLLM在图像美学推理上表现接近随机猜测;而使用HCM-GRPO的小模型显著优于开源及闭源大模型。 Conclusion: 通过构建高质量数据集并引入HCM-GRPO训练策略,可有效提升多模态模型在图像美学推理任务上的性能,为图像筛选提供了高效且成本可控的解决方案。 Abstract: The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.[113] When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?
Qilang Ye,Wei Zeng,Meng Liu,Jie Zhang,Yupeng Hu,Zitong Yu,Yu Zhou
Main category: cs.CV
TL;DR: 本文提出了一种新的基准AV-ConfuseBench,用于评估多模态大语言模型(MLLMs)在视觉存在但音频缺失情况下的混淆对象识别能力,并引入了基于强化学习的协作多MLLM框架RL-CoMM以提升音频-视觉推理准确性。
Details
Motivation: 研究发现现有的MLLMs容易因视觉主导而无法准确判断音频是否存在,因此需要一种能够缓解这种视觉主导偏差的方法来提高音频-视觉理解的准确性。 Method: 提出了一个两阶段方法:首先利用大型音频语言模型(LALM)生成纯音频推理作为参考,通过逐步推理奖励函数使MLLM自我改进;其次采用答案中心置信度优化减少异构推理差异带来的不确定性。 Result: 实验表明,RL-CoMM在音频-视觉问答和音频-视觉幻觉任务上比基线模型准确率提高了10~30%,且所需训练数据有限。 Conclusion: RL-CoMM有效提升了MLLMs在复杂音频-视觉场景中的推理能力,特别是在处理视觉与听觉信息冲突时的表现。 Abstract: Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.[114] Multivariate Gaussian Representation Learning for Medical Action Evaluation
Luming Yang,Haoxian Liu,Siqing Li,Alper Yilmaz
Main category: cs.CV
TL;DR: 本文提出了CPREval-6k医学动作基准数据集和GaussMedAct框架,用于细粒度医疗动作评估。该方法通过多维高斯编码和混合空间编码实现自适应时空表示学习,在准确性和鲁棒性上优于现有方法。
Details
Motivation: 由于缺乏全面的数据集、严格的精度要求以及对快速动作的时空动态建模不足,医疗视觉中的细粒度动作评估面临挑战。 Method: 提出CPREval-6k多视角多标签数据集,并设计GaussMedAct框架,采用多变量高斯表示将关节运动映射到时序缩放的多维空间,分解为自适应3D高斯‘token’,结合笛卡尔与向量双流策略进行混合空间编码。 Result: 在CPREval-6k上达到92.1% Top-1准确率,实时推理,相比ST-GCN基线提升5.9%,仅使用10% FLOPs;跨数据集实验验证了其鲁棒性优势。 Conclusion: GaussMedAct通过自适应时空建模显著提升了医疗场景下细粒度动作识别的性能与效率,具备良好的泛化能力和应用潜力。 Abstract: Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming the ST-GCN baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.[115] Perceive, Act and Correct: Confidence Is Not Enough for Hyperspectral Classification
Muzhou Yang,Wuzhou Quan,Mingqiang Wei
Main category: cs.CV
TL;DR: 提出了一种名为CABIN的半监督框架,通过感知、行动和纠正的闭环学习过程提升高光谱图像分类中的模型认知能力,有效缓解因置信度误导导致的确认偏差问题。
Details
Motivation: 现有模型在高光谱图像分类中常将高置信度误判为正确性,缺乏对不确定性的认知,导致在标注稀疏或类别不平衡时出现确认偏差和过拟合。 Method: CABIN通过估计认知不确定性实现感知意识;采用不确定性引导的双采样策略选择样本进行探索并生成稳定伪标签;引入细粒度动态分配策略,将伪标签数据分类并应用定制损失以纠正噪声监督。 Result: 实验表明,结合CABIN后多种先进方法在性能和标注效率上均有提升,能更有效地识别模糊区域并减少偏差。 Conclusion: CABIN通过闭环的认知感知学习机制显著改善了模型在不确定性下的泛化能力,为高光谱图像分类中的半监督学习提供了有效解决方案。 Abstract: Confidence alone is often misleading in hyperspectral image classification, as models tend to mistake high predictive scores for correctness while lacking awareness of uncertainty. This leads to confirmation bias, especially under sparse annotations or class imbalance, where models overfit confident errors and fail to generalize. We propose CABIN (Cognitive-Aware Behavior-Informed learNing), a semi-supervised framework that addresses this limitation through a closed-loop learning process of perception, action, and correction. CABIN first develops perceptual awareness by estimating epistemic uncertainty, identifying ambiguous regions where errors are likely to occur. It then acts by adopting an Uncertainty-Guided Dual Sampling Strategy, selecting uncertain samples for exploration while anchoring confident ones as stable pseudo-labels to reduce bias. To correct noisy supervision, CABIN introduces a Fine-Grained Dynamic Assignment Strategy that categorizes pseudo-labeled data into reliable, ambiguous, and noisy subsets, applying tailored losses to enhance generalization. Experimental results show that a wide range of state-of-the-art methods benefit from the integration of CABIN, with improved labeling efficiency and performance.[116] VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System
Gwangyeon Ahn,Jiwan Seo,Joonhyuk Kang
Main category: cs.CV
TL;DR: 提出了一种基于视觉-语言特征的多模态语义通信系统VLF-MSC,通过统一的紧凑表示同时支持图像和文本生成,提升了频谱效率和抗噪声能力。
Details
Motivation: 现有语义通信方法通常分别处理不同模态,导致带宽浪费和语义不一致,因此需要一种统一的多模态表示方法。 Method: 利用预训练的视觉-语言模型(VLM)将图像编码为视觉-语言语义特征(VLF),在接收端使用解码器语言模型和扩散图像生成器共同依赖该特征生成文本和图像。 Result: 实验表明,VLF-MSC在低信噪比下优于仅文本或仅图像基线方法,显著降低带宽的同时提高了双模态语义准确性。 Conclusion: VLF-MSC通过统一语义表示实现了高效、鲁棒的多模态通信,为未来语义通信系统提供了新范式。 Abstract: We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.[117] Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints
Xiangyue Zhang,Jianfang Li,Jianqiang Ren,Jiaxu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的全局关节旋转空间下的协同语音动作生成框架GlobalDiff,首次在全局旋转空间中进行生成,避免了传统方法中因层级结构导致的误差累积问题。
Details
Motivation: 现有方法在局部关节旋转空间中生成动作,由于骨骼层级结构的存在,容易产生累积误差,导致末端效应器动作不稳定和不自然。因此需要一种能够解耦关节预测依赖、减少误差累积的新方法。 Method: 提出GlobalDiff,一种基于扩散模型的框架,直接在全局关节旋转空间中操作;引入多层次约束机制,包括关节结构约束(虚拟锚点)、骨骼结构约束(骨间角度一致性)和时序结构约束(多尺度变分编码器),以增强生成过程中的结构感知。 Result: 在标准协同语音动作生成基准上的实验表明,GlobalDiff生成的动作更加平滑准确,在多种说话人身份下性能较当前SOTA提升了46.0%。 Conclusion: GlobalDiff通过在全局旋转空间中进行扩散生成并引入多级结构约束,有效解决了层级误差累积问题,显著提升了协同语音动作生成的质量和稳定性。 Abstract: Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint's prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0 % compared to the current SOTA under multiple speaker identities.[118] GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs
Yuxiang Duan,Ao Li,Yingqin Li,Luyu Li,Pengwei Wang
Main category: cs.CV
TL;DR: 提出GridPrune方法,通过“全局引导、局部选择”的区域化策略优化多模态大模型中的视觉token剪枝,在显著减少计算开销的同时保持高性能。
Details
Motivation: 现有视觉token剪枝方法主要关注“选择什么”(what to select),忽视了“关注哪里”(where to look),导致空间分配效率低、位置偏差和冗余token保留问题。受人类视觉注意机制启发,需引入先粗后细的两阶段注意力分配策略。 Method: 提出GridPrune,将图像划分为多个空间区域,首先基于文本条件动态分配各区域的token预算(guide-globally),然后在每个区域内进行局部token选择(select-locally),实现更高效的剪枝。 Result: 在LLaVA-NeXT-7B上,仅使用11.1%的token即保留96.98%的完整性能,相比最优基线在同一剪枝率下提升2.34%。 Conclusion: GridPrune通过模拟人类视觉注意力的两阶段机制,有效解决了传统剪枝方法的空间分配低效问题,显著提升了多模态大语言模型的推理效率与性能平衡。 Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.[119] SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition
Qilang Ye,Yu Zhou,Lian He,Jie Zhang,Xuanming Guo,Jiayu Zhang,Mingkui Tan,Weicheng Xie,Yue Sun,Tao Tan,Xiaochen Yuan,Ghada Khoriba,Zitong Yu
Main category: cs.CV
TL;DR: 提出了一种名为SUGAR的新范式,结合大语言模型(LLM)与人体骨架数据进行动作识别与描述,通过视觉-运动知识指导骨架学习并生成离散表示,利用预训练LLM生成动作目标和描述,并引入TQP模块处理长时间序列骨架信号,在多个基准上表现出色,且在零样本场景中优于线性方法。
Details
Motivation: 探索如何将大语言模型(LLM)应用于人体骨架动作识别,解决LLM难以直接理解骨架数据以及区分不同动作的问题。 Method: 利用现成的大规模视频模型提取视觉和运动信息作为先验知识,监督骨架学习生成离散表示;使用原始预训练LLM解析这些表示并生成动作分类与描述;提出Temporal Query Projection (TQP) 模块以建模长序列骨架信号。 Result: 在多个基于骨架的动作分类基准上验证了SUGAR的有效性,在零样本场景下表现优于基于线性的方法,显示出更强的泛化能力。 Conclusion: SUGAR成功桥接了大语言模型与骨架动作识别之间的鸿沟,无需微调LLM即可实现动作理解与描述,展现了其在跨模态任务中的潜力与通用性。 Abstract: Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.[120] MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models
Zihan Wang,Guansong Pang,Wenjun Miao,Jin Zheng,Xiao Bai
Main category: cs.CV
TL;DR: 本文提出了MTAttack,首个针对大型视觉语言模型(LVLMs)的多目标后门攻击框架,通过引入代理空间划分和触发原型锚定约束,有效解决了多触发器间的特征干扰问题,实现了高成功率、强泛化性和抗防御能力的多目标攻击。
Details
Motivation: 现有的后门攻击主要集中在单目标攻击,而实际应用中多目标攻击威胁更大。本文旨在揭示LVLMs在多目标后门攻击下的安全漏洞。 Method: 提出MTAttack框架,采用新型优化方法,在潜在空间中联合优化多个触发器,引入代理空间划分约束和触发原型锚定约束,确保每个触发器独立映射到唯一的代理类并保持可分性。 Result: 实验表明MTAttack在多个基准上实现了很高的攻击成功率,显著优于现有方法,并展现出跨数据集的泛化能力和对防御策略的鲁棒性。 Conclusion: LVLMs面临严重的多目标后门攻击风险,MTAttack的有效性凸显了开发相应防御机制的紧迫性。 Abstract: Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at https://github.com/mala-lab/MTAttack.[121] RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo
Jueun Ko,Hyewon Park,Hyesong Choi,Dongbo Min
Main category: cs.CV
TL;DR: 提出了一种名为RobIA的鲁棒、实例感知的连续测试时自适应框架,用于立体深度估计,通过动态路由和伪监督提升在动态域中的性能。
Details
Motivation: 立体深度估计在现实环境中面临领域持续变化、标注稀疏等问题,现有TTA方法多基于静态假设,难以应对连续域变。 Method: 设计了AttEx-MoE模块,利用轻量自注意力机制根据极线几何动态路由输入至专家网络;并构建基于PEFT的Robust AdaptBN Teacher模型,结合稀疏人工标签提供密集伪监督。 Result: 实验表明,RobIA在多个动态目标域上实现了优于现有方法的自适应性能,同时保持较高的计算效率。 Conclusion: RobIA通过实例感知的动态适应策略和强健的教师-学生框架,有效提升了立体深度估计在连续域变下的鲁棒性与泛化能力。 Abstract: Stereo Depth Estimation in real-world environments poses significant challenges due to dynamic domain shifts, sparse or unreliable supervision, and the high cost of acquiring dense ground-truth labels. While recent Test-Time Adaptation (TTA) methods offer promising solutions, most rely on static target domain assumptions and input-invariant adaptation strategies, limiting their effectiveness under continual shifts. In this paper, we propose RobIA, a novel Robust, Instance-Aware framework for Continual Test-Time Adaptation (CTTA) in stereo depth estimation. RobIA integrates two key components: (1) Attend-and-Excite Mixture-of-Experts (AttEx-MoE), a parameter-efficient module that dynamically routes input to frozen experts via lightweight self-attention mechanism tailored to epipolar geometry, and (2) Robust AdaptBN Teacher, a PEFT-based teacher model that provides dense pseudo-supervision by complementing sparse handcrafted labels. This strategy enables input-specific flexibility, broad supervision coverage, improving generalization under domain shift. Extensive experiments demonstrate that RobIA achieves superior adaptation performance across dynamic target domains while maintaining computational efficiency.[122] Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction
Mingda Jia,Weiliang Meng,Zenghuang Fu,Yiheng Li,Qi Zeng,Yifan Zhang,Ju Xin,Rongtao Xu,Jiguang Zhang,Xiaopeng Zhang
Main category: cs.CV
TL;DR: 提出了一种显式的时序-语义建模框架CACMI,用于密集视频描述任务,通过跨模态帧聚合和上下文感知特征增强提升性能。
Details
Motivation: 现有方法依赖于隐式建模,难以捕捉事件序列间的时序连贯性和视觉上下文的完整语义。 Method: 设计了上下文感知的跨模态交互(CACMI)框架,包含跨模态帧聚合和上下文感知特征增强两个模块,利用视频中的潜在时序特性和文本语义进行显式建模。 Result: 在ActivityNet Captions和YouCook2数据集上取得了当前最优的性能表现。 Conclusion: CACMI通过显式的时序-语义建模有效提升了密集视频描述的准确性和连贯性。 Abstract: Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.[123] Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation
Mayank Vatsa,Aparna Bharati,Richa Singh
Main category: cs.CV
TL;DR: 本文调查了当前主流文本到图像模型在处理逻辑组合(如否定、计数和空间关系)时的失效问题,指出其性能在组合条件下急剧下降,并归因于训练数据缺乏否定表达、连续注意力架构不适合离散逻辑以及评估指标重视觉合理性轻逻辑满足。作者认为实现真正的组合性需要根本性的表征与推理进步。
Details
Motivation: 现有文本到图像模型在处理逻辑组合时表现不佳,无法满足对精确语义理解的需求,亟需系统性分析其根本原因。 Method: 通过分析否定、计数和空间关系三大逻辑原语的组合任务,结合现有基准和方法,揭示模型性能下降的原因,并从数据、架构和评估三个层面进行归因。 Result: 发现模型在组合逻辑任务上性能急剧下降,主要由于训练数据中缺乏明确的否定表达、连续注意力机制难以处理离散逻辑,以及评估指标偏向视觉合理性而非逻辑正确性。 Conclusion: 实现真正的组合性生成需要在表示和推理机制上的根本性突破,而非对现有架构的渐进式改进。 Abstract: The architectural blueprint of today's leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.[124] Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space
Zhicheng Cai,Hao Zhu,Linsen Chen,Qiu Shen,Xun Cao
Main category: cs.CV
TL;DR: 提出了一种名为split-layer的新型MLP结构,通过将每一层拆分为多个并行分支并使用Hadamard积融合输出,有效构建高次多项式空间,显著提升隐式神经表示(INR)的表达能力,且不带来过高计算开销。
Details
Motivation: 传统MLP架构中低维特征空间限制了隐式神经表示(INR)的表达能力,加宽网络虽可线性增加特征维度但导致计算和内存成本二次增长。 Method: 提出split-layer方法,将MLP的每一层拆分为多个并行分支,并通过Hadamard积整合输出,从而构建高维特征空间和高次多项式表示能力。 Result: 在2D图像拟合、2D CT重建、3D形状表示和5D新视角合成等多个任务上,split-layer均显著优于现有方法,提升了INR性能。 Conclusion: split-layer能高效扩展INR的特征空间维度,在不显著增加计算成本的前提下大幅提升其表达能力和应用效果。 Abstract: Implicit neural representation (INR) models signals as continuous functions using neural networks, offering efficient and differentiable optimization for inverse problems across diverse disciplines. However, the representational capacity of INR defined by the range of functions the neural network can characterize, is inherently limited by the low-dimensional feature space in conventional multilayer perceptron (MLP) architectures. While widening the MLP can linearly increase feature space dimensionality, it also leads to a quadratic growth in computational and memory costs. To address this limitation, we propose the split-layer, a novel reformulation of MLP construction. The split-layer divides each layer into multiple parallel branches and integrates their outputs via Hadamard product, effectively constructing a high-degree polynomial space. This approach significantly enhances INR's representational capacity by expanding the feature space dimensionality without incurring prohibitive computational overhead. Extensive experiments demonstrate that the split-layer substantially improves INR performance, surpassing existing methods across multiple tasks, including 2D image fitting, 2D CT reconstruction, 3D shape representation, and 5D novel view synthesis.[125] Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection
Feng Ding,Wenhui Yi,Yunpeng Zhou,Xinan He,Hong Rao,Shu Hu
Main category: cs.CV
TL;DR: 提出一种双机制协同优化框架,通过结构公平解耦和全局分布对齐,在保持检测精度的同时提升跨域深伪检测模型的群体间和群体内公平性。
Details
Motivation: 现有公平性增强的深伪检测模型常以牺牲检测精度为代价,且存在对不同人口统计群体的偏见,导致系统性误判和社会不公。 Method: 在模型架构层面解耦对人口统计群体敏感的通道,并在特征层面缩小整体样本分布与各群体分布之间的距离,结合结构公平解耦与全局分布对齐的双机制。 Result: 实验表明,所提方法在多个域上均提升了群体间和群体内公平性,同时保持了较高的整体检测精度。 Conclusion: 该双机制协同优化框架有效平衡了检测精度与公平性,有助于实现可信、公正的深伪检测系统。 Abstract: Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.[126] GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval
Hao Zou,Runqing Zhang,Xue Zhou,Jianxiao Zou
Main category: cs.CV
TL;DR: 提出生成增强对齐方法(GEA),通过扩散生成图像作为中介语义表示,提升文本到图像行人检索中的跨模态对齐性能。
Details
Motivation: 现有文本到图像行人检索方法受限于文本查询表达不完整及模态鸿沟问题,导致跨模态对齐效果差和过拟合。 Method: 设计两个并行模块:文本引导令牌增强(TGTE)利用扩散生成图像作为中介语义桥接图文;生成式中间融合(GIF)通过交叉注意力融合生成图像、原图像与文本特征,并用三元组损失优化统一表征。 Result: 在CUHK-PEDES、RSTPReid和ICFG-PEDES三个数据集上实验表明所提GEA方法优于现有方法,有效提升检索性能。 Conclusion: 从生成视角出发的GEA框架能有效缓解模态差异与语义缺失问题,显著提升文本到图像行人检索的跨模态对齐能力。 Abstract: Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at https://github.com/sugelamyd123/Sup-for-GEA.[127] Physically Interpretable Multi-Degradation Image Restoration via Deep Unfolding and Explainable Convolution
Hu Gao,Xiaoning Lei,Xichen Xu,Depeng Dang,Lizhuang Ma
Main category: cs.CV
TL;DR: 本文提出了一种基于深度展开网络的可解释多退化图像恢复方法InterIR,通过改进的二阶半光滑牛顿算法和可解释卷积模块,在保持良好可解释性的同时实现了优异的恢复性能。
Details
Motivation: 现有图像恢复方法多针对单一退化类型,且堆叠模块的方式缺乏可解释性;而真实场景中图像常包含多种退化,需要兼具性能与可解释性的模型。 Method: 基于深度展开网络框架,将优化算法的迭代过程映射为可学习结构;采用改进的二阶半光滑牛顿算法保证模块的物理可解释性,并设计受人脑信息处理启发的可解释卷积模块以提升灵活性和自适应性。 Result: 所提InterIR模型在多退化图像恢复任务上表现优异,同时在单退化任务上也具有很强的竞争力。 Conclusion: 通过引入可解释性驱动的设计,InterIR不仅提升了对复杂真实退化的处理能力,还增强了模型的透明度和灵活性,为图像恢复提供了新的思路。 Abstract: Although image restoration has advanced significantly, most existing methods target only a single type of degradation. In real-world scenarios, images often contain multiple degradations simultaneously, such as rain, noise, and haze, requiring models capable of handling diverse degradation types. Moreover, methods that improve performance through module stacking often suffer from limited interpretability. In this paper, we propose a novel interpretability-driven approach for multi-degradation image restoration, built upon a deep unfolding network that maps the iterative process of a mathematical optimization algorithm into a learnable network structure. Specifically, we employ an improved second-order semi-smooth Newton algorithm to ensure that each module maintains clear physical interpretability. To further enhance interpretability and adaptability, we design an explainable convolution module inspired by the human brain's flexible information processing and the intrinsic characteristics of images, allowing the network to flexibly leverage learned knowledge and autonomously adjust parameters for different input. The resulting tightly integrated architecture, named InterIR, demonstrates excellent performance in multi-degradation restoration while remaining highly competitive on single-degradation tasks.[128] Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals
Shruti Singh Baghel,Yash Pratap Singh Rathore,Sushovan Jena,Anurag Pradhan,Amit Shukla,Arnav Bhavsar,Pawan Goyal
Main category: cs.CV
TL;DR: 本研究评估了不同规模的视觉语言模型(SmolVLM2)在为盲人和低视力用户生成无障碍视频描述时的表现,提出了两个新的评估框架,并在智能手机上测试了模型的实际部署性能。
Details
Motivation: 大型视觉语言模型虽然能力强,但资源消耗高,难以在移动端部署,限制了其在盲人和低视力用户中的实际应用。因此需要研究小模型在无障碍场景下的表现。 Method: 使用500M和2.2B参数的SmolVLM2模型,在AVCaps和Charades两个数据集上进行评估;提出多上下文BLV框架和导航辅助框架两种新评估方法;系统比较四种提示设计策略;并在智能手机上测试FP32和INT8精度版本的性能。 Result: 较小的模型在特定提示下可达到接近大模型的描述质量;提出的评估框架能更全面地衡量对BLV用户有用的信息;移动端部署结果显示INT8量化显著降低资源消耗且保持良好性能。 Conclusion: 小型视觉语言模型经优化后可在资源受限设备上为BLV用户提供高质量、情境感知的视频描述,具备实际应用潜力。 Abstract: Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.[129] CephRes-MHNet: A Multi-Head Residual Network for Accurate and Robust Cephalometric Landmark Detection
Ahmed Jaheen,Islam Hassan,Mohanad Abouserie,Abdelaty Rehab,Adham Elasfar,Knzy Elmasry,Mostafa El-Dawlatly,Seif Eldawlatly
Main category: cs.CV
TL;DR: 本文提出了一种名为CephRes-MHNet的多头残差卷积网络,用于从2D侧位颅骨X光片中自动检测头影测量标志点,具有高精度和高效性。
Details
Motivation: 手动标注头影测量标志点耗时且易出错,现有自动化方法在低对比度和复杂解剖结构下表现不佳,因此需要一种更鲁棒、高效的自动检测方法。 Method: 提出CephRes-MHNet,结合残差编码、双注意力机制和多头解码器,提升上下文推理能力和解剖定位精度,在Aariz头影测量数据集(1000张X光片)上进行训练与评估。 Result: CephRes-MHNet取得了1.23 mm的平均径向误差(MRE)和85.5%的2.0 mm成功检测率(SDR),性能优于所有对比模型,且参数量不足最佳基线模型AFPF-Net的25%。 Conclusion: CephRes-MHNet通过架构优化实现了头影测量标志点检测的最先进水平,兼具高准确性和低计算成本,适用于实际临床应用。 Abstract: Accurate localization of cephalometric landmarks from 2D lateral skull X-rays is vital for orthodontic diagnosis and treatment. Manual annotation is time-consuming and error-prone, whereas automated approaches often struggle with low contrast and anatomical complexity. This paper introduces CephRes-MHNet, a multi-head residual convolutional network for robust and efficient cephalometric landmark detection. The architecture integrates residual encoding, dual-attention mechanisms, and multi-head decoders to enhance contextual reasoning and anatomical precision. Trained on the Aariz Cephalometric dataset of 1,000 radiographs, CephRes-MHNet achieved a mean radial error (MRE) of 1.23 mm and a success detection rate (SDR) @ 2.0 mm of 85.5%, outperforming all evaluated models. In particular, it exceeded the strongest baseline, the attention-driven AFPF-Net (MRE = 1.25 mm, SDR @ 2.0 mm = 84.1%), while using less than 25% of its parameters. These results demonstrate that CephRes-MHNet attains state-of-the-art accuracy through architectural efficiency, providing a practical solution for real-world orthodontic analysis.[130] Utilizing a Geospatial Foundation Model for Coastline Delineation in Small Sandy Islands
Tishya Chhabra,Manisha Bajpai,Walter Zesk,Skylar Tibbits
Main category: cs.CV
TL;DR: 本研究评估了NASA和IBM的Prithvi-EO-2.0地理空间基础模型在马尔代夫小沙岛岸线提取中的应用,使用225幅多光谱图像进行训练和测试,结果表明即使仅用5张训练图像,模型仍能取得高精度(F1=0.94,IoU=0.79),显示出其在数据稀缺地区海岸监测中的强大迁移学习能力。
Details
Motivation: 岸线提取对沿海环境监测至关重要,但在数据稀缺地区获取高质量标注数据困难,因此需要具备强迁移学习能力的基础模型来提升监测效率。 Method: 收集并标注了225幅马尔代夫两个岛屿的多光谱卫星图像,公开发布该数据集,并使用包含5至181幅图像的训练子集对Prithvi-EO-2.0的3亿和6亿参数版本进行微调,评估其在小样本条件下的岸线提取性能。 Result: 即使仅使用5张训练图像,Prithvi模型仍能达到F1分数0.94、IoU 0.79的高性能,且随着训练样本增加性能稳定提升,验证了其出色的迁移学习能力。 Conclusion: Prithvi-EO-2.0在小样本条件下表现出卓越的岸线提取能力,表明地理空间基础模型在数据稀缺地区的遥感应用中具有巨大潜力,可有效支持全球海岸带监测。 Abstract: We present an initial evaluation of NASA and IBM's Prithvi-EO-2.0 geospatial foundation model on shoreline delineation of small sandy islands using satellite images. We curated and labeled a dataset of 225 multispectral images of two Maldivian islands, which we publicly release, and fine-tuned both the 300M and 600M parameter versions of Prithvi on training subsets ranging from 5 to 181 images. Our experiments show that even with as few as 5 training images, the models achieve high performance (F1 of 0.94, IoU of 0.79). Our results demonstrate the strong transfer learning capability of Prithvi, underscoring the potential of such models to support coastal monitoring in data-poor regions.[131] VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction
Stephane Da Silva Martins,Emanuel Aldea,Sylvie Le Hégarat-Mascle
Main category: cs.CV
TL;DR: 提出VISTA,一种基于递归目标条件变换器的多智能体轨迹预测模型,通过融合长期意图与运动历史、灵活建模社会交互并提供可解释的社会影响模式,在多个基准上实现最先进的精度和更低的碰撞率。
Details
Motivation: 现有方法难以同时捕捉智能体的长期目标及其细粒度的社会交互,导致预测的多智能体未来轨迹不够真实。 Method: VISTA采用跨注意力融合模块整合长期意图与历史运动,引入社交token注意力机制实现灵活的智能体间交互建模,并利用成对注意力图在推理时实现社会影响模式的可解释性。 Result: 在MADRAS和SDD数据集上,VISTA在ADE、FDE、minFDE等指标上达到最先进水平;在MADRAS上平均碰撞率从2.14%降至0.03%,在SDD上实现零碰撞。 Conclusion: VISTA能够生成符合社会规则、具备目标感知且可解释的轨迹,适用于安全关键的自主系统。 Abstract: Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents' long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.[132] LiNeXt: Revisiting LiDAR Completion with Efficient Non-Diffusion Architectures
Wenzhe He,Xiaojun Chen,Ruiqi Wang,Ruihui Li,Huilong Pi,Jiapeng Zhang,Zhuo Tang,Kenli Li
Main category: cs.CV
TL;DR: 提出了一种轻量级、非扩散的网络LiNeXt,用于快速准确的点云补全,在SemanticKITTI数据集上实现了比LiDiff快199.8倍的推理速度,Chamfer距离减少50.7%,参数量仅为6.1%。
Details
Motivation: 现有的基于扩散模型的3D LiDAR点云补全方法因多步迭代采样导致计算开销大,难以满足实时性需求。 Method: 提出LiNeXt,包含Noise-to-Coarse(N2C)模块单步去噪,Refine模块进行精细化修复,并设计Distance-aware Selected Repeat策略以生成更均匀分布的噪声点云,适应LiDAR近密远疏的特性。 Result: 在SemanticKITTI数据集上,相比LiDiff,LiNeXt推理速度快199.8倍,Chamfer距离降低50.7%,参数量仅为其6.1%。 Conclusion: LiNeXt在保持高精度的同时显著提升效率,适用于实时3D LiDAR场景补全。 Abstract: 3D LiDAR scene completion from point clouds is a fundamental component of perception systems in autonomous vehicles. Previous methods have predominantly employed diffusion models for high-fidelity reconstruction. However, their multi-step iterative sampling incurs significant computational overhead, limiting its real-time applicability. To address this, we propose LiNeXt-a lightweight, non-diffusion network optimized for rapid and accurate point cloud completion. Specifically, LiNeXt first applies the Noise-to-Coarse (N2C) Module to denoise the input noisy point cloud in a single pass, thereby obviating the multi-step iterative sampling of diffusion-based methods. The Refine Module then takes the coarse point cloud and its intermediate features from the N2C Module to perform more precise refinement, further enhancing structural completeness. Furthermore, we observe that LiDAR point clouds exhibit a distance-dependent spatial distribution, being densely sampled at proximal ranges and sparsely sampled at distal ranges. Accordingly, we propose the Distance-aware Selected Repeat strategy to generate a more uniformly distributed noisy point cloud. On the SemanticKITTI dataset, LiNeXt achieves a 199.8x speedup in inference, reduces Chamfer Distance by 50.7%, and uses only 6.1% of the parameters compared with LiDiff. These results demonstrate the superior efficiency and effectiveness of LiNeXt for real-time scene completion.[133] HeatV2X: Scalable Heterogeneous Collaborative Perception via Efficient Alignment and Interaction
Yueran Zhao,Zhang Zhang,Chao Sun,Tianze Wang,Chao Yue,Nuoran Li
Main category: cs.CV
TL;DR: 提出了一种可扩展的车联万物(V2X)协同感知框架HeatV2X,通过异构图注意力和自适应微调机制解决多模态异构代理间的特征对齐与协作问题,在降低训练开销的同时提升了感知性能。
Details
Motivation: 现有V2X协同感知框架在处理多模态异构代理时面临特征对齐困难,且难以扩展新代理,导致性能下降和训练成本过高。 Method: 提出HeatV2X框架,首先基于异构图注意力训练高性能基础代理;然后设计局部异构微调(使用Hetero-Aware Adapters提取模态差异)和全局协同微调(使用Multi-Cognitive Adapter增强跨代理协作)以实现高效对齐与融合。 Result: 在OPV2V-H和DAIR-V2X数据集上验证,相比现有最先进方法,在显著降低训练开销的同时实现了更优的感知性能。 Conclusion: HeatV2X有效解决了V2X协同感知中的异构性与可扩展性问题,兼顾高性能与低训练成本,具备实际应用潜力。 Abstract: Vehicle-to-Everything (V2X) collaborative perception extends sensing beyond single vehicle limits through transmission. However, as more agents participate, existing frameworks face two key challenges: (1) the participating agents are inherently multi-modal and heterogeneous, and (2) the collaborative framework must be scalable to accommodate new agents. The former requires effective cross-agent feature alignment to mitigate heterogeneity loss, while the latter renders full-parameter training impractical, highlighting the importance of scalable adaptation. To address these issues, we propose Heterogeneous Adaptation (HeatV2X), a scalable collaborative framework. We first train a high-performance agent based on heterogeneous graph attention as the foundation for collaborative learning. Then, we design Local Heterogeneous Fine-Tuning and Global Collaborative Fine-Tuning to achieve effective alignment and interaction among heterogeneous agents. The former efficiently extracts modality-specific differences using Hetero-Aware Adapters, while the latter employs the Multi-Cognitive Adapter to enhance cross-agent collaboration and fully exploit the fusion potential. These designs enable substantial performance improvement of the collaborative framework with minimal training cost. We evaluate our approach on the OPV2V-H and DAIR-V2X datasets. Experimental results demonstrate that our method achieves superior perception performance with significantly reduced training overhead, outperforming existing state-of-the-art approaches. Our implementation will be released soon.[134] Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization
Ashutosh Anshul,Shreyas Gopal,Deepu Rajan,Eng Siong Chng
Main category: cs.CV
TL;DR: 提出一种单阶段训练框架,通过引入单模态和跨模态的下一帧预测以及窗口级注意力机制,提升多模态深度伪造检测的泛化能力和时间定位精度。
Details
Motivation: 现有方法依赖预训练且主要关注音视频不一致性,难以泛化到未见篡改并忽略模态内伪影,尤其在保持音视频对齐的篡改下表现不佳。 Method: 在单阶段训练中引入单模态与跨模态的下一帧预测,并设计窗口级注意力机制来捕捉预测帧与实际帧之间的差异,以检测每帧周围的局部伪影。 Result: 在多个基准数据集上验证了模型具有强泛化能力,能准确分类全篡改视频并精确定位部分篡改视频中的伪造片段。 Conclusion: 所提方法无需预训练即可实现良好泛化,同时兼顾模态内与跨模态异常检测,在完全和部分深伪视频检测中均表现出优越的时间定位性能。 Abstract: Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.[135] TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding
Jinxuan Li,Yi Zhang,Jian-Fang Hu,Chaolei Tan,Tianming Liang,Beihao Xia
Main category: cs.CV
TL;DR: 提出了一种名为TubeRMC的框架,用于弱监督下的时空视频定位(STVG),通过文本条件生成候选管并利用时空约束进行重建优化,显著提升了定位和跟踪的一致性。
Details
Motivation: 现有弱监督STVG方法多采用简单后期融合,生成的时空管与文本描述无关,导致目标识别失败和跟踪不一致,因此需要一种能更好融合语言和视觉信息的方法。 Method: 提出TubeRMC框架,利用预训练视觉定位模型生成文本条件下的候选管,并设计三种时空重建策略(时间、空间、时空)和对应的管条件重建器,在重建过程中引入空间与时间提议之间的互约束机制以提升质量。 Result: 在VidSTG和HCSTVG两个公开基准上超越了现有方法,可视化结果表明该方法有效缓解了目标识别错误和跟踪不一致问题。 Conclusion: TubeRMC通过管条件重建与互约束机制,实现了更精准的文本引导时空定位,为弱监督STVG提供了新的有效解决方案。 Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.[136] FineSkiing: A Fine-grained Benchmark for Skiing Action Quality Assessment
Yongji Zhang,Siqi Li,Yue Gao,Yu Jiang
Main category: cs.CV
TL;DR: 本文提出了首个包含细粒度子分数和扣分项标注的空中滑雪动作质量评估(AQA)数据集,并提出了一种名为JudgeMind的新方法,通过模拟专业裁判的评分思维,结合阶段划分、阶段感知特征增强与融合模块以及基于知识的评分解码器,显著提升了AQA的性能和可靠性。
Details
Motivation: 现有AQA方法依赖整段视频特征,缺乏可解释性和可靠性,且数据集缺少细粒度评分标注,难以支持精细化评分分析。 Method: 提出JudgeMind方法,将动作视频分阶段处理,采用阶段感知特征增强与融合模块提升关键区域感知能力,并设计知识驱动的评分解码器引入扣分项先验知识以提高评分准确性。 Result: 在自建的细粒度AQA数据集上实验表明,所提方法实现了最先进的性能,显著优于现有方法。 Conclusion: JudgeMind通过模拟裁判评分逻辑并结合细粒度标注,在提升AQA精度、可解释性和鲁棒性方面具有显著优势,为未来AQA研究提供了新基准和思路。 Abstract: Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.[137] Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis
Jiulong Wu,Yucheng Shen,Lingyong Yan,Haixin Sun,Deguo Xia,Jizhou Huang,Min Cao
Main category: cs.CV
TL;DR: 提出Facial-R1,一种三阶段对齐框架,用于解决面部情感分析中的幻觉推理和推理-识别不一致问题,并构建FEA-20K数据集,实现SOTA性能。
Details
Motivation: 现有基于视觉-语言模型的方法在面部情感分析中存在推理幻觉和情感识别与推理过程不一致的问题,缺乏细粒度、可解释的联合建模。 Method: 提出三阶段框架:指令微调建立基础推理能力;强化学习以情感和AU标签为奖励信号对齐推理与识别;设计数据合成管道迭代扩展训练数据,实现自提升。 Result: 在FEA-20K数据集上验证,Facial-R1在八个标准基准上均取得SOTA性能,具备强泛化性和可解释性。 Conclusion: Facial-R1有效解决了VLM在面部情感分析中的关键缺陷,通过少量监督实现了高性能、可解释的细粒度情感分析。 Abstract: Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.[138] H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification
Yongji Zhang,Siqi Li,Kuiyang Huang,Yue Gao,Yu Jiang
Main category: cs.CV
TL;DR: 提出H3Former,一种基于高阶语义关系的token-to-region框架,用于细粒度视觉分类,通过超图卷积和双曲层次对比损失提升性能。
Details
Motivation: 现有方法在捕捉判别性特征时存在不足,常引入大量类别无关冗余,难以应对细粒度类别间的微小差异和类内变化。 Method: 提出语义感知聚合模块(SAAM),利用多尺度上下文构建加权超图并通过超图卷积聚合局部特征为区域级表示;引入双曲层次对比损失(HHCL),在非欧空间中增强类间可分性和类内一致性。 Result: 在四个标准细粒度分类基准上实验表明,所提H3Former框架优于现有方法。 Conclusion: H3Former通过建模高阶语义关系和层次化对比学习,有效提升了细粒度视觉分类的性能。 Abstract: Fine-Grained Visual Classification (FGVC) remains a challenging task due to subtle inter-class differences and large intra-class variations. Existing approaches typically rely on feature-selection mechanisms or region-proposal strategies to localize discriminative regions for semantic analysis. However, these methods often fail to capture discriminative cues comprehensively while introducing substantial category-agnostic redundancy. To address these limitations, we propose H3Former, a novel token-to-region framework that leverages high-order semantic relations to aggregate local fine-grained representations with structured region-level modeling. Specifically, we propose the Semantic-Aware Aggregation Module (SAAM), which exploits multi-scale contextual cues to dynamically construct a weighted hypergraph among tokens. By applying hypergraph convolution, SAAM captures high-order semantic dependencies and progressively aggregates token features into compact region-level representations. Furthermore, we introduce the Hyperbolic Hierarchical Contrastive Loss (HHCL), which enforces hierarchical semantic constraints in a non-Euclidean embedding space. The HHCL enhances inter-class separability and intra-class consistency while preserving the intrinsic hierarchical relationships among fine-grained categories. Comprehensive experiments conducted on four standard FGVC benchmarks validate the superiority of our H3Former framework.[139] PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning
Yanbei Jiang,Chao Lei,Yihao Ding,Krista Ehinger,Jey Han Lau
Main category: cs.CV
TL;DR: PROPA是一种结合蒙特卡洛树搜索与策略对齐的视觉语言模型推理优化框架,通过生成密集的过程级奖励来提升复杂视觉推理的性能,无需人工标注,在多个基准上显著优于现有方法。
Details
Motivation: 现有视觉语言模型在复杂视觉推理中因早期错误传播而表现受限,且当前后训练范式(如监督微调和稀疏奖励强化学习)依赖昂贵标注或缺乏细粒度反馈,难以稳定优化中间推理步骤。 Method: 提出PROPA框架,将蒙特卡洛树搜索与GRPO结合,生成过程级密集奖励;通过交替进行GRPO更新与监督微调缓解冷启动问题,并训练过程奖励模型(PRM)以指导推理时搜索,实现训练与测试阶段的对齐。 Result: 在七个基准数据集和四种视觉语言模型主干上,PROPA均优于基于SFT和RLVR的方法,在领域内任务上最高提升17.0%,跨领域任务上提升达21.0%,展现出强大的推理与泛化能力。 Conclusion: PROPA通过引入过程级优化机制,在无需人工标注的情况下有效提升了视觉语言模型的多步推理性能和泛化能力,为复杂视觉推理任务提供了新的解决方案。 Abstract: Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.[140] Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
Zhengtao Zou,Ya Gao,Jiarui Guan,Bin Li,Pekka Marttinen
Main category: cs.CV
TL;DR: 提出RUDDER框架,通过上下文激活残差方向向量和贝叶斯启发的自适应门控机制,在几乎不增加计算开销的情况下有效减少大视觉语言模型的对象幻觉问题。
Details
Motivation: 大视觉语言模型(LVLMs)常出现对象幻觉问题,现有缓解方法在有效性与计算效率之间存在权衡,尤其因额外前向传播导致高延迟,限制实际应用。 Method: 提出RUDDER框架,包含两个核心:1)从自注意力层残差更新中提取每样本的视觉证据向量(CARD);2)基于模型偏离视觉上下文程度动态调节修正信号强度的贝叶斯启发自适应门控机制,实现低开销、逐标记的解码调控。 Result: 在POPE和CHAIR等主流幻觉基准上,RUDDER性能媲美当前最优方法,同时计算延迟极低,仅需一次标准前向传播即可完成干预。 Conclusion: RUDDER是一种高效且实用的推理时干预方法,能够在不显著牺牲效率的前提下提升LVLMs生成结果的视觉一致性与可靠性。 Abstract: Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model's deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs' reliability without a significant compromise on efficiency.[141] Generalizable Slum Detection from Satellite Imagery with Mixture-of-Experts
Sumin Lee,Sungwon Park,Jeasurk Yang,Jihee Kim,Meeyoung Cha
Main category: cs.CV
TL;DR: 提出GRAM框架,利用大规模卫星图像数据集实现无需目标区域标注数据的鲁棒贫民窟分割。
Details
Motivation: 由于非正式定居点形态差异大,现有模型在特定区域训练后难以推广到新区域,亟需提升跨区域泛化能力。 Method: 构建包含12个城市百万级卫星图像的数据集,采用Mixture-of-Experts架构,在共享主干网络基础上捕捉区域特征;通过两阶段测试时自适应框架GRAM,利用专家间预测一致性过滤不可靠伪标签,实现无监督域适应。 Result: GRAM在非洲等低资源城市的表现优于现有最先进基线方法,显著提升跨区域贫民窟识别的准确性和鲁棒性。 Conclusion: GRAM为全球贫民窟制图提供了可扩展、标签高效的解决方案,有助于推动数据驱动的城市规划。 Abstract: Satellite-based slum segmentation holds significant promise in generating global estimates of urban poverty. However, the morphological heterogeneity of informal settlements presents a major challenge, hindering the ability of models trained on specific regions to generalize effectively to unseen locations. To address this, we introduce a large-scale high-resolution dataset and propose GRAM (Generalized Region-Aware Mixture-of-Experts), a two-phase test-time adaptation framework that enables robust slum segmentation without requiring labeled data from target regions. We compile a million-scale satellite imagery dataset from 12 cities across four continents for source training. Using this dataset, the model employs a Mixture-of-Experts architecture to capture region-specific slum characteristics while learning universal features through a shared backbone. During adaptation, prediction consistency across experts filters out unreliable pseudo-labels, allowing the model to generalize effectively to previously unseen regions. GRAM outperforms state-of-the-art baselines in low-resource settings such as African cities, offering a scalable and label-efficient solution for global slum mapping and data-driven urban planning.[142] Rethinking Visual Information Processing in Multimodal LLMs
Dongwan Kim,Viresh Ranjan,Takashi Nagata,Arnab Dhua,Amit Kumar K C
Main category: cs.CV
TL;DR: 提出LLaViT,将大语言模型同时用作视觉编码器,通过三个关键改进提升视觉-语言建模性能。
Details
Motivation: 解决LLaVA架构中文本与视觉模态不匹配导致的视觉特征整合困难问题。 Method: 通过学习独立的视觉QKV投影、实现视觉token的双向注意力、融合全局与局部视觉表征,使大语言模型兼具视觉编码能力。 Result: 在多种LLM上进行实验,LLaViT显著优于基线LLaVA方法,并超越参数量两倍的模型。 Conclusion: LLaViT提供了一种更有效的视觉-语言建模方法,验证了大语言模型作为扩展Vision Transformer的潜力。 Abstract: Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.[143] Revisiting Evaluation of Deep Neural Networks for Pedestrian Detection
Patrick Feifel,Benedikt Franke,Frank Bonarens,Frank Köster,Arne Raulf,Friedhelm Schwenker
Main category: cs.CV
TL;DR: 本文提出了一种用于行人检测的新型评估方法,通过引入八种错误类别和新指标,利用图像分割信息实现更细粒度和鲁棒的模型性能比较,并在CityPersons数据集上取得了无需额外训练数据的SOTA结果。
Details
Motivation: 现有的行人检测性能评估指标在验证数据集的不同子集上存在局限性,无法真实反映DNN在安全关键场景下的表现,因此需要一种更精细、可靠的评估方式。 Method: 基于图像分割提供的细粒度场景信息,提出了八种行人检测错误类别,并设计了新的评估指标;使用简化的APD框架,结合不同主干网络进行实验验证。 Result: 新指标实现了对模型更细粒度和稳健的比较,尤其提升了安全关键性能的评估能力;在CityPersons-reasonable设置下(无需额外训练数据)达到了SOTA性能。 Conclusion: 该工作通过引入基于分割的错误分类和新评估指标,显著改进了行人检测模型的评估方式,有助于推动自动驾驶系统中行人检测的可靠性与安全性。 Abstract: Reliable pedestrian detection represents a crucial step towards automated driving systems. However, the current performance benchmarks exhibit weaknesses. The currently applied metrics for various subsets of a validation dataset prohibit a realistic performance evaluation of a DNN for pedestrian detection. As image segmentation supplies fine-grained information about a street scene, it can serve as a starting point to automatically distinguish between different types of errors during the evaluation of a pedestrian detector. In this work, eight different error categories for pedestrian detection are proposed and new metrics are proposed for performance comparison along these error categories. We use the new metrics to compare various backbones for a simplified version of the APD, and show a more fine-grained and robust way to compare models with each other especially in terms of safety-critical performance. We achieve SOTA on CityPersons-reasonable (without extra training data) by using a rather simple architecture.[144] CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification
Xiaomei Yang,Xizhan Gao,Sijie Niu,Fa Zhu,Guang Feng,Xiaofeng Qu,David Camacho
Main category: cs.CV
TL;DR: 本文提出了一种基于CLIP的模态共享表示学习网络CLIP4VI-ReID,用于可见光-红外行人重识别(VI-ReID),通过文本语义生成、红外特征校正和高层语义对齐三个模块实现跨模态对齐,显著提升了识别性能。
Details
Motivation: 由于可见光图像与红外图像在物理特性上存在巨大差异,直接进行模态对齐困难,因此需要一种能够有效桥接两种模态并提取身份相关语义信息的方法。 Method: 提出CLIP4VI-ReID网络,包含文本语义生成(TSG)、红外特征嵌入(IFE)和高层语义对齐(HSA)三个模块:TSG为可见光图像生成文本语义以实现初步对齐;IFE利用文本语义校正红外特征,注入身份相关信息;HSA进一步优化高层语义对齐,确保文本语义聚焦于身份信息。 Result: 实验结果表明,CLIP4VI-ReID在多个主流VI-ReID数据集上性能优于现有最先进方法,实现了更精确的跨模态对齐和更具判别性的共享表示。 Conclusion: 通过引入文本作为桥梁并设计多阶段语义对齐机制,CLIP4VI-ReID有效解决了可见光与红外模态间的语义鸿沟,显著提升了VI-ReID的跨模态匹配精度。 Abstract: This paper proposes a novel CLIP-driven modality-shared representation learning network named CLIP4VI-ReID for VI-ReID task, which consists of Text Semantic Generation (TSG), Infrared Feature Embedding (IFE), and High-level Semantic Alignment (HSA). Specifically, considering the huge gap in the physical characteristics between natural images and infrared images, the TSG is designed to generate text semantics only for visible images, thereby enabling preliminary visible-text modality alignment. Then, the IFE is proposed to rectify the feature embeddings of infrared images using the generated text semantics. This process injects id-related semantics into the shared image encoder, enhancing its adaptability to the infrared modality. Besides, with text serving as a bridge, it enables indirect visible-infrared modality alignment. Finally, the HSA is established to refine the high-level semantic alignment. This process ensures that the fine-tuned text semantics only contain id-related information, thereby achieving more accurate cross-modal alignment and enhancing the discriminability of the learned modal-shared representations. Extensive experimental results demonstrate that the proposed CLIP4VI-ReID achieves superior performance than other state-of-the-art methods on some widely used VI-ReID datasets.[145] Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision
Yu Deng,Baozhu Zhao,Junyan Su,Xiaohan Zhang,Qi Liu
Main category: cs.CV
TL;DR: 提出一种结合景深监督和多视角一致性监督的3D高斯点阵化框架,提升复杂场景中远近区域的深度重建精度。
Details
Motivation: 现有方法难以同时解决远距离区域深度估计不准和近距离区域结构退化的问题,尤其在深度变化剧烈的场景中表现不佳。 Method: 1) 利用单目深度估计器生成深度先验,通过散焦卷积合成物理准确的散焦图像,并引入景深损失增强几何一致性;2) 基于LoFTR进行半稠密特征匹配,通过最小化跨视角几何误差并利用最小二乘优化可靠匹配点来强化深度一致性。 Result: 在Waymo Open Dataset上比当前最先进方法提升了0.8 dB的PSNR,显著改善了远近场深度保真度。 Conclusion: 该方法融合物理成像原理与学习型深度正则化,为城市环境中复杂深度分层提供了可扩展的解决方案。 Abstract: Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.[146] Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment
Wenti Yin,Huaxin Zhang,Xiang Wang,Yuqing Lu,Yicheng Zhang,Bingquan Gong,Jialong Zuo,Li Yu,Changxin Gao,Nong Sang
Main category: cs.CV
TL;DR: 提出了一种新的解耦语义对齐网络(DSANet),用于弱监督视频异常检测,通过粗粒度和细粒度层次分离正常与异常特征,提升分类性能。
Details
Motivation: 现有方法倾向于检测最显著的片段,忽视多样化的正常模式挖掘,且因外观相似导致类别混淆,难以实现细粒度分类。 Method: 在粗粒度层面引入自指导的正常性建模分支,通过学习的正常原型重构视频特征;在细粒度层面提出解耦的对比语义对齐机制,将视频分解为事件中心和背景中心成分,并采用视觉-语言对比学习增强类别判别表示。 Result: 在XD-Violence和UCF-Crime两个标准数据集上实验表明,DSANet优于现有的最先进方法。 Conclusion: DSANet能有效分离正常与异常特征,提升弱监督视频异常检测的性能,尤其在细粒度分类方面表现优越。 Abstract: Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.[147] FOUND: Fourier-based von Mises Distribution for Robust Single Domain Generalization in Object Detection
Mengzhu Wang,Changyuan Deng,Shanshan Wang,Nan Yin,Long Lan,Liang Yang
Main category: cs.CV
TL;DR: 本文提出了一种新的单域泛化(SDG)目标检测框架,结合vMF分布和傅里叶变换增强CLIP引导模型的跨域鲁棒性。
Details
Motivation: 现有方法忽视了特征分布结构和频域特性对SDG的重要性,导致泛化能力受限。 Method: 采用vMF分布建模方向特征以捕获嵌入空间中的域不变语义结构,并引入基于傅里叶的增强策略,扰动频域中的幅度和相位分量以模拟域偏移。 Result: 在恶劣天气驾驶基准上实验表明,该方法显著优于当前最先进的SDG目标检测方法。 Conclusion: 所提方法通过结合vMF和傅里叶增强,在保持CLIP语义对齐的同时提升了特征多样性和跨域一致性,有效增强了SDG性能。 Abstract: Single Domain Generalization (SDG) for object detection aims to train a model on a single source domain that can generalize effectively to unseen target domains. While recent methods like CLIP-based semantic augmentation have shown promise, they often overlook the underlying structure of feature distributions and frequency-domain characteristics that are critical for robustness. In this paper, we propose a novel framework that enhances SDG object detection by integrating the von Mises-Fisher (vMF) distribution and Fourier transformation into a CLIP-guided pipeline. Specifically, we model the directional features of object representations using vMF to better capture domain-invariant semantic structures in the embedding space. Additionally, we introduce a Fourier-based augmentation strategy that perturbs amplitude and phase components to simulate domain shifts in the frequency domain, further improving feature robustness. Our method not only preserves the semantic alignment benefits of CLIP but also enriches feature diversity and structural consistency across domains. Extensive experiments on the diverse weather-driving benchmark demonstrate that our approach outperforms the existing state-of-the-art method.[148] DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile
Thales Bezerra,Emanoel Thyago,Kelvin Cunha,Rodrigo Abreu,Fábio Papais,Francisco Mauro,Natália Lopes,Érico Medeiros,Jéssica Guido,Shirley Cruz,Paulo Borba,Tsang Ing Ren
Main category: cs.CV
TL;DR: DermAI是一个轻量级的智能手机应用,用于在常规诊疗中实时捕捉、标注和分类皮肤病变,强调了标准化、多样化数据收集对机器学习的重要性。
Details
Motivation: AI皮肤病学的应用受限于有偏见的数据集、图像质量参差不齐以及验证不足的问题。 Method: 开发了一个基于智能手机的轻量级应用程序DermAI,支持设备端质量检查和本地模型适应,并使用涵盖多种肤色、种族和设备来源的临床数据集进行训练和验证。 Result: 在初步实验中,基于公开数据集训练的模型无法泛化到新样本,而通过本地数据微调后性能显著提升。 Conclusion: 标准化且多样化的数据采集对于满足医疗需求并推动机器学习发展至关重要。 Abstract: AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.[149] SHRUG-FM: Reliability-Aware Foundation Models for Earth Observation
Kai-Hendrik Cohrs,Zuzanna Osika,Maria Gonzalez-Calabuig,Vishal Nedungadi,Ruben Cartuyvels,Steffen Knoblauch,Joppe Massant,Shruti Nath,Patrick Ebel,Vasileios Sitokonstantinou
Main category: cs.CV
TL;DR: SHRUG-FM is a reliability-aware framework for geospatial foundation models that combines input and embedding space OOD detection with predictive uncertainty to improve real-world performance, particularly in underrepresented environments like low-elevation and large river areas.
Details
Motivation: Geospatial foundation models often perform poorly in environments underrepresented during pretraining, limiting their reliability in real-world, climate-sensitive applications. Method: SHRUG-FM integrates three signals: out-of-distribution (OOD) detection in input space, OOD detection in embedding space, and task-specific predictive uncertainty, applied to burn scar segmentation and evaluated using HydroATLAS land cover data. Result: OOD scores correlate with reduced performance in specific environments; uncertainty-based flags effectively identify poor predictions, revealing systematic failures in low-elevation and large river regions due to data underrepresentation. Conclusion: SHRUG-FM enhances the safety and interpretability of geospatial foundation models by identifying and explaining prediction failures, bridging the gap between benchmark results and real-world reliability. Abstract: Geospatial foundation models for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty. Applied to burn scar segmentation, SHRUG-FM shows that OOD scores correlate with lower performance in specific environmental conditions, while uncertainty-based flags help discard many poorly performing predictions. Linking these flags to land cover attributes from HydroATLAS shows that failures are not random but concentrated in certain geographies, such as low-elevation zones and large river areas, likely due to underrepresentation in pretraining data. SHRUG-FM provides a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, helping bridge the gap between benchmark performance and real-world reliability.[150] MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
Xun Huang,Shijia Zhao,Yunxiang Wang,Xin Lu,Wanfa Zhang,Rongsheng Qu,Weixin Li,Yunhong Wang,Chenglu Wen
Main category: cs.CV
TL;DR: 提出了一种基于多模态3D场景图(M3DSG)的零样本导航系统MSGNav,通过保留视觉线索和动态图像边来解决现有方法的视觉信息丢失和词汇限制问题,并引入可见性感知的视点决策模块解决零样本导航中的最后一英里问题,在GOAT-Bench和HM3D-OVON数据集上达到SOTA性能。
Details
Motivation: 现有的零样本具身导航方法在构建3D场景图时将丰富的视觉观测压缩为纯文本关系,导致高构建成本、视觉信息不可逆丢失和词汇受限,难以满足真实场景中开放词汇泛化和低训练开销的需求。 Method: 提出多模态3D场景图(M3DSG),用动态分配的图像替代文本关系边以保留视觉线索;基于M3DSG构建MSGNav系统,包含关键子图选择、自适应词汇更新和闭环推理模块;并设计基于可见性的视点决策模块以解决最后一英里问题。 Result: MSGNav在GOAT-Bench和HM3D-OVON两个基准上均实现了最先进的零样本导航性能,显著优于现有方法,验证了多模态场景表示与各模块的有效性。 Conclusion: MSGNav通过保留视觉信息的多模态场景图和针对性模块设计,有效解决了零样本具身导航中的信息损失、词汇局限和最终视点选择等问题,推动了无需任务特定训练的通用机器人导航发展。 Abstract: Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.[151] Fragile by Design: On the Limits of Adversarial Defenses in Personalized Generation
Zhen Chen,Yi Zhang,Xiangyu Yin,Chengxuan Qin,Xingyu Zhao,Xiaowei Huang,Wenjie Ruan
Main category: cs.CV
TL;DR: 现有防御方法在防止个性化生成模型中的面部身份泄露方面存在明显缺陷,对抗性扰动易被检测且脆弱,研究提出新的评估框架AntiDB_Purify,发现当前方法在真实净化威胁下均失效,强调需要更隐蔽、鲁棒的保护机制。
Details
Motivation: 个性化AI应用(如DreamBooth)存在隐私泄露风险,尤其是面部身份泄露,现有防御方法(如Anti-DreamBooth)虽引入对抗扰动,但存在可感知伪影和扰动脆弱的问题,亟需系统评估其实际有效性。 Method: 提出名为AntiDB_Purify的评估框架,系统性测试现有防御方法在面对传统图像滤波和对抗性净化等现实净化手段时的鲁棒性。 Result: 实验结果表明,所有现有防御方法在经过简单滤波或对抗性净化后均失去保护效果,扰动被去除后模型仍可成功记忆并复现用户身份。 Conclusion: 当前的防御机制提供的是虚假的安全感,未来需要设计更不易察觉且更鲁棒的防护方法以真正保护用户身份隐私。 Abstract: Personalized AI applications such as DreamBooth enable the generation of customized content from user images, but also raise significant privacy concerns, particularly the risk of facial identity leakage. Recent defense mechanisms like Anti-DreamBooth attempt to mitigate this risk by injecting adversarial perturbations into user photos to prevent successful personalization. However, we identify two critical yet overlooked limitations of these methods. First, the adversarial examples often exhibit perceptible artifacts such as conspicuous patterns or stripes, making them easily detectable as manipulated content. Second, the perturbations are highly fragile, as even a simple, non-learned filter can effectively remove them, thereby restoring the model's ability to memorize and reproduce user identity. To investigate this vulnerability, we propose a novel evaluation framework, AntiDB_Purify, to systematically evaluate existing defenses under realistic purification threats, including both traditional image filters and adversarial purification. Results reveal that none of the current methods maintains their protective effectiveness under such threats. These findings highlight that current defenses offer a false sense of security and underscore the urgent need for more imperceptible and robust protections to safeguard user identity in personalized generation.[152] SAMIRO: Spatial Attention Mutual Information Regularization with a Pre-trained Model as Oracle for Lane Detection
Hyunjong Lee,Jangho Lee,Jaekoo Lee
Main category: cs.CV
TL;DR: 本文提出了一种名为SAMIRO的车道检测方法,通过结合预训练模型和空间注意力互信息正则化,提升现有模型在复杂环境下的性能。
Details
Motivation: 由于真实环境中存在背景杂乱、光照变化和遮挡等问题,传统数据驱动的车道检测方法面临数据标注成本高、泛化能力差的挑战,需要引入上下文和全局信息来提升检测效果。 Method: 提出SAMIRO方法,利用预训练模型作为Oracle,通过空间注意力机制保留领域无关的空间信息,并以插件形式集成到多种主流车道检测模型中。 Result: 在CULane、Tusimple和LLAMAS等多个主流数据集上实验表明,SAMIRO能持续提升不同模型的检测性能。 Conclusion: SAMIRO是一种有效的即插即用模块,能够通过知识迁移增强车道检测性能,具有良好的通用性和应用前景。 Abstract: Lane detection is an important topic in the future mobility solutions. Real-world environmental challenges such as background clutter, varying illumination, and occlusions pose significant obstacles to effective lane detection, particularly when relying on data-driven approaches that require substantial effort and cost for data collection and annotation. To address these issues, lane detection methods must leverage contextual and global information from surrounding lanes and objects. In this paper, we propose a Spatial Attention Mutual Information Regularization with a pre-trained model as an Oracle, called SAMIRO. SAMIRO enhances lane detection performance by transferring knowledge from a pretrained model while preserving domain-agnostic spatial information. Leveraging SAMIRO's plug-and-play characteristic, we integrate it into various state-of-the-art lane detection approaches and conduct extensive experiments on major benchmarks such as CULane, Tusimple, and LLAMAS. The results demonstrate that SAMIRO consistently improves performance across different models and datasets. The code will be made available upon publication.[153] Physics informed Transformer-VAE for biophysical parameter estimation: PROSAIL model inversion in Sentinel-2 imagery
Prince Mensah,Pelumi Victor Aderinto,Ibrahim Salihu Yusuf,Arnu Pretorius
Main category: cs.CV
TL;DR: 提出了一种基于物理信息的Transformer-VAE架构,用于从Sentinel-2数据中反演PROSAIL模型以同时估计植被冠层参数,仅使用模拟数据训练但性能媲美使用真实影像的最先进方法。
Details
Motivation: 准确从卫星图像中反演植被生物物理参数对生态系统监测和农业管理至关重要,现有方法多依赖真实影像进行监督训练,限制了其广泛应用。 Method: 提出一种物理信息驱动的Transformer-VAE模型,将PROSAIL辐射传输模型作为可微分的物理解码器,仅在模拟数据上训练,实现叶面积指数(LAI)和冠层叶绿素含量(CCC)的同时反演。 Result: 在FRM4Veg和BelSAR真实野外数据集上实现了与使用真实Sentinel-2影像训练的最先进方法相当的精度,且无需实地标签或真实图像校准。 Conclusion: 该方法通过融合物理模型与深度学习,实现了无需真实标签的自监督植被参数反演,为全球尺度的植被监测提供了一种高效、物理一致的新方案。 Abstract: Accurate retrieval of vegetation biophysical variables from satellite imagery is crucial for ecosystem monitoring and agricultural management. In this work, we propose a physics-informed Transformer-VAE architecture to invert the PROSAIL radiative transfer model for simultaneous estimation of key canopy parameters from Sentinel-2 data. Unlike previous hybrid approaches that require real satellite images for self-supevised training. Our model is trained exclusively on simulated data, yet achieves performance on par with state-of-the-art methods that utilize real imagery. The Transformer-VAE incorporates the PROSAIL model as a differentiable physical decoder, ensuring that inferred latent variables correspond to physically plausible leaf and canopy properties. We demonstrate retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world field datasets (FRM4Veg and BelSAR) with accuracy comparable to models trained with real Sentinel-2 data. Our method requires no in-situ labels or calibration on real images, offering a cost-effective and self-supervised solution for global vegetation monitoring. The proposed approach illustrates how integrating physical models with advanced deep networks can improve the inversion of RTMs, opening new prospects for large-scale, physically-constrained remote sensing of vegetation traits.[154] MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
Jiarui Zhang,Yuliang Liu,Zijun Wu,Guosheng Pang,Zhili Ye,Yupei Zhong,Junteng Ma,Tao Wei,Haiyang Xu,Weikai Chen,Zeen Wang,Qiangjun Ji,Fanxi Zhou,Qi Zhang,Yuanrui Hu,Jiahao Liu,Zhang Li,Ziyang Zhang,Qiang Liu,Xiang Bai
Main category: cs.CV
TL;DR: MonkeyOCR v1.5 是一个用于复杂文档解析的统一视觉-语言框架,通过两阶段流程提升布局理解和内容识别性能。
Details
Motivation: 现有OCR系统在处理具有复杂布局、多级表格、嵌入式图像或公式以及跨页结构的真实文档时仍面临挑战,需要更强大的文档解析能力。 Method: 采用两阶段解析流程:第一阶段使用大型多模态模型联合预测文档布局和阅读顺序;第二阶段在检测区域内进行文本、公式和表格的局部识别,并引入基于视觉一致性的强化学习方案优化表格识别,同时设计了两个专用模块处理含图像的表格和跨页表格合并。 Result: 在OmniDocBench v1.5上实验表明,MonkeyOCR v1.5性能优于PPOCR-VL和MinerU 2.5,达到最先进水平,且在视觉复杂的文档场景中表现出卓越鲁棒性。 Conclusion: MonkeyOCR v1.5有效提升了复杂文档的解析精度与鲁棒性,尤其在处理多模态元素和跨区域结构方面具有显著优势。 Abstract: Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.[155] GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models
Oussema Dhaouadi,Johannes Meier,Jacques Kaiser,Daniel Cremers
Main category: cs.CV
TL;DR: 本文提出了Ground Diffusion (GrounDiff),首个基于扩散模型的数字地形模型(DTM)生成框架,通过将去除非地面结构问题建模为去噪任务,实现了从数字表面模型(DSM)到DTM的高效转换,并结合置信度引导的门控设计和先验引导拼接策略(PrioStitch),在多个基准上显著优于现有方法。
Details
Motivation: 传统DTM生成方法依赖人工调参或复杂网络结构并需后处理,缺乏高效、通用且高精度的自动化方案。 Method: 提出GrounDiff,采用扩散模型迭代去除DSM中的非地面结构,引入门控机制与置信度引导生成实现选择性滤波;为提升可扩展性,设计PrioStitch方法,利用GrounDiff生成的低分辨率全局先验指导局部高分辨率预测。 Result: 在ALS2DTM和USGS数据集上,RMSE分别降低最多93%和47%;在道路重建任务中,相比专用方法在GeRoD基准上距离误差降低达81%,且保持良好平滑性;提出的GrounDiff+变体进一步提升了表面平滑度。 Conclusion: GrounDiff是首个基于扩散模型的DSM-to-DTM框架,具有高精度、强泛化能力和良好可扩展性,无需任务特定优化即可在多种应用场景(包括高要求的道路重建)中超越现有最先进方法。 Abstract: Digital Terrain Models (DTMs) represent the bare-earth elevation and are important in numerous geospatial applications. Such data models cannot be directly measured by sensors and are typically generated from Digital Surface Models (DSMs) derived from LiDAR or photogrammetry. Traditional filtering approaches rely on manually tuned parameters, while learning-based methods require well-designed architectures, often combined with post-processing. To address these challenges, we introduce Ground Diffusion (GrounDiff), the first diffusion-based framework that iteratively removes non-ground structures by formulating the problem as a denoising task. We incorporate a gated design with confidence-guided generation that enables selective filtering. To increase scalability, we further propose Prior-Guided Stitching (PrioStitch), which employs a downsampled global prior automatically generated using GrounDiff to guide local high-resolution predictions. We evaluate our method on the DSM-to-DTM translation task across diverse datasets, showing that GrounDiff consistently outperforms deep learning-based state-of-the-art methods, reducing RMSE by up to 93% on ALS2DTM and up to 47% on USGS benchmarks. In the task of road reconstruction, which requires both high precision and smoothness, our method achieves up to 81% lower distance error compared to specialized techniques on the GeRoD benchmark, while maintaining competitive surface smoothness using only DSM inputs, without task-specific optimization. Our variant for road reconstruction, GrounDiff+, is specifically designed to produce even smoother surfaces, further surpassing state-of-the-art methods. The project page is available at https://deepscenario.github.io/GrounDiff/.[156] LLM-YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components
Yaru Li,Yanxue Wang,Meng Li,Xinming Li,Jianbo Feng
Main category: cs.CV
TL;DR: 提出了一种结合YOLOMS与大语言模型(LLM)的风力涡轮机故障智能分析框架,通过多尺度检测和KV映射模块提升故障特征提取与语义可解释性,实验显示该方法在故障检测和维护建议生成方面具有高准确率。
Details
Motivation: 现有风力涡轮机故障检测方法多依赖视觉识别,输出缺乏语义可解释性,难以支持维护决策,亟需提升诊断结果的可理解性与实用性。 Method: 采用YOLOMS进行多尺度检测与滑窗裁剪以增强特征提取,并设计轻量级KV映射模块将检测结果转化为包含定性和定量属性的结构化文本,再由领域调优的LLM进行语义推理,生成可解释的故障分析与维护建议。 Result: 在真实数据集上的实验表明,该框架故障检测准确率达90.6%,维护报告生成平均准确率为89%。 Conclusion: 所提方法有效提升了风力涡轮机故障诊断的语义可解释性与维护决策支持能力,为智能运维提供了可行方案。 Abstract: The health condition of wind turbine (WT) components is crucial for ensuring stable and reliable operation. However, existing fault detection methods are largely limited to visual recognition, producing structured outputs that lack semantic interpretability and fail to support maintenance decision-making. To address these limitations, this study proposes an integrated framework that combines YOLOMS with a large language model (LLM) for intelligent fault analysis and diagnosis. Specifically, YOLOMS employs multi-scale detection and sliding-window cropping to enhance fault feature extraction, while a lightweight key-value (KV) mapping module bridges the gap between visual outputs and textual inputs. This module converts YOLOMS detection results into structured textual representations enriched with both qualitative and quantitative attributes. A domain-tuned LLM then performs semantic reasoning to generate interpretable fault analyses and maintenance recommendations. Experiments on real-world datasets demonstrate that the proposed framework achieves a fault detection accuracy of 90.6\% and generates maintenance reports with an average accuracy of 89\%, thereby improving the interpretability of diagnostic results and providing practical decision support for the operation and maintenance of wind turbines.[157] 3DFETUS: Standardizing Fetal Facial Planes in 3D Ultrasound
Alomar Antonia,Rubio Ricardo,Albaiges Gerard,Salort-Benejam Laura,Caminal Julia,Prat Maria,Rueda Carolina,Cortes Berta,Piella Gemma,Sukno Federico
Main category: cs.CV
TL;DR: 提出了一种名为GT++的算法和一种名为3DFETUS的深度学习模型,用于在3D胎儿超声中自动标准化面部平面的定位。
Details
Motivation: 常规胎儿超声检查中获取标准面部平面具有挑战性,主要由于胎儿运动、方向变异性和操作者技术水平差异,导致检查不一致、耗时增加和潜在诊断偏差。 Method: 使用标注的解剖标志点,通过GT++算法估计标准面部平面,并利用3DFETUS深度学习模型实现自动化和标准化定位。 Result: 该方法在定量评估中实现了平均4.13毫米的平移误差和7.93度的旋转误差,优于现有最先进方法;临床评估也证实了其在平面估计准确性上的显著提升。 Conclusion: GT++和3DFETUS能有效提高3D胎儿超声中面部平面定位的准确性与一致性,有助于减少操作者依赖性和诊断偏差。 Abstract: Acquiring standard facial planes during routine fetal ultrasound (US) examinations is often challenging due to fetal movement, variability in orientation, and operator-dependent expertise. These factors contribute to inconsistencies, increased examination time, and potential diagnostic bias. To address these challenges in the context of facial assessment, we present: 1) GT++, a robust algorithm that estimates standard facial planes from 3D US volumes using annotated anatomical landmarks; and 2) 3DFETUS, a deep learning model that automates and standardizes their localization in 3D fetal US volumes. We evaluated our methods both qualitatively, through expert clinical review, and quantitatively. The proposed approach achieved a mean translation error of 4.13 mm and a mean rotation error of 7.93 degrees per plane, outperforming other state-of-the-art methods on 3D US volumes. Clinical assessments further confirmed the effectiveness of both GT++ and 3DFETUS, demonstrating statistically significant improvements in plane estimation accuracy.[158] RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation
Daniele Perlo,Vladimir Despotovic,Selma Boudissa,Sang-Yoon Kim,Petr Nazarov,Yanrong Zhang,Max Wintermark,Olivier Keunen
Main category: cs.CV
TL;DR: 提出并发布了RodEpil数据集,包含13,053个标注的短视频片段,用于基于视频的啮齿类动物癫痫发作检测。
Details
Motivation: 为支持非侵入式、基于视频的临床前癫痫研究,需要一个高质量、标注良好的视频数据集来训练和评估自动 seizure 检测模型。 Method: 构建了一个包含正负样本的视频数据集,采用严格按个体划分的五折交叉验证,并使用TimeSformer模型进行基线实验。 Result: TimeSformer模型在区分癫痫发作与正常活动时达到平均97%的F1分数,验证了数据集的有效性和模型的高性能。 Conclusion: RodEpil数据集为可重复的癫痫监测研究提供了可靠资源,推动了基于视频的动物行为分析发展。 Abstract: We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: 10.5281/zenodo.17601357[159] Histology-informed tiling of whole tissue sections improves the interpretability and predictability of cancer relapse and genetic alterations
Willem Bonnaffé,Yang Hu,Andrea Chatrian,Mengran Fan,Stefano Malacrino,Sandy Figiel,CRUK ICGC Prostate Group,Srinivasa R. Rao,Richard Colling,Richard J. Bryant,Freddie C. Hamdy,Dan J. Woodcock,Ian G. Mills,Clare Verrill,Jens Rittscher
Main category: cs.CV
TL;DR: 提出了一种基于语义分割提取腺体的组织学信息瓦片(HIT)方法,用于提高多实例学习在癌症分析中的准确性与可解释性。
Details
Motivation: 传统数字病理学流程使用基于网格的切片方法,忽略了组织结构,引入了无关信息并限制了模型的可解释性。 Method: 采用语义分割从全切片图像(WSI)中提取腺体作为生物有意义的输入片段,应用于多实例学习(MIL)和表型分析。 Result: 在137个样本上训练的HIT实现了0.83的腺体级别Dice分数;在ICGC-C和TCGA-PRAD队列中提取了38万个腺体,使MIL模型在检测EMT和MYC相关拷贝数变异时AUC提升10%,并识别出15个腺体聚类,其中多个与癌症复发、致癌突变和高Gleason评分相关。 Conclusion: HIT通过聚焦于生物学上有意义的结构,提高了MIL模型的准确性、可解释性,并简化了计算过程。 Abstract: Histopathologists establish cancer grade by assessing histological structures, such as glands in prostate cancer. Yet, digital pathology pipelines often rely on grid-based tiling that ignores tissue architecture. This introduces irrelevant information and limits interpretability. We introduce histology-informed tiling (HIT), which uses semantic segmentation to extract glands from whole slide images (WSIs) as biologically meaningful input patches for multiple-instance learning (MIL) and phenotyping. Trained on 137 samples from the ProMPT cohort, HIT achieved a gland-level Dice score of 0.83 +/- 0.17. By extracting 380,000 glands from 760 WSIs across ICGC-C and TCGA-PRAD cohorts, HIT improved MIL models AUCs by 10% for detecting copy number variation (CNVs) in genes related to epithelial-mesenchymal transitions (EMT) and MYC, and revealed 15 gland clusters, several of which were associated with cancer relapse, oncogenic mutations, and high Gleason. Therefore, HIT improved the accuracy and interpretability of MIL predictions, while streamlining computations by focussing on biologically meaningful structures during feature extraction.[160] OpenSR-SRGAN: A Flexible Super-Resolution Framework for Multispectral Earth Observation Data
Simon Donike,Cesar Aybar,Julio Contreras,Luis Gómez-Chova
Main category: cs.CV
TL;DR: OpenSR-SRGAN是一个开源、模块化的地球观测单图像超分辨率框架,统一实现了SRGAN风格模型,支持通过配置文件灵活配置生成器、判别器、损失函数和训练计划,适用于多光谱卫星数据如Sentinel-2,旨在降低研究人员和实践者使用GAN进行超分辨率实验、模型比较和部署的门槛。
Details
Motivation: 现有的超分辨率模型通常需要修改代码才能适配不同架构或数据,限制了可扩展性和易用性;为此,OpenSR-SRGAN旨在提供一个开放、模块化且易于配置的框架,以简化地球观测中GAN-based超分辨率的应用与研究。 Method: 采用SRGAN-style模型架构,通过配置文件驱动的方式实现生成器、判别器、损失函数和训练调度的模块化设计,支持多种遥感场景下的配置复用,并集成日志记录、验证和大场景推理功能。 Result: 框架支持多种架构、缩放因子和波段设置的灵活切换,提供即用型配置和合理的默认参数,已在多光谱卫星数据(如Sentinel-2)上验证其适用性,显著降低了模型实验与部署的复杂度。 Conclusion: OpenSR-SRGAN作为一个实用工具和基准实现,有效提升了超分辨率技术在地球观测领域的可访问性、可重复性和可扩展性,促进了SRGAN模型在遥感图像处理中的应用。 Abstract: We present OpenSR-SRGAN, an open and modular framework for single-image super-resolution in Earth Observation. The software provides a unified implementation of SRGAN-style models that is easy to configure, extend, and apply to multispectral satellite data such as Sentinel-2. Instead of requiring users to modify model code, OpenSR-SRGAN exposes generators, discriminators, loss functions, and training schedules through concise configuration files, making it straightforward to switch between architectures, scale factors, and band setups. The framework is designed as a practical tool and benchmark implementation rather than a state-of-the-art model. It ships with ready-to-use configurations for common remote sensing scenarios, sensible default settings for adversarial training, and built-in hooks for logging, validation, and large-scene inference. By turning GAN-based super-resolution into a configuration-driven workflow, OpenSR-SRGAN lowers the entry barrier for researchers and practitioners who wish to experiment with SRGANs, compare models in a reproducible way, and deploy super-resolution pipelines across diverse Earth-observation datasets.[161] Utility of Pancreas Surface Lobularity as a CT Biomarker for Opportunistic Screening of Type 2 Diabetes
Tejas Sudharshan Mathai,Anisa V. Prasad,Xinya Wang,Praveen T. S. Balamuralikrishna,Yan Zhuang,Abhinav Suri,Jianfei Liu,Perry J. Pickhardt,Ronald M. Summers
Main category: cs.CV
TL;DR: 本研究提出了一种全自动深度学习方法,用于通过CT影像中的胰腺表面小叶性(PSL)等生物标志物进行2型糖尿病(T2DM)的机会性筛查。研究表明,T2DM患者PSL显著升高,且结合多变量模型可实现高达0.90 AUC的预测性能。
Details
Motivation: 早期检测T2DM对防止器官损伤至关重要。尽管已有研究关注胰腺体积和脂肪含量,但胰腺表面小叶性(PSL)在T2DM中的作用尚未充分探索。因此,亟需一种自动化的影像学生物标志物分析方法用于早期筛查。 Method: 使用四个深度学习模型在584例患者的内部CT数据集中分割胰腺及其他腹部结构,自动检测PSL并提取影像生物标志物。采用Dice分数和ASSD评估分割性能,并构建多变量模型预测T2DM。 Result: T2DM患者的PSL显著高于非糖尿病患者(p=0.01);PancAP模型表现最优(Dice: 0.79±0.17,ASSD: 1.94±2.63 mm);基于CT生物标志物的多变量模型预测T2DM的AUC为0.90,敏感度66.7%,特异度91.9%。 Conclusion: PSL是T2DM潜在的有效影像学生物标志物,结合深度学习分割与CT生物标志物分析,可实现高特异度的T2DM机会性筛查,有助于早期预测疾病 onset。 Abstract: Type 2 Diabetes Mellitus (T2DM) is a chronic metabolic disease that affects millions of people worldwide. Early detection is crucial as it can alter pancreas function through morphological changes and increased deposition of ectopic fat, eventually leading to organ damage. While studies have shown an association between T2DM and pancreas volume and fat content, the role of increased pancreatic surface lobularity (PSL) in patients with T2DM has not been fully investigated. In this pilot work, we propose a fully automated approach to delineate the pancreas and other abdominal structures, derive CT imaging biomarkers, and opportunistically screen for T2DM. Four deep learning-based models were used to segment the pancreas in an internal dataset of 584 patients (297 males, 437 non-diabetic, age: 45$\pm$15 years). PSL was automatically detected and it was higher for diabetic patients (p=0.01) at 4.26 $\pm$ 8.32 compared to 3.19 $\pm$ 3.62 for non-diabetic patients. The PancAP model achieved the highest Dice score of 0.79 $\pm$ 0.17 and lowest ASSD error of 1.94 $\pm$ 2.63 mm (p$<$0.05). For predicting T2DM, a multivariate model trained with CT biomarkers attained 0.90 AUC, 66.7\% sensitivity, and 91.9\% specificity. Our results suggest that PSL is useful for T2DM screening and could potentially help predict the early onset of T2DM.[162] SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers
Oded Schlesinger,Amirhossein Farzam,J. Matias Di Martino,Guillermo Sapiro
Main category: cs.CV
TL;DR: 本文提出了一种名为SPOT的框架,通过利用token的相关性动态,在视觉Transformer中早期识别并剔除冗余token,从而在不损失性能的前提下显著提升计算效率。
Details
Motivation: 由于Vision Transformer(ViT)的计算复杂度随token数量呈二次增长,亟需高效方法来减少计算开销。现有方法难以在早期准确识别不重要的token,因此需要一种更上下文感知且可解释的机制来检测token重要性。 Method: SPOT框架结合token嵌入、跨层交互和注意力动态,使用轻量级预测器在各层中推断token的重要性,并据此进行输入自适应的token稀疏化。该方法可插入多种ViT架构,支持根据不同资源约束灵活调整稀疏程度。 Result: 实验表明,与标准ViT相比,SPOT最多可实现40%的效率提升,同时保持甚至提高模型准确率。 Conclusion: SPOT提供了一种高效、灵活且可解释的token剪枝方法,有效缓解了ViT的高计算成本问题,适用于多种架构和资源受限场景。 Abstract: While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .[163] Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising
Yusuf Talha Basak,Mehmet Ozan Unal,Metin Ertas,Isa Yildirim
Main category: cs.CV
TL;DR: 提出了一种可学习的全变分(LTV)框架,通过端到端训练联合优化重建与正则化,实现空间自适应平滑,在去噪和边缘保持方面优于传统TV和FBP+U-Net。
Details
Motivation: 传统全变分(TV)方法依赖固定的lambda参数,限制了其效率和实用性,难以在不同区域自适应地平衡去噪与边缘保留。 Method: 提出Learnable Total Variation (LTV) 框架,结合展开的TV求解器与数据驱动的LambdaNet,预测逐像素的正则化图,实现端到端联合优化。 Result: 在DeepLesion数据集上实验显示,相比经典TV和FBP+U-Net,平均提升+2.9 dB PSNR和+6% SSIM,实现了更强的均匀区域平滑和边界附近的松弛平滑。 Conclusion: LTV提供了一种可解释的、优于黑箱CNN的图像重建方法,并为3D重建和数据一致性驱动的方法提供了基础。 Abstract: Although Total Variation (TV) performs well in noise reduction and edge preservation on images, its dependence on the lambda parameter limits its efficiency and makes it difficult to use effectively. In this study, we present a Learnable Total Variation (LTV) framework that couples an unrolled TV solver with a data-driven Lambda Mapping Network (LambdaNet) predicting a per-pixel regularization map. The pipeline is trained end-to-end so that reconstruction and regularization are optimized jointly, yielding spatially adaptive smoothing: strong in homogeneous regions, relaxed near anatomical boundaries. Experiments on the DeepLesion dataset, using a realistic noise model adapted from the LoDoPaB-CT methodology, show consistent gains over classical TV and FBP+U-Net: +2.9 dB PSNR and +6% SSIM on average. LTV provides an interpretable alternative to black-box CNNs and a basis for 3D and data-consistency-driven reconstruction. Our codes are available at: https://github.com/itu-biai/deep_tv_for_ldct[164] SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation
Wei Li,Renshan Zhang,Rui Shao,Zhijian Fang,Kaiwen Zhou,Zhuotao Tian,Liqiang Nie
Main category: cs.CV
TL;DR: 本文提出了SemanticVLA,一种用于高效机器人操作的视觉-语言-动作框架,通过语义对齐的稀疏化与增强机制,在减少冗余感知的同时提升指令与视觉的语义对齐,显著提高了性能与效率。
Details
Motivation: 现有VLA模型存在感知冗余和指令-视觉表面对齐问题,导致语义接地能力弱,影响实际部署效率与效果。 Method: 提出SemanticVLA框架,包括:1)SD-Pruner(ID-Pruner和SA-Pruner)进行语义引导的视觉稀疏化;2)SH-Fuser融合稠密块与稀疏令牌以整合语义与几何信息;3)SA-Coupler改进感知到动作的映射,实现更高效可解释的行为建模。 Result: 在仿真和真实任务中,SemanticVLA在LIBERO基准上比OpenVLA成功率提高21.1%,训练成本降低3.0倍,推理延迟降低2.7倍。 Conclusion: SemanticVLA通过语义对齐的稀疏化与增强策略,有效解决了感知冗余与语义接地问题,实现了高性能、高效率的机器人操作,具备良好的实际应用前景。 Abstract: Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA[165] Dynamic Avatar-Scene Rendering from Human-centric Context
Wenqing Wang,Haosen Yang,Josef Kittler,Xiatian Zhu
Main category: cs.CV
TL;DR: 提出了一种“先分离后映射”(Separate-then-Map, StM)策略,用于从单目视频中重建动态人类与真实环境的交互,通过共享变换函数统一分别建模的人体与场景,显著提升了渲染质量与精度。
Details
Motivation: 现有方法在建模动态人体与场景交互时,要么忽略不同组件的运动特性导致重建不完整,要么因缺乏信息交互引发空间不一致和边界伪影。 Method: 采用分离建模并引入专用的信息映射机制,通过为每个高斯属性设计共享变换函数,实现人体与场景的高效统一优化。 Result: 在多个单目视频数据集上实验表明,StM在视觉质量和渲染精度上均优于当前最先进方法,尤其在人体-场景交互边界表现更优。 Conclusion: StM策略有效解决了人体与场景建模中的信息割裂问题,在保持计算效率的同时实现了高质量、空间一致的4D重建。 Abstract: Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.[166] Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation
Isabela Albuquerque,Ira Ktena,Olivia Wiles,Ivana Kajić,Amal Rannen-Triki,Cristina Vasconcelos,Aida Nematzadeh
Main category: cs.CV
TL;DR: 本文提出了一种评估文本到图像(T2I)模型多样性的新框架,通过人类评估模板、精心设计的提示集和统计方法系统衡量生成多样性,并揭示了现有模型在多样性上的不足。
Details
Motivation: 当前T2I模型虽然生成质量提高,但输出缺乏多样性,亟需可靠的多样性评估方法。 Method: 提出了一个人类评估模板、一个包含多样化概念及其变化因素的提示集,并采用二项检验比较模型在人类标注上的表现;同时评估了多种图像嵌入方法在多样性测量中的效果。 Result: 该框架能够系统地评估T2I模型在不同概念和变化因素下的多样性,实现了模型间的多样性排序,并识别出模型表现较差的类别。 Conclusion: 本研究提供了一个可靠的T2I模型多样性评估方法,为改进模型多样性和开发相关指标提供了重要基础。 Abstract: Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.[167] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
Huijie Liu,Shuhao Cui,Haoxiang Cao,Shuai Ma,Kai Wu,Guoliang Kang
Main category: cs.CV
TL;DR: 本文提出了CoTyle,首个开源的“代码到风格”图像生成方法,通过数值风格码实现新颖且一致的视觉风格生成,验证了“一种风格值一个代码”的理念。
Details
Motivation: 现有风格生成方法依赖复杂输入(如长文本、参考图或微调),难以保证风格一致性与创造性,且学术界在基于数值代码生成风格的研究存在空白。 Method: 首先从图像集中训练离散风格码本以提取风格嵌入,用作文本到图像扩散模型的条件;然后在风格嵌入上训练自回归风格生成器以建模其分布,从而合成新风格嵌入;推理时,数值风格代码通过生成器映射为风格嵌入,指导扩散模型生成对应风格图像。 Result: 实验表明CoTyle能有效将数值代码转化为风格控制器,在无需文本或参考图像的情况下生成多样化且一致的新颖视觉风格,具有高度可复现性和简洁性。 Conclusion: CoTyle填补了学术界在代码驱动风格生成领域的空白,展示了仅凭单一数值代码即可控制和生成复杂视觉风格的潜力,推动了开放、可复制的风格化生成研究。 Abstract: Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.[168] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded
Haosong Peng,Hao Li,Yalun Dai,Yushi Lan,Yihang Luo,Tianyu Qi,Zhengshen Zhang,Yufeng Zhan,Junfei Zhang,Wenchao Xu,Ziwei Liu
Main category: cs.CV
TL;DR: 本文提出了OmniVGGT,一种能有效融合任意数量几何模态(如深度、相机内/外参)的通用3D基础模型框架,通过GeoAdapter和随机多模态融合策略,在保持高效推理的同时提升了多种视觉任务性能,并在视觉-语言-动作模型中验证了其实用性。
Details
Motivation: 现有通用3D基础模型大多仅使用RGB输入,忽略了易于获取的几何信息(如相机参数和深度图),限制了其在空间感知任务中的性能,因此需要一种能灵活利用多种几何模态的方法。 Method: 提出OmniVGGT框架,包含GeoAdapter模块(采用零初始化卷积逐步注入几何信息)和随机多模态融合训练策略(训练时随机采样模态子集),可在不破坏基础模型表示空间的前提下融合多种几何输入,并支持测试时使用任意模态组合。 Result: 在单目/多视角深度估计、多视图立体匹配和相机位姿估计等任务上超越了以往使用辅助输入的方法,并在仅使用RGB输入时达到SOTA;集成到视觉-语言-动作模型后,在主流基准和机器人任务中均表现出优于基线模型的性能。 Conclusion: OmniVGGT能够有效且灵活地融合多种几何模态信息,在不影响推理速度的前提下显著提升3D视觉任务性能,并具备良好的泛化能力和实际应用价值。 Abstract: General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.[169] From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis
Yen Nhi Truong Vu,Dan Guo,Sripad Joshi,Harshit Kumar,Jason Su,Thomas Paul Matthews
Main category: cs.CV
TL;DR: 提出M&M-3D模型,可在不增加参数的情况下实现对DBT数据的3D推理,显著提升乳腺癌检测中的定位与分类性能。
Details
Motivation: 由于标注数据稀缺,现有方法在利用深度学习处理数字乳腺断层合成(DBT)时面临挑战:2D方法丢失3D信息,而3D方法需要更多数据和复杂结构。 Method: 通过修改原有的M&M模型结构,构建可学习的恶性肿瘤引导3D特征,并反复融合切片级信息实现3D推理,且不引入额外参数,支持从FFDM模型直接迁移权重。 Result: 在多个实验中,M&M-3D在低数据场景下比2D和3D切片方法分别提升11-54%(定位)和3-10%(分类),优于复杂3D模型20-47%(定位)和2-10%(分类);在BCS-DBT基准上超越先前最优方法4%(分类)和10%(定位)。 Conclusion: M&M-3D在不增加参数的前提下有效实现了对DBT数据的3D推理,解决了数据稀缺下的模型迁移难题,在低数据条件下表现尤为突出。 Abstract: Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.[170] Multitask GLocal OBIA-Mamba for Sentinel-2 Landcover Mapping
Zack Dewis,Yimin Zhu,Zhengsen Xu,Mabel Heffring,Saeid Taleghanidoozdoozan,Kaylee Xiao,Motasem Alkayid,Lincoln Linlin Xu
Main category: cs.CV
TL;DR: 提出了一种基于多任务全局-局部OBIA-Mamba(MSOM)模型的Sentinel-2影像土地利用分类方法,结合超像素、CNN-Mamba双分支结构和多任务优化框架,显著提升了分类精度和细节表现。
Details
Motivation: 解决Sentinel-2影像土地覆盖分类中存在的空间异质性、上下文信息不足和光谱混淆等关键问题。 Method: 设计了以超像素为Mamba token的OBIA-Mamba模型;构建了融合局部细节与全局上下文信息的GLocal双分支CNN-Mamba架构;采用多任务优化框架结合双重损失函数平衡局部精度与全局一致性。 Result: 在加拿大阿尔伯塔省的Sentinel-2影像上进行实验,结果表明该方法相比现有先进方法具有更高的分类精度和更精细的分类结果。 Conclusion: MSOM模型有效提升了Sentinel-2影像的土地利用分类性能,具备处理复杂地表场景的能力,适用于环境监测等应用。 Abstract: Although Sentinel-2 based land use and land cover (LULC) classification is critical for various environmental monitoring applications, it is a very difficult task due to some key data challenges (e.g., spatial heterogeneity, context information, signature ambiguity). This paper presents a novel Multitask Glocal OBIA-Mamba (MSOM) for enhanced Sentinel-2 classification with the following contributions. First, an object-based image analysis (OBIA) Mamba model (OBIA-Mamba) is designed to reduce redundant computation without compromising fine-grained details by using superpixels as Mamba tokens. Second, a global-local (GLocal) dual-branch convolutional neural network (CNN)-mamba architecture is designed to jointly model local spatial detail and global contextual information. Third, a multitask optimization framework is designed to employ dual loss functions to balance local precision with global consistency. The proposed approach is tested on Sentinel-2 imagery in Alberta, Canada, in comparison with several advanced classification approaches, and the results demonstrate that the proposed approach achieves higher classification accuracy and finer details that the other state-of-the-art methods.[171] One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models
Aleksandr Razin,Danil Kazantsev,Ilya Makarov
Main category: cs.CV
TL;DR: 提出了一种名为Latent Upscaler Adapter (LUA)的轻量级模块,可在潜在空间中直接进行超分辨率处理,提升扩散模型在高分辨率图像生成中的效率和质量。
Details
Motivation: 扩散模型在超过训练分辨率时扩展性差,直接高分辨率采样慢且昂贵,而后期图像超分辨率(ISR)会在解码后引入伪影和延迟。因此需要一种高效、低延迟的高分辨率生成方法。 Method: 设计了一个轻量级的LUA模块,集成在生成器的潜在代码上,在VAE解码前进行超分辨率处理;采用共享的Swin风格主干网络和特定尺度的像素打乱头,支持2x和4x放大,并兼容图像空间SR方法。 Result: LUA在感知质量上与现有方法相当,但解码和上采样时间减少近3倍(从1.87秒降至0.42秒),且无需修改基础模型或增加扩散阶段;在不同VAE的潜在空间中表现出强泛化能力。 Conclusion: LUA为现代扩散模型提供了一条实用且高效的高保真、可扩展图像合成路径,显著提升了高分辨率生成的效率和部署便利性。 Abstract: Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.[172] Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin,Sili Chen,Junhao Liew,Donny Y. Chen,Zhenyu Li,Guang Shi,Jiashi Feng,Bingyi Kang
Main category: cs.CV
TL;DR: Depth Anything 3 (DA3) 是一种可从任意数量视觉输入中预测空间一致几何的模型,无需复杂结构设计或多任务学习,仅使用普通Transformer即可在多种视觉几何任务上达到新SOTA。
Details
Motivation: 追求极简建模,实现通用、高精度的视觉几何预测,减少对特定架构和已知相机姿态的依赖。 Method: 采用教师-学生训练范式,使用单一普通Transformer作为主干网络,并以单一半径深度射线为预测目标。 Result: 在新提出的视觉几何基准上,DA3在相机位姿估计和几何精度上分别超越先前SOTA平均44.3%和25.1%,且优于DA2的单目深度估计性能。 Conclusion: DA3通过简化模型结构和训练目标,在不依赖专有数据的情况下,在多任务视觉几何理解中实现了卓越的泛化与精度。 Abstract: We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.[173] Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling
Jiahao Wang,Weiye Xu,Aijun Yang,Wengang Zhou,Lewei Lu,Houqiang Li,Xiaohua Wang,Jinguo Zhu
Main category: cs.CV
TL;DR: 提出Self-Consistency Sampling (SCS) 方法,通过引入视觉扰动和重复截断重采样,利用轨迹一致性评分来抑制不可靠推理路径的影响,显著提升多模态大模型在结果奖励强化学习下的性能。