Skip to content

Table of Contents

cs.CL [Back]

[1] Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Omnilingual ASR team,Gil Keren,Artyom Kozhevnikov,Yen Meng,Christophe Ropers,Matthew Setzler,Skyler Wang,Ife Adebara,Michael Auli,Can Balioglu,Kevin Chan,Chierh Cheng,Joe Chuang,Caley Droof,Mark Duppenthaler,Paul-Ambroise Duquenne,Alexander Erben,Cynthia Gao,Gabriel Mejia Gonzalez,Kehan Lyu,Sagar Miglani,Vineel Pratap,Kaushik Ram Sadagopan,Safiyyah Saleem,Arina Turkatenko,Albert Ventayol-Boada,Zheng-Xin Yong,Yu-An Chung,Jean Maillard,Rashel Moritz,Alexandre Mourachko,Mary Williamson,Shireen Yates

Main category: cs.CL

TL;DR: Omnilingual ASR 是首个大规模可扩展的语音识别系统,通过自监督预训练和基于大语言模型的解码器架构,实现了对1600多种语言(包括500多种新语言)的零样本泛化支持,显著提升了低资源语言的识别性能,并开源模型以促进社区参与。

Details Motivation: 大多数语言缺乏自动语音识别(ASR)支持,现有系统受限于架构和数据成本,且存在伦理问题。需要一个可扩展、包容性强并尊重社区协作的解决方案。 Method: 采用70亿参数的自监督预训练,结合编码器-解码器架构与受大语言模型启发的解码器,利用包含公共资源和社区合作采集的多样化数据集进行训练,实现对未见语言的零样本泛化。 Result: 系统覆盖超过1600种语言,其中500多种为首次支持;在低资源条件下显著优于先前系统,具备强泛化能力,并发布从3亿到70亿参数的多规模开源模型。 Conclusion: Omnilingual ASR 实现了ASR技术在长尾语言上的可扩展性突破,通过开放模型和工具降低了研究与应用门槛,强调了社区合作与伦理设计的重要性,具有广泛的社会影响潜力。 Abstract: Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.

[2] Order Matters: Rethinking Prompt Construction in In-Context Learning

Warren Li,Yiqian Wang,Zihan Wang,Jingbo Shang

Main category: cs.CL

TL;DR: 本文重新审视了上下文学习中示例选择与排序对模型性能的影响,发现示例顺序的差异对性能的影响可与不同示例集相媲美,并表明仅用开发集即可找到高效排序,提示应重新评估ICL中的传统假设。

Details Motivation: 以往研究普遍认为在上下文学习中,选择哪些示例比它们的顺序更重要,导致研究主要集中在示例选择上;本文质疑这一假设,系统比较选择与排序的影响。 Method: 通过在分类和生成任务上的受控实验,使用多个开源模型系列(0.5B到27B参数)和GPT-5,比较不同示例选择和排序策略对性能的影响,并利用开发集寻找最优排序。 Result: 发现不同示例顺序引起的性能方差与使用不同示例集相当;基于开发集的排序方法可接近基于测试标签选择最佳顺序的oracle性能。 Conclusion: 示例选择与排序在提示设计中具有同等且相互关联的重要性,需重新审视上下文学习中的现有假设。 Abstract: In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.

[3] Contextual morphologically-guided tokenization for Latin encoder models

Marisa Hudspeth,Patrick J. Burns,Brendan O'Connor

Main category: cs.CL

TL;DR: 本文研究了形态感知的分词方法在拉丁语中的应用,发现这种分词方式能提升模型在多种下游任务上的表现,尤其是在领域外文本上表现更佳,表明利用语言学资源可有效改善形态复杂语言的语言建模效果。

Details Motivation: 标准分词方法通常优先考虑信息论目标,如高压缩率和低生成率,而忽视了形态对齐等语言学目标,这在形态丰富的语言中可能导致次优结果。因此,本文旨在探索更适合形态丰富语言的分词方法。 Method: 提出并采用形态引导的分词策略,利用拉丁语丰富的词汇资源进行分词,并在四个下游任务上评估其性能。 Result: 形态引导的分词方法在所有下游任务中均优于传统方法,尤其在领域外文本上提升显著,显示出更强的泛化能力。 Conclusion: 利用语言学资源进行形态感知分词是提升形态复杂语言建模性能的有效途径,尤其适用于缺乏大规模预训练数据的低资源语言。 Abstract: Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models' improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.

[4] Assessing the Applicability of Natural Language Processing to Traditional Social Science Methodology: A Case Study in Identifying Strategic Signaling Patterns in Presidential Directives

C. LeMay,A. Lane,J. Seales,M. Winstead,S. Baty

Main category: cs.CL

TL;DR: 本研究探讨了自然语言处理(NLP)在从总统指令文本库中提取主要主题的应用,比较了NLP与人工标注的结果,发现其具有潜力但存在差异,需进一步验证其有效性。

Details Motivation: 探索NLP技术在大规模文本分析中的适用性,特别是在社会科学研究中自动识别政策信号主题的需求。 Method: 应用NLP技术对里根至克林顿政府时期的总统指令进行主题提取,并与人工标注结果进行对比分析。 Result: NLP和人工分析均能识别出相关文档,但两者结果存在差异,表明当前NLP工具在该应用场景下仍存在局限性。 Conclusion: NLP在处理大规模文本时展现出潜力,但在用于精细的社会科学研究时仍需进一步改进和验证,尤其是在快速发展的AI环境下需重新评估已有工具的有效性。 Abstract: Our research investigates how Natural Language Processing (NLP) can be used to extract main topics from a larger corpus of written data, as applied to the case of identifying signaling themes in Presidential Directives (PDs) from the Reagan through Clinton administrations. Analysts and NLP both identified relevant documents, demonstrating the potential utility of NLPs in research involving large written corpuses. However, we also identified discrepancies between NLP and human-labeled results that indicate a need for more research to assess the validity of NLP in this use case. The research was conducted in 2023, and the rapidly evolving landscape of AIML means existing tools have improved and new tools have been developed; this research displays the inherent capabilities of a potentially dated AI tool in emerging social science applications.

[5] How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Muskaan Chopra,Lorenz Sparrenberg,Sarthak Khanna,Rafet Sifa

Main category: cs.CL

TL;DR: 小型语言模型(约10亿参数)结合轻量级校准和小样本监督,可在设备端高效检测机器翻译中的关键错误,兼顾准确性和计算效率。

Details Motivation: 大型语言模型虽能有效评估机器翻译,但因规模大、成本高,难以在边缘设备和隐私敏感场景部署。研究旨在探索在保持检测能力的前提下,模型可缩小到何种程度。 Method: 聚焦英译德的关键错误检测(CED),在WMT21、WMT22和SynCED-EnDe-2025数据集上评测多个小于20亿参数的模型;采用标准化提示、轻量logit偏置校准和多数投票策略,评估语义质量和计算开销。 Result: 约10亿参数的模型达到最佳质量-效率平衡:Gemma-3-1B在SynCED-EnDe-2025上经微调后MCC=0.77,F1-ERR=0.98,MacBook Pro M4 Pro上单样本延迟400ms;Qwen-3-1.7B性能更高(MCC+0.11),但计算成本更高;0.6B模型仍可用,但对实体和数字错误检测不足。 Conclusion: 紧凑型指令调优语言模型结合轻量校准和小样本监督,可实现可靠、私有的设备端翻译错误检测,适用于现实翻译流程中的低成本、低延迟筛查。 Abstract: Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.

[6] Predicate-Argument Structure Divergences in Chinese and English Parallel Sentences and their Impact on Language Transfer

Rocco Tripodi,Xiaoyu Liu

Main category: cs.CL

TL;DR: 本文分析了中英文平行句中的谓词-论元结构,探讨了跨语言迁移中的对齐与错位现象,揭示了语言迁移的不对称性。

Details Motivation: 由于语言差异,尤其是类型上相距较远的语言之间,跨语言知识迁移面临挑战,因此需要系统分析结构性差异以提升低资源语言下的NLP性能。 Method: 通过注解投影实验,分别以中文和英文为源语言,对平行句中的谓词标注进行投射,并进行定性与定量分析,提出结构性差异的分类体系。 Result: 发现语言迁移具有明显的不对称性,源语言的选择显著影响投影效果,且不同语言间的结构错位模式存在系统性差异。 Conclusion: 在跨语言NLP研究中必须重视迁移的不对称性,选择合适的源语言对提升迁移效果至关重要,且在做出科学结论前需深入分析语言间结构差异。 Abstract: Cross-lingual Natural Language Processing (NLP) has gained significant traction in recent years, offering practical solutions in low-resource settings by transferring linguistic knowledge from resource-rich to low-resource languages. This field leverages techniques like annotation projection and model transfer for language adaptation, supported by multilingual pre-trained language models. However, linguistic divergences hinder language transfer, especially among typologically distant languages. In this paper, we present an analysis of predicate-argument structures in parallel Chinese and English sentences. We explore the alignment and misalignment of predicate annotations, inspecting similarities and differences and proposing a categorization of structural divergences. The analysis and the categorization are supported by a qualitative and quantitative analysis of the results of an annotation projection experiment, in which, in turn, one of the two languages has been used as source language to project annotations into the corresponding parallel sentences. The results of this analysis show clearly that language transfer is asymmetric. An aspect that requires attention when it comes to selecting the source language in transfer learning applications and that needs to be investigated before any scientific claim about cross-lingual NLP is proposed.

[7] TARG: Training-Free Adaptive Retrieval Gating for Efficient RAG

Yufeng Wang,Lu wei,Haibin Ling

Main category: cs.CL

TL;DR: 提出无需训练的自适应检索门控TARG,通过基础模型的短前缀logits计算不确定性分数决定是否检索,在保持甚至提升准确率的同时显著减少70-90%的检索次数和延迟。

Details Motivation: 传统RAG对每个查询都进行检索,导致效率低下、延迟高且成本增加,而现有方法多需额外训练或复杂设计,缺乏通用性和实用性。 Method: 利用基础模型生成的短无上下文草稿的前缀logits,计算轻量级不确定性分数(如平均token熵、top-1/top-2 logit间隙的margin信号、小N前缀方差),当分数超过阈值时才触发检索;该方法无需训练,模型无关,仅增加少量草稿token。 Result: 在NQ-Open、TriviaQA和PopQA上,TARG相比Always-RAG减少了70-90%的检索,降低端到端延迟,同时匹配或提升EM/F1指标;其开销接近Never-RAG,且margin信号被验证为鲁棒默认选择。 Conclusion: TARG提供了一种高效、即插即用的检索门控方案,显著提升了RAG系统的准确性与效率权衡,适用于现代指令调优大模型下的实际部署场景。 Abstract: Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (TARG), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft's prefix logits, TARG computes lightweight uncertainty scores: mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-N variance across a handful of stochastic prefixes, and triggers retrieval only when the score exceeds a threshold. The gate is model agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently shifts the accuracy-efficiency frontier: compared with Always-RAG, TARG matches or improves EM/F1 while reducing retrieval by 70-90% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-N variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a delta-latency view to make budget trade-offs explicit.

[8] Khmer Spellchecking: A Holistic Approach

Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing

Main category: cs.CL

TL;DR: 提出了一种综合性的高棉语拼写检查方法,结合子词分割、命名实体识别、音素转换和语言模型,实现了高达94.4%的准确率。

Details Motivation: 高棉语拼写检查面临词汇与分词模型不匹配、一词多形、复合词灵活构成以及专有名词误判等挑战,现有方法未能有效解决这些问题。 Method: 集成高棉语子词分割、命名实体识别(NER)、图素到音素(G2P)转换和语言模型,以识别纠错候选并排序最优结果。 Result: 所提方法在高棉语拼写检查任务上达到最高94.4%的准确率,优于现有方案,并公开了拼写检查和NER的基准数据集。 Conclusion: 该综合性方法显著提升了高棉语拼写检查性能,为低资源语言处理提供了可借鉴的框架。 Abstract: Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the most suitable candidate. Experimental results show that the proposed approach achieves a state-of-the-art Khmer spellchecking accuracy of up to 94.4%, compared to existing solutions. The benchmark datasets for Khmer spellchecking and NER tasks in this study will be made publicly available.

[9] Improving Graduate Outcomes by Identifying Skills Gaps and Recommending Courses Based on Career Interests

Rahul Soni,Basem Suleiman,Sonit Singh

Main category: cs.CL

TL;DR: 本文提出了一种结合数据挖掘、机器学习和用户偏好的课程推荐系统,旨在弥合高校教育与行业需求之间的差距,提升学生课程选择的科学性与职业发展支持。

Details Motivation: 当前学生在选课时缺乏与行业趋势和职业目标对齐的有效指导,导致教育与就业需求脱节。 Method: 采用数据挖掘、协同过滤和机器学习算法,结合用户偏好、学术标准及职业目标,构建推荐模型,并通过迭代设计开发了注重易用性和交互性的前端界面。 Result: 系统能够提供个性化、行业导向的课程推荐,通过用户反馈不断优化,提升了推荐的准确性和用户体验。 Conclusion: 该课程推荐系统有助于学生做出数据驱动的选课决策,支持终身学习与职业发展,对提升高校毕业生就业竞争力具有积极意义。 Abstract: This paper aims to address the challenge of selecting relevant courses for students by proposing the design and development of a course recommendation system. The course recommendation system utilises a combination of data analytics techniques and machine learning algorithms to recommend courses that align with current industry trends and requirements. In order to provide customised suggestions, the study entails the design and implementation of an extensive algorithmic framework that combines machine learning methods, user preferences, and academic criteria. The system employs data mining and collaborative filtering techniques to examine past courses and individual career goals in order to provide course recommendations. Moreover, to improve the accessibility and usefulness of the recommendation system, special attention is given to the development of an easy-to-use front-end interface. The front-end design prioritises visual clarity, interaction, and simplicity through iterative prototyping and user input revisions, guaranteeing a smooth and captivating user experience. We refined and optimised the proposed system by incorporating user feedback, ensuring that it effectively meets the needs and preferences of its target users. The proposed course recommendation system could be a useful tool for students, instructors, and career advisers to use in promoting lifelong learning and professional progression as it fills the gap between university learning and industry expectations. We hope that the proposed course recommendation system will help university students in making data-drive and industry-informed course decisions, in turn, improving graduate outcomes for the university sector.

[10] Answering Students' Questions on Course Forums Using Multiple Chain-of-Thought Reasoning and Finetuning RAG-Enabled LLM

Neo Wang,Sonit Singh

Main category: cs.CL

TL;DR: 本文提出了一种基于开源大语言模型和检索增强生成(RAG)方法的课程问答系统,结合本地知识库和多链式思维推理,以应对学生提问增多和模型幻觉问题,在HotpotQA数据集上表现出良好的问答性能。

Details Motivation: 随着在线课程学生人数增加,论坛中问题数量激增,导致教师难以及时回应重复性问题,亟需自动化问答解决方案。 Method: 采用开源大语言模型并进行微调,结合课程内容构建本地知识库,使用RAG方法检索相关信息,并引入多链式思维推理减少模型幻觉。 Result: 在HotpotQA数据集上的实验表明,所提出的微调LLM结合RAG的方法在问答任务中表现优异,能有效提升回答准确性和相关性。 Conclusion: 该方法能有效缓解大规模课程论坛中的答疑压力,具备实际应用潜力,尤其适用于教育场景下的自动问答系统部署。 Abstract: The course forums are increasingly significant and play vital role in facilitating student discussions and answering their questions related to the course. It provides a platform for students to post their questions related to the content and admin issues related to the course. However, there are several challenges due to the increase in the number of students enrolled in the course. The primary challenge is that students' queries cannot be responded immediately and the instructors have to face lots of repetitive questions. To mitigate these issues, we propose a question answering system based on large language model with retrieval augmented generation (RAG) method. This work focuses on designing a question answering system with open source Large Language Model (LLM) and fine-tuning it on the relevant course dataset. To further improve the performance, we use a local knowledge base and applied RAG method to retrieve relevant documents relevant to students' queries, where the local knowledge base contains all the course content. To mitigate the hallucination of LLMs, We also integrate it with multi chain-of-thought reasoning to overcome the challenge of hallucination in LLMs. In this work, we experiment fine-tuned LLM with RAG method on the HotpotQA dataset. The experimental results demonstrate that the fine-tuned LLM with RAG method has a strong performance on question answering task.

Yidan Sun,Mengying Zhu,Feiyue Chen,Yangyang Wu,Xiaolei Dan,Mengyuan Yang,Xiaolin Zheng,Shenglin Ben

Main category: cs.CL

TL;DR: 本文提出TermGPT,一种用于术语适应的多级对比微调框架,通过构建句子图生成语义一致且具有区分性的正负样本,并在句子和词元级别进行对比学习,有效提升金融和法律领域术语的表示能力。

Details Motivation: 大语言模型在嵌入空间中存在各向同性问题,导致对领域术语(尤其是法律和金融)的区分能力差,影响下游任务性能。 Method: 构建句子图以捕捉语义和结构关系,基于上下文和拓扑线索生成正负样本;设计句子级和词元级的多级对比学习策略,增强全局上下文理解和细粒度术语区分能力;并构建首个源自官方监管文件的金融术语数据集。 Result: 实验表明,TermGPT在金融和法律领域的术语区分任务上优于现有基线方法。 Conclusion: TermGPT有效缓解了大模型嵌入空间的各向同性问题,显著提升了领域术语的表示质量,适用于需精细语义区分的下游任务。 Abstract: Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in terminology-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained terminology discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.

[12] In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

Mingye Zhu,Yi Liu,Zheren Fu,Quan Wang,Yongdong Zhang

Main category: cs.CL

TL;DR: 提出了一种名为InTRO的新框架,通过token级探索和自反馈机制提升大语言模型在思维链推理中的准确性与简洁性,在数学推理任务上显著优于基线模型,并展现出跨领域泛化能力。

Details Motivation: 传统监督微调因仅依赖单一‘黄金’推理路径而限制了模型泛化能力,强化学习方法则面临信用分配困难和计算成本高昂的问题。 Method: 引入InTRO框架,利用生成策略与其答案条件版本之间的信息差异估计token-wise重要性权重(校正因子),实现单次前向传播中的token级探索与自反馈,优化推理过程。 Result: 在六个数学推理基准上,InTRO相比基模型最高提升20%的解题准确率,生成的思维链更简洁,且具备跨领域迁移能力。 Conclusion: InTRO有效解决了现有思维链训练方法在泛化性、准确性和效率上的局限,为大模型推理提供了可扩展且高效的训练范式。 Abstract: Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single "golden" rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.

[13] HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

Nikunj Gupta,Bill Guo,Rajgopal Kannan,Viktor K. Prasanna

Main category: cs.CL

TL;DR: 提出了一种名为HierRouter的分层路由方法,通过强化学习动态选择多个轻量级模型进行推理,显著提升性能并控制成本。

Details Motivation: 大型语言模型虽然性能优越,但计算和内存开销大,难以在资源受限或实时场景中部署,因此需要更高效的推理方法。 Method: 将分层路由建模为有限视界马尔可夫决策过程(MDP),使用PPO强化学习代理根据上下文和累积成本动态选择多跳推理中的模型调用序列。 Result: 在三个开源LLM和六个基准任务(包括问答、代码生成和数学推理)上的实验表明,相比单独使用模型,HierRouter的响应质量最高提升2.4倍,且平均仅增加极少推理开销。 Conclusion: HierRouter展示了通过分层路由实现高效、高性能LLM推理的潜力,适用于资源受限环境。 Abstract: Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.

[14] EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Jialin Wu,Kecen Li,Zhicong Huang,Xinfeng Li,Xiaofeng Wang,Cheng Hong

Main category: cs.CL

TL;DR: EnchTable是一种新型框架,通过NTK-based安全向量蒸馏和干扰感知融合技术,在不需大量重训练的情况下,有效迁移并保持下游大语言模型的安全对齐,兼容多种架构,并在多个任务领域中实现优于现有方法的安全性与实用性平衡。

Details Motivation: 微调大语言模型时常常导致安全对齐性能下降,增加产生有害输出的风险,亟需一种无需频繁重训练即可保持安全性的解决方案。 Method: 提出EnchTable框架,采用基于神经正切核(NTK)的安全向量蒸馏方法解耦安全约束与任务推理,并设计干扰感知融合技术以平衡安全性与实用性。 Result: 在三个任务领域、三种LLM架构及十一个数据集上验证了EnchTable的有效性,展现出对静态和动态越狱攻击的强抵抗能力,安全性优于厂商发布模型,且在安全率、实用性得分和通用性方面优于六种参数修改方法和两种推理时对齐基线。 Conclusion: EnchTable能有效维持微调后模型的安全对齐,具备跨架构、跨领域的通用性,可无缝集成到部署流程中,为实际应用提供高效、安全的解决方案。 Abstract: Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.

[15] HI-TransPA: Hearing Impairments Translation Personal Assistant

Zhiming Ma,Shiyu Gan,Junhao Zhao,Xianming Li,Qingyun Pan,Peidong Wang,Mingjun Pan,Yuhao Mo,Jiajie Cheng,Chengxin Chen,Zhonglun Cao,Chonghan Liu,Shi Cheng

Main category: cs.CL

TL;DR: 本文提出了HI-TransPA,一种基于Omni-Model范式的听障人士语音视觉助手,通过融合模糊语音与高帧率唇动实现翻译与对话一体化。

Details Motivation: 现有Omni-Model对听障人群的模糊语音适应性差,且缺乏统一灵活的沟通辅助方案。 Method: 构建了包含面部关键点检测、唇部区域稳定化和多模态样本质量评估的预处理流程,并采用课程学习策略结合SigLIP编码器与Unified 3D-Resampler建模高帧率唇动。 Result: 在自建HI-Dialogue数据集上,HI-TransPA在字面准确性和语义保真度方面均达到SOTA性能。 Conclusion: 该工作为Omni-Model在助残通信技术中的应用奠定了基础,提供了端到端建模范式和关键处理工具。 Abstract: To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

[16] MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection

Pritish Sahu,Anirudh Som,Dimitra Vergyri,Ajay Divakaran

Main category: cs.CL

TL;DR: 本文提出了Norm-RAG,一种用于多轮对话中社会规范推理的检索增强型框架,并引入了双语数据集MINDS,以提升跨文化场景下规范识别与对话系统的社会智能。

Details Motivation: 现有研究多关注孤立语句或合成对话中的规范标注,难以捕捉真实多轮对话中动态、上下文依赖的社会规范,且缺乏跨文化视角。 Method: 提出Norm-RAG框架,结合语用特征(如交际意图、说话人角色等)与基于语义分块的检索机制,从结构化规范文档中进行可解释的规范推理;构建MINDS双语多轮对话数据集,采用多标注者共识进行规范类别与遵守状态标注。 Result: 实验表明,Norm-RAG在规范检测和泛化能力上表现更优,尤其在跨语言、跨文化对话中展现出更强的适应性。 Conclusion: Norm-RAG通过融合语境感知与外部知识检索,有效提升了多轮对话中社会规范理解的准确性与可解释性,为构建文化敏感的社交智能系统提供了新路径。 Abstract: Social norms are implicit, culturally grounded expectations that guide interpersonal communication. Unlike factual commonsense, norm reasoning is subjective, context-dependent, and varies across cultures, posing challenges for computational models. Prior works provide valuable normative annotations but mostly target isolated utterances or synthetic dialogues, limiting their ability to capture the fluid, multi-turn nature of real-world conversations. In this work, we present Norm-RAG, a retrieval-augmented, agentic framework for nuanced social norm inference in multi-turn dialogues. Norm-RAG models utterance-level attributes including communicative intent, speaker roles, interpersonal framing, and linguistic cues and grounds them in structured normative documentation retrieved via a novel Semantic Chunking approach. This enables interpretable and context-aware reasoning about norm adherence and violation across multilingual dialogues. We further introduce MINDS (Multilingual Interactions with Norm-Driven Speech), a bilingual dataset comprising 31 multi-turn Mandarin-English and Spanish-English conversations. Each turn is annotated for norm category and adherence status using multi-annotator consensus, reflecting cross-cultural and realistic norm expression. Our experiments show that Norm-RAG improves norm detection and generalization, demonstrates improved performance for culturally adaptive and socially intelligent dialogue systems.

[17] Leveraging Large Language Models for Identifying Knowledge Components

Canwen Wang,Jionghao Lin,Kenneth R. Koedinger

Main category: cs.CL

TL;DR: 本研究探讨了使用大语言模型(LLM)自动生成知识组件(KC)以支持自适应学习系统的可行性,发现直接扩展LLM生成会导致性能下降和KC冗余;通过引入基于余弦相似度的语义合并策略,显著提升了模型性能,为自动化KC识别提供了有效路径。

Details Motivation: 手动标注知识组件(KC)耗时费力,依赖领域专家,成为自适应学习系统发展的瓶颈;尽管大语言模型(LLM)有望实现自动化生成KC,但先前研究受限于小规模数据集且易产生冗余标签,因此需要探索更可扩展且高效的自动化方法。 Method: 采用GPT-4o-mini对包含646道选择题的大规模‘模拟教科书’数据集进行KC生成,并评估其性能;针对生成结果中KC数量过多的问题,提出一种基于余弦相似度的语义相似KC标签合并方法,通过设置不同阈值(如0.8)来聚类相似KC并重新构建模型。 Result: 原始LLM生成方法表现不如专家设计的KC模型(RMSE 0.4285 vs. 0.4206),且生成569个KC远超专家的101个;应用余弦相似度0.8阈值合并后,KC数量降至428,RMSE改善至0.4259,显著优于未合并情况。 Conclusion: 单纯扩大LLM生成KC的规模不足以超越专家模型,但结合语义相似性合并策略可有效减少冗余、提升预测性能,表明‘LLM生成+语义合并’是实现KC自动化识别的可行方向。 Abstract: Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a "simulated textbook" LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model's performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.

[18] REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering

Yijie Zhu,Haojie Zhou,Wanting Hong,Tailin Liu,Ning Wang

Main category: cs.CL

TL;DR: 提出了一种名为REAP的递归评估与自适应规划方法,通过子任务规划和事实提取模块提升多跳推理中检索增强生成的全局规划能力与推理可靠性。

Details Motivation: 现有检索增强生成方法在多跳推理任务中缺乏全局规划,容易陷入局部推理困境,且对检索内容和潜在线索利用不足,影响推理准确性。 Method: 设计子任务规划器(SP)和事实提取器(FE)模块:SP维护全局视角并动态优化推理路径,FE对检索内容进行细粒度分析以提取可靠答案和线索,二者协同构建连贯的全局知识表示;同时提出统一任务范式用于多任务微调。 Result: 在多个公开多跳数据集上实验表明,REAP在领域内和跨领域设置下均显著优于现有RAG方法。 Conclusion: REAP通过显式的结构化子任务管理和自适应规划,有效提升了复杂多跳推理任务的准确性和可追溯性,验证了其在缓解LLM幻觉和增强推理能力方面的有效性。 Abstract: Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose Recursive Evaluation and Adaptive Planning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP's performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.

[19] NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction

Peter Røysland Aarnes,Vinay Setty

Main category: cs.CL

TL;DR: 本文系统评估了最先进的大语言模型在数值声明真实性预测中的表现,发现即使是最先进的模型在面对特定扰动时准确率也会大幅下降,且增加上下文长度通常会降低准确性,但在引入扰动示例后可部分恢复。

Details Motivation: 大语言模型在知识密集型任务中表现良好,但在数值推理方面存在明显不足,亟需评估其在数值事实核查中的鲁棒性。 Method: 通过受控扰动(包括标签翻转探测)对当前最先进的模型进行系统性评估,并测试不同上下文长度和扰动示例的影响。 Result: 领先模型在某些扰动下准确率下降高达62%,没有模型在所有条件下都表现出鲁棒性;增加上下文长度通常降低准确性,但若加入扰动演示则多数模型可显著恢复性能。 Conclusion: 当前大语言模型在数值事实核查中的鲁棒性存在严重缺陷,提升数值推理的稳定性仍是未解决的挑战。 Abstract: Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62\% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.

Bo Li,Tian Tian,Zhenghua Xu,Hao Cheng,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 提出了一种无需训练的动态检索增强生成方法ETC,通过建模token级不确定性的趋势变化来更早、更准确地触发检索,提升了问答性能并减少检索次数。

Details Motivation: 现有动态检索方法依赖低token级置信度触发检索,往往在错误传播后才进行干预,导致时机滞后。 Method: 引入Entropy-Trend Constraint(ETC),利用熵序列的一阶和二阶差分检测不确定性趋势,从而判断最佳检索时机,无需训练且可即插即用。 Result: 在六个QA基准和三种大模型上实验表明,ETC持续优于强基线,降低检索频率,在领域特定场景中表现出良好泛化能力。 Conclusion: 通过趋势感知的不确定性建模能更有效地决定检索时机,ETC是一种通用、高效、易于集成的动态RAG解决方案。 Abstract: Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.

[21] Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation

Bo Li,Zhenghua Xu,Rui Xie

Main category: cs.CL

TL;DR: 本文研究了多语言检索增强生成(RAG)中的输出语言漂移问题,发现该问题源于解码器层面的崩溃而非理解失败,并提出一种轻量级、无需训练的软约束解码(SCD)方法来缓解该问题。

Details Motivation: 在多语言RAG中,当检索到的文档、用户查询和上下文示例语言不一致时,模型容易产生非目标语言的响应,尤其是在需要复杂推理(如思维链生成)的情况下,语言漂移现象更加严重。 Method: 通过在多个数据集、语言和大语言模型上的受控实验分析语言漂移现象,并提出软约束解码(SCD)方法,在解码阶段通过惩罚非目标语言标记来引导生成过程。 Result: 实验证明,英语在跨语言条件下成为语义吸引子,主导了生成过程;SCD方法在多种语言和数据集上均显著提升了语言对齐性和任务性能。 Conclusion: 语言漂移主要由解码器层面的标记分布偏差引起,而所提出的SCD是一种通用、无需训练且有效的方法,可显著减轻多语言RAG中的语言漂移问题。 Abstract: Multilingual Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to perform knowledge-intensive tasks in multilingual settings by leveraging retrieved documents as external evidence. However, when the retrieved evidence differs in language from the user query and in-context exemplars, the model often exhibits language drift by generating responses in an unintended language. This phenomenon is especially pronounced during reasoning-intensive decoding, such as Chain-of-Thought (CoT) generation, where intermediate steps introduce further language instability. In this paper, we systematically study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones. Our controlled experiments reveal that the drift results not from comprehension failure but from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language. We further observe that English serves as a semantic attractor under cross-lingual conditions, emerging as both the strongest interference source and the most frequent fallback language. To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language by penalizing non-target-language tokens. SCD is model-agnostic and can be applied to any generation algorithm without modifying the architecture or requiring additional data. Experiments across three multilingual datasets and multiple typologically diverse languages show that SCD consistently improves language alignment and task performance, providing an effective and generalizable solution in multilingual RAG.

[22] FinNuE: Exposing the Risks of Using BERTScore for Numerical Semantic Evaluation in Finance

Yu-Shiang Huang,Yun-Yu Lee,Tzu-Hsin Chou,Che Lin,Chuan-Ju Wang

Main category: cs.CL

TL;DR: BERTScore存在对数值变化不敏感的问题,在金融领域可能导致严重误判,为此提出FinNuE数据集来诊断该问题,并呼吁建立更适用于金融NLP的数值感知评估框架。

Details Motivation: BERTScore在衡量语义相似性时对数值变化不敏感,这在强调数值精度的金融领域是一个严重缺陷。 Method: 构建了一个包含受控数值扰动的诊断数据集FinNuE,涵盖财报电话会议、监管文件、社交媒体和新闻文章,并用其评估BERTScore的表现。 Result: 实验表明,BERTScore无法区分具有重要语义差异的数值变化,常给金融含义迥异的文本对赋予高相似度分数。 Conclusion: 基于嵌入的评估指标在金融领域存在根本局限,需发展能够感知数值变化的新评估框架。 Abstract: BERTScore has become a widely adopted metric for evaluating semantic similarity between natural language sentences. However, we identify a critical limitation: BERTScore exhibits low sensitivity to numerical variation, a significant weakness in finance where numerical precision directly affects meaning (e.g., distinguishing a 2% gain from a 20% loss). We introduce FinNuE, a diagnostic dataset constructed with controlled numerical perturbations across earnings calls, regulatory filings, social media, and news articles. Using FinNuE, demonstrate that BERTScore fails to distinguish semantically critical numerical differences, often assigning high similarity scores to financially divergent text pairs. Our findings reveal fundamental limitations of embedding-based metrics for finance and motivate numerically-aware evaluation frameworks for financial NLP.

[23] PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models

Shivam Sharma,Riya Naik,Tejas Gawas,Heramb Patil,Kunal Korgaonkar

Main category: cs.CL

TL;DR: 本文提出了一个名为PustakAI的框架,用于设计和评估与印度NCERT课程对齐的“NCERT-QA”问答数据集,涵盖6至8年级的英语和科学科目,并分析了多种大语言模型在教育场景中的适用性。

Details Motivation: 将大语言模型有效适配到特定课程(如印度NCERT大纲)面临准确性、一致性与教学相关性的挑战,尤其是在教育资源匮乏的地区需要个性化的学习工具。 Method: 构建了NCERT-QA数据集,分类为事实型、推断型和其他类型(评估与推理),采用元提示、少样本和思维链等提示技术,并使用多种指标评估开源和高端大语言模型的表现。 Result: 评估结果显示不同提示方法在课程对齐上的效果差异,揭示了当前开源模型(如Gemma、Llama3.2、Nemotron)和高端模型(如Llama-4-Scout、Deepseek-r1)在教育应用中的性能优劣。 Conclusion: PustakAI框架和NCERT-QA数据集为课程对齐的语言模型应用提供了有效路径,有助于提升资源有限地区的个性化教育质量,同时指出了模型选择与提示工程在教学场景中的关键作用。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework "PustakAI"\footnote{Pustak means `book' in many Indian languages.} for the design and evaluation of a novel question-answering dataset "NCERT-QA" aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.

[24] ScaleFormer: Span Representation Cumulation for Long-Context Transformer

Jiangshu Du,Wenpeng Yin,Philip Yu

Main category: cs.CL

TL;DR: ScaleFormer是一种无需修改架构的即插即用框架,通过重叠分块和累积上下文表示,使预训练编码器-解码器模型能高效处理长文本。

Details Motivation: 标准自注意力机制的二次复杂度限制了Transformer在长文本任务中的应用,而现有高效变体通常需要重新预训练或架构修改,难以直接应用。 Method: 将长输入分割为重叠块,提出一种无参数的融合机制,通过累积前后块的上下文向量增强各块边界表示,从而赋予模型文档结构感知能力,并实现线性复杂度。 Result: 在长文档摘要任务上,ScaleFormer表现优于或媲美当前最先进方法,且无需架构改动或外部检索机制。 Conclusion: ScaleFormer提供了一种简单有效的方式,使现成的预训练模型能够处理长序列,同时保持模型性能和推理效率。 Abstract: The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk's representation with structural awareness of its position within the document. It achieves this by enriching each chunk's boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document's narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.

[25] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Jinhong Jeong,Sunghyun Lee,Jaeyoung Lee,Seonah Han,Youngjae Yu

Main category: cs.CL

TL;DR: 本文提出了一种利用声音象征性(sound symbolism)来探究多模态大语言模型(MLLMs)如何理解人类语言中听觉信息的方法,构建了包含真实词和伪词的大型拟声词数据集LEX-ICON,并在多种语义维度上分析MLLMs对语音象似性的响应及其注意力机制。

Details Motivation: 探索多模态大语言模型如何处理语音与语义之间的非任意关联,即声音象征性,以揭示其对听觉信息的理解能力,并连接人工智能与认知语言学的交叉研究。 Method: 构建名为LEX-ICON的数据集,包含四种自然语言的真实词和系统构造的伪词,标注多个语义特征;使用文本(拼写和IPA)与音频输入形式,评估MLLMs在不同语义维度上的表现,并通过测量音位级注意力分数分析模型的层间信息处理过程。 Result: 实验发现:(1)MLLMs在多个语义维度上展现出与现有语言学研究一致的语音直觉;(2)注意力模式显示模型能聚焦于具有象似性的关键音位,表明其具备一定的音义对应感知能力。 Conclusion: 研究首次从大规模定量角度揭示了MLLMs在语音象似性方面的可解释性,为AI模型的语言认知机制提供了新视角,架起了人工智能与认知语言学之间的桥梁。 Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.

[26] GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt

Zhenhe Li,Can Lin,Ling Zheng,Wen-Da Wei,Junli Liang,Qi Song

Main category: cs.CL

TL;DR: 本文提出了GraphIF,一个基于图结构的多轮指令跟随框架,通过将多轮对话建模为有向关系图,并利用图提示增强大语言模型的指令跟随能力,在长距离多轮对话中显著提升了性能。

Details Motivation: 现有的多轮指令跟随方法主要依赖于大规模多轮对话数据集进行微调,但通常将每轮响应生成视为独立任务,缺乏对多轮指令跟随的显式建模,导致模型难以处理复杂的长距离约束。 Method: 提出GraphIF框架,包含三个模块:基于代理的关系抽取模块,通过动作触发机制捕捉轮次间的语义关系并构建结构化图;关系图提示生成模块,将图结构信息转化为自然语言提示;响应重写模块,利用生成的图提示优化初始LLM输出。 Result: 在两个长多轮对话数据集上的实验表明,GraphIF能无缝集成到指令微调的LLM中,并在四个多轮指令跟随评估指标上均带来显著提升。 Conclusion: GraphIF首次探索了利用图结构来增强LLM的多轮指令跟随能力,验证了结构化建模在复杂对话场景中的有效性,为提升智能对话系统的连贯性和指令一致性提供了新思路。 Abstract: Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.

[27] ADI-20: Arabic Dialect Identification dataset and models

Haroun Elleuch,Salima Mdhaffar,Yannick Estève,Fethi Bougares

Main category: cs.CL

TL;DR: 本文介绍了ADI-20,这是对先前发布的ADI-17阿拉伯语方言识别数据集的扩展,涵盖了所有阿拉伯语国家的方言,包含来自19种阿拉伯语方言和现代标准阿拉伯语(MSA)的3,556小时语音数据。作者使用该数据集训练和评估了多种先进的ADI系统,探索了基于预训练ECAPA-TDNN模型的微调方法,以及结合注意力池化层和分类密集层的Whisper编码器模块。研究还分析了训练数据量和模型参数数量对识别性能的影响,结果表明即使仅使用原始训练数据的30%,F1分数也仅有小幅下降。作者开源了收集的数据和训练好的模型,以支持后续研究和成果复现。

Details Motivation: 为了提升阿拉伯语方言识别(ADI)系统的性能并促进相关研究,需要一个覆盖更广、规模更大的高质量数据集。现有的ADI-17数据集覆盖范围有限,因此有必要构建一个涵盖所有阿拉伯语国家方言的更大规模数据集ADI-20,并基于此开展系统性实验,推动该领域的发展。 Method: 作者构建了ADI-20数据集,包含19种阿拉伯语方言及现代标准阿拉伯语(MSA),总时长达3,556小时。在此基础上,采用两种主流架构进行实验:一是微调预训练的ECAPA-TDNN模型;二是使用Whisper编码器块结合注意力池化层与分类密集层的结构。同时,研究了不同训练数据规模(如30%子集)和模型参数量对ADI性能的影响。 Result: 实验结果表明,所使用的模型在ADI任务上表现良好,即使仅使用30%的训练数据,F1分数也仅有轻微下降,显示出模型对数据量的鲁棒性。此外,模型规模对性能有一定影响,但具体提升需权衡计算成本。所有数据和训练模型均已开源。 Conclusion: ADI-20是一个大规模、全覆盖的阿拉伯语方言识别数据集,显著提升了现有资源的覆盖范围和数据量。基于该数据集的实验验证了当前先进模型在ADI任务上的有效性,并揭示了数据量与模型规模对性能的影响。开源的数据与模型将有力支持未来阿拉伯语方言识别及相关领域的研究工作。 Abstract: We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.

[28] Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Xanh Ho,Yun-Ang Wu,Sunisth Kumar,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa

Main category: cs.CL

TL;DR: 本文研究了多模态大语言模型在科学声明验证任务中对表格和图表作为证据的理解能力,发现当前模型在处理表格时表现较好,但在图表理解上存在明显不足,揭示了模型在跨模态推理上的局限性。

Details Motivation: 随着科学论文数量的增加,亟需系统辅助评审人员评估科研主张;而现有模型在不同证据格式(如表格与图表)下的科学声明验证能力尚不明确。 Method: 设计实验并构建适用于多模态声明验证的数据集,基于两个现有科学论文数据集进行标注与结构化改造,并评估12种多模态大语言模型在表格和图表两种证据下的表现,同时开展人类对比实验。 Result: 当前多模态大模型在表格证据下表现优于图表;人类在两种格式上均表现良好;小于8B的小型多模态模型在表与图任务间的性能相关性弱,显示其跨模态泛化能力有限。 Conclusion: 现有多模态大语言模型在图表理解方面存在显著短板,未来应加强图表理解能力以提升科学声明验证的多模态推理水平。 Abstract: With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models' multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.

[29] ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks

Haroun Elleuch,Youssef Saidi,Salima Mdhaffar,Yannick Estève,Fethi Bougares

Main category: cs.CL

TL;DR: Elyadata和LIA在NADI 2025多方言阿拉伯语语音处理挑战赛中表现优异,其联合提交方案在口语阿拉伯语方言识别(ADI)任务中排名第一,在多方言阿拉伯语ASR任务中排名第二。

Details Motivation: 提升多方言阿拉伯语语音识别和方言识别的性能,利用大规模预训练模型解决低资源方言的语音处理问题。 Method: 对于ADI任务,采用数据增强后的Whisper-large-v3编码器进行微调;对于多方言ASR任务,分别对SeamlessM4T-v2 Large(埃及变体)模型针对八种方言进行独立微调。 Result: ADI系统在测试集上达到79.83%的准确率,位居第一;多方言ASR系统在测试集上取得平均WER 38.54%和CER 14.53%,排名第二。 Conclusion: 研究表明,结合针对性微调的大规模预训练语音模型在多方言阿拉伯语语音处理任务中具有显著有效性。 Abstract: This paper describes Elyadata \& LIA's joint submission to the NADI multi-dialectal Arabic Speech Processing 2025. We participated in the Spoken Arabic Dialect Identification (ADI) and multi-dialectal Arabic ASR subtasks. Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants. Our ADI system is a fine-tuned Whisper-large-v3 encoder with data augmentation. This system obtained the highest ADI accuracy score of \textbf{79.83\%} on the official test set. For multi-dialectal Arabic ASR, we fine-tuned SeamlessM4T-v2 Large (Egyptian variant) separately for each of the eight considered dialects. Overall, we obtained an average WER and CER of \textbf{38.54\%} and \textbf{14.53\%}, respectively, on the test set. Our results demonstrate the effectiveness of large pre-trained speech models with targeted fine-tuning for Arabic speech processing.

[30] On the Military Applications of Large Language Models

Satu Johansson,Taneli Riihonen

Main category: cs.CL

TL;DR: 本文探讨了自然语言处理和大语言模型在军事领域的应用,通过询问基于GPT的模型(如Microsoft Copilot)并评估其提供的信息,同时研究了利用商业云服务(如Microsoft Azure)构建此类应用的可行性。

Details Motivation: 探索大语言模型在军事场景中的潜在应用,并评估现有技术实现这些应用的可能性。 Method: 通过与基于GPT的语言模型交互获取其对军事应用的看法,并分析商业云服务平台支持此类应用开发的能力。 Result: 发现语言模型的摘要和生成能力可直接支持多种军事应用,其他特性也有特定用途,部分应用通过现有云服务具备实现可行性。 Conclusion: 大语言模型在军事领域具有广泛应用潜力,且借助现有商业云平台可实现部分应用。 Abstract: In this paper, military use cases or applications and implementation thereof are considered for natural language processing and large language models, which have broken into fame with the invention of the generative pre-trained transformer (GPT) and the extensive foundation model pretraining done by OpenAI for ChatGPT and others. First, we interrogate a GPT-based language model (viz. Microsoft Copilot) to make it reveal its own knowledge about their potential military applications and then critically assess the information. Second, we study how commercial cloud services (viz. Microsoft Azure) could be used readily to build such applications and assess which of them are feasible. We conclude that the summarization and generative properties of language models directly facilitate many applications at large and other features may find particular uses.

[31] Generalizing to Unseen Disaster Events: A Causal View

Philipp Seeberger,Steffen Freisinger,Tobias Bocklet,Korbinian Riedhammer

Main category: cs.CL

TL;DR: 提出一种基于因果学习的方法来减轻灾害事件分类中的事件和领域相关偏见,提升了模型对未来事件的泛化能力。

Details Motivation: 现有系统在处理社交媒体数据时易受事件相关偏见影响,难以泛化到新发灾害事件,且当前去偏方法在该领域探索不足。 Method: 采用因果学习视角设计去偏方法,减少事件和领域相关的偏差,以提升模型在灾难分类任务中的泛化性能。 Result: 在三个基于预训练语言模型的灾害分类任务中,相比多种基线方法最高提升+1.9% F1值,并显著增强分类器性能。 Conclusion: 所提因果去偏方法有效缓解了灾害信息处理中的偏差问题,显著提升了模型对新兴事件的适应性和分类性能。 Abstract: Due to the rapid growth of social media platforms, these tools have become essential for monitoring information during ongoing disaster events. However, extracting valuable insights requires real-time processing of vast amounts of data. A major challenge in existing systems is their exposure to event-related biases, which negatively affects their ability to generalize to emerging events. While recent advancements in debiasing and causal learning offer promising solutions, they remain underexplored in the disaster event domain. In this work, we approach bias mitigation through a causal lens and propose a method to reduce event- and domain-related biases, enhancing generalization to future events. Our approach outperforms multiple baselines by up to +1.9% F1 and significantly improves a PLM-based classifier across three disaster classification tasks.

[32] Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA

Yiran Zhang,Mingyang Lin,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: 本文提出了VISTA,一个用于多轮推理任务中语言模型行为分析的可视化交互系统,支持上下文影响可视化、对话历史修改和推理依赖树生成,降低分析复杂度。

Details Motivation: 现有研究缺乏针对多轮推理中复杂上下文依赖的有效分析工具,导致研究人员认知负担高,难以深入理解大模型的推理过程。 Method: 设计并实现了一个基于Web的可视化交互系统VISTA,支持上下文影响可视化、交互式‘假设’分析以及自动构建推理依赖树,并支持自定义基准和本地模型集成。 Result: VISTA能够有效揭示大语言模型在多轮对话中的推理路径,提升对模型决策过程的理解,支持跨模型比较和透明化分析。 Conclusion: VISTA为分析大语言模型的多轮推理能力提供了一个高效、透明且可扩展的工具,有助于推动对模型推理机制的深入研究。 Abstract: Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning processes within these interactions presents a significant challenge due to complex contextual dependencies and a lack of specialized visualization tools, leading to a high cognitive load for researchers. To address this gap, we present VISTA, an web-based Visual Interactive System for Textual Analytics in multi-turn reasoning tasks. VISTA allows users to visualize the influence of context on model decisions and interactively modify conversation histories to conduct "what-if" analyses across different models. Furthermore, the platform can automatically parse a session and generate a reasoning dependency tree, offering a transparent view of the model's step-by-step logical path. By providing a unified and interactive framework, VISTA significantly reduces the complexity of analyzing reasoning chains, thereby facilitating a deeper understanding of the capabilities and limitations of current LLMs. The platform is open-source and supports easy integration of custom benchmarks and local models.

[33] Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Qifeng Cai,Hao Liang,Chang Xu,Tao Xie,Wentao Zhang,Bin Cui

Main category: cs.CL

TL;DR: 提出了一种SQL感知的数据增强框架Text2SQL-Flow,用于生成大规模、语义有效且结构多样的Text-to-SQL数据对,并构建高质量数据集SQLFlow,显著提升开源和闭源大模型在Text-to-SQL任务上的性能。

Details Motivation: 现有Text-to-SQL数据集稀缺、简单且多样性不足,限制了模型性能,亟需高质量、高多样性的数据支持。 Method: 设计了六维数据增强的Text2SQL-Flow框架,集成SQL执行验证、自然语言生成、思维链推理和数据分类模块,并通过模块化数据库管理器实现跨数据库兼容;构建SQLFlow数据集,并提出掩码对齐检索方法用于闭源模型。 Result: 在相同数据预算下,基于SQLFlow微调的开源大模型在多个基准上表现更优;提出的检索方法在闭源模型中优于现有技术,验证了数据质量和对齐策略的有效性。 Conclusion: 高质量、结构丰富的数据对Text-to-SQL至关重要,Text2SQL-Flow和SQLFlow为数据驱动的Text-to-SQL系统提供了可扩展的基础。 Abstract: The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

[34] EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models

Junquan Huang,Haotian Wu,Yubo Gao,Yibo Yan,Junyan Zhang,Yonghua Hei,Song Dai,Jie Zhang,Puay Siew Tan,Xuming Hu

Main category: cs.CL

TL;DR: 本文提出了EffiReason-Bench,一个用于评估高效推理方法的统一基准,并引入E3-Score指标,在多种任务和模型上进行跨范式比较,发现没有单一方法在所有场景下都最优。

Details Motivation: 现有的大语言模型在使用思维链提示时常常生成冗长且不必要的推理过程,导致成本增加并可能降低准确性;同时,缺乏统一的评估标准使得不同高效推理方法之间的比较困难。 Method: 构建了一个包含三类高效推理方法(推理蓝图、动态执行和事后优化)的统一评测基准EffiReason-Bench,并通过标准化流程为CommonsenseQA和LogiQA数据集创建经过人工验证的思维链标注;在6个开源大模型和4个数据集上评估7种方法,并提出基于经济权衡思想的E3-Score作为综合评价指标。 Result: 实验表明,不同方法的表现依赖于模型规模、任务复杂度和架构,没有一种方法在所有情况下均占优;E3-Score提供了稳定、平滑的评估结果,避免了传统指标的不连续性和启发式依赖。 Conclusion: 高效的推理方法需根据具体模型和任务进行选择,未来的研究应结合模型特性与任务需求设计自适应策略,而EffiReason-Bench和E3-Score为这类研究提供了可靠的评估基础。 Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.

[35] Persona-Aware Alignment Framework for Personalized Dialogue Generation

Guanrong Li,Xinyu Liu,Zhen Wu,Xinyu Dai

Main category: cs.CL

TL;DR: 提出了一种新的个性化对话生成框架PAL,通过两阶段训练方法提升对话的个性化程度。

Details Motivation: 主流模型在生成个性化对话时往往忽略给定的人设信息,导致生成内容泛化、缺乏一致性。 Method: 提出Persona-Aware Alignment Framework(PAL),采用两阶段训练:Persona-aware Learning 和 Persona Alignment,并结合“Select then Generate”的推理策略,在语义层面增强人设相关性。 Result: 实验表明,PAL在多个基准上优于现有的先进个性化对话模型和大语言模型。 Conclusion: PAL通过将人设对齐作为训练目标,有效提升了对话生成中的人设相关性和一致性。 Abstract: Personalized dialogue generation aims to leverage persona profiles and dialogue history to generate persona-relevant and consistent responses. Mainstream models typically rely on token-level language model training with persona dialogue data, such as Next Token Prediction, to implicitly achieve personalization, making these methods tend to neglect the given personas and generate generic responses. To address this issue, we propose a novel Persona-Aware Alignment Framework (PAL), which directly treats persona alignment as the training objective of dialogue generation. Specifically, PAL employs a two-stage training method including Persona-aware Learning and Persona Alignment, equipped with an easy-to-use inference strategy Select then Generate, to improve persona sensitivity and generate more persona-relevant responses at the semantics level. Through extensive experiments, we demonstrate that our framework outperforms many state-of-the-art personalized dialogue methods and large language models.

[36] LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning

Yangfan Ye,Xiaocheng Feng,Xiachong Feng,Lei Huang,Weitao Ma,Qichen Hong,Yunfei Lu,Duyu Tang,Dandan Tu,Bing Qin

Main category: cs.CL

TL;DR: 本文提出了LangGPS,一种基于语言可分离性的轻量级两阶段预筛选框架,用于提升多语言大模型训练中数据选择的有效性。

Details Motivation: 现有数据选择方法通常忽略多语言数据的内在语言结构,导致多语言能力对训练数据构成敏感。因此需要一种更关注语言特性的数据选择机制。 Method: LangGPS首先根据模型表示空间中的语言可分离性得分过滤训练数据,然后结合现有选择方法进行二次优化,从而提升数据质量。 Result: 在六个基准和22种语言上的实验表明,LangGPS能显著提升现有选择方法在理解和低资源语言任务上的效果;高可分离性样本有助于形成清晰的语言边界,低可分离性样本促进跨语言对齐。此外,语言可分离性还可作为多语言课程学习的有效信号。 Conclusion: 语言可分离性是衡量多语言数据效用的重要指标,LangGPS为构建更具语言感知能力的大模型提供了新思路。 Abstract: Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs), but the resulting multilingual capability remains highly sensitive to the composition and selection of the training data. Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data. In this paper, we propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability which quantifies how well samples in different languages can be distinguished in the model's representation space. LangGPS first filters training data based on separability scores and then refines the subset using existing selection methods. Extensive experiments across six benchmarks and 22 languages demonstrate that applying LangGPS on top of existing selection methods improves their effectiveness and generalizability in multilingual training, especially for understanding tasks and low-resource languages. Further analysis reveals that highly separable samples facilitate the formation of clearer language boundaries and support faster adaptation, while low-separability samples tend to function as bridges for cross-lingual alignment. Besides, we also find that language separability can serve as an effective signal for multilingual curriculum learning, where interleaving samples with diverse separability levels yields stable and generalizable gains. Together, we hope our work offers a new perspective on data utility in multilingual contexts and support the development of more linguistically informed LLMs.

[37] VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction

Yuhao Wang,Ziyang Cheng,Heyang Liu,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: 提出VocalNet-M2,一种低延迟的端到端口语语言模型,通过多码本分词器和多标记预测策略显著降低响应延迟。

Details Motivation: 现有端到端口语语言模型存在较高响应延迟,主要源于语音标记的自回归生成和复杂的流匹配合成模型。 Method: 引入多码本分词器和多标记预测(MTP)策略,直接生成多码本语音标记,避免使用高延迟的流匹配模型。 Result: 实验显示,VocalNet-M2将首块延迟从约725ms降至350ms,同时保持与其他主流SLMs相当的性能。 Conclusion: VocalNet-M2有效降低了语音生成延迟,适用于实时交互应用,并为高效SLM设计提供了新思路。 Abstract: Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model. Furthermore, our MTP strategy enhances generation efficiency and improves overall performance. Extensive experiments demonstrate that VocalNet-M2 achieves a substantial reduction in first chunk latency (from approximately 725ms to 350ms) while maintaining competitive performance across mainstream SLMs. This work also provides a comprehensive comparison of single-codebook and multi-codebook strategies, offering valuable insights for developing efficient and high-performance SLMs for real-time interactive applications.

[38] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

He Zhang,Wenqian Cui,Haoning Xu,Xiaohui Li,Lei Zhu,Shaohua Ma,Irwin King

Main category: cs.CL

TL;DR: 本文提出了MTR-DuplexBench,一种用于评估全双工语音语言模型(FD-SLMs)在多轮对话中表现的新基准,解决了现有评测在多轮交互、指令遵循和安全性方面的不足。

Details Motivation: 现有的FD-SLM评测主要集中在单轮对话和基本对话特征上,忽略了多轮交互中的复杂性,如回合边界模糊和上下文不一致,缺乏对关键能力(如指令遵循和安全性)的系统评估。 Method: 提出MTR-DuplexBench,通过将连续的全双工对话分割为离散的回合,实现逐轮评估,涵盖对话质量、对话动态、指令遵循和安全性四个维度。 Result: 实验表明,当前的FD-SLMs在多轮对话中难以保持一致性能,尤其在指令遵循和安全性方面表现不足,验证了所提基准的有效性和必要性。 Conclusion: MTR-DuplexBench为全双工语音语言模型提供了更全面、细致的评估框架,有助于推动该领域向更真实、复杂的多轮交互发展。 Abstract: Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.

[39] Local Hybrid Retrieval-Augmented Document QA

Paolo Astrino

Main category: cs.CL

TL;DR: 提出一种完全在本地运行的问答系统,结合语义理解与关键词精确性,在保护数据隐私的同时实现高准确率,适用于法律、科学和对话类文档。

Details Motivation: 解决组织在使用云端AI系统时面临的数据隐私与准确性之间的权衡问题,尤其是在处理敏感文件时既需保障安全又需高效检索。 Method: 结合语义理解和关键词检索两种互补策略,系统完全在本地基础设施上运行,并利用消费级硬件加速技术进行高效信息检索。 Result: 在法律、科学和对话类文档上实现了具有竞争力的准确率,复杂查询响应可靠且错误率低,所有数据均保留在本地,无需联网。 Conclusion: 证明了企业部署AI时隐私与性能可以兼得,使银行、医院和律师事务所等机构能在不向外部传输专有信息的情况下采用对话式文档AI。 Abstract: Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.

[40] Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

Changyuan Tian,Zhicong Lu,Shuang Qian,Nayu Liu,Peiguang Li,Li Jin,Leiyi Hu,Zhizhao Zeng,Sirui Wang,Ke Zeng,Zhi Guo

Main category: cs.CL

TL;DR: 本文提出了一种基于困惑度感知的强化学习算法,以纠正大语言模型在多步数学推理中评估偏好不平衡的问题,从而提升其批判能力。

Details Motivation: 现有方法主要依赖高质量的监督微调示例来提升批判能力,但忽视了大语言模型批判性能差的根本原因。本文旨在探究这一问题背后的潜在因素——评估偏好不平衡,并加以纠正。 Method: 构建了一个一对一多解(OPS)基准来量化大语言模型在评估自身与他人生成解时的行为差异;通过面向困惑度的统计偏好分析发现模型倾向于将低困惑度的解判断为正确;提出一种困惑度感知的分组相对策略优化算法,引导模型探索将高困惑度解判断为正确的策略。 Result: 在自建的OPS和现有批评基准上的实验表明,所提方法能有效提升大语言模型的批判能力,验证了其有效性。 Conclusion: 通过识别并纠正大语言模型中存在的‘评估偏好不平衡’现象,本文提出的困惑度感知强化学习方法显著提升了模型在多步数学推理中的批判能力,为改进模型推理监督提供了新思路。 Abstract: To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

[41] BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Guduru Manoj,Neel Prabhanjan Rachamalla,Ashish Kulkarni,Gautam Rajeev,Jay Piplodiya,Arul Menezes,Shaharukh Khan,Souvik Rana,Manya Sah,Chandra Khatri,Shubham Agarwal

Main category: cs.CL

TL;DR: 本研究系统探讨了印度语言的合成多语言预训练数据的生成与评估,构建了包含5400亿token的大规模数据集BhashaKritika,并提出模块化质量评估流程以支持可扩展、语言敏感的数据质量控制。

Details Motivation: 在大语言模型预训练中,低资源语言的数据稀缺导致模型发展不均衡,亟需高效生成高质量多语言数据的方法。 Method: 采用5种技术为10种印度语言生成合成数据,结合文档、人物设定和主题进行生成;比较英语内容翻译与本地语言生成,并分析提示语言和文档语种对数据质量的影响;设计集成脚本识别、元数据检查、n-gram重复分析和KenLM困惑度过滤的评估管道。 Result: 成功构建540B token的BhashaKritika数据集,实验揭示不同生成策略间的权衡,发现基于文档接地和使用本地语言提示可提升数据质量。 Conclusion: 本地化生成结合多维度质量控制能有效提升印度语言合成数据质量,为低资源语言的LLM预训练提供了可行路径与最佳实践。 Abstract: In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.

[42] Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Andrea Schimmenti,Valentina Pasqual,Fabio Vitali,Marieke van Erp

Main category: cs.CL

TL;DR: 本文提出了ATR4CH,一种基于大语言模型的文化遗产文本到知识图谱的系统性五步法,通过维基百科案例验证了其在元数据、实体、假设和证据提取中的高效性能,支持文化遗产机构的知识结构化与查询。

Details Motivation: 文化遗产文本包含丰富但非结构化的知识,难以系统化查询,现有方法缺乏与领域本体协调的系统性框架。 Method: 提出ATR4CH五步法:基础分析、标注模式设计、流水线架构、集成优化与评估;结合标注模型、本体框架与多个大语言模型(Claude Sonnet 3.7、Llama 3.3 70B、GPT-4o-mini)构建顺序抽取流水线。 Result: 在维基百科争议文物文章上的实验显示:元数据提取F1为0.96-0.99,实体识别0.7-0.8,假设提取0.65-0.75,证据提取0.95-0.97,论述表示G-EVAL得分为0.62;小型模型表现具竞争力,具备成本效益。 Conclusion: ATR4CH是首个将大语言模型与文化遗产本体系统结合的知识抽取方法,提供可复用、可适配的框架,有助于文化遗产机构实现文本知识的结构化转化与应用。 Abstract: Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts...), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.

[43] TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs

Shuyi Liu,Yuming Shang,Xi Zhang

Main category: cs.CL

TL;DR: 本文提出了TruthfulRAG,首个利用知识图谱(KGs)解决检索增强生成(RAG)系统中事实级知识冲突的框架。通过从检索内容中提取三元组构建KG,结合基于查询的图检索与熵过滤机制,精确定位冲突并减少事实不一致,从而提升RAG系统的准确性与可信度。实验表明,该方法优于现有技术。

Details Motivation: 随着外部知识库的扩展和大模型内部参数化知识的过时,RAG系统面临检索信息与模型内部知识之间的事实冲突问题,现有方法在语义或词元层面处理冲突,难以实现对事实差异的整体理解,尤其影响知识密集型任务的准确性与可靠性。 Method: 提出TruthfulRAG框架:1)从检索内容中系统抽取三元组构建知识图谱;2)采用基于查询的图检索获取相关知识;3)设计熵基过滤机制精确定位冲突元素并缓解事实不一致。 Result: 实验结果显示,TruthfulRAG在多个知识密集型任务中优于现有方法,能有效缓解知识冲突,提升生成内容的忠实性与系统鲁棒性。 Conclusion: TruthfulRAG通过引入知识图谱实现了对RAG系统中事实级冲突的精细建模与解析,显著提高了生成结果的准确性和可信度,为构建更可靠的RAG系统提供了新思路。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs' internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs' knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, TruthfulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems.

[44] Position: On the Methodological Pitfalls of Evaluating Base LLMs for Reasoning

Jason Chan,Zhixue Zhao,Robert Gaizauskas

Main category: cs.CL

TL;DR: 本文讨论了在评估基础大语言模型(LLM)推理能力时存在的方法论问题,指出其预训练目标与推理评估标准之间的根本性不匹配,并呼吁重新审视现有研究中的隐含假设。

Details Motivation: 现有研究常忽略基础LLM在推理评估中的方法论缺陷,本文旨在揭示这些被忽视的问题。 Method: 通过分析基础LLM的预训练目标与其输出在逻辑正确性上的偶然性关系,论证评估其推理能力的方法论局限。 Result: 发现基础LLM生成结论是遵循统计语言模式的结果,而非真正意图进行正确推理,因此其输出不应被视为真正的推理尝试。 Conclusion: 对基础LLM的推理评估存在根本性问题,相关结论难以推广到经过指令优化的后训练LLM,需警惕现有研究的假设前提。 Abstract: Existing work investigates the reasoning capabilities of large language models (LLMs) to uncover their limitations, human-like biases and underlying processes. Such studies include evaluations of base LLMs (pre-trained on unlabeled corpora only) for this purpose. Our position paper argues that evaluating base LLMs' reasoning capabilities raises inherent methodological concerns that are overlooked in such existing studies. We highlight the fundamental mismatch between base LLMs' pretraining objective and normative qualities, such as correctness, by which reasoning is assessed. In particular, we show how base LLMs generate logically valid or invalid conclusions as coincidental byproducts of conforming to purely linguistic patterns of statistical plausibility. This fundamental mismatch challenges the assumptions that (a) base LLMs' outputs can be assessed as their bona fide attempts at correct answers or conclusions; and (b) conclusions about base LLMs' reasoning can generalize to post-trained LLMs optimized for successful instruction-following. We call for a critical re-examination of existing work that relies implicitly on these assumptions, and for future work to account for these methodological pitfalls.

[45] DELICATE: Diachronic Entity LInking using Classes And Temporal Evidence

Cristian Santini,Sebastian Barzaghi,Paolo Sernani,Emanuele Frontoni,Mehwish Alam

Main category: cs.CL

TL;DR: 本文提出了DELICATE(一种结合BERT编码器与Wikidata上下文信息的神经符号方法)和ENEIDE(一个历史意大利语的多领域实体链接语料库),用于解决人文领域中因文档类型复杂、缺乏领域数据集和长尾实体带来的实体链接挑战。

Details Motivation: 由于人文学科中文档类型的复杂性、缺乏特定领域的数据集和模型,以及知识库中表示不足的长尾实体,自然语言处理中的实体链接任务仍具挑战性。 Method: 提出DELICATE方法,结合基于BERT的编码器与来自Wikidata的上下文信息,利用时间合理性和实体类型一致性进行知识库实体选择;同时构建ENEIDE语料库,涵盖19至20世纪的文学与政治文本。 Result: DELICATE在历史意大利语实体链接任务上优于其他模型,甚至超过参数量达数十亿的更大架构,并展现出更高的可解释性和特征敏感性。 Conclusion: DELICATE结合神经与符号方法,在低资源历史语言场景下实现了高性能且可解释的实体链接,ENEIDE语料库为未来研究提供了宝贵资源。 Abstract: In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.

[46] Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction

Chunyang Jiang,Paola Merlo

Main category: cs.CL

TL;DR: 提出一种基于类比范式组织的认知启发方法,使轻量级模型仅用百个样本即可在语言规则学习任务中超越零样本大模型。

Details Motivation: 探索是否可以通过类比范式组织让轻量级模型在极小数据下达到类似大模型的性能。 Method: 结合类比结构、对比学习和最小上下文线索三种认知启发原则,构建计算方法,在结构化补全任务上训练轻量级模型。 Result: 在仅使用100个样本的情况下,轻量级模型(BERT+CNN)在因果/起始交替任务上达到F1=0.95,超过零样本GPT-3的F1=0.87;消融实验验证了各组件的有效性,跨现象验证显示方法具有鲁棒性。 Conclusion: 类比范式组织能以数量级更低的数据需求实现高效语言规则学习,为轻量模型的小样本学习提供了新路径。 Abstract: Large language models achieve strong performance through training on vast datasets. Can analogical paradigm organization enable lightweight models to match this performance with minimal data? We develop a computational approach implementing three cognitive-inspired principles: analogical structure, contrastive learning, and minimal contextual cues. We test this approach with structured completion tasks where models identify correct sentence completions from analogical patterns with contrastive alternatives. Training lightweight models (BERT+CNN, $0.5M$ parameters) on only one hundred structured examples of English causative/inchoative alternations achieves $F1=0.95$, outperforming zero-shot \texttt{GPT-o3} ($F1=0.87$). Ablation studies confirm that analogical organization and contrastive structure improve performance, consistently surpassing randomly shuffled baselines across architectures. Cross-phenomenon validation using unspecified object alternations replicates these efficiency gains, confirming approach robustness. Our results show that analogical paradigm organization enables competitive linguistic rule learning with orders of magnitude less data than conventional approaches require.

[47] Reasoning About Intent for Ambiguous Requests

Irina Saparina,Mirella Lapata

Main category: cs.CL

TL;DR: 提出一种通过生成多个解释-答案对来应对模糊请求的方法,利用强化学习和定制奖励函数训练模型,在覆盖有效答案方面优于基线方法。

Details Motivation: 大型语言模型在面对模糊请求时通常会隐式地选择一种解释,容易导致意图误解,引发用户不满和安全风险。因此需要更透明、准确的方式来处理歧义。 Method: 使用强化学习和定制的奖励函数训练模型,以多个有效答案作为监督信号,使模型在一次生成中输出多个解释-答案对的结构化响应。 Result: 在对话式问答和语义解析任务上的实验表明,该方法比基线方法能覆盖更多有效答案,且人类评估显示预测的解释与答案高度一致。 Conclusion: 该方法通过显式表达多种解释提高了透明性,仅需一次生成步骤保证效率,并以结构化输出支持下游应用,有效改善模型对模糊请求的响应能力。 Abstract: Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.

[48] Exploring State Tracking Capabilities of Large Language Models

Kiamehr Rezaee,Jose Camacho-Collados,Mohammad Taher Pilehvar

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在状态跟踪任务中的表现,提出一个基于三个明确定义任务的基准测试。结果表明,GPT-4和Llama3等新一代模型在引入思维链等机制后能有效跟踪状态,而前一代模型在多步后常会失败。

Details Motivation: 为了评估大语言模型在需要持续推理的状态跟踪任务中的能力,并隔离其他干扰因素。 Method: 设计了一个包含三个明确定义状态跟踪任务的基准测试,评估不同LLM在多种场景下的表现,特别是结合思维链(Chain of Thought)等提示技术的情况。 Result: GPT-4和Llama3等最新一代模型能够有效完成状态跟踪任务,尤其是使用思维链时;而早期模型虽能理解任务并在初期成功,但在经过多步推理后往往失败。 Conclusion: 状态跟踪对大语言模型仍具挑战性,模型代际之间存在明显性能差距,提示机制如思维链有助于提升表现。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.

[49] LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning

Zihan Gao,Yifei Xu,Jacob Thebault-Spieker

Main category: cs.CL

TL;DR: 本文提出了LocalBench,首个系统评估大语言模型在美国县级本地知识能力的基准,揭示了现有模型在处理超本地化知识方面的严重不足。

Details Motivation: 随着现实应用对AI理解社区特定动态的需求增加,但现有基准无法捕捉细粒度本地知识的复杂性,因此需要一个更精确的评估工具。 Method: 基于Localness概念框架,构建包含14,782个验证问答对的LocalBench数据集,覆盖49个州526个县,结合人口普查数据、地方子论坛和区域新闻等多源信息,并在闭卷和网络增强设置下评估13种主流LLM。 Result: 最佳模型在叙事类问题上准确率仅为56.8%,数值推理低于15.5%;更大的模型或引入搜索并不总能提升表现,例如搜索使Gemini提升+13.6%,却使GPT系列下降-11.4%。 Conclusion: 当前大语言模型在处理超本地知识方面存在显著局限,亟需发展能够公平、精准理解地方情境的“地点感知”AI系统。 Abstract: Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.

[50] Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks

Yunzhe Xu,Zhuosheng Zhang,Zhe Liu

Main category: cs.CL

TL;DR: 提出知识提供型提示优化框架KPPO,通过知识缺口填补、批量化候选评估和自适应知识剪枝,在知识密集型任务上平均提升约6%性能,同时降低最多29%的token消耗。

Details Motivation: 现有基于提示激发的方法在知识密集型任务中受限于固定参数,无法提供所需的专业知识、术语准确性和推理模式,难以满足实际需求。 Method: 将提示优化重构为系统性知识整合过程,提出KPPO框架,包含知识缺口识别与填补、兼顾性能提升与分布稳定性的批量化评估、以及平衡性能与token效率的自适应知识剪枝策略。 Result: 在15个跨领域知识密集型基准上,KPPO平均比最强基线提升约6%,且达到相当或更低的token消耗。 Conclusion: KPPO通过主动注入领域知识而非仅依赖模型潜能激发,显著提升了语言模型在知识密集型任务中的表现,为提示优化提供了新的范式。 Abstract: While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models' capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO's superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.

[51] Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

Yun He,Wenzhe Li,Hejia Zhang,Songlin Li,Karishma Mandyam,Sopan Khosla,Yuanhao Xiong,Nanshu Wang,Selina Peng,Beibin Li,Shengjie Bi,Shishir G. Patil,Qi Qi,Shengyu Feng,Julian Katz-Samuels,Richard Yuanzhe Pang,Sujan Gonugondla,Hunter Lang,Yue Yu,Yundi Qian,Maryam Fazel-Zarandi,Licheng Yu,Amine Benhalloum,Hany Awadalla,Manaal Faruqui

Main category: cs.CL

TL;DR: 本文提出了AdvancedIF基准和RIFL训练方法,通过基于评分标准的强化学习显著提升大语言模型在复杂指令跟随任务上的表现。

Details Motivation: 现有大语言模型在复杂、多轮和系统级指令跟随方面仍面临挑战,缺乏高质量的人工标注基准和可靠的奖励信号来有效评估和训练此类能力。 Method: 提出AdvancedIF基准,包含1600多个提示和专家设计的评分标准;并提出RIFL训练框架,结合评分标准生成、微调的评分验证器和奖励塑形,实现基于评分标准的强化学习。 Result: 实验表明RIFL在AdvancedIF上取得6.7%的绝对提升,并在公开基准上表现优异,消融研究验证了各组件的有效性。 Conclusion: 评分为大语言模型的高级指令跟随能力提供了有效的训练与评估工具,为构建更强大、可靠的人工智能系统奠定了基础。 Abstract: Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.

[52] LOCA-R: Near-Perfect Performance on the Chinese Physics Olympiad 2025

Dong-Shan Jian,Xiang Li,Chen-Xu Yan,Hui-Wen Zheng,Zhi-Zhang Bian,You-Le Fang,Sheng-Qi Zhang,Bing-Rui Gong,Ren-Xi He,Jing-Tian Zhang,Ce Meng,Yan-Qing Ma

Main category: cs.CL

TL;DR: 本文提出了LOCA-R(用于推理的逻辑链增强),这是一种改进的复杂推理框架,并应用于2025年中国物理奥林匹克理论考试,取得了313/320的接近完美分数,超越了人类最高分选手和所有基线方法。

Details Motivation: 奥赛级别的物理问题解决对AI来说极具挑战,需要精确计算、抽象推理与对物理原理的深刻理解;中国物理奥赛(CPhO)因其难度成为检验AI高级能力的理想测试平台。 Method: 提出并改进了LOCA框架,发展为LOCA-R,通过增强逻辑推理链来处理复杂的物理问题求解过程,并应用于CPhO 2025理论试题。 Result: LOCA-R在CPhO 2025理论考试中获得313/320分,超过最高分人类选手,显著优于所有基线模型。 Conclusion: LOCA-R在高难度物理推理任务上表现出色,展示了AI在复杂科学问题求解中的巨大潜力。 Abstract: Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles. The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods.

[53] Say It Differently: Linguistic Styles as Jailbreak Vectors

Srikant Panda,Avinash Rai

Main category: cs.CL

TL;DR: 该研究系统性地探讨了语言风格(如恐惧、好奇)如何重构有害意图并突破对齐模型的安全限制,发现风格重写可使越狱成功率提升高达57个百分点,并提出通过风格中立化预处理来缓解此问题。

Details Motivation: 现有对大语言模型鲁棒性的评估多关注语义等价的改写,而忽视语言风格变化作为攻击面的潜力,本文旨在揭示语言风格对模型安全性的系统性影响。 Method: 通过手工模板和基于LLM的重写,将三个标准数据集的提示词转换为11种不同语言风格,构建风格增强的越狱基准,并在16个主流指令微调模型上进行评估;同时提出使用辅助LLM进行风格中立化预处理以防御此类攻击。 Result: 实验显示,风格重写显著提升越狱成功率(最高+57%),其中恐惧、好奇和同情等风格最有效,且基于上下文的LLM重写优于模板式改写;所提风格中立化方法能显著降低越狱成功率。 Conclusion: 语言风格是一种被忽视但严重的安全漏洞,当前安全机制对此类攻击缺乏抵抗力,需在安全流水线中纳入对语言风格的防御策略。 Abstract: Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.

[54] Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG

Egor Pakhomov,Erik Nijkamp,Caiming Xiong

Main category: cs.CL

TL;DR: 本文提出了一种用于对话记忆评估的综合性基准,包含75,336个问题-答案对,涵盖用户事实、助手回忆、偏好、时间变化等多个类别。研究揭示了对话记忆与检索增强生成(RAG)之间的关系,并指出在小规模语料场景下,简单的全上下文方法优于复杂的RAG系统,建议应专门研究而非直接套用RAG方案。

Details Motivation: 现有对话记忆评估基准在统计效力、数据生成一致性和评估灵活性方面存在局限,且缺乏对记忆系统随对话逐步增长特性的充分考虑,因此需要构建更全面的基准并重新审视RAG方法的适用性。 Method: 构建了一个大规模、多类别的对话记忆评估基准,分析不同方法(如全上下文模型和RAG-based系统Mem0)在不同对话长度下的表现,并识别性能拐点。 Result: 简单全上下文方法在挑战性任务中达到70-82%准确率,而Mem0等RAG系统仅达30-45%;长上下文在前30次对话中表现优异,150次内仍可行,超过则需混合或RAG方法。 Conclusion: 对话记忆因其从零开始、渐进增长的特性,在小规模对话中具有独特优势,应针对该特性开展专门研究,而非直接应用传统RAG框架。 Abstract: We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory--where exhaustive search and complete reranking are feasible--deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.

[55] Computing the Formal and Institutional Boundaries of Contemporary Genre and Literary Fiction

Natasha Johnson

Main category: cs.CL

TL;DR: 本研究使用计算方法分析文学与类型小说(如言情、悬疑、科幻)在形式与制度层面的差异,发现各类别存在显著的形式标记,并揭示女性作者身份会削弱作品获得文学地位的可能性。

Details Motivation: 探讨类型概念在当代语境下是基于形式特征还是制度性分类更为有效,并考察作者性别对文学分类的影响。 Method: 基于Andrew Piper的CONLIT数据集构建文学与类型小说语料库,采用Welch's ANOVA比较不同体裁中作者性别的叙事特征分布,使用逻辑回归建模各特征对文学分类的影响及其受性别的调节作用,并分析文体与语义向量表示。 Result: 发现每种文学类别均存在统计上显著的形式标记,且女性作者的作品在文学分类中的边界更模糊,难以达到文学地位。 Conclusion: 体裁不仅具有形式基础,也受制度因素影响;作者性别在文学评价体系中扮演重要角色,女性写作面临更高的文学认可门槛。 Abstract: Though the concept of genre has been a subject of discussion for millennia, the relatively recent emergence of genre fiction has added a new layer to this ongoing conversation. While more traditional perspectives on genre have emphasized form, contemporary scholarship has invoked both formal and institutional characteristics in its taxonomy of genre, genre fiction, and literary fiction. This project uses computational methods to explore the soundness of genre as a formal designation as opposed to an institutional one. Pulling from Andrew Piper's CONLIT dataset of Contemporary Literature, we assemble a corpus of literary and genre fiction, with the latter category containing romance, mystery, and science fiction novels. We use Welch's ANOVA to compare the distribution of narrative features according to author gender within each genre and within genre versus literary fiction. Then, we use logistic regression to model the effect that each feature has on literary classification and to measure how author gender moderates these effects. Finally, we analyze stylistic and semantic vector representations of our genre categories to understand the importance of form and content in literary classification. This project finds statistically significant formal markers of each literary category and illustrates how female authorship narrows and blurs the target for achieving literary status.

[56] URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

Yongxin Shi,Jiapeng Wang,Zeyu Shan,Dezhi Peng,Zening Lin,Lianwen Jin

Main category: cs.CL

TL;DR: 提出URaG框架,通过在多模态大语言模型内部统一检索与生成,利用早期Transformer层作为跨模态证据选择器,实现高效长文档理解。

Details Motivation: 现有方法在处理长文档时面临信息干扰和计算成本高的问题,且牺牲细节或增加系统复杂性,难以端到端优化。 Method: 基于MLLMs的粗到细推理模式,设计轻量级跨模态检索模块,将早期Transformer层转化为证据选择器,筛选并保留关键页面内容。 Result: 实验表明URaG在多个任务上达到SOTA性能,同时降低44-56%的计算开销。 Conclusion: URaG有效结合了检索与生成,提升了长文档理解的效率与准确性,具备良好的实用性与可扩展性。 Abstract: Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.

[57] DESS: DeBERTa Enhanced Syntactic-Semantic Aspect Sentiment Triplet Extraction

Vishal Thenuwara,Nisansa de Silva

Main category: cs.CL

TL;DR: 本文提出了一种名为DESS的新方法,通过结合DeBERTa和LSTM的双通道结构,在方面情感三元组抽取(ASTE)任务中显著提升了性能,尤其在处理复杂句子结构时表现突出。

Details Motivation: 现有的ASTE方法在捕捉方面、观点和情感极性之间的关系上仍存在挑战,且先进语言模型在理解复杂语言模式上的潜力尚未被充分挖掘。 Method: 提出DESS框架,采用DeBERTa的增强注意力机制与LSTM并行处理语义和语法信息,并优化双通道的信息交互方式。 Result: 在标准数据集上F1分数分别提升了4.85、8.36和2.42,实验表明DeBERTa的注意力机制有助于更好地处理长距离依赖和复杂句式。 Conclusion: 合理整合更先进的语言模型可显著提升细粒度情感分析性能,验证了模型架构与先进预训练模型协同优化的重要性。 Abstract: Fine-grained sentiment analysis faces ongoing challenges in Aspect Sentiment Triple Extraction (ASTE), particularly in accurately capturing the relationships between aspects, opinions, and sentiment polarities. While researchers have made progress using BERT and Graph Neural Networks, the full potential of advanced language models in understanding complex language patterns remains unexplored. We introduce DESS, a new approach that builds upon previous work by integrating DeBERTa's enhanced attention mechanism to better understand context and relationships in text. Our framework maintains a dual-channel structure, where DeBERTa works alongside an LSTM channel to process both meaning and grammatical patterns in text. We have carefully refined how these components work together, paying special attention to how different types of language information interact. When we tested DESS on standard datasets, it showed meaningful improvements over current methods, with F1-score increases of 4.85, 8.36, and 2.42 in identifying aspect opinion pairs and determining sentiment accurately. Looking deeper into the results, we found that DeBERTa's sophisticated attention system helps DESS handle complicated sentence structures better, especially when important words are far apart. Our findings suggest that upgrading to more advanced language models when thoughtfully integrated, can lead to real improvements in how well we can analyze sentiments in text. The implementation of our approach is publicly available at: https://github.com/VishalRepos/DESS.

[58] Evaluating Prompting Strategies with MedGemma for Medical Order Extraction

Abhinand Balachandran,Bavana Durgapraveen,Gowsikkan Sikkan Sudhagar,Vidhya Varshany J S,Sriram Rajkumar

Main category: cs.CL

TL;DR: 本文研究了在医生-患者对话中提取医疗指令任务中,使用领域特定语言模型MedGemma及不同提示方法的性能表现,发现简单的单样本提示在人工标注数据上优于复杂推理框架。

Details Motivation: 准确从医患对话中提取医疗指令对减轻临床文档负担和保障患者安全至关重要,但如何选择合适的提示策略尚需探索。 Method: 采用MedGemma模型,系统评估三种提示范式:单样本提示、基于推理的ReAct框架和多步代理工作流,并在MEDIQA-OE-2025共享任务的验证集上进行实验。 Result: 简单的一次性提示方法在官方验证集上表现最佳,而更复杂的ReAct和代理流程因“过度思考”引入噪声,导致性能下降。 Conclusion: 在处理人工标注的临床转录文本时,直接的提示方法更为稳健高效,复杂推理框架未必带来提升。 Abstract: The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to "overthinking" and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.

[59] Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering

Bavana Durgapraveen,Sornaraj Sivasankaran,Abhinand Balachandran,Sriram Rajkumar

Main category: cs.CL

TL;DR: 本文针对MEDIQA-WV 2025共享任务,提出两种用于生成伤口护理文本回复的方法:基于检索的提示策略和基于元数据增强的生成方法。

Details Motivation: 异步远程医疗的快速发展增加了临床医生的工作负担,亟需AI系统协助高效处理患者咨询,特别是在结合图像的伤口护理领域。 Method: 第一种方法采用挖掘提示策略,通过检索训练数据中相似样例作为少样本示范;第二种方法基于元数据消融研究,训练分类器预测四项关键元数据属性,并将其动态融入生成过程以提升输出质量。 Result: 实验结果表明,挖掘提示策略提升了回复的相关性,而元数据引导生成进一步提高了临床准确性。 Conclusion: 结合检索增强与元数据建模的策略可有效提升AI在伤口护理问答中的表现,为临床辅助系统的发展提供了可行方向。 Abstract: The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.

[60] Know Your Limits: Entropy Estimation Modeling for Compression and Generalization

Benjamin L. Badger,Matthew Neligeorge

Main category: cs.CL

TL;DR: 本文提出了一种编码器增强的因果解码器模型架构,能够在有限硬件上实现比传统因果Transformer更高的压缩效率,并通过逐token熵估计提升语言模型的泛化能力。

Details Motivation: 由于语言内在的信息熵限制,语言预测存在准确率上限和压缩下限。当前最高效的压缩算法是基于因果大模型的下一个token预测,但其计算成本过高,难以用于精确估算语言熵。 Method: 引入编码器增强的因果解码器架构,提升训练效率并实现更优压缩;在模型训练中引入逐token的熵估计,控制模型逼近而非超过熵值以优化泛化性能。 Result: 新架构在有限硬件上实现了优于传统因果Transformer的压缩效果;实验表明,考虑熵约束训练的模型在泛化能力上优于未考虑熵的模型。 Conclusion: 通过结合编码器信息与熵感知训练策略,模型不仅能更高效地逼近语言熵极限,还能显著提升泛化性能,为语言建模与压缩提供了新的优化方向。 Abstract: Language prediction is constrained by informational entropy intrinsic to language, such that there exists a limit to how accurate any language model can become and equivalently a lower bound to language compression. The most efficient language compression algorithms today are causal (next token prediction) large language models, but the use of these models to form accurate estimates of language entropy is currently computationally infeasible. We introduce encoder-augmented causal decoder model architectures that exhibit superior training efficiency characteristics and achieve higher compression than causal transformers even when trained on modest hardware. We demonstrate how entropy estimates can be obtained on a per-token basis, and show that the generalization of models trained to approach the entropy of their training data necessarily exceeds the generalization of models trained to minimize loss beyond this value. We show empirically that causal models trained to approach but not exceed estimated per-token entropies exhibit greater generalization than models trained without taking entropy into account.

[61] SSR: Socratic Self-Refine for Large Language Model Reasoning

Haizhou Shi,Ye Liu,Bo Pang,Zeyu Leo Liu,Hao Wang,Silvio Savarese,Caiming Xiong,Yingbo Zhou,Semih Yavuz

Main category: cs.CL

TL;DR: 提出了一种名为Socratic Self-Refine (SSR)的新框架,通过细粒度验证和迭代优化提升大语言模型的推理能力。

Details Motivation: 现有测试时框架依赖粗略的自我验证和修正,难以有效处理复杂任务中的精细推理问题。 Method: 将模型响应分解为可验证的(子问题,子答案)对,通过受控重求解和自一致性检查进行步骤级置信度评估,并针对性地迭代优化不可靠步骤。 Result: 在五个推理基准和三个大语言模型上的实验表明,SSR持续优于现有的先进自优化方法。 Conclusion: SSR提供了一种有效的黑箱方法,用于评估和理解大语言模型的内部推理过程,提升了推理准确性和可解释性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.

[62] Instella: Fully Open Language Models with Stellar Performance

Jiang Liu,Jialian Wu,Xiaodong Yu,Yusheng Su,Prakamya Mishra,Gowtham Ramesh,Sudhanshu Ranjan,Chaitanya Manem,Ximeng Sun,Ze Wang,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: Instella 是一个完全开源的三 billion 参数语言模型系列,基于公开数据和代码库训练,在性能上达到同类开源模型的领先水平,并发布了支持长上下文和数学推理的专用变体。

Details Motivation: 大多数高性能大语言模型仍为闭源或部分开源,限制了研究的透明度与可复现性,因此需要一个完全开源且高性能的语言模型。 Method: 通过大规模预训练、通用指令微调以及基于人类偏好的对齐方法,在AMD Instinct MI300X GPU上完成训练,并使用公开数据和代码库。此外还开发了支持128K上下文和数学推理能力的专用版本。 Result: Instella 在少于同类模型使用的预训练 token 数量下,仍实现了当前完全开源模型中的最先进性能,并在同等规模开源权重模型中具有竞争力。其两个变体 Instella-Long 和 Instella-Math 分别在长上下文和数学推理任务中表现优异。 Conclusion: Instella 提供了一个透明、高效且多功能的开源语言模型方案,推动了开放性和可复现性的语言模型研究发展。 Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.

[63] Black-Box On-Policy Distillation of Large Language Models

Tianzhu Ye,Li Dong,Zewen Chi,Xun Wu,Shaohan Huang,Furu Wei

Main category: cs.CL

TL;DR: 本文提出了生成对抗蒸馏(GAD),实现了一种基于黑盒、策略上的一致性大语言模型蒸馏方法,通过对抗训练机制使学生模型性能显著优于传统序列级知识蒸馏。

Details Motivation: 现有的黑盒知识蒸馏方法通常依赖于教师模型的输出文本,但缺乏有效的反馈机制来指导学生模型在策略层面进行优化,因此需要一种更稳定和自适应的蒸馏框架。 Method: 提出生成对抗蒸馏(GAD),将学生模型视为生成器,训练一个判别器来区分学生与教师模型的响应,构建极小极大博弈;判别器作为随学生模型共同演化的在线奖励模型,提供稳定的适应性反馈。 Result: 实验表明,GAD在多个评估中持续优于传统的序列级知识蒸馏方法;使用GAD训练的Qwen2.5-14B-Instruct学生模型在LMSYS-Chat自动评估中表现接近其教师模型GPT-5-Chat。 Conclusion: GAD是一种有前景且高效的黑盒大语言模型蒸馏范式,能够通过对抗性奖励机制实现更优的学生模型学习。 Abstract: Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.

[64] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Yesheng Liang,Haisheng Chen,Song Han,Zhijian Liu

Main category: cs.CL

TL;DR: 提出了一种名为ParoQuant的权重量化方法,通过成对旋转和通道级缩放减少大语言模型推理中的量化误差,在保持低开销的同时提升了推理准确性。

Details Motivation: 现有权重量化方法在处理大语言模型中的权重和激活异常值时,容易导致较大的量化误差和精度下降,尤其是在推理链较长的推理任务中问题更为严重。 Method: 提出Pairwise Rotation Quantization (ParoQuant),结合硬件高效的独立Givens旋转与通道级缩放,均衡通道间幅度并缩小量化组内的动态范围,并协同设计推理内核以充分利用GPU并行性。 Result: 在推理任务上平均比AWQ提升2.4%的准确率,运行时开销低于10%。 Conclusion: ParoQuant有效平衡了量化效率与模型精度,为推理型大语言模型的高效部署提供了可行方案。 Abstract: Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.

cs.CV [Back]

[65] FedeCouple: Fine-Grained Balancing of Global-Generalization and Local-Adaptability in Federated Learning

Ming Yang,Dongrun Li,Xin Wang,Feng Li,Lisheng Fan,Chunxiao Wang,Xiaoming Wu,Peng Cheng

Main category: cs.CV

TL;DR: 提出FedeCouple,一种在细粒度上平衡全局泛化与局部适应性的联邦学习方法,通过联合学习全局和局部特征表示、动态知识蒸馏和锚点优化特征空间,在隐私保护前提下显著提升性能。

Details Motivation: 现有个性化联邦学习方法多关注特征空间一致性与分类器个性化,忽视了特征提取器的局部适应性和分类器的全局泛化能力,导致组件间协同不足,影响整体性能。 Method: 提出FedeCouple,联合学习全局与局部特征表示,采用动态知识蒸馏增强分类器泛化能力,并引入非传输的局部锚点来优化特征空间,提升解耦部件间的耦合性。 Result: 在五个图像分类数据集上实验表明,FedeCouple在有效性、稳定性、可扩展性和安全性方面均优于九个基线方法,有效性上最高提升4.3%。同时证明了算法对非凸目标的收敛性。 Conclusion: FedeCouple通过细粒度平衡局部适应性与全局泛化,有效提升了解耦式个性化联邦学习的性能,兼具理论保证与实际优势,适用于异构数据下的隐私保护场景。 Abstract: In privacy-preserving mobile network transmission scenarios with heterogeneous client data, personalized federated learning methods that decouple feature extractors and classifiers have demonstrated notable advantages in enhancing learning capability. However, many existing approaches primarily focus on feature space consistency and classification personalization during local training, often neglecting the local adaptability of the extractor and the global generalization of the classifier. This oversight results in insufficient coordination and weak coupling between the components, ultimately degrading the overall model performance. To address this challenge, we propose FedeCouple, a federated learning method that balances global generalization and local adaptability at a fine-grained level. Our approach jointly learns global and local feature representations while employing dynamic knowledge distillation to enhance the generalization of personalized classifiers. We further introduce anchors to refine the feature space; their strict locality and non-transmission inherently preserve privacy and reduce communication overhead. Furthermore, we provide a theoretical analysis proving that FedeCouple converges for nonconvex objectives, with iterates approaching a stationary point as the number of communication rounds increases. Extensive experiments conducted on five image-classification datasets demonstrate that FedeCouple consistently outperforms nine baseline methods in effectiveness, stability, scalability, and security. Notably, in experiments evaluating effectiveness, FedeCouple surpasses the best baseline by a significant margin of 4.3%.

[66] MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Ye Tian,Ling Yang,Jiongfan Yang,Anran Wang,Yu Tian,Jiani Zheng,Haochen Wang,Zhiyang Teng,Zhuochen Wang,Yinjie Wang,Yunhai Tong,Mengdi Wang,Xiangtai Li

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态并行扩散框架MMaDA-Parallel,用于解决思维感知生成中因推理与图像输出对齐不佳导致的性能下降问题。通过新构建的ParaBench基准测试验证,结合并行强化学习优化,显著提升了跨模态一致性。

Details Motivation: 现有的自回归式思维生成方法在复杂任务中存在错误传播问题,导致推理过程与图像输出对齐差,从而降低性能。 Method: 提出ParaBench基准以评估文本和图像输出;设计MMaDA-Parallel框架,实现文本与图像在整个去噪过程中的双向交互;采用监督微调和新型并行强化学习(ParaRL)进行训练,通过轨迹上的语义奖励增强跨模态一致性。 Result: 在ParaBench上,MMaDA-Parallel相比当前最优模型Bagel在输出对齐性上提升了6.9%,显著改善了跨模态对齐与语义一致性。 Conclusion: MMaDA-Parallel通过并行多模态交互和轨迹级强化学习,建立了更鲁棒的思维感知图像生成范式,有效缓解了错误累积与模态错位问题。 Abstract: While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

[67] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild

Felix B. Mueller,Jan F. Meier,Timo Lueddecke,Richard Vogg,Roger L. Freixanet,Valentin Hassler,Tiffany Bosshard,Elif Karakoc,William J. O'Hearn,Sofia M. Pereira,Sandro Sehner,Kaja Wierucka,Judith Burkart,Claudia Fichtel,Julia Fischer,Alexander Gail,Catherine Hobaiter,Julia Ostner,Liran Samuni,Oliver Schülke,Neda Shahidi,Erin G. Wessling,Alexander S. Ecker

Main category: cs.CV

TL;DR: 本文提出了一种以数据为中心的方法,构建了大规模灵长类中心视频预训练数据集PriVi,并通过在多个基准上的实验验证了其在行为识别任务中优于现有方法的泛化能力和数据效率。

Details Motivation: 现有计算机视觉方法多依赖于以人为中心的预训练模型,且局限于单一数据集,难以泛化到非人灵长类行为分析任务。因此,需要一种更通用、更具数据效率的灵长类专用预训练方法。 Method: 提出PriVi数据集,包含424小时灵长类视频,其中174小时来自科研场景,250小时来自网络;采用可扩展的数据整理流程,并在该数据集上使用V-JEPA进行自监督预训练,随后用轻量级冻结分类器在四个基准数据集上进行评估。 Result: 在ChimpACT、BaboonLand、PanAf500和ChimpBehave四个基准上,该方法 consistently 超过现有方法,包括全微调基线,且在少标签情况下表现更优,显示出良好的可扩展性和数据效率。 Conclusion: 灵长类中心的预训练能显著提升模型在跨场景行为识别任务中的泛化能力和数据效率,是一种适用于低标签场景的有前景方法。 Abstract: Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We pretrain V-JEPA on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets, ChimpACT, BaboonLand, PanAf500, and ChimpBehave, our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.

[68] Classifying Phonotrauma Severity from Vocal Fold Images with Soft Ordinal Regression

Katie Matton,Purvaja Balaji,Hamzeh Ghasemzadeh,Jameson C. Cooper,Daryush D. Mehta,Jarrad H. Van Stan,Robert E. Hillman,Rosalind Picard,John Guttag,S. Mazdak Abulnaga

Main category: cs.CV

TL;DR: 提出了一种基于软标签序数回归的自动分类方法,用于从声带图像中评估语音创伤严重程度,性能接近临床专家,并提供可靠的不确定性估计。

Details Motivation: 声带损伤严重程度的评估依赖于临床医生的经验判断,成本高且可靠性差异大,亟需一种自动化、可重复的评估方法。 Method: 采用序数回归框架,并提出一种新的损失函数修改方法,使其能够处理反映标注者评分分布的软标签,从而更好地建模标签不确定性。 Result: 所提出的软序数回归方法在预测性能上接近临床专家水平,并能产生校准良好的不确定性估计。 Conclusion: 该方法为声带损伤严重程度提供了自动化的评估工具,有助于开展大规模研究,提升临床理解和患者护理水平。 Abstract: Phonotrauma refers to vocal fold tissue damage resulting from exposure to forces during voicing. It occurs on a continuum from mild to severe, and treatment options can vary based on severity. Assessment of severity involves a clinician's expert judgment, which is costly and can vary widely in reliability. In this work, we present the first method for automatically classifying phonotrauma severity from vocal fold images. To account for the ordinal nature of the labels, we adopt a widely used ordinal regression framework. To account for label uncertainty, we propose a novel modification to ordinal regression loss functions that enables them to operate on soft labels reflecting annotator rating distributions. Our proposed soft ordinal regression method achieves predictive performance approaching that of clinical experts, while producing well-calibrated uncertainty estimates. By providing an automated tool for phonotrauma severity assessment, our work can enable large-scale studies of phonotrauma, ultimately leading to improved clinical understanding and patient care.

[69] SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

Arman Zarei,Samyadeep Basu,Mobina Pournemat,Sayan Nag,Ryan Rossi,Soheil Feizi

Main category: cs.CV

TL;DR: 本文提出了SliderEdit,一种用于连续图像编辑的框架,能够对多指令提示中的每个指令进行细粒度、可解释的强度控制。

Details Motivation: 现有基于指令的图像编辑模型对每个指令使用固定强度,限制了用户对编辑强度的精确和连续控制。 Method: SliderEdit通过解耦多部分编辑指令,并将每个指令表示为全局训练的滑块,实现对编辑强度的平滑调节;采用一组通用的低秩适应矩阵,避免为每个属性单独训练。 Result: 在FLUX-Kontext和Qwen-Image-Edit等先进模型上应用SliderEdit后,显著提升了编辑的可控性、视觉一致性和用户引导能力。 Conclusion: SliderEdit是首个实现在基于指令的图像编辑中连续、细粒度控制的框架,为交互式、指令驱动的图像操作提供了新方向。 Abstract: Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user's ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

[70] Density Estimation and Crowd Counting

Balachandra Devarangadi Sunil,Rakshith Venkatesh,Shantanu Todmal

Main category: cs.CV

TL;DR: 本研究提出了一种基于视频的改进型人群密度估计方法,结合去噪概率模型与扩散过程生成高质量密度图,并引入事件驱动采样以降低计算开销。

Details Motivation: 传统图像-based人群密度估计方法难以有效处理视频数据中的时序动态变化,且计算和存储成本较高,因此需要一种更高效、准确的视频-based人群分析框架。 Method: 采用基于扩散过程的去噪概率模型生成高精度密度图,使用窄高斯核并生成多个密度图;引入回归分支进行特征提取,并通过相似性评分融合多图输出;结合Farneback光流算法实现事件驱动的关键帧采样。 Result: 该方法在密集和稀疏场景下均能准确捕捉人群动态,显著降低帧率和存储需求的同时保持关键事件完整性;定性和定量评估(如MAE和叠加图)显示其优越性能。 Conclusion: 所提出的框架有效解决了视频中人群密度估计的时序建模与效率问题,具备可扩展性和实时性,适用于公共安全、灾害响应和活动管理等实际应用场景。 Abstract: This study enhances a crowd density estimation algorithm originally designed for image-based analysis by adapting it for video-based scenarios. The proposed method integrates a denoising probabilistic model that utilizes diffusion processes to generate high-quality crowd density maps. To improve accuracy, narrow Gaussian kernels are employed, and multiple density map outputs are generated. A regression branch is incorporated into the model for precise feature extraction, while a consolidation mechanism combines these maps based on similarity scores to produce a robust final result. An event-driven sampling technique, utilizing the Farneback optical flow algorithm, is introduced to selectively capture frames showing significant crowd movements, reducing computational load and storage by focusing on critical crowd dynamics. Through qualitative and quantitative evaluations, including overlay plots and Mean Absolute Error (MAE), the model demonstrates its ability to effectively capture crowd dynamics in both dense and sparse settings. The efficiency of the sampling method is further assessed, showcasing its capability to decrease frame counts while maintaining essential crowd events. By addressing the temporal challenges unique to video analysis, this work offers a scalable and efficient framework for real-time crowd monitoring in applications such as public safety, disaster response, and event management.

[71] PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model

Yunqian Cheng,Benjamin Princen,Roberto Manduchi

Main category: cs.CV

TL;DR: PALMS+ 是一种无需基础设施的室内定位系统,利用单目深度估计模型重建3D点云,并通过与平面图进行几何布局匹配实现精确定位,在静态和序列定位中均优于现有方法。

Details Motivation: 解决现有视觉定位方法如PALMS受限于智能手机LiDAR范围短和室内布局模糊的问题,实现在无GPS环境下的高精度、基础设施无关的室内定位。 Method: 提出PALMS+,使用基础单目深度估计模型(Depth Pro)从RGB图像重建尺度对齐的3D点云,并通过与平面图卷积进行几何布局匹配,输出位置和方向后验概率,支持直接或序列化定位。 Result: 在Structured3D和自建校园数据集上,PALMS+在静态定位精度上优于PALMS和F3Loc;在33条真实轨迹的序列定位中结合粒子滤波器也表现出更低的定位误差。 Conclusion: PALMS+在不依赖训练的情况下实现了高精度、鲁棒的基础设施无关室内定位,具有在应急响应和辅助导航等场景中的应用潜力。 Abstract: Indoor localization in GPS-denied environments is crucial for applications like emergency response and assistive navigation. Vision-based methods such as PALMS enable infrastructure-free localization using only a floor plan and a stationary scan, but are limited by the short range of smartphone LiDAR and ambiguity in indoor layouts. We propose PALMS$+$, a modular, image-based system that addresses these challenges by reconstructing scale-aligned 3D point clouds from posed RGB images using a foundation monocular depth estimation model (Depth Pro), followed by geometric layout matching via convolution with the floor plan. PALMS$+$ outputs a posterior over the location and orientation, usable for direct or sequential localization. Evaluated on the Structured3D and a custom campus dataset consisting of 80 observations across four large campus buildings, PALMS$+$ outperforms PALMS and F3Loc in stationary localization accuracy -- without requiring any training. Furthermore, when integrated with a particle filter for sequential localization on 33 real-world trajectories, PALMS$+$ achieved lower localization errors compared to other methods, demonstrating robustness for camera-free tracking and its potential for infrastructure-free applications. Code and data are available at https://github.com/Head-inthe-Cloud/PALMS-Plane-based-Accessible-Indoor-Localization-Using-Mobile-Smartphones

[72] Social LSTM with Dynamic Occupancy Modeling for Realistic Pedestrian Trajectory Prediction

Ahmed Alia,Mohcine Chraibi,Armin Seyfried

Main category: cs.CV

TL;DR: 本文提出了一种改进的Social LSTM模型,通过引入动态占用空间损失函数,在保持位移误差的同时有效降低行人轨迹预测中的碰撞率。

Details Motivation: 现有方法常将行人视为点实体,忽略其实际占用的空间,导致在密集场景中预测结果不真实且易发生碰撞。因此需要一种能考虑行人物理空间占用并适应不同密度场景的预测模型。 Method: 在Social LSTM基础上引入新的动态占用空间损失函数,该损失结合平均位移误差和对场景密度与个体空间占用敏感的碰撞惩罚项,以优化训练过程。使用2022年里昂灯光节的真实行人轨迹生成五个数据集(四种均一密度和一种混合密度)进行训练与评估。 Result: 所提模型在所有数据集中均降低了碰撞率和位移误差,相比基线模型平均减少31%的碰撞率,平均降低5%的平均位移误差和6%的最终位移误差,并在多数测试集上优于多种前沿深度学习模型。 Conclusion: 引入考虑行人物理空间和场景密度的损失函数可显著提升行人轨迹预测的真实性和准确性,所提方法在不同密度环境下均表现出优越性能。 Abstract: In dynamic and crowded environments, realistic pedestrian trajectory prediction remains a challenging task due to the complex nature of human motion and the mutual influences among individuals. Deep learning models have recently achieved promising results by implicitly learning such patterns from 2D trajectory data. However, most approaches treat pedestrians as point entities, ignoring the physical space that each person occupies. To address these limitations, this paper proposes a novel deep learning model that enhances the Social LSTM with a new Dynamic Occupied Space loss function. This loss function guides Social LSTM in learning to avoid realistic collisions without increasing displacement error across different crowd densities, ranging from low to high, in both homogeneous and heterogeneous density settings. Such a function achieves this by combining the average displacement error with a new collision penalty that is sensitive to scene density and individual spatial occupancy. For efficient training and evaluation, five datasets were generated from real pedestrian trajectories recorded during the Festival of Lights in Lyon 2022. Four datasets represent homogeneous crowd conditions -- low, medium, high, and very high density -- while the fifth corresponds to a heterogeneous density distribution. The experimental findings indicate that the proposed model not only lowers collision rates but also enhances displacement prediction accuracy in each dataset. Specifically, the model achieves up to a 31% reduction in the collision rate and reduces the average displacement error and the final displacement error by 5% and 6%, respectively, on average across all datasets compared to the baseline. Moreover, the proposed model consistently outperforms several state-of-the-art deep learning models across most test sets.

[73] Soiling detection for Advanced Driver Assistance Systems

Filip Beránek,Václav Diviš,Ivan Gruber

Main category: cs.CV

TL;DR: 本文将汽车摄像头的污垢检测问题视为语义分割任务,比较了多种主流分割方法,并指出其相较于图块级分类方法的优势。同时,作者对Woodscape数据集进行了详细分析,发现存在数据泄露和标注不准确的问题。为此,作者构建了一个更小但更高质量的子数据集,在显著缩短训练时间的同时达到了相当的性能。代码和数据划分已公开。

Details Motivation: 为了提高高级驾驶辅助系统在复杂环境(如天气、灰尘等)下的鲁棒性,需要准确检测摄像头镜头的污染情况。然而现有方法(如图块级分类)性能有限,且常用数据集存在数据泄露和标注错误问题,影响模型评估的可靠性。 Method: 将污垢检测建模为语义分割任务,系统地评估了多种主流语义分割模型的性能,并与传统的图块级分类方法进行对比。同时,对Woodscape数据集进行深入分析,识别并修正了其中的数据泄露和标注问题,构建了一个更小但更精确的子数据集用于实验验证。 Result: 语义分割方法在性能上显著优于图块级分类方法。基于修正后的高质量子数据集,模型在训练时间大幅缩短的情况下仍能达到与在原始全集上训练相当的性能表现。 Conclusion: 语义分割是解决汽车摄像头污垢检测的有效方法,且使用高质量、无数据泄露的标注数据集对于模型训练和公平评估至关重要。作者提供的修正数据集和代码有助于推动该领域的进一步研究。 Abstract: Soiling detection for automotive cameras is a crucial part of advanced driver assistance systems to make them more robust to external conditions like weather, dust, etc. In this paper, we regard the soiling detection as a semantic segmentation problem. We provide a comprehensive comparison of popular segmentation methods and show their superiority in performance while comparing them to tile-level classification approaches. Moreover, we present an extensive analysis of the Woodscape dataset showing that the original dataset contains a data-leakage and imprecise annotations. To address these problems, we create a new data subset, which, despite being much smaller, provides enough information for the segmentation method to reach comparable results in a much shorter time. All our codes and dataset splits are available at https://github.com/filipberanek/woodscape_revision.

[74] Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and Segmentation

Frank Li,Theo Dapamede,Mohammadreza Chavoshi,Young Seok Jeon,Bardia Khosravi,Abdulhameed Dere,Beatrice Brown-Mulry,Rohan Satya Isaac,Aawez Mansuri,Chiratidzo Sanyika,Janice Newsome,Saptarshi Purkayastha,Imon Banerjee,Hari Trivedi,Judy Gichoya

Main category: cs.CV

TL;DR: 本研究评估了八种医学和通用领域基础模型在胸部X光分析中的视觉编码器性能,发现医学领域的预训练模型在线性探测中表现更优,但特征效用高度依赖任务类型,尤其在复杂病灶分割任务中所有模型均表现不佳,且存在使用混淆特征的问题。此外,无需昂贵的图文对齐,图像专用或标签监督模型即可达到顶尖性能,而传统的监督式端到端模型在分割任务上仍具竞争力。

Details Motivation: 基础模型在医学影像中的泛化能力差异较大,尚不清楚预训练领域、范式和架构如何影响嵌入质量,导致难以选择最适合放射学任务的编码器。 Method: 评估了来自八个医学和通用领域基础模型的视觉编码器,在胸部X光的分类(气胸、心脏肥大)和分割(气胸、心脏边界)任务中,采用线性探测和微调两种方式 benchmark 模型性能。 Result: 医学领域预训练模型在线性探测中显著优于通用模型,表明其初始特征质量更高;预训练特征对全局分类和显著解剖结构分割有效,但在复杂、细微病理(如气胸)的定位分割中表现差,需大量微调;部分模型利用混淆线索(如胸管)进行分类,导致分割失败;无需图文对齐的图像专用(RAD-DINO)和标签监督(Ark+)模型表现优异;传统监督式端到端模型在分割任务上与最佳基础模型相当甚至更优。 Conclusion: 医学预训练有助于提升特征质量,但其优势受限于具体任务,尤其在细微病变定位方面仍有重大缺口;模型架构(如多尺度设计)至关重要;预训练特征并非普遍有效,对于复杂定位任务,传统监督模型仍是强有力的竞争者。 Abstract: Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.

[75] Gradient-Guided Exploration of Generative Model's Latent Space for Controlled Iris Image Augmentations

Mahsa Mitcheff,Siamul Karim Khan,Adam Czajka

Main category: cs.CV

TL;DR: 提出一种基于生成模型潜在空间遍历的虹膜图像增强策略,通过梯度引导操纵特定虹膜属性(如清晰度、瞳孔大小等)并保持身份一致性。

Details Motivation: 现有虹膜数据集缺乏多样性,难以同时控制特定属性并保持身份一致性,且合成具有真实变化的同身份虹膜图像具有挑战性。 Method: 在生成模型的潜在空间中进行梯度引导的遍历,利用可微损失函数调控几何、纹理或质量相关特征;结合预训练GAN或真实虹膜图像,并通过GAN反演将实际图像投影到潜在空间以获取对应编码。 Result: 实现了对虹膜图像多种属性的可控编辑,同时保持身份不变;方法通用性强,可扩展至任何可构建可微损失的属性;支持从随机生成或真实图像出发进行增强。 Conclusion: 该方法为虹膜识别与活体检测提供了高效的数据增强手段,提升了生成图像的多样性和可控性,有助于提高模型的鲁棒性和泛化能力。 Abstract: Developing reliable iris recognition and presentation attack detection methods requires diverse datasets that capture realistic variations in iris features and a wide spectrum of anomalies. Because of the rich texture of iris images, which spans a wide range of spatial frequencies, synthesizing same-identity iris images while controlling specific attributes remains challenging. In this work, we introduce a new iris image augmentation strategy by traversing a generative model's latent space toward latent codes that represent same-identity samples but with some desired iris image properties manipulated. The latent space traversal is guided by a gradient of specific geometrical, textural, or quality-related iris image features (e.g., sharpness, pupil size, iris size, or pupil-to-iris ratio) and preserves the identity represented by the image being manipulated. The proposed approach can be easily extended to manipulate any attribute for which a differentiable loss term can be formulated. Additionally, our approach can use either randomly generated images using either a pre-train GAN model or real-world iris images. We can utilize GAN inversion to project any given iris image into the latent space and obtain its corresponding latent code.

[76] STORM: Segment, Track, and Object Re-Localization from a Single 3D Model

Yu Deng,Teng Cao,Hikaru Shindo,Jiahong Xue,Quentin Delfosse,Kristian Kersting

Main category: cs.CV

TL;DR: STORM是一种无需手动标注的实时6D姿态估计系统,通过视觉-语言理解与自监督特征匹配实现高精度目标定位、跟踪与重定位。

Details Motivation: 现有6D姿态估计方法依赖首帧手动分割掩码,标注耗时且在遮挡或快速运动下性能下降。 Method: 提出三阶段 pipeline:利用视觉-语言理解进行上下文目标定位,自交叉注意力机制筛选候选区域,分割模型生成精确掩码;并引入基于特征相似性监测的自动重注册机制以应对跟踪失败。 Result: 在包含多物体遮挡、高速运动和光照变化的工业数据集上达到SOTA精度,同时实现实时运行速度,无需额外训练。 Conclusion: STORM实现了无需标注的鲁棒6D姿态估计与跟踪,显著降低部署成本,适用于柔性制造和智能质检等实际应用场景。 Abstract: Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically rely on a manually annotated segmentation mask of the target in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limi- tations, we propose STORM (Segment, Track, and Object Re-localization from a single 3D Model), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with self-supervised feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and a segmentation model produces precise masks for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.

[77] PANDA - Patch And Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning

Siddeshwar Raghavan,Jiangpeng He,Fengqing Zhu

Main category: cs.CV

TL;DR: 提出PANDA框架,通过补丁级和分布感知的数据增强,解决基于预训练模型的免示例持续学习中的双重不平衡问题,有效缓解灾难性遗忘。

Details Motivation: 现实数据流存在任务内和任务间的双重不平衡,现有方法忽视该问题导致学习效果差。 Method: 利用CLIP编码器识别代表性区域并进行跨类别移植增强,结合自适应平衡策略利用历史任务分布缓解任务间不平衡。 Result: 在多个基准上实验表明,PANDA能有效提升准确率并减少灾难性遗忘,兼容现有PTM-based EFCL方法。 Conclusion: PANDA通过细粒度增强和分布感知策略,显著改善了不均衡数据下的持续学习性能。 Abstract: Exemplar-Free Continual Learning (EFCL) restricts the storage of previous task data and is highly susceptible to catastrophic forgetting. While pre-trained models (PTMs) are increasingly leveraged for EFCL, existing methods often overlook the inherent imbalance of real-world data distributions. We discovered that real-world data streams commonly exhibit dual-level imbalances, dataset-level distributions combined with extreme or reversed skews within individual tasks, creating both intra-task and inter-task disparities that hinder effective learning and generalization. To address these challenges, we propose PANDA, a Patch-and-Distribution-Aware Augmentation framework that integrates seamlessly with existing PTM-based EFCL methods. PANDA amplifies low-frequency classes by using a CLIP encoder to identify representative regions and transplanting those into frequent-class samples within each task. Furthermore, PANDA incorporates an adaptive balancing strategy that leverages prior task distributions to smooth inter-task imbalances, reducing the overall gap between average samples across tasks and enabling fairer learning with frozen PTMs. Extensive experiments and ablation studies demonstrate PANDA's capability to work with existing PTM-based CL methods, improving accuracy and reducing catastrophic forgetting.

[78] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

Konstantinos M. Dafnis,Dimitris N. Metaxas

Main category: cs.CV

TL;DR: 提出了一种名为Spectrum-Aware Test-Time Steering (STS)的轻量级测试时适应框架,通过在潜在空间中调整少量参数来提升视觉-语言模型在域偏移下的性能,无需反向传播或修改冻结编码器。

Details Motivation: 现有的测试时适应方法通常需要对大模型权重进行反向传播或修改核心组件,计算开销大且不适用于实际部署。 Method: 从文本嵌入中提取谱子空间以定义主要语义方向,并学习在频谱感知的方式下调整潜在表示,通过优化少量每样本偏移参数来最小化增强视图间的熵。 Result: 实验表明,STS在标准评估协议下显著优于或媲美现有最先进方法,额外参数极少,推理速度提高达8倍,内存占用仅为传统提示调优的1/12。 Conclusion: STS是一种高效、轻量且无需修改模型结构的测试时适应方法,适用于实际应用中的零样本域适应场景。 Abstract: Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.

[79] Lumos3D: A Single-Forward Framework for Low-Light 3D Scene Restoration

Hanzhou Liu,Peng Jiang,Jia Huang,Mi Lu

Main category: cs.CV

TL;DR: 提出Lumos3D,一种可泛化的无需位姿的3D低光场景恢复框架,能够在无须每场景优化的情况下,直接从无位姿的多视角低光图像中恢复光照和结构。

Details Motivation: 现有方法依赖预计算的相机位姿和特定场景优化,限制了在动态真实环境中的可扩展性。 Method: 采用基于几何的骨干网络,通过交叉光照蒸馏策略和专用的Lumos损失,在单个数据集上训练一次后即可进行前馈推理,重建正常光照下的3D高斯表示。 Result: 在真实世界数据集上实验表明,Lumos3D能实现高保真、几何准确且泛化能力强的3D低光场景恢复,并可自然扩展至过曝校正。 Conclusion: Lumos3D是一种高效、通用的3D光照恢复框架,无需位姿输入和场景特定优化,具有良好的实际应用潜力。 Abstract: Restoring 3D scenes captured under low-light con- ditions remains a fundamental yet challenging problem. Most existing approaches depend on precomputed camera poses and scene-specific optimization, which greatly restricts their scala- bility to dynamic real-world environments. To overcome these limitations, we introduce Lumos3D, a generalizable pose-free framework for 3D low-light scene restoration. Trained once on a single dataset, Lumos3D performs inference in a purely feed- forward manner, directly restoring illumination and structure from unposed, low-light multi-view images without any per- scene training or optimization. Built upon a geometry-grounded backbone, Lumos3D reconstructs a normal-light 3D Gaussian representation that restores illumination while faithfully pre- serving structural details. During training, a cross-illumination distillation scheme is employed, where the teacher network is distilled on normal-light ground truth to transfer accurate geometric information, such as depth, to the student model. A dedicated Lumos loss is further introduced to promote photomet- ric consistency within the reconstructed 3D space. Experiments on real-world datasets demonstrate that Lumos3D achieves high- fidelity low-light 3D scene restoration with accurate geometry and strong generalization to unseen cases. Furthermore, the framework naturally extends to handle over-exposure correction, highlighting its versatility for diverse lighting restoration tasks.

[80] From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance

Jeongho Min,Dongyoung Kim,Jaehyup Lee

Main category: cs.CV

TL;DR: 提出一种无需训练的跨视角图像检索框架,利用预训练视觉编码器和大语言模型,通过地理线索提取与卫星图像匹配,在零样本设置下优于先前方法。

Details Motivation: 现有方法依赖监督训练和特定数据(如全景图或无人机图像),限制了实际应用;需要一种更通用、低成本的解决方案。 Method: 使用预训练视觉编码器(如DINOv2)结合PCA白化特征优化,通过网络图像搜索和大语言模型推断地理位置,再利用地理编码API生成卫星查询并检索匹配图像。 Result: 在零样本设置下,该方法在基准数据集上超越了之前的有监督学习方法,并能自动构建语义对齐的街景-卫星图像数据集。 Conclusion: 所提方法无需微调或真实标签,具有良好的可扩展性和成本效益,为跨视角检索提供了新思路。 Abstract: Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://jeonghomin.github.io/street2orbit.github.io/.

[81] AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting

Aymen Mir,Jian Wang,Riza Alp Guler,Chuan Guo,Gerard Pons-Moll,Bing Zhou

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯点阵(3DGS)的新框架,用于在3D场景中实现几何一致的自由视角人类动画与交互,无需配对数据即可解耦渲染与运动合成。

Details Motivation: 现有动画管线多使用网格或点云表示,而3DGS在新视角合成中表现优异但尚未充分探索其在人-场景动画中的应用,因此本文旨在利用3DGS提升人类动画的真实感与灵活性。 Method: 将人类和场景均表示为高斯分布,通过解耦渲染与运动合成,设计高斯对齐的运动模块,并利用不透明度线索和投影结构指导人体姿态与位置;进一步提出人-场景高斯优化以增强真实交互。 Result: 在Scannet++和SuperSplat场景及多视角重建人物上验证了方法有效性,支持自由视角渲染,并实现了单目视频编辑后加入动画人物的新应用。 Conclusion: 该框架充分利用3DGS的优势,实现了高质量、几何一致的人类动画与场景交互,拓展了其在单目视频动画中的应用潜力。 Abstract: We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation to the problem of animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows for geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that the rendering can be decoupled from the motion synthesis and each sub-problem can be addressed independently, without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework allows for novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with new animated humans, showcasing the unique advantage of 3DGS for monocular video-based human animation.

[82] CertMask: Certifiable Defense Against Adversarial Patches via Theoretically Optimal Mask Coverage

Xuntao Lyu,Ching-Chi Lin,Abdullah Al Arafat,Georg von der Brüggen,Jian-Jia Chen,Zhishan Guo

Main category: cs.CV

TL;DR: 提出CertMask,一种具有可证明鲁棒性的防御方法,通过单轮O(n)复杂度的掩码策略有效抵御对抗性图像补丁攻击。

Details Motivation: 对抗性补丁攻击可通过物理部署威胁实际视觉系统,现有防御方法效率低或理论保障不足。 Method: 设计一种数学上严谨的覆盖策略,构建能覆盖每个可能补丁位置至少k次的二值掩码集,仅需单轮掩码且时间复杂度为O(n)。 Result: 在ImageNet、ImageNette和CIFAR-10上,CertMask相比PatchCleanser将认证鲁棒准确率最高提升13.4%,同时保持与原始模型相当的干净准确率。 Conclusion: CertMask在保证强理论安全性的同时显著提升了防御效率和性能,是抵御对抗性补丁攻击的有效方案。 Abstract: Adversarial patch attacks inject localized perturbations into images to mislead deep vision models. These attacks can be physically deployed, posing serious risks to real-world applications. In this paper, we propose CertMask, a certifiably robust defense that constructs a provably sufficient set of binary masks to neutralize patch effects with strong theoretical guarantees. While the state-of-the-art approach (PatchCleanser) requires two rounds of masking and incurs $O(n^2)$ inference cost, CertMask performs only a single round of masking with $O(n)$ time complexity, where $n$ is the cardinality of the mask set to cover an input image. Our proposed mask set is computed using a mathematically rigorous coverage strategy that ensures each possible patch location is covered at least $k$ times, providing both efficiency and robustness. We offer a theoretical analysis of the coverage condition and prove its sufficiency for certification. Experiments on ImageNet, ImageNette, and CIFAR-10 show that CertMask improves certified robust accuracy by up to +13.4\% over PatchCleanser, while maintaining clean accuracy nearly identical to the vanilla model.

[83] CORONA-Fields: Leveraging Foundation Models for Classification of Solar Wind Phenomena

Daniela Martin,Jinsu Hong,Connor O'Brien,Valmir P Moraes Filho,Jasmine R. Kobayashi,Evangelia Samara,Joseph Gallego

Main category: cs.CV

TL;DR: 本研究将用于太阳物理的基础模型应用于太阳风结构分析,通过结合航天器位置和太阳磁连接性信息,构建神经场模型,并利用帕克太阳探测器的数据进行下游分类任务,验证了基础模型嵌入在原位太阳风任务中的可行性。

Details Motivation: 由于太阳活动引发的空间天气对卫星和地面技术基础设施构成日益增长的风险,而太阳风和日冕物质抛射的多变特性使得自动分类具有挑战性,因此需要一种能够桥接遥感与原位观测的新方法。 Method: 采用最初在太阳动力学观测站图像上训练的基础模型生成太阳风结构分析的嵌入表示,将其与使用傅里叶特征编码的航天器位置和太阳磁连接性信息拼接,构建基于神经场的深度学习模型,并在帕克太阳探测器测量数据上进行微调以实现等离子体属性到太阳风结构的分类。 Result: 尽管整体分类性能有限(可能由于标签粗糙、类别不平衡及预训练模型迁移能力不足),但该模型展示了利用基础模型嵌入进行原位太阳风分析的可行性。 Conclusion: 作为首个概念验证,该研究为未来改进空间天气预测模型奠定了基础,且研究代码公开以支持可重复性。 Abstract: Space weather at Earth, driven by the solar activity, poses growing risks to satellites around our planet as well as to critical ground-based technological infrastructure. Major space weather contributors are the solar wind and coronal mass ejections whose variable density, speed, temperature, and magnetic field make the automated classification of those structures challenging. In this work, we adapt a foundation model for solar physics, originally trained on Solar Dynamics Observatory imagery, to create embeddings suitable for solar wind structure analysis. These embeddings are concatenated with the spacecraft position and solar magnetic connectivity encoded using Fourier features which generates a neural field-based model. The full deep learning architecture is fine-tuned bridging the gap between remote sensing and in situ observations. Labels are derived from Parker Solar Probe measurements, forming a downstream classification task that maps plasma properties to solar wind structures. Although overall classification performance is modest, likely due to coarse labeling, class imbalance, and limited transferability of the pretrained model, this study demonstrates the feasibility of leveraging foundation model embeddings for in situ solar wind tasks. As a first proof-of-concept, it lays the groundwork for future improvements toward more reliable space weather predictions. The code and configuration files used in this study are publicly available to support reproducibility.

[84] IPCD: Intrinsic Point-Cloud Decomposition

Shogo Sato,Takuhiro Kaneko,Shoichiro Takeda,Tomoyasu Shimada,Kazuhiko Murasaki,Taiga Yoshida,Ryuichi Tanida,Akisato Kimura

Main category: cs.CV

TL;DR: 本文提出了Intrinsic Point-Cloud Decomposition (IPCD)方法,用于将彩色点云分解为反照率和阴影,解决了非网格结构处理和全局光照方向建模两个挑战。

Details Motivation: 点云在AR和机器人等领域广泛应用,但现有方法难以在非网格结构上准确分离反照率与阴影,且缺乏对全局光照方向的建模,导致分解结果不准确。 Method: 提出IPCD-Net,结合逐点特征聚合以处理非网格数据;引入基于投影的亮度分布(PLD)和分层特征优化,通过多视角投影捕捉全局光照信息。 Result: 实验表明IPCD-Net能有效减少反照率中的投影阴影,提升阴影的颜色准确性,并在纹理编辑、重光照和变光照下的点云配准中展现良好应用效果。 Conclusion: IPCD-Net有效实现了点云的内在分解,在合成和真实场景中均表现出良好的性能和应用潜力。 Abstract: Point clouds are widely used in various fields, including augmented reality (AR) and robotics, where relighting and texture editing are crucial for realistic visualization. Achieving these tasks requires accurately separating albedo from shade. However, performing this separation on point clouds presents two key challenges: (1) the non-grid structure of point clouds makes conventional image-based decomposition models ineffective, and (2) point-cloud models designed for other tasks do not explicitly consider global-light direction, resulting in inaccurate shade. In this paper, we introduce \textbf{Intrinsic Point-Cloud Decomposition (IPCD)}, which extends image decomposition to the direct decomposition of colored point clouds into albedo and shade. To overcome challenge (1), we propose \textbf{IPCD-Net} that extends image-based model with point-wise feature aggregation for non-grid data processing. For challenge (2), we introduce \textbf{Projection-based Luminance Distribution (PLD)} with a hierarchical feature refinement, capturing global-light ques via multi-view projection. For comprehensive evaluation, we create a synthetic outdoor-scene dataset. Experimental results demonstrate that IPCD-Net reduces cast shadows in albedo and enhances color accuracy in shade. Furthermore, we showcase its applications in texture editing, relighting, and point-cloud registration under varying illumination. Finally, we verify the real-world applicability of IPCD-Net.

[85] Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies

Peng Gao,Yujian Lee,Xiaofeng Zhang,Zailong Chen,Hui Zhang

Main category: cs.CV

TL;DR: 提出了一种无需训练的三步衰减恢复策略(T-DRS),用于缓解大视觉语言模型中旋转位置编码导致的长距离依赖衰减问题,显著提升视觉问答任务性能。

Details Motivation: 大视觉语言模型在使用旋转位置编码(ROPE)时存在长距离注意力衰减问题,影响模型对全局上下文的记忆能力,限制了其在多模态任务中的表现。 Method: 提出了推理阶段可用的三步衰减恢复策略(T-DRS),包括:语义驱动的DRS(SD-DRS)、距离感知控制DRS(DC-DRS)和重强化远距离DRS(reRD-DRS),分别通过内容感知残差、基于位置距离的平滑调制和保留远程依赖来恢复被抑制的长距离依赖。 Result: 在多个视觉问答(VQA)基准上进行了广泛实验,结果表明T-DRS能在无需训练的情况下持续提升模型性能。 Conclusion: T-DRS有效缓解了ROPE带来的长距离注意力衰减问题,在不损害局部归纳偏置的前提下恢复了全局上下文建模能力,为大视觉语言模型提供了高效、即插即用的改进方案。 Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model's ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise while preserving locality, and (3) re-Reinforce Distant DRS (reRD-DRS), consolidating the remaining informative remote dependencies to maintain global coherence. Together, the T-DRS recover suppressed long-range token pairs without harming local inductive biases. Extensive experiments on Vision Question Answering (VQA) benchmarks demonstrate that T-DRS can consistently improve performance in a training-free manner. The code can be accessed in https://github.com/labixiaoq-qq/Remember-me

[86] SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection

Jia Lin,Xiaofei Zhou,Jiyuan Liu,Runmin Cong,Guodao Zhang,Zhi Liu,Jiyong Zhang

Main category: cs.CV

TL;DR: 提出了一种基于深度引导自适应查询的SAM模型(SAM-DAQ),用于RGB-D视频显著性目标检测,通过融合深度和时序信息,在无需手动提示的情况下实现高效准确的分割。

Details Motivation: 现有方法在将SAM应用于RGB-D视频显著性检测时面临依赖人工提示、内存消耗高和计算负担重三大挑战,因此需要一种更高效、自动化的解决方案。 Method: 提出SAM-DAQ,包括两个核心模块:1)基于并行适配器的多模态图像编码器(PAMIE),引入深度引导并行适配器(DPA)以跳跃连接方式融合深度信息,并在无提示条件下微调SAM编码器;2)查询驱动的时序记忆模块(QTM),统一记忆库与提示嵌入,利用帧级和视频级查询提取时序一致性特征并更新查询表示。 Result: 在三个RGB-D VSOD数据集上的实验表明,SAM-DAQ在所有评估指标上均优于现有最先进方法,具有更强的性能和效率。 Conclusion: SAM-DAQ有效解决了将SAM应用于RGB-D视频显著性检测中的关键瓶颈,实现了无需人工提示、低内存消耗且高性能的显著目标分割,为后续研究提供了可行框架。 Abstract: Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.

[87] RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion

Wenzhe He,Xiaojun Chen,Wentang Chen,Hongyu Wang,Ying Liu,Ruihui Li

Main category: cs.CV

TL;DR: 提出了一种基于RWKV机制的轻量级点云语义场景补全网络RWKV-PCSSC,通过RWKV-SG和RWKV-PD模块实现高效特征聚合与逐点恢复,在显著降低参数量和提升内存效率的同时,在多个室内外数据集上达到SOTA性能。

Details Motivation: 现有语义场景补全方法通常采用密集网络结构,导致模型复杂度高、资源消耗大,亟需更轻量高效的解决方案。 Method: 设计了RWKV Seed Generator(RWKV-SG)模块用于从部分点云中聚合特征生成带粗略特征的粗略点云,并通过多阶段的RWKV Point Deconvolution(RWKV-PD)模块逐步恢复点级特征,整体采用轻量化网络架构。 Result: 相比当前方法PointSSC,参数量减少4.18倍,内存效率提升1.37倍,并在SSC-PC、NYUCAD-PC、PointSSC等多个标准及新提出的数据集(NYUCAD-PC-V2, 3D-FRONT-PC)上实现了最先进的性能。 Conclusion: RWKV-PCSSC通过引入RWKV机制实现了高效且轻量的点云语义场景补全,在保持高性能的同时大幅降低了模型复杂度,具有良好的应用前景。 Abstract: Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18$\times$ and improves memory efficiency by 1.37$\times$ compared to state-of-the-art methods PointSSC. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).

[88] HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models

Liheng Zhang,Jin Wang,Hui Li,Bingfeng Zhang,Weifeng Liu

Main category: cs.CV

TL;DR: 提出了一种名为HCC-3D的分层补偿压缩方法,用于高效压缩3D点云token,显著降低计算开销的同时保持关键信息完整性。

Details Motivation: 现有3D-VLM因将所有3D token输入大语言模型而导致计算成本过高,亟需一种能减少计算开销同时保留关键信息的方法。 Method: 提出HCC-3D,包含全局结构压缩(GSC)和自适应细节挖掘(ADM)两个模块:GSC通过全局查询将大量3D token压缩为少量关键token;ADM则通过互补评分机制选择性地恢复被忽略的重要特征。 Result: 实验表明,HCC-3D实现了约98%的极端压缩比,在效率大幅提升的同时达到了新的SOTA性能。 Conclusion: HCC-3D有效解决了3D-VLM中计算开销与信息保留之间的矛盾,兼顾高效率与高性能,为多模态3D理解提供了更优框架。 Abstract: 3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.

[89] Scale-Aware Relay and Scale-Adaptive Loss for Tiny Object Detection in Aerial Images

Jinfu Li,Yuqi Huang,Hong Song,Ting Wang,Jianghan Xia,Yucong Lin,Jingfan Fan,Jian Yang

Main category: cs.CV

TL;DR: 提出了一种用于航拍图像中小目标检测的尺度感知中继层(SARL)和尺度自适应损失(SAL),在多个基准上显著提升了检测性能。

Details Motivation: 现代检测器在航拍图像中难以检测小目标,主要因为小目标特征少且在网络传播中易丢失,同时训练时大目标带来的回归损失占主导。 Method: 设计了SARL模块,通过跨尺度空间-通道注意力机制增强各层特征并促进层间特征共享;提出了SAL损失函数,动态降低大目标的权重,使训练更关注小目标。 Result: 在AI-TOD、DOTA-v2.0和VisDrone2019三个基准上,嵌入YOLOv5和YOLOX后平均精度(AP)提升5.5%,在真实噪声数据集AI-TOD-v2.0上达到29.0% AP。 Conclusion: SARL和SAL有效提升了小目标检测性能,具有良好的通用性和鲁棒性,适用于主流检测框架。 Abstract: Recently, despite the remarkable advancements in object detection, modern detectors still struggle to detect tiny objects in aerial images. One key reason is that tiny objects carry limited features that are inevitably degraded or lost during long-distance network propagation. Another is that smaller objects receive disproportionately greater regression penalties than larger ones during training. To tackle these issues, we propose a Scale-Aware Relay Layer (SARL) and a Scale-Adaptive Loss (SAL) for tiny object detection, both of which are seamlessly compatible with the top-performing frameworks. Specifically, SARL employs a cross-scale spatial-channel attention to progressively enrich the meaningful features of each layer and strengthen the cross-layer feature sharing. SAL reshapes the vanilla IoU-based losses so as to dynamically assign lower weights to larger objects. This loss is able to focus training on tiny objects while reducing the influence on large objects. Extensive experiments are conducted on three benchmarks (\textit{i.e.,} AI-TOD, DOTA-v2.0 and VisDrone2019), and the results demonstrate that the proposed method boosts the generalization ability by 5.5\% Average Precision (AP) when embedded in YOLOv5 (anchor-based) and YOLOx (anchor-free) baselines. Moreover, it also promotes the robust performance with 29.0\% AP on the real-world noisy dataset (\textit{i.e.,} AI-TOD-v2.0).

[90] Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

Zubia Naz,Farhan Asghar,Muhammad Ishfaq Hussain,Yahya Hadadi,Muhammad Aasim Rafique,Wookjin Choi,Moongu Jeon

Main category: cs.CV

TL;DR: 提出了一种基于Swin-BART的编码器-解码器系统,结合轻量级区域注意力模块,用于医学图像自动描述,在ROCO数据集上实现了最先进的语义保真度,同时保持模型紧凑和可解释。

Details Motivation: 医学图像描述生成有助于放射学报告流程,但需要在语义准确性和模型可解释性之间取得平衡。现有方法在捕捉关键诊断区域方面存在不足,因此需要更高效、更具临床相关性的模型。 Method: 采用Swin-BART作为编码器-解码器框架,并引入一个轻量级的区域注意力模块,在跨注意力之前增强对诊断显著区域的关注。使用ROCO数据集进行训练和评估,采用beam search进行解码,并进行了消融实验、按模态分析和显著性测试。 Result: 在ROUGE和BERTScore指标上显著优于基线模型(如ResNet-CNN和BLIP2-OPT),ROUGE达到0.603±std,BERTScore达到0.807±std,并具有竞争力的BLEU、CIDEr和METEOR分数。消融实验验证了区域注意力的有效性,热图可视化展示了生成描述的关键区域。 Conclusion: 所提出的模型能够生成准确且符合临床表述的图像描述,同时提供透明的区域归因,支持带有人工监督的安全研究应用。 Abstract: Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.

[91] Simulating Distribution Dynamics: Liquid Temporal Feature Evolution for Single-Domain Generalized Object Detection

Zihao Zhang,Yang Li,Aming Wu,Yahong Han

Main category: cs.CV

TL;DR: 提出了一种名为Liquid Temporal Feature Evolution (LTFE) 的新方法,用于单域广义目标检测(Single-DGOD),通过引入时间建模和液态神经网络驱动的参数调整,模拟特征从源域到潜在分布的渐进演化,显著提升了模型在未知域上的泛化能力和鲁棒性。

Details Motivation: 现有方法依赖离散数据增强或静态扰动来缓解目标域数据缺失问题,难以捕捉真实场景中连续渐变的域偏移(如天气、光照变化),限制了对细粒度跨域差异的感知能力。 Method: 引入可控高斯噪声注入和多尺度高斯模糊模拟初始特征扰动,结合时间建模与液态神经网络驱动的参数调节机制,生成自适应调制参数,实现跨域的平滑连续适应。 Result: 在Diverse Weather数据集和Real-to-Art基准上取得显著性能提升,验证了方法在未知域迁移中的优越性。 Conclusion: LTFE通过建模渐进式跨域特征演化并动态调节适应路径,有效缩小源域与未知域之间的分布差距,显著增强了检测器的泛化性和鲁棒性。 Abstract: In this paper, we focus on Single-Domain Generalized Object Detection (Single-DGOD), aiming to transfer a detector trained on one source domain to multiple unknown domains. Existing methods for Single-DGOD typically rely on discrete data augmentation or static perturbation methods to expand data diversity, thereby mitigating the lack of access to target domain data. However, in real-world scenarios such as changes in weather or lighting conditions, domain shifts often occur continuously and gradually. Discrete augmentations and static perturbations fail to effectively capture the dynamic variation of feature distributions, thereby limiting the model's ability to perceive fine-grained cross-domain differences. To this end, we propose a new method, Liquid Temporal Feature Evolution, which simulates the progressive evolution of features from the source domain to simulated latent distributions by incorporating temporal modeling and liquid neural network-driven parameter adjustment. Specifically, we introduce controllable Gaussian noise injection and multi-scale Gaussian blurring to simulate initial feature perturbations, followed by temporal modeling and a liquid parameter adjustment mechanism to generate adaptive modulation parameters, enabling a smooth and continuous adaptation across domains. By capturing progressive cross-domain feature evolution and dynamically regulating adaptation paths, our method bridges the source-unknown domain distribution gap, significantly boosting generalization and robustness to unseen shifts. Significant performance improvements on the Diverse Weather dataset and Real-to-Art benchmark demonstrate the superiority of our method. Our code is available at https://github.com/2490o/LTFE.

[92] MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

Ketong Chen,Yuhao Chen,Yang Xue

Main category: cs.CV

TL;DR: 本文提出了MosaicDoc,一个大规模、双语(中英文)、多任务的视觉丰富文档理解(VRDU)基准数据集,通过多智能体流水线自动生成,涵盖复杂版式与真实文档挑战。

Details Motivation: 现有视觉语言模型(VLMs)的评估基准以英文为主、版式简单、任务有限,难以有效评估复杂布局和密集文本场景下的视觉丰富文档理解能力。 Method: 提出DocWeaver多智能体管道,利用大语言模型(LLMs)自动构建MosaicDoc基准;数据源自报纸和杂志,包含72K图像和60万以上问答对,支持OCR、视觉问答、阅读顺序和定位等多任务。 Result: MosaicDoc包含来自196家出版商的多样化复杂版式(如多栏、非曼哈顿结构),并实现双语覆盖;在多个前沿模型上的实验揭示了它们在处理真实文档复杂性方面的显著不足。 Conclusion: MosaicDoc为视觉语言模型在复杂文档理解任务中的评估提供了新标准,并指明了未来研究方向。 Abstract: Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.

[93] Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers

Xuan Rao,Simian Xu,Zheng Li,Bo Zhao,Derong Liu,Mingming Ha,Cesare Alippi

Main category: cs.CV

TL;DR: 提出了一种用于类增量学习的序列学习方法SLDC,通过补偿特征分布漂移来提升预训练视觉Transformer的性能。

Details Motivation: 现有顺序微调方法在类增量学习中易受分布漂移影响,导致先前类别的特征分布与更新后模型不匹配,影响分类器性能。 Method: 引入潜空间转移算子,提出线性及弱非线性的SLDC变体,并结合知识蒸馏对齐不同任务间的特征分布以缓解漂移问题。 Result: 在多个标准CIL基准上验证了SLDC的有效性,显著提升了SeqFT性能,结合KD后可达到与联合训练相当的效果。 Conclusion: SLDC能有效补偿分布漂移,提升序列微调在类增量学习中的表现,为持续学习提供了一种可行方案。 Abstract: Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: https://github.com/raoxuan98-hash/sldc.git.

[94] Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification

Yuhang Zhou,Yanxiang Zhao,Zhongyun Hua,Zhipu Liu,Zhaoquan Gu,Qing Liao,Leo Yu Zhang

Main category: cs.CV

TL;DR: 本文提出了一种去偏双不变防御框架,用于提升行人重识别(ReID)模型在对抗攻击下的鲁棒性,通过数据平衡和双对抗自元机制,在未知身份和未知攻击类型上实现了优越的泛化性能。

Details Motivation: 现有的对抗防御方法在分类任务中取得进展,但难以直接应用于行人ReID等度量学习任务,且现有ReID防御方法未能有效应对模型偏差和复合泛化需求等核心挑战。 Method: 提出去偏双不变防御框架,包含两个阶段:1)基于扩散模型的数据重采样以缓解模型偏差;2)结合最远负样本软扩展的度量对抗训练与对抗增强的自元学习机制,实现对未见身份和未见攻击类型的双重泛化。 Result: 实验表明,所提方法在多种对抗攻击下显著优于现有最先进防御方法,提升了ReID模型的鲁棒性和泛化能力。 Conclusion: 该框架有效解决了ReID中模型偏差和复合泛化难题,为构建安全可靠的行人重识别系统提供了新思路。 Abstract: Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking. However, advanced deep learning-based ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions, posing significant security threats. Although numerous adversarial defense strategies have been proposed for classification tasks, their extension to metric learning tasks such as person ReID remains relatively unexplored. Moreover, the several existing defenses for person ReID fail to address the inherent unique challenges of adversarially robust ReID. In this paper, we systematically identify the challenges of adversarial defense in person ReID into two key issues: model bias and composite generalization requirements. To address them, we propose a debiased dual-invariant defense framework composed of two main phases. In the data balancing phase, we mitigate model bias using a diffusion-model-based data resampling strategy that promotes fairness and diversity in training data. In the bi-adversarial self-meta defense phase, we introduce a novel metric adversarial training approach incorporating farthest negative extension softening to overcome the robustness degradation caused by the absence of classifier. Additionally, we introduce an adversarially-enhanced self-meta mechanism to achieve dual-generalization for both unseen identities and unseen attack types. Experiments demonstrate that our method significantly outperforms existing state-of-the-art defenses.

[95] AdaptViG: Adaptive Vision GNN with Exponential Decay Gating

Mustafa Munir,Md Mostafijur Rahman,Radu Marculescu

Main category: cs.CV

TL;DR: 提出AdaptViG,一种高效的混合视觉图神经网络,通过自适应图卷积机制在精度和效率上达到新的SOTA。

Details Motivation: 解决现有视觉图神经网络(ViGs)因图构建阶段带来的高计算开销问题,提升模型效率。 Method: 引入自适应图卷积,基于静态轴向骨架和动态的内容感知门控策略(指数衰减门控),并在早期使用高效门控、最后阶段使用全局注意力的混合策略。 Result: AdaptViG-M达到82.6% top-1准确率,参数减少80%,GMACs减少84%;下游任务中在mIoU、APbox、APmask上均超越更大模型。 Conclusion: AdaptViG在显著降低计算成本的同时提升了性能,为视觉GNN提供了更优的精度-效率权衡方案。 Abstract: Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.

[96] TSPE-GS: Probabilistic Depth Extraction for Semi-Transparent Surface Reconstruction via 3D Gaussian Splatting

Zhiyuan Xu,Nan Min,Yuhang Guo,Tong Wei

Main category: cs.CV

TL;DR: 提出TSPE-GS方法,通过建模像素级的多模态不透明度和深度分布,解决3D高斯点阵中半透明表面重建的深度歧义问题。

Details Motivation: 传统3D高斯点阵假设每个像素只有一个深度,难以处理多个可见表面(如半透明物体)的情况,导致半透明表面重建效果差。 Method: 提出TSPE-GS,均匀采样透射率以建模像素级多模态的不透明度与深度分布,并通过渐进融合截断符号距离函数,在统一框架内分别重建外部和内部表面。 Result: 在公开和自采集的半透明与不透明数据集上实验表明,TSPE-GS显著提升了半透明几何重建质量,同时保持了对不透明场景的良好性能。 Conclusion: TSPE-GS有效解决了半透明表面的深度歧义问题,可推广至其他基于高斯的重建流程,且无需额外训练开销。 Abstract: 3D Gaussian Splatting offers a strong speed-quality trade-off but struggles to reconstruct semi-transparent surfaces because most methods assume a single depth per pixel, which fails when multiple surfaces are visible. We propose TSPE-GS (Transparent Surface Probabilistic Extraction for Gaussian Splatting), which uniformly samples transmittance to model a pixel-wise multi-modal distribution of opacity and depth, replacing the prior single-peak assumption and resolving cross-surface depth ambiguity. By progressively fusing truncated signed distance functions, TSPE-GS reconstructs external and internal surfaces separately within a unified framework. The method generalizes to other Gaussian-based reconstruction pipelines without extra training overhead. Extensive experiments on public and self-collected semi-transparent and opaque datasets show TSPE-GS significantly improves semi-transparent geometry reconstruction while maintaining performance on opaque scenes.

[97] Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment

Zhicheng Liao,Dongxu Wu,Zhenshan Shi,Sijie Mai,Hanwei Zhu,Lingyu Zhu,Yuncheng Jiang,Baoliang Chen

Main category: cs.CV

TL;DR: 提出一种新的自适应融合框架,利用CLIP图像特征的幅度信息和余弦相似度进行无参考图像质量评估,无需任务特定训练,在多个基准数据集上优于现有方法。

Details Motivation: 现有的基于CLIP的无参考图像质量评估方法主要依赖语义相似性(如余弦相似度),忽略了图像特征幅度与感知质量之间的强相关性。 Method: 提取绝对CLIP图像特征,应用Box-Cox变换进行统计归一化以减弱语义敏感性,并设计置信度引导的融合策略来自适应地结合幅度信息和余弦相似度。 Result: 在多个标准IQA数据集上实验表明,该方法在无需任务特定训练的情况下, consistently 优于标准CLIP-based IQA和最先进的基线方法。 Conclusion: 特征幅度是一个重要且被忽视的质量线索,结合幅度信息与语义匹配可有效提升无参考图像质量评估性能。 Abstract: Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as "a good photo" or "a bad photo." However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

[98] Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching

Uday Bhaskar,Rishabh Bhattacharya,Avinash Patel,Sarthak Khoche,Praveen Anil Kulkarni,Naresh Manwani

Main category: cs.CV

TL;DR: 提出一种利用视觉语言模型(VLM)生成伪标签并结合每对象协同教学策略训练高效、实时目标检测器的新方法,显著减少对人工标注的依赖,在KITTI等数据集上显著提升性能。

Details Motivation: VLM在零样本目标检测中具有潜力,但存在延迟高和预测幻觉问题,难以直接用于自动驾驶等实时场景,且依赖大量人工标注数据。 Method: 设计一个新流程:利用VLM生成伪标签,并提出基于每对象协同教学的训练策略,两个YOLO模型协作,根据彼此的损失值过滤每个小批量中的不可靠边界框,而非丢弃整张图像。 Result: 在KITTI数据集上,相比基线YOLOv5m,mAP@0.5从31.12%提升至46.61%,加入10%真实标签后进一步提升至57.97%,并在ACDC和BDD100k上验证了有效性,保持实时性。 Conclusion: 该方法提供了一种高效、鲁棒且可扩展的方式,用于训练自动驾驶中的高性能检测器,大幅降低对人工标注的依赖。 Abstract: Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12\%$ to $46.61\%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10\%$) leads to further performance gains, reaching $57.97\%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.

[99] Equivariant Sampling for Improving Diffusion Model-based Image Restoration

Chenxu Wu,Qingpeng Kong,Peiang Zhao,Wendi Yang,Wenxin Ma,Fenghe Tang,Zihang Jiang,S. Kevin Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的图像恢复方法EquS,通过双采样轨迹引入等变信息,并结合时间步感知调度(TAS)提升性能,实验证明其在不增加计算成本的情况下显著优于现有方法。

Details Motivation: 现有的问题无关型扩散模型图像恢复方法难以充分挖掘扩散先验,导致性能受限,本文旨在分析其采样过程并提出改进方案。 Method: 提出EquS方法,通过双采样轨迹引入等变信息;进一步设计时间步感知调度(TAS),优先确定性步骤以提升采样效率和恢复效果。 Result: 在多个基准数据集上实验表明,EquS及其增强版EquS+能显著提升现有问题无关型DMIR方法的性能,且兼容性强,不增加计算开销。 Conclusion: EquS通过引入等变信息和优化采样策略,有效提升了扩散模型在图像恢复任务中的表现,为问题无关型DMIR提供了新的改进方向。 Abstract: Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by analyzing their sampling process and providing effective solutions. We introduce EquS, a DMIR method that imposes equivariant information through dual sampling trajectories. To further boost EquS, we propose the Timestep-Aware Schedule (TAS) and introduce EquS$^+$. TAS prioritizes deterministic steps to enhance certainty and sampling efficiency. Extensive experiments on benchmarks demonstrate that our method is compatible with previous problem-agnostic DMIR methods and significantly boosts their performance without increasing computational costs. Our code is available at https://github.com/FouierL/EquS.

[100] Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models

Satoshi Suzuki,Shin'ya Yamaguchi,Shoichiro Takeda,Taiga Yamane,Naoki Makishima,Naotaka Kawata,Mana Ihori,Tomohiro Tanaka,Shota Orihashi,Ryo Masumura

Main category: cs.CV

TL;DR: 本文提出了一种名为Difference Vector Equalization (DiVE)的方法,用于在保持预训练视觉-语言模型几何结构的前提下进行鲁棒微调,从而在分布内、分布外和零样本场景中均取得良好性能。

Details Motivation: 现有的微调方法在微调过程中破坏了嵌入空间的几何结构,导致模型在分布外和零样本任务上的泛化能力受限。因此,需要一种能够保持几何结构的微调方法。 Method: 提出DiVE方法,通过约束来自预训练模型和微调模型同一数据样本的嵌入差向量来保持几何结构。引入两种损失:平均向量损失(AVL)全局地将差向量约束为加权平均;成对向量损失(PVL)局部地保持多模态对齐的一致性。 Result: 实验表明,DiVE能有效保持嵌入空间的几何结构,在分布内、分布外和零样本分类任务上均优于现有方法。 Conclusion: DiVE通过保留预训练阶段学习到的嵌入几何结构,在不牺牲零样本和分布外性能的同时提升了分布内任务的性能,为视觉-语言模型的鲁棒微调提供了有效解决方案。 Abstract: Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.

[101] STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data

Yongdeuk Seo,Hyun-seok Min,Sungchul Choi

Main category: cs.CV

TL;DR: 本文提出了STELLAR,一种针对低资源语言和真实场景数据的场景文本编辑方法,通过语言自适应字形编码器和多阶段训练策略,在视觉一致性和识别准确率上优于现有方法,并提出了新的风格保留评估指标TAS。

Details Motivation: 现有扩散模型在场景文本编辑中存在对低资源语言支持不足、合成与真实数据间领域差异大以及缺乏有效风格保留评估指标的问题。 Method: 提出STELLAR框架,包含语言自适应字形编码器和多阶段训练策略(先在合成数据上预训练,再在真实图像上微调),并构建了新数据集STIPLAR,同时提出Text Appearance Similarity(TAS)指标来量化字体、颜色和背景的相似性。 Result: 实验表明,STELLAR在多个语言上的平均TAS指标比基线提升2.2%,在视觉质量和识别准确性方面均优于当前最先进的模型。 Conclusion: STELLAR有效解决了低资源语言支持、域偏移和风格评估问题,显著提升了真实场景下多语言文本编辑的性能与评估能力。 Abstract: Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.

[102] MOBA: A Material-Oriented Backdoor Attack against LiDAR-based 3D Object Detection Systems

Saket S. Chaturvedi,Gaurav Bagwe,Lan Zhang,Pan He,Xiaoyong Yuan

Main category: cs.CV

TL;DR: 本文提出了一种面向材料的后门攻击框架MOBA,通过建模真实触发器的材料特性,弥合了数字与物理域之间的鸿沟,在LiDAR 3D目标检测系统中实现了高达93.50%的攻击成功率,显著优于现有方法。

Details Motivation: 现有的LiDAR后门攻击缺乏物理可实现性,因数字触发器忽略材料相关的反射特性,而物理触发器又常未优化,导致效果差或易被发现。 Method: 提出MOBA框架:1)选择具有高漫反射率和环境鲁棒性的二氧化钛(TiO_2)作为触发材料;2)构建包含Oren-Nayar BRDF模型角度无关近似和距离感知缩放机制的仿真管道,确保数字触发器准确模拟物理行为。 Result: 在先进LiDAR及多模态融合模型上实验显示,MOBA攻击成功率达93.50%,超过先前方法41%以上。 Conclusion: MOBA揭示了新型可物理实现的威胁,强调防御机制需考虑现实环境中材料级属性的重要性。 Abstract: LiDAR-based 3D object detection is widely used in safety-critical systems. However, these systems remain vulnerable to backdoor attacks that embed hidden malicious behaviors during training. A key limitation of existing backdoor attacks is their lack of physical realizability, primarily due to the digital-to-physical domain gap. Digital triggers often fail in real-world settings because they overlook material-dependent LiDAR reflection properties. On the other hand, physically constructed triggers are often unoptimized, leading to low effectiveness or easy detectability.This paper introduces Material-Oriented Backdoor Attack (MOBA), a novel framework that bridges the digital-physical gap by explicitly modeling the material properties of real-world triggers. MOBA tackles two key challenges in physical backdoor design: 1) robustness of the trigger material under diverse environmental conditions, 2) alignment between the physical trigger's behavior and its digital simulation. First, we propose a systematic approach to selecting robust trigger materials, identifying titanium dioxide (TiO_2) for its high diffuse reflectivity and environmental resilience. Second, to ensure the digital trigger accurately mimics the physical behavior of the material-based trigger, we develop a novel simulation pipeline that features: (1) an angle-independent approximation of the Oren-Nayar BRDF model to generate realistic LiDAR intensities, and (2) a distance-aware scaling mechanism to maintain spatial consistency across varying depths. We conduct extensive experiments on state-of-the-art LiDAR-based and Camera-LiDAR fusion models, showing that MOBA achieves a 93.50% attack success rate, outperforming prior methods by over 41%. Our work reveals a new class of physically realizable threats and underscores the urgent need for defenses that account for material-level properties in real-world environments.

[103] DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation

Xuexun Liu,Xiaoxu Xu,Qiudan Zhang,Lin Ma,Xu Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为DBGroup的两阶段弱监督3D实例分割框架,利用场景级标注生成高质量伪标签,并通过自训练提升性能,在减少标注成本的同时实现了优于现有方法的效果。

Details Motivation: 现有的弱监督3D实例分割方法依赖点击或边界框标注,仍需大量人工且成本高,本文旨在通过更高效、可扩展的场景级标注来降低标注负担。 Method: 提出DBGroup框架:第一阶段使用双分支点群组模块,结合多视图图像中的语义和掩码线索生成伪标签,并引入粒度感知实例合并与语义选择传播策略优化标签质量;第二阶段采用多轮自训练和实例掩码过滤策略在端到端网络中进行学习。 Result: 实验表明,DBGroup在稀疏点级监督方法中表现具有竞争力,并优于当前最先进的场景级监督3D语义分割方法。 Conclusion: DBGroup通过场景级标注和伪标签优化策略,有效降低了标注成本,同时实现了高性能的3D实例分割,具备良好的可扩展性和应用潜力。 Abstract: Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

[104] LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

Minjun Kim,Jaeri Lee,Jongjin Kim,Jeongin Yun,Yongmo Kwon,U Kang

Main category: cs.CV

TL;DR: 提出了一种名为LampQ的层级别混合精度量化方法,用于高效准确地量化Vision Transformer模型。

Details Motivation: 现有量化方法采用统一精度,忽略了ViT不同组件对量化敏感性的差异;此前的混合精度量化方法在粒度、指标尺度匹配和位分配上存在局限性。 Method: LampQ采用层级量化,引入类型感知的Fisher-based敏感性度量,并通过整数线性规划进行最优比特分配,迭代更新位宽。 Result: 实验表明,LampQ在图像分类、目标检测和零样本量化等任务中均达到最先进水平。 Conclusion: LampQ有效克服了现有ViT量化方法的三大限制,实现了高精度与高效压缩的平衡。 Abstract: How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.

[105] MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging

Shufeng Kong,Zijie Wang,Nuan Cui,Hao Tang,Yihan Meng,Yuanyuan Wei,Feifan Chen,Yingheng Wang,Zhuo Cai,Yaonan Wang,Yulong Zhang,Yuzheng Li,Zibin Zheng,Caihua Liu

Main category: cs.CV

TL;DR: 提出MIRNet框架,结合自监督预训练与约束图推理,用于医学图像诊断,尤其适用于舌象分析,并构建了大规模数据集TongueAtlas-4K。

Details Motivation: 解决医学图像自动解读中的标注稀缺、标签不平衡和临床合理性约束等问题,特别是在舌象诊断这一细粒度视觉语义理解挑战性领域。 Method: 采用自监督掩码自动编码器(MAE)从无标签数据学习视觉表示,利用图注意力网络(GAT)建模专家定义的标签相关性,通过KL散度和正则化损失引入临床先验约束,并使用非对称损失(ASL)和提升集成缓解类别不平衡。 Result: 在舌象诊断任务上达到最先进性能,所提框架可推广至更广泛的医学影像诊断任务,且发布了包含4000张图像的大规模公开数据集TongueAtlas-4K。 Conclusion: MIRNet有效整合自监督学习与结构化知识推理,在减少对标注数据依赖的同时提升模型的临床合理性和泛化能力,为医学图像分析提供了可扩展的新范式。 Abstract: Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels--representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.

[106] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Xinyi Wang,Xun Yang,Yanlong Xu,Yuchen Wu,Zhen Li,Na Zhao

Main category: cs.CV

TL;DR: 本文提出了一个名为细粒度3D具身推理的新任务,旨在根据任务指令预测3D场景中可操作元素的空间位置、运动类型和运动轴。为此,作者设计了AffordBot框架,结合多模态大语言模型与定制的思维链推理范式,并通过环绕视图图像将3D候选元素投影到2D视角以实现3D-2D对齐。该方法在SceneFun3D数据集上实现了最先进的性能,展现出强大的泛化能力和物理感知推理能力。

Details Motivation: 现有方法通常在物体级别操作或割裂地处理细粒度的可供性推理,缺乏基于指令的连贯定位与推理能力,难以满足物理环境中人机协作对精确交互理解的需求。 Method: 提出AffordBot框架,利用多模态大语言模型(MLLM)与定制的思维链(CoT)推理流程:首先通过主动感知阶段选择最信息丰富的视点,再逐步推理定位可供性元素并推断交互动作;通过渲染场景的环绕视图并将3D候选元素投影其中,构建与几何结构对齐的视觉表示。 Result: 在SceneFun3D数据集上的实验表明,AffordBot在仅使用3D点云输入和MLLM的情况下达到了最先进的性能,表现出优异的泛化能力和物理一致性推理效果。 Conclusion: AffordBot通过融合MLLM与结构化推理流程,成功实现了指令驱动的细粒度3D可供性理解,为具身智能体在复杂环境中的精细交互提供了有效解决方案。 Abstract: Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.

[107] Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation

Yuxin Jiang,Wei Luo,Hui Zhang,Qiyu Chen,Haiming Yao,Weiming Shen,Yunkang Cao

Main category: cs.CV

TL;DR: 提出Anomagic,一种基于跨模态提示编码和对比细化的零样本异常生成方法,结合视觉与文本线索,在无须异常样本的情况下生成语义连贯的异常,并通过新构建的AnomVerse数据集训练,显著提升下游异常检测性能。

Details Motivation: 现有异常生成方法通常依赖异常样本或示例,限制了其在真实场景中的应用;因此需要一种无需异常样本即可生成多样化、语义合理异常的零样本方法。 Method: 提出Anomagic,采用跨模态提示编码融合视觉与文本线索,指导基于修复的异常生成流程,并通过对比细化策略确保生成异常与掩码间的精确对齐;同时构建包含12,987个三元组的AnomVerse数据集用于训练。 Result: 实验表明,Anomagic生成的异常更逼真且多样性更高,在下游异常检测任务中表现优于先前方法,并能根据用户定义提示为任意正常类别图像生成异常。 Conclusion: Anomagic是一种有效的零样本异常生成框架,具备良好的泛化能力与实用性,为异常检测提供了可扩展的合成数据生成方案。 Abstract: We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.

[108] DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection

Feiyang Jia,Caiyan Jia,Ailin Liu,Shaoqing Xu,Qiming Xia,Lin Liu,Lei Yang,Yan Gong,Ziying Song

Main category: cs.CV

TL;DR: 本文提出了一种基于双引导范式的多模态3D目标检测方法DGFusion,通过难度感知的实例匹配机制提升对远距离、小尺寸和遮挡等困难目标的检测性能。

Details Motivation: 现有单引导范式的多模态3D检测方法未能充分考虑不同模态在困难实例上的信息密度差异,导致对难检目标(如远处、小或遮挡物体)检测效果不佳,影响自动驾驶安全性。 Method: 提出DGFusion,采用双引导范式(Point-guide-Image与Image-guide-Point结合),并设计难度感知实例对匹配器(DIPM)进行实例级特征匹配,生成易/难实例对,由双引导模块分别优化融合。 Result: 在nuScenes数据集上,DGFusion相比基线方法提升了+1.0% mAP、+0.8% NDS和+1.3%平均召回率,并在不同距离、尺寸、可见性和小样本训练场景中均表现出对困难实例更强的鲁棒性。 Conclusion: DGFusion通过双引导融合范式和难度感知匹配机制,有效提升了多模态3D目标检测中对困难实例的检测能力,增强了系统在复杂驾驶环境下的可靠性。 Abstract: As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0\% mAP, +0.8\% NDS, and +1.3\% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.

[109] LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning

Xinran Yang,Shuichang Lai,Jiangjing Lyu,Hongjie Li,Bowen Pan,Yuanqi Li,Jie Guo,Zhou Zhengkang,Yanwen Guo

Main category: cs.CV

TL;DR: 提出一种基于无符号距离场(UDF)的3D变分自编码器(VAE)框架,通过局部到全局(LoG)架构实现高保真3D内容生成,支持复杂拓扑结构并达到2048^3超高分辨率。

Details Motivation: 现有基于符号距离场(SDF)和点云的方法在处理非流形几何、开放表面和内部结构时存在预处理成本高、表面不连续等问题,难以兼顾几何细节与复杂拓扑。 Method: 设计一种新型3D VAE框架,采用无符号距离场(UDF)表示;引入局部到全局(LoG)架构,将UDF划分为UBlock子块,结合3D卷积捕捉局部细节与稀疏Transformer保证全局一致性,并使用Pad-Average策略优化边界重建。 Result: 实现了高达2048^3分辨率的3D生成与重建,在重建精度、表面平滑性和几何灵活性方面均达到当前最优性能。 Conclusion: 该方法克服了传统SDF和点云表示的局限性,为高保真、复杂拓扑的3D内容生成提供了高效且可扩展的解决方案。 Abstract: Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies-such as open surfaces and intricate internal structures-while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)-a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutions for capturing local detail with sparse transformers for enforcing global coherence. A Pad-Average strategy further ensures smooth transitions at subvolume boundaries during reconstruction. This modular design enables seamless scaling to ultra-high resolutions up to 2048^3-a regime previously unattainable for 3D VAEs. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and generative quality, yielding superior surface smoothness and geometric flexibility.

[110] FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Wencong Wu,Xiuwei Zhang,Hanlin Yin,Shun Dai,Hongxi Zhang,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于可见光-红外目标检测的频域融合Transformer模型FreDFT,通过引入多模态频域注意力(MFDA)和频域前馈层(FDFFL),结合跨模态全局建模模块(CGMM)与局部特征增强模块(LFEM),有效解决了多模态信息不平衡和跨模态融合不足的问题,在多个公开数据集上表现出优异性能。

Details Motivation: 现有方法在处理可见光与红外模态融合时存在信息不平衡问题,且主要在空间域使用Transformer,忽略了频域中挖掘互补信息的潜力。 Method: 提出FreDFT模型,包含多模态频域注意力(MFDA)以挖掘模态间互补信息,频域前馈层(FDFFL)进行多尺度频域特征融合,跨模态全局建模模块(CGMM)实现像素级跨模态交互,以及局部特征增强模块(LFEM)提升局部特征表示与融合能力。 Result: 在多个公开可见光-红外目标检测数据集上,FreDFT显著优于现有方法,验证了其有效性与先进性。 Conclusion: FreDFT通过频域建模与多模块协同设计,有效提升了可见光-红外跨模态目标检测性能,为多模态检测提供了新的解决方案。 Abstract: Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.

[111] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples

Xurui Li,Feng Xue,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出了一种用于零样本异常分类和分割的互评框架MuSc-V2,利用正常图像块在2D外观和3D形状上的相似性与异常的孤立性之间的差异,通过多模态融合和互评机制显著提升了性能。

Details Motivation: 现有零样本异常检测方法忽略了正常图像块在2D和3D空间中具有高度相似性的关键特性,而异常则通常孤立且多样。本文旨在利用这一判别性特征提升检测性能。 Method: 提出MuSc-V2框架,包括迭代点分组(IPG)优化3D表示,多度相似邻域聚合(SNAMD)融合2D/3D多尺度特征,互评机制(MSM)进行模态内打分,跨模态异常增强(CAE)融合双模态得分,以及约束邻域重打分(RsCon)抑制误分类。 Result: 在MVTec 3D-AD数据集上实现+23.7% AP提升,在Eyecandies数据集上提升+19.3%,超越此前零样本方法并优于多数少样本方法。 Conclusion: MuSc-V2通过显式建模正常样本的聚集性和异常的孤立性,结合多模态互评机制,在零样本异常分类与分割任务中实现了显著性能提升,具备良好的适应性和鲁棒性。 Abstract: Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7\%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3\%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.

[112] Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

Zhiyuan Hu,Zheng Sun,Yi Wei,Long Yu

Main category: cs.CV

TL;DR: 提出了一种针对图像美学推理能力的完整解决方案,包括大规模数据集构建和基于硬例挖掘的改进训练方法HCM-GRPO,在图像筛选任务上显著超越现有大模型。

Details Motivation: 现有MLLM在图像美学推理任务上表现不佳,主要受限于数据稀缺和模型推理能力不足。 Method: 构建了包含12.8万样本的大规模图像筛选数据集,涵盖四种美学维度;采用多种标注方式获取高质量思维链数据;提出HCM-GRPO方法,结合硬例挖掘与动态比例准确率奖励机制,增强模型推理能力。 Result: 实验显示主流闭源MLLM(如GPT4o、Qwen-VL-Max)在该任务上接近随机猜测;而采用HCM-GRPO的小模型显著优于开源及闭源大模型。 Conclusion: 通过高质量数据集和改进的训练策略,可在不依赖大模型的情况下实现更强的图像美学推理能力,为图像生成质量评估提供了有效解决方案。 Abstract: The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

[113] When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Qilang Ye,Wei Zeng,Meng Liu,Jie Zhang,Yupeng Hu,Zitong Yu,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出了一个名为AV-ConfuseBench的新基准,用于评估多模态大语言模型(MLLMs)在视觉存在但音频缺失场景下的辨识能力,并发现现有MLMs存在视觉主导的推理偏差。为此,作者提出RL-CoMM方法,结合强化学习与协作式多模型架构,利用大型音频语言模型(LALM)提供音频-only推理作为参考,通过逐步推理奖励函数和答案中心置信度优化,显著提升了音频-视觉问答的准确率(提高10~30%)。

Details Motivation: 现有MLLMs在处理视听信息时倾向于依赖视觉线索,难以判断音频是否真实存在,容易产生视听混淆问题。为了探究并解决这一视觉主导的推理偏差,需要新的评测基准和更优的多模态协同机制。 Method: 提出RL-CoMM框架,包含两个阶段:1)引入大型音频语言模型(LALM)生成纯音频推理作为参考,设计基于步骤的推理奖励函数,通过强化学习引导MLLM自我改进视听推理;2)引入答案中心的置信度优化策略,减少异构推理差异带来的不确定性。 Result: 在AV-ConfuseBench上的实验表明,Qwen2.5-Omni和Gemini 2.5等主流MLLM难以识别不存在的音频;而所提RL-CoMM方法在有限训练数据下,相较基线模型在音频-视觉问答和幻觉抑制任务上准确率提升10~30%。 Conclusion: MLLMs在视听理解中存在视觉主导的推理缺陷,RL-CoMM通过引入音频专用模型和强化学习协同机制,有效增强了对虚假音频的辨识能力,显著提升了多模态推理的准确性与鲁棒性。 Abstract: Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.

[114] Multivariate Gaussian Representation Learning for Medical Action Evaluation

Luming Yang,Haoxian Liu,Siqing Li,Alper Yilmaz

Main category: cs.CV

TL;DR: 提出CPREval-6k数据集和GaussMedAct框架,用于细粒度医疗动作分析,显著提升精度与效率。

Details Motivation: 医疗视觉中细粒度动作评估面临数据集缺乏、精度要求高和快速动作建模不足的挑战。 Method: 构建多视角多标签基准CPREval-6k,提出基于多元高斯编码的GaussMedAct框架,结合自适应时空表征学习与混合空间编码策略。 Result: 在CPREval-6k上达到92.1% Top-1准确率,比ST-GCN基线高5.9%,仅使用10% FLOPs,具备实时推理能力,并在跨数据集实验中表现出更强鲁棒性。 Conclusion: GaussMedAct通过自适应3D高斯建模和混合空间编码,有效提升了医疗场景下快速动作的识别精度与抗噪能力,推动了细粒度动作评估的发展。 Abstract: Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming the ST-GCN baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.

[115] Perceive, Act and Correct: Confidence Is Not Enough for Hyperspectral Classification

Muzhou Yang,Wuzhou Quan,Mingqiang Wei

Main category: cs.CV

TL;DR: 本文提出了CABIN,一种认知感知的半监督学习框架,通过感知、行动与纠正的闭环学习机制提升高光谱图像分类性能。

Details Motivation: 现有模型在高光谱图像分类中常因过度依赖高置信度预测而忽视不确定性,导致确认偏误,尤其在标注稀疏或类别不平衡时表现更差。 Method: CABIN通过估计认知不确定性来感知模糊区域,采用不确定性引导的双采样策略选择样本,并引入细粒度动态分配策略对伪标签数据进行分类并应用定制损失函数。 Result: 实验表明,结合CABIN后多种先进方法在标注效率和分类性能上均有提升。 Conclusion: CABIN有效缓解了模型的确认偏误问题,增强了对不确定性的感知与校正能力,提升了半监督高光谱图像分类的泛化性能。 Abstract: Confidence alone is often misleading in hyperspectral image classification, as models tend to mistake high predictive scores for correctness while lacking awareness of uncertainty. This leads to confirmation bias, especially under sparse annotations or class imbalance, where models overfit confident errors and fail to generalize. We propose CABIN (Cognitive-Aware Behavior-Informed learNing), a semi-supervised framework that addresses this limitation through a closed-loop learning process of perception, action, and correction. CABIN first develops perceptual awareness by estimating epistemic uncertainty, identifying ambiguous regions where errors are likely to occur. It then acts by adopting an Uncertainty-Guided Dual Sampling Strategy, selecting uncertain samples for exploration while anchoring confident ones as stable pseudo-labels to reduce bias. To correct noisy supervision, CABIN introduces a Fine-Grained Dynamic Assignment Strategy that categorizes pseudo-labeled data into reliable, ambiguous, and noisy subsets, applying tailored losses to enhance generalization. Experimental results show that a wide range of state-of-the-art methods benefit from the integration of CABIN, with improved labeling efficiency and performance.

[116] VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System

Gwangyeon Ahn,Jiwan Seo,Joonhyuk Kang

Main category: cs.CV

TL;DR: 提出了一种基于视觉-语言特征的多模态语义通信系统VLF-MSC,通过统一的紧凑表示同时支持接收端的图像和文本生成,利用预训练模型提升频谱效率和抗噪能力。

Details Motivation: 现有语义通信方法通常分别处理不同模态,导致带宽浪费和语义不一致,因此需要一种统一的多模态表示方法以提高传输效率和语义保真度。 Method: 采用预训练的视觉-语言模型(VLM)将源图像编码为视觉-语言语义特征(VLF),通过无线信道传输,并在接收端使用解码器语言模型和扩散图像生成器联合生成文本和图像。 Result: 实验表明,VLF-MSC在低信噪比下优于仅文本或仅图像的基线方法,显著降低带宽的同时提升了两种模态的语义准确性。 Conclusion: VLF-MSC通过统一的视觉-语言表示实现了高效、鲁棒的多模态语义通信,为未来语义通信系统提供了新的框架。 Abstract: We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.

[117] Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints

Xiangyue Zhang,Jianfang Li,Jianqiang Ren,Jiaxu Zhang

Main category: cs.CV

TL;DR: 本文提出GlobalDiff,首个在全局关节旋转空间中进行扩散建模的对话语音驱动动作生成框架,通过多层次约束机制解决全局表示中结构先验缺失问题,显著提升生成动作的稳定性和准确性。

Details Motivation: 现有方法基于局部关节旋转建模,因层级骨骼结构导致误差累积,尤其在末端效应器上产生不稳定和不合理的动作,难以保证运动的可靠性。 Method: 提出GlobalDiff,首次在全局关节旋转空间中进行扩散建模,解耦各关节预测与上游依赖关系;引入多级约束机制:关节结构约束通过虚拟锚点捕捉精细朝向,骨骼结构约束保持骨间角度一致性,时序结构约束利用多尺度变分编码器对齐真实运动时序模式。 Result: 在标准对话语音基准上广泛评估,GlobalDiff生成的动作更平滑准确,在多种说话人身份下性能较当前SOTA提升46.0%。 Conclusion: GlobalDiff通过全局旋转空间建模有效缓解了传统方法中的误差累积问题,结合多级结构约束增强了生成结果的结构合理性和时间一致性,显著提升了对话语音驱动动作生成的质量。 Abstract: Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint's prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0 % compared to the current SOTA under multiple speaker identities.

[118] GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

Yuxiang Duan,Ao Li,Yingqin Li,Luyu Li,Pengwei Wang

Main category: cs.CV

TL;DR: 提出GridPrune方法,通过“全局引导、局部选择”的区域化策略提升多模态大模型的视觉token剪枝效率,在显著减少计算开销的同时保持优越性能。

Details Motivation: 现有视觉token剪枝方法主要关注“选择什么”(what to select),忽视了“关注哪里”(where to look),导致空间分配低效、位置偏差和冗余token保留问题。受人类视觉注意机制启发,需引入两阶段策略以更高效地分配注意力。 Method: 提出GridPrune,将图像划分为空间区域,首先基于文本条件引导动态分配各区域的token预算(guide-globally),然后在每个区域内进行局部token选择(select-locally),替代传统的全局Top-K剪枝机制。 Result: 在LLaVA-NeXT-7B上,仅使用11.1%的视觉token即保留96.98%的完整性能,相比最优基线在相同剪枝率下提升2.34%。 Conclusion: GridPrune通过模拟人类视觉注意的两阶段机制,实现了更高效的空间分配与细粒度选择,在多种MLLM架构中显著提升了剪枝效率与模型性能平衡。 Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.

[119] SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition

Qilang Ye,Yu Zhou,Lian He,Jie Zhang,Xuanming Guo,Jiayu Zhang,Mingkui Tan,Weicheng Xie,Yue Sun,Tao Tan,Xiaochen Yuan,Ghada Khoriba,Zitong Yu

Main category: cs.CV

TL;DR: 提出了一种名为SUGAR的新范式,利用视觉-运动知识指导骨架学习,并结合大语言模型进行动作识别与描述。

Details Motivation: 探索如何将大语言模型(LLM)与人体骨架结合用于动作分类和描述,解决LLM难以直接理解骨架数据以及区分不同动作的问题。 Method: 利用现成的大规模视频模型提取视觉和运动信息作为先验知识,监督骨架学习生成离散表示;通过Temporal Query Projection(TQP)模块建模长序列骨架信号;使用未经微调的预训练LLM理解这些表示并生成动作标签和描述。 Result: 在多个基于骨架的动作分类基准上验证了SUGAR的有效性,在零样本场景下表现优于线性方法,展现出更强的泛化能力。 Conclusion: SUGAR成功桥接了LLM与骨架数据之间的语义鸿沟,证明了利用外部视觉-运动知识引导骨架表征学习的可行性,并展示了其在动作识别中的强大迁移性和通用性。 Abstract: Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.

[120] MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models

Zihan Wang,Guansong Pang,Wenjun Miao,Jin Zheng,Xiao Bai

Main category: cs.CV

TL;DR: 本文提出了MTAttack,首个针对大型视觉语言模型(LVLMs)的多目标后门攻击框架,通过引入代理空间划分和触发原型锚定约束,有效解决多触发器间的特征干扰问题,实现高成功率、强泛化性和抗防御能力的多目标攻击。

Details Motivation: 现有的后门攻击主要集中在单目标攻击,而实际应用中多目标攻击更具威胁。本文旨在揭示LVLMs在多目标后门攻击下的安全漏洞。 Method: 提出MTAttack框架,采用新型优化方法,在潜在空间中联合优化多个触发器,引入代理空间划分约束和触发原型锚定约束,确保每个触发器独立映射到唯一的代理类并保持可分性。 Result: 实验表明MTAttack在多个基准上实现了很高的攻击成功率,显著优于现有方法,并展现出跨数据集的泛化能力和对多种后门防御策略的鲁棒性。 Conclusion: LVLMs面临严重的多目标后门攻击风险,MTAttack的有效性凸显了开发相应防御机制的紧迫性。 Abstract: Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at https://github.com/mala-lab/MTAttack.

[121] RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo

Jueun Ko,Hyewon Park,Hyesong Choi,Dongbo Min

Main category: cs.CV

TL;DR: 提出了一种名为RobIA的鲁棒、实例感知的连续测试时自适应框架,用于立体深度估计,通过动态路由和伪监督提升在动态域中的性能。

Details Motivation: 立体深度估计在真实环境中面临动态域偏移、稀疏或不可靠监督以及高成本密集真值标签的问题,现有TTA方法多基于静态假设,难以应对持续变化的环境。 Method: 提出RobIA框架,包含两个关键组件:(1) Attend-and-Excite Mixture-of-Experts(AttEx-MoE),通过轻量级自注意力机制实现输入到冻结专家的动态路由;(2) Robust AdaptBN Teacher,基于PEFT的教师模型提供密集伪监督以补充稀疏手工标签。 Result: 实验表明,RobIA在多个动态目标域上实现了优越的自适应性能,同时保持了计算效率。 Conclusion: RobIA通过实例感知的动态适应策略和增强的监督信号,在连续域变化下显著提升了立体深度估计的鲁棒性和泛化能力。 Abstract: Stereo Depth Estimation in real-world environments poses significant challenges due to dynamic domain shifts, sparse or unreliable supervision, and the high cost of acquiring dense ground-truth labels. While recent Test-Time Adaptation (TTA) methods offer promising solutions, most rely on static target domain assumptions and input-invariant adaptation strategies, limiting their effectiveness under continual shifts. In this paper, we propose RobIA, a novel Robust, Instance-Aware framework for Continual Test-Time Adaptation (CTTA) in stereo depth estimation. RobIA integrates two key components: (1) Attend-and-Excite Mixture-of-Experts (AttEx-MoE), a parameter-efficient module that dynamically routes input to frozen experts via lightweight self-attention mechanism tailored to epipolar geometry, and (2) Robust AdaptBN Teacher, a PEFT-based teacher model that provides dense pseudo-supervision by complementing sparse handcrafted labels. This strategy enables input-specific flexibility, broad supervision coverage, improving generalization under domain shift. Extensive experiments demonstrate that RobIA achieves superior adaptation performance across dynamic target domains while maintaining computational efficiency.

[122] Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

Mingda Jia,Weiliang Meng,Zenghuang Fu,Yiheng Li,Qi Zeng,Yifan Zhang,Ju Xin,Rongtao Xu,Jiguang Zhang,Xiaopeng Zhang

Main category: cs.CV

TL;DR: 提出了一种显式的时序-语义建模框架CACMI,用于密集视频描述任务,通过跨模态帧聚合和上下文感知特征增强,实现了最先进的性能。

Details Motivation: 现有方法依赖于隐式建模,使用帧级或片段化特征,难以捕捉事件序列间的时序连贯性和视觉上下文中的完整语义。 Method: 提出Context-Aware Cross-Modal Interaction (CACMI)框架,包含跨模态帧聚合(提取时序连贯、事件对齐的文本特征)和上下文感知特征增强(通过查询引导注意力融合视觉动态与伪事件语义)。 Result: 在ActivityNet Captions和YouCook2数据集上进行了大量实验,CACMI在密集视频描述任务中达到了最先进的性能。 Conclusion: CACMI通过显式的时序-语义建模,有效提升了密集视频描述的准确性和语义完整性。 Abstract: Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.

[123] Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation

Mayank Vatsa,Aparna Bharati,Richa Singh

Main category: cs.CV

TL;DR: 当前主流文本到图像模型在处理逻辑组合(如否定、计数和空间关系)时表现严重不足,根源在于训练数据缺乏否定表达、连续注意力架构不适合离散逻辑,以及评估指标重视觉合理性轻逻辑满足。实现真正的组合性需根本性的表征与推理进步。

Details Motivation: 揭示并分析现有文本到图像模型在逻辑组合任务上的性能崩溃问题,探讨其根本原因。 Method: 通过分析否定、计数和空间关系三种核心逻辑原语的组合表现,结合对训练数据、模型架构和评估指标的系统考察,识别导致组合性失败的关键因素。 Result: 发现模型在组合多个逻辑原语时性能急剧下降;训练数据中几乎不存在明确的否定表达;连续注意力机制难以处理离散逻辑;现有评估指标不利于逻辑准确性。 Conclusion: 真正的组合性能力无法通过简单扩展或现有架构的微调实现,必须发展新的表征与推理机制。 Abstract: The architectural blueprint of today's leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.

[124] Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space

Zhicheng Cai,Hao Zhu,Linsen Chen,Qiu Shen,Xun Cao

Main category: cs.CV

TL;DR: 提出了一种名为split-layer的新型MLP结构,通过将每层分解为多个并行分支并使用Hadamard积融合输出,显著提升了隐式神经表示(INR)的表达能力,同时避免了计算和内存成本的急剧增加。

Details Motivation: 传统MLP架构中的低维特征空间限制了隐式神经表示(INR)的表达能力,而简单加宽网络会导致计算和内存成本二次增长,因此需要一种高效扩展特征空间的方法。 Method: 提出split-layer方法,将MLP每一层拆分为多个并行分支,并通过Hadamard积整合输出,从而构建高次多项式空间,在不显著增加计算开销的情况下扩展特征空间维度。 Result: 在2D图像拟合、2D CT重建、3D形状表示和5D新视角合成等多个任务上,split-layer均显著优于现有方法,验证了其对INR性能的有效提升。 Conclusion: split-layer通过高效的高维特征空间构造,有效增强了INR的表达能力,为各类逆问题和连续信号表示任务提供了更强大的网络架构。 Abstract: Implicit neural representation (INR) models signals as continuous functions using neural networks, offering efficient and differentiable optimization for inverse problems across diverse disciplines. However, the representational capacity of INR defined by the range of functions the neural network can characterize, is inherently limited by the low-dimensional feature space in conventional multilayer perceptron (MLP) architectures. While widening the MLP can linearly increase feature space dimensionality, it also leads to a quadratic growth in computational and memory costs. To address this limitation, we propose the split-layer, a novel reformulation of MLP construction. The split-layer divides each layer into multiple parallel branches and integrates their outputs via Hadamard product, effectively constructing a high-degree polynomial space. This approach significantly enhances INR's representational capacity by expanding the feature space dimensionality without incurring prohibitive computational overhead. Extensive experiments demonstrate that the split-layer substantially improves INR performance, surpassing existing methods across multiple tasks, including 2D image fitting, 2D CT reconstruction, 3D shape representation, and 5D novel view synthesis.

[125] Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection

Feng Ding,Wenhui Yi,Yunpeng Zhou,Xinan He,Hong Rao,Shu Hu

Main category: cs.CV

TL;DR: 提出一种双机制协同优化框架,通过结构公平解耦和全局分布对齐,在保持检测精度的同时提升跨域深伪检测模型的群体间和群体内公平性。

Details Motivation: 现有公平性增强的深伪检测模型常以牺牲检测精度为代价,且存在对不同性别、种族等群体的偏见,可能导致系统性误判,加剧社会不公。 Method: 提出双机制协同优化框架,结合模型架构层面的结构公平解耦(解耦对人口统计特征敏感的通道)和特征层面的全局分布对齐(缩小整体样本分布与各群体分布之间的距离)。 Result: 实验结果表明,所提方法在多个域上均优于现有方法,能在保持整体检测精度的同时,有效提升群体间和群体内公平性。 Conclusion: 该框架实现了公平性与检测性能的平衡,有助于推动可信、公正的深伪检测模型在数字身份安全等敏感场景中的部署。 Abstract: Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.

[126] GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval

Hao Zou,Runqing Zhang,Xue Zhou,Jianxiao Zou

Main category: cs.CV

TL;DR: 本文提出了一种生成增强对齐方法(GEA),通过引入扩散生成图像作为中间语义表示,提升文本到图像行人检索的跨模态对齐性能。

Details Motivation: 现有文本到图像行人检索方法受限于文本查询表达不完整和模态鸿沟问题,导致跨模态对齐效果差和过拟合。 Method: 提出GEA框架,包含两个模块:文本引导的标记增强(TGTE)利用扩散生成图像作为中介;生成式中间融合(GIF)通过交叉注意力融合生成图像、原始图像与文本特征,并使用三元组损失优化。 Result: 在CUHK-PEDES、RSTPReid和ICFG-PEDES三个数据集上进行了实验,结果表明所提方法有效提升了检索性能。 Conclusion: GEA通过生成图像作为中间表示,有效缓解了模态差距和语义不完整问题,显著提高了文本到图像行人检索的准确性和鲁棒性。 Abstract: Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at https://github.com/sugelamyd123/Sup-for-GEA.

[127] Physically Interpretable Multi-Degradation Image Restoration via Deep Unfolding and Explainable Convolution

Hu Gao,Xiaoning Lei,Xichen Xu,Depeng Dang,Lizhuang Ma

Main category: cs.CV

TL;DR: 本文提出了一种基于深度展开网络的可解释多退化图像恢复方法InterIR,通过改进的二阶半光滑牛顿算法和可解释卷积模块,在保证模型可解释性的同时实现了对多种退化类型的自适应处理。

Details Motivation: 现有图像恢复方法多针对单一退化类型,且堆叠模块的方法缺乏可解释性;实际场景中图像常包含多种退化,需要兼具性能与可解释性的统一模型。 Method: 基于深度展开网络框架,将优化算法迭代过程映射为可学习网络结构;采用改进的二阶半光滑牛顿算法保证模块的物理可解释性,并设计受人脑信息处理启发的可解释卷积模块以增强适应性和透明度。 Result: InterIR在多退化图像恢复任务中表现出色,同时在单退化任务上也具有很强的竞争力,验证了其有效性和泛化能力。 Conclusion: 该研究展示了可解释性驱动设计在多退化图像恢复中的优势,所提出的InterIR架构兼顾性能、可解释性与灵活性,为复杂真实场景下的图像恢复提供了新思路。 Abstract: Although image restoration has advanced significantly, most existing methods target only a single type of degradation. In real-world scenarios, images often contain multiple degradations simultaneously, such as rain, noise, and haze, requiring models capable of handling diverse degradation types. Moreover, methods that improve performance through module stacking often suffer from limited interpretability. In this paper, we propose a novel interpretability-driven approach for multi-degradation image restoration, built upon a deep unfolding network that maps the iterative process of a mathematical optimization algorithm into a learnable network structure. Specifically, we employ an improved second-order semi-smooth Newton algorithm to ensure that each module maintains clear physical interpretability. To further enhance interpretability and adaptability, we design an explainable convolution module inspired by the human brain's flexible information processing and the intrinsic characteristics of images, allowing the network to flexibly leverage learned knowledge and autonomously adjust parameters for different input. The resulting tightly integrated architecture, named InterIR, demonstrates excellent performance in multi-degradation restoration while remaining highly competitive on single-degradation tasks.

[128] Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

Shruti Singh Baghel,Yash Pratap Singh Rathore,Sushovan Jena,Anurag Pradhan,Amit Shukla,Arnav Bhavsar,Pawan Goyal

Main category: cs.CV

TL;DR: 本文研究了不同规模的视觉-语言模型在为盲人和低视力(BLV)用户生成无障碍视频描述时的表现,提出了两个新的评估框架,并在智能手机上评估了模型的实际运行性能。

Details Motivation: 大型视觉-语言模型虽然性能强大,但资源消耗高,难以在移动端部署,限制了其在BLV用户中的实际应用。因此需要研究小规模模型在无障碍场景下的表现及其可行性。 Method: 使用SmolVLM2的500M和2.2B参数版本,在AVCaps和Charades两个数据集上进行评估;提出多上下文BLV框架和导航辅助框架用于评估描述质量;系统分析四种提示设计策略,并在智能手机上测试FP32与INT8精度下的性能。 Result: 较小的模型在特定提示策略下可生成高质量的上下文感知描述;INT8量化显著降低资源消耗且对描述质量影响有限;两个新评估框架能更全面地衡量BLV相关的信息覆盖能力。 Conclusion: 轻量级VLM结合合理提示设计和量化技术,可在资源受限设备上有效支持BLV用户的视频理解需求,具备实际部署潜力。 Abstract: Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.

[129] CephRes-MHNet: A Multi-Head Residual Network for Accurate and Robust Cephalometric Landmark Detection

Ahmed Jaheen,Islam Hassan,Mohanad Abouserie,Abdelaty Rehab,Adham Elasfar,Knzy Elmasry,Mostafa El-Dawlatly,Seif Eldawlatly

Main category: cs.CV

TL;DR: 本文提出了一种名为CephRes-MHNet的多头残差卷积网络,用于从2D侧位颅骨X光片中自动检测头影测量标志点,具有高精度和参数效率,在Aariz数据集上实现了优于现有方法的性能。

Details Motivation: 准确的头影测量标志点定位对正畸诊断至关重要,但手动标注耗时且易出错,现有自动化方法在低对比度和复杂解剖结构下表现不佳。 Method: 提出CephRes-MHNet,结合残差编码、双注意力机制和多头解码器,以增强上下文推理和解剖学定位精度。 Result: 在Aariz数据集(1,000张图像)上,CephRes-MHNet达到平均径向误差1.23 mm,2.0 mm范围内成功检测率85.5%,优于所有对比模型,且参数量不足最佳基线模型的25%。 Conclusion: CephRes-MHNet通过高效的网络架构实现了头影测量标志点检测的最先进水平,兼具高精度与轻量化,适合临床实际应用。 Abstract: Accurate localization of cephalometric landmarks from 2D lateral skull X-rays is vital for orthodontic diagnosis and treatment. Manual annotation is time-consuming and error-prone, whereas automated approaches often struggle with low contrast and anatomical complexity. This paper introduces CephRes-MHNet, a multi-head residual convolutional network for robust and efficient cephalometric landmark detection. The architecture integrates residual encoding, dual-attention mechanisms, and multi-head decoders to enhance contextual reasoning and anatomical precision. Trained on the Aariz Cephalometric dataset of 1,000 radiographs, CephRes-MHNet achieved a mean radial error (MRE) of 1.23 mm and a success detection rate (SDR) @ 2.0 mm of 85.5%, outperforming all evaluated models. In particular, it exceeded the strongest baseline, the attention-driven AFPF-Net (MRE = 1.25 mm, SDR @ 2.0 mm = 84.1%), while using less than 25% of its parameters. These results demonstrate that CephRes-MHNet attains state-of-the-art accuracy through architectural efficiency, providing a practical solution for real-world orthodontic analysis.

[130] Utilizing a Geospatial Foundation Model for Coastline Delineation in Small Sandy Islands

Tishya Chhabra,Manisha Bajpai,Walter Zesk,Skylar Tibbits

Main category: cs.CV

TL;DR: 本研究评估了NASA和IBM的Prithvi-EO-2.0地理空间基础模型在小沙岛岸线提取中的应用,使用225幅马尔代夫岛屿多光谱图像进行训练与测试,结果显示即使仅用5张训练图像,模型仍表现出色(F1=0.94,IoU=0.79),展现了其在数据稀缺地区海岸监测中的潜力。

Details Motivation: 由于小规模岛屿常缺乏足够的遥感标注数据,传统方法难以有效进行岸线提取,因此需要具备强迁移学习能力的基础模型来支持数据贫乏地区的海岸监测。 Method: 构建并公开发布包含225幅多光谱卫星图像的数据集,选取两个马尔代夫岛屿作为研究对象;使用不同大小的训练子集(5至181张图像)对Prithvi-EO-2.0的300M和600M参数版本进行微调,评估其在岸线提取任务上的性能。 Result: 即使仅使用5张训练图像,Prithvi模型仍达到F1分数0.94、IoU 0.79的高性能,且在不同训练样本规模下表现稳定,验证了其出色的迁移学习能力。 Conclusion: Prithvi-EO-2.0在小样本条件下表现出卓越的岸线提取能力,表明地理空间基础模型在数据稀缺的沿海地区具有广泛应用前景,可显著提升遥感监测效率。 Abstract: We present an initial evaluation of NASA and IBM's Prithvi-EO-2.0 geospatial foundation model on shoreline delineation of small sandy islands using satellite images. We curated and labeled a dataset of 225 multispectral images of two Maldivian islands, which we publicly release, and fine-tuned both the 300M and 600M parameter versions of Prithvi on training subsets ranging from 5 to 181 images. Our experiments show that even with as few as 5 training images, the models achieve high performance (F1 of 0.94, IoU of 0.79). Our results demonstrate the strong transfer learning capability of Prithvi, underscoring the potential of such models to support coastal monitoring in data-poor regions.

[131] VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction

Stephane Da Silva Martins,Emanuel Aldea,Sylvie Le Hégarat-Mascle

Main category: cs.CV

TL;DR: VISTA是一种用于多智能体轨迹预测的递归目标条件变换模型,通过融合长期意图与过去运动、灵活建模社会交互并提供可解释的社会影响模式,在密集环境中实现了最先进的精度和更低的碰撞率。

Details Motivation: 现有方法难以同时捕捉智能体的长期目标及其细粒度的社会交互,导致预测的多智能体未来轨迹不够真实。 Method: 提出VISTA模型,包含跨注意力融合模块(整合长期意图与历史运动)、社交令牌注意力机制(灵活建模智能体间交互)以及成对注意力图(实现推理时社会影响的可解释性),将单智能体目标条件预测扩展为连贯的多智能体预测框架。 Result: 在MADRAS和SDD数据集上,VISTA在标准位移指标上达到最先进水平,并显著降低碰撞率:在MADRAS上平均碰撞率从2.14%降至0.03%,在SDD上实现零碰撞,同时提升ADE、FDE和minFDE指标。 Conclusion: VISTA能够生成符合社会规则、具备目标感知且可解释的轨迹,适用于安全关键的自主系统。 Abstract: Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents' long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.

[132] LiNeXt: Revisiting LiDAR Completion with Efficient Non-Diffusion Architectures

Wenzhe He,Xiaojun Chen,Ruiqi Wang,Ruihui Li,Huilong Pi,Jiapeng Zhang,Zhuo Tang,Kenli Li

Main category: cs.CV

TL;DR: 提出了一种轻量级非扩散网络LiNeXt,用于快速准确的3D LiDAR点云补全,相比扩散模型显著提升推理速度并降低计算开销。

Details Motivation: 现有基于扩散模型的方法因多步迭代采样导致计算开销大,难以满足自动驾驶中实时感知的需求。 Method: 设计了Noise-to-Coarse(N2C)模块单步去噪,并结合Refine模块进行精细修复;提出距离感知的选择重复策略,以生成更均匀分布的噪声点云。 Result: 在SemanticKITTI数据集上,相比LiDiff推理速度快199.8倍,Chamfer距离降低50.7%,参数量仅为6.1%。 Conclusion: LiNeXt在效率和补全质量上均显著优于现有方法,适合实时3D LiDAR场景补全应用。 Abstract: 3D LiDAR scene completion from point clouds is a fundamental component of perception systems in autonomous vehicles. Previous methods have predominantly employed diffusion models for high-fidelity reconstruction. However, their multi-step iterative sampling incurs significant computational overhead, limiting its real-time applicability. To address this, we propose LiNeXt-a lightweight, non-diffusion network optimized for rapid and accurate point cloud completion. Specifically, LiNeXt first applies the Noise-to-Coarse (N2C) Module to denoise the input noisy point cloud in a single pass, thereby obviating the multi-step iterative sampling of diffusion-based methods. The Refine Module then takes the coarse point cloud and its intermediate features from the N2C Module to perform more precise refinement, further enhancing structural completeness. Furthermore, we observe that LiDAR point clouds exhibit a distance-dependent spatial distribution, being densely sampled at proximal ranges and sparsely sampled at distal ranges. Accordingly, we propose the Distance-aware Selected Repeat strategy to generate a more uniformly distributed noisy point cloud. On the SemanticKITTI dataset, LiNeXt achieves a 199.8x speedup in inference, reduces Chamfer Distance by 50.7%, and uses only 6.1% of the parameters compared with LiDiff. These results demonstrate the superior efficiency and effectiveness of LiNeXt for real-time scene completion.

[133] HeatV2X: Scalable Heterogeneous Collaborative Perception via Efficient Alignment and Interaction

Yueran Zhao,Zhang Zhang,Chao Sun,Tianze Wang,Chao Yue,Nuoran Li

Main category: cs.CV

TL;DR: 提出了一种可扩展的车联万物(V2X)协同感知框架HeatV2X,通过异构图注意力和自适应微调机制,在多模态异构代理间实现高效特征对齐与协作,显著降低训练开销的同时提升感知性能。

Details Motivation: 现有V2X协同感知框架在处理多模态异构代理时面临特征对齐困难,且难以扩展新代理,导致训练成本高、性能受限。 Method: 提出HeatV2X框架,首先基于异构图注意力训练高性能基础代理;然后设计局部异构微调(使用Hetero-Aware Adapters提取模态差异)和全局协同微调(使用Multi-Cognitive Adapter增强跨代理协作)以实现高效对齐与融合。 Result: 在OPV2V-H和DAIR-V2X数据集上验证,该方法在显著降低训练开销的同时,实现了优于当前最先进方法的感知性能。 Conclusion: HeatV2X通过局部与全局自适应微调机制,有效解决了多模态异构代理间的协同感知难题,兼具高性能与可扩展性,适合实际V2X应用。 Abstract: Vehicle-to-Everything (V2X) collaborative perception extends sensing beyond single vehicle limits through transmission. However, as more agents participate, existing frameworks face two key challenges: (1) the participating agents are inherently multi-modal and heterogeneous, and (2) the collaborative framework must be scalable to accommodate new agents. The former requires effective cross-agent feature alignment to mitigate heterogeneity loss, while the latter renders full-parameter training impractical, highlighting the importance of scalable adaptation. To address these issues, we propose Heterogeneous Adaptation (HeatV2X), a scalable collaborative framework. We first train a high-performance agent based on heterogeneous graph attention as the foundation for collaborative learning. Then, we design Local Heterogeneous Fine-Tuning and Global Collaborative Fine-Tuning to achieve effective alignment and interaction among heterogeneous agents. The former efficiently extracts modality-specific differences using Hetero-Aware Adapters, while the latter employs the Multi-Cognitive Adapter to enhance cross-agent collaboration and fully exploit the fusion potential. These designs enable substantial performance improvement of the collaborative framework with minimal training cost. We evaluate our approach on the OPV2V-H and DAIR-V2X datasets. Experimental results demonstrate that our method achieves superior perception performance with significantly reduced training overhead, outperforming existing state-of-the-art approaches. Our implementation will be released soon.

[134] Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization

Ashutosh Anshul,Shreyas Gopal,Deepu Rajan,Eng Siong Chng

Main category: cs.CV

TL;DR: 提出一种单阶段训练框架,通过引入单模态和跨模态的下一帧预测以及窗口级注意力机制,提升多模态深度伪造检测的泛化能力和时间定位精度。

Details Motivation: 现有方法依赖预训练且主要关注音视频不一致性,难以泛化到未见伪造类型,并忽略模态内伪影,在完全伪造或部分伪造视频中表现不佳。 Method: 在单阶段训练中引入下一帧预测任务(包括单模态和跨模态),并设计窗口级注意力机制来捕捉预测帧与实际帧之间的差异,以检测局部伪造痕迹。 Result: 在多个基准数据集上表现出优异的泛化性能和精确的时间定位能力,尤其在完全操纵视频和部分伪造片段的检测中效果显著。 Conclusion: 所提方法无需额外预训练即可实现强泛化,能有效检测保持音视频对齐的伪造内容,同时精准定位伪造时间段,提升了深度伪造检测的实用性和鲁棒性。 Abstract: Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.

[135] TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding

Jinxuan Li,Yi Zhang,Jian-Fang Hu,Chaolei Tan,Tianming Liang,Beihao Xia

Main category: cs.CV

TL;DR: 提出了一种名为TubeRMC的框架,用于弱监督时空视频定位,通过管条件重建和相互约束机制提升文本-视频对齐,显著提高了目标识别和跟踪的一致性。

Details Motivation: 现有弱监督时空视频定位方法多采用简单的晚期融合策略,生成的候选管与文本描述无关,导致目标识别失败和跟踪不一致,因此需要一种更紧密耦合文本和时空管的方法。 Method: 提出TubeRMC框架,利用预训练视觉定位模型生成文本条件下的候选管,并设计三种时空重建策略(时间、空间、时空)及对应的管条件重建器,在重建过程中引入时空相互约束以提升提议质量。 Result: 在VidSTG和HCSTVG两个公开基准上超越了现有方法,可视化结果表明该方法有效减少了目标识别错误和不一致的跟踪。 Conclusion: TubeRMC通过引入文本引导的候选管生成和多视角管条件重建机制,在弱监督设置下实现了更准确和一致的时空视频定位。 Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.

[136] FineSkiing: A Fine-grained Benchmark for Skiing Action Quality Assessment

Yongji Zhang,Siqi Li,Yue Gao,Yu Jiang

Main category: cs.CV

TL;DR: 本文提出了一种新的动作质量评估方法JudgeMind,并构建了首个包含细粒度子分和扣分标注的自由式滑雪空中技巧数据集,通过模拟专业裁判的评分思维提升评分的准确性和可解释性。

Details Motivation: 现有动作质量评估方法依赖全视频特征,缺乏可解释性,且数据集缺少细粒度评分标注,限制了模型的可靠性与性能。 Method: 提出JudgeMind方法,将动作视频分阶段评分,引入阶段感知特征增强与融合模块,并结合基于知识的评分感知解码器,利用扣分项先验知识提升评分准确性。 Result: 在新构建的数据集上实验表明,所提方法实现了最先进的性能,显著提升了评分的准确性和鲁棒性。 Conclusion: JudgeMind通过模拟裁判思维和引入细粒度标注,有效提高了动作质量评估的可靠性与可解释性,为未来AQA研究提供了新基准和思路。 Abstract: Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.

[137] Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Jiulong Wu,Yucheng Shen,Lingyong Yan,Haixin Sun,Deguo Xia,Jizhou Huang,Min Cao

Main category: cs.CV

TL;DR: 提出Facial-R1,一种三阶段对齐框架,用于解决面部情感分析中的幻觉推理和推理-识别不一致问题,并构建FEA-20K数据集,实现SOTA性能。

Details Motivation: 现有基于视觉-语言模型的方法在面部情感分析中存在幻觉推理和情感识别与推理过程不一致的问题,缺乏细粒度、可解释的联合建模。 Method: 提出三阶段对齐框架Facial-R1:1)指令微调建立基本情感推理能力;2)以情感和AU标签为奖励信号进行强化训练,对齐推理与识别;3)设计数据合成 pipeline 自迭代扩展训练集。同时构建FEA-20K数据集。 Result: 在八个标准基准上实验表明,Facial-R1在面部情感分析任务中达到最先进性能,具有强泛化性和可解释性。 Conclusion: Facial-R1通过三阶段对齐有效解决了VLM在情感分析中的幻觉与不一致问题,结合自演化的数据合成策略,推动了可解释细粒度情感分析的发展。 Abstract: Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

[138] H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification

Yongji Zhang,Siqi Li,Kuiyang Huang,Yue Gao,Yu Jiang

Main category: cs.CV

TL;DR: 本文提出了一种名为H3Former的新型token-to-region框架,用于细粒度视觉分类(FGVC),通过高阶语义关系聚合局部表示并进行区域级建模,显著提升了分类性能。

Details Motivation: 现有方法在捕捉判别性特征时往往不全面,并引入大量与类别无关的冗余,难以应对细粒度类别间的微小差异和类内变化。 Method: 提出H3Former框架,包含语义感知聚合模块(SAAM)和双曲层次对比损失(HHCL)。SAAM利用多尺度上下文线索构建加权超图并通过超图卷积聚合token特征;HHCL在非欧几里得空间中施加层次语义约束以增强类间可分性和类内一致性。 Result: 在四个标准FGVC基准上的实验表明,H3Former在细粒度分类任务上优于现有方法,有效提升了性能。 Conclusion: H3Former通过高阶语义建模和非欧几里得空间中的层次对比学习,实现了更全面的判别特征提取,为细粒度视觉分类提供了新的有效解决方案。 Abstract: Fine-Grained Visual Classification (FGVC) remains a challenging task due to subtle inter-class differences and large intra-class variations. Existing approaches typically rely on feature-selection mechanisms or region-proposal strategies to localize discriminative regions for semantic analysis. However, these methods often fail to capture discriminative cues comprehensively while introducing substantial category-agnostic redundancy. To address these limitations, we propose H3Former, a novel token-to-region framework that leverages high-order semantic relations to aggregate local fine-grained representations with structured region-level modeling. Specifically, we propose the Semantic-Aware Aggregation Module (SAAM), which exploits multi-scale contextual cues to dynamically construct a weighted hypergraph among tokens. By applying hypergraph convolution, SAAM captures high-order semantic dependencies and progressively aggregates token features into compact region-level representations. Furthermore, we introduce the Hyperbolic Hierarchical Contrastive Loss (HHCL), which enforces hierarchical semantic constraints in a non-Euclidean embedding space. The HHCL enhances inter-class separability and intra-class consistency while preserving the intrinsic hierarchical relationships among fine-grained categories. Comprehensive experiments conducted on four standard FGVC benchmarks validate the superiority of our H3Former framework.

[139] PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning

Yanbei Jiang,Chao Lei,Yihao Ding,Krista Ehinger,Jey Han Lau

Main category: cs.CV

TL;DR: 本文提出了PROPA框架,结合蒙特卡洛树搜索与强化学习,实现无需人工标注的细粒度视觉语言推理优化,在多个基准和模型上显著提升性能。

Details Motivation: 现有视觉语言模型在复杂推理中因早期错误传播而表现不佳,且依赖昂贵的步骤级标注或稀疏反馈,难以稳定优化。 Method: 提出PROPA框架,结合MCTS与GRPO生成密集的过程级奖励,并交替使用SFT与GRPO更新以解决冷启动问题,同时训练过程奖励模型指导推理搜索。 Result: 在七个基准和四个VLM主干上,PROPA优于SFT和RLVR基线,在领域内任务上最高提升17.0%,领域外任务上提升21.0%。 Conclusion: PROPA通过引入过程级优化和对齐训练-推理机制,显著增强了视觉语言模型的复杂推理能力和泛化性能。 Abstract: Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.

[140] Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

Zhengtao Zou,Ya Gao,Jiarui Guan,Bin Li,Pekka Marttinen

Main category: cs.CV

TL;DR: 本文提出了一种名为RUDDER的低开销框架,用于减少大视觉语言模型(LVLMs)在生成文本时的对象幻觉问题。该方法通过利用自注意力层残差更新中的上下文激活残差方向(CARD)向量,并结合贝叶斯启发的自适应门控机制,在不显著增加计算延迟的情况下有效提升生成结果的视觉一致性。

Details Motivation: 大视觉语言模型常出现对象幻觉问题,即生成内容与输入视觉信息不一致,影响可靠性。现有缓解方法虽有效但计算开销大,限制了实际应用。因此需要一种高效且低延迟的干预方案。 Method: 提出RUDDER框架,包含两个核心:一是从单次前向传播中提取自注意力层残差更新的CARD向量作为视觉证据;二是设计贝叶斯启发的自适应门,根据模型偏离视觉上下文的程度动态调节纠错信号强度,实现逐token修正。 Result: 在POPE和CHAIR等主流幻觉评测基准上,RUDDER表现媲美当前最优方法,同时几乎不增加计算延迟,验证了其在保持效率的同时有效提升LVLM生成可靠性的能力。 Conclusion: RUDDER是一种实用且高效的推理时干预方法,能够在极低开销下显著降低LVLM的对象幻觉,适用于对延迟敏感的实际应用场景。 Abstract: Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model's deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs' reliability without a significant compromise on efficiency.

[141] Generalizable Slum Detection from Satellite Imagery with Mixture-of-Experts

Sumin Lee,Sungwon Park,Jeasurk Yang,Jihee Kim,Meeyoung Cha

Main category: cs.CV

TL;DR: 提出GRAM框架,利用大规模卫星图像数据集和两阶段测试时自适应方法,实现无需目标区域标注数据的鲁棒贫民窟分割。

Details Motivation: 由于非正式居住区形态差异大,现有模型难以泛化到未见区域,制约了基于卫星影像的全球贫困估计。 Method: 构建百万级跨洲卫星图像数据集,采用Mixture-of-Experts架构,在共享主干网络基础上捕捉区域特征;通过两阶段测试时自适应机制,利用专家间预测一致性过滤不可靠伪标签,实现无监督域适应。 Result: GRAM在非洲等低资源城市中优于现有最先进基线方法,显著提升跨区域贫民窟分割性能。 Conclusion: GRAM为全球贫民窟制图提供了可扩展、标签高效的解决方案,有助于推动数据驱动的城市规划。 Abstract: Satellite-based slum segmentation holds significant promise in generating global estimates of urban poverty. However, the morphological heterogeneity of informal settlements presents a major challenge, hindering the ability of models trained on specific regions to generalize effectively to unseen locations. To address this, we introduce a large-scale high-resolution dataset and propose GRAM (Generalized Region-Aware Mixture-of-Experts), a two-phase test-time adaptation framework that enables robust slum segmentation without requiring labeled data from target regions. We compile a million-scale satellite imagery dataset from 12 cities across four continents for source training. Using this dataset, the model employs a Mixture-of-Experts architecture to capture region-specific slum characteristics while learning universal features through a shared backbone. During adaptation, prediction consistency across experts filters out unreliable pseudo-labels, allowing the model to generalize effectively to previously unseen regions. GRAM outperforms state-of-the-art baselines in low-resource settings such as African cities, offering a scalable and label-efficient solution for global slum mapping and data-driven urban planning.

[142] Rethinking Visual Information Processing in Multimodal LLMs

Dongwan Kim,Viresh Ranjan,Takashi Nagata,Arnab Dhua,Amit Kumar K C

Main category: cs.CV

TL;DR: 提出LLaViT,将大语言模型同时作为视觉编码器,通过三个关键改进提升视觉-语言建模效果,显著超越LLaVA基线。

Details Motivation: LLaVA架构在视觉-语言任务中存在文本与视觉模态融合困难的问题,需更有效的跨模态整合方法。 Method: 设计LLaViT,通过学习独立的视觉QKV投影、实现视觉token的双向注意力、融合全局与局部视觉表征,使大语言模型兼具视觉编码能力。 Result: 在多个LLM上实验表明,LLaViT在多项基准上显著优于LLaVA,甚至超过参数量两倍的模型。 Conclusion: LLaViT提供了一种更高效的视觉-语言建模范式,验证了大语言模型作为视觉编码器的潜力。 Abstract: Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.

[143] Revisiting Evaluation of Deep Neural Networks for Pedestrian Detection

Patrick Feifel,Benedikt Franke,Frank Bonarens,Frank Köster,Arne Raulf,Friedhelm Schwenker

Main category: cs.CV

TL;DR: 提出八种行人检测错误类别和新度量指标,通过图像分割实现更细粒度、更可靠的模型性能评估,并在CityPersons数据集上达到SOTA。

Details Motivation: 现有行人检测性能评估指标无法真实反映DNN在不同验证子集上的表现,缺乏对安全关键场景的细粒度分析。 Method: 基于图像分割信息,提出八种行人检测错误类别,并设计新的评估指标;使用简化的APD架构比较不同主干网络。 Result: 实现了在CityPersons-reasonable数据集上的最先进性能(无需额外训练数据),并提供了更细粒度和鲁棒的模型比较方式。 Conclusion: 所提错误分类和新指标能更准确评估行人检测模型,尤其在安全关键性能方面,有助于自动驾驶系统的可靠性提升。 Abstract: Reliable pedestrian detection represents a crucial step towards automated driving systems. However, the current performance benchmarks exhibit weaknesses. The currently applied metrics for various subsets of a validation dataset prohibit a realistic performance evaluation of a DNN for pedestrian detection. As image segmentation supplies fine-grained information about a street scene, it can serve as a starting point to automatically distinguish between different types of errors during the evaluation of a pedestrian detector. In this work, eight different error categories for pedestrian detection are proposed and new metrics are proposed for performance comparison along these error categories. We use the new metrics to compare various backbones for a simplified version of the APD, and show a more fine-grained and robust way to compare models with each other especially in terms of safety-critical performance. We achieve SOTA on CityPersons-reasonable (without extra training data) by using a rather simple architecture.

[144] CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification

Xiaomei Yang,Xizhan Gao,Sijie Niu,Fa Zhu,Guang Feng,Xiaofeng Qu,David Camacho

Main category: cs.CV

TL;DR: 提出了一种基于CLIP的模态共享表示学习网络CLIP4VI-ReID,用于可见光-红外行人重识别(VI-ReID),通过文本语义桥接实现跨模态对齐,在多个数据集上优于现有方法。

Details Motivation: 由于可见光与红外图像在物理特性上存在巨大差异,直接进行跨模态对齐困难,因此需要一种有效机制来缩小模态差距并提升共享表示的判别能力。 Method: 设计了文本语义生成(TSG)、红外特征嵌入(IFE)和高层语义对齐(HSA)三个模块:TSG为可见光图像生成文本语义以实现可见-文本对齐;IFE利用文本语义修正红外特征,注入身份相关信息;HSA进一步优化高层语义对齐,确保文本语义聚焦于身份信息。 Result: 在多个主流VI-ReID数据集上实验表明,CLIP4VI-ReID性能优于当前最先进的方法。 Conclusion: 通过引入文本作为桥梁并设计分阶段的语义对齐机制,CLIP4VI-ReID有效提升了可见光-红外行人重识别中跨模态共享表示的学习能力与判别性。 Abstract: This paper proposes a novel CLIP-driven modality-shared representation learning network named CLIP4VI-ReID for VI-ReID task, which consists of Text Semantic Generation (TSG), Infrared Feature Embedding (IFE), and High-level Semantic Alignment (HSA). Specifically, considering the huge gap in the physical characteristics between natural images and infrared images, the TSG is designed to generate text semantics only for visible images, thereby enabling preliminary visible-text modality alignment. Then, the IFE is proposed to rectify the feature embeddings of infrared images using the generated text semantics. This process injects id-related semantics into the shared image encoder, enhancing its adaptability to the infrared modality. Besides, with text serving as a bridge, it enables indirect visible-infrared modality alignment. Finally, the HSA is established to refine the high-level semantic alignment. This process ensures that the fine-tuned text semantics only contain id-related information, thereby achieving more accurate cross-modal alignment and enhancing the discriminability of the learned modal-shared representations. Extensive experimental results demonstrate that the proposed CLIP4VI-ReID achieves superior performance than other state-of-the-art methods on some widely used VI-ReID datasets.

[145] Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision

Yu Deng,Baozhu Zhao,Junyan Su,Xiaohan Zhang,Qi Liu

Main category: cs.CV

TL;DR: 提出了一种结合景深监督和多视角一致性监督的3D高斯点阵化框架,显著提升了复杂场景中远近区域的深度重建精度。

Details Motivation: 现有方法难以同时解决远距离区域深度估计不准和近距离区域结构退化的问题,尤其在深度变化剧烈的场景中表现不佳。 Method: 1) 利用单目深度估计器生成深度先验,通过散焦卷积合成物理准确的散焦图像,并设计景深损失来增强几何一致性;2) 使用LoFTR进行半稠密特征匹配,通过最小化跨视角几何误差并利用最小二乘优化可靠匹配点来实现多视角一致性监督。 Result: 在Waymo Open Dataset上相比当前最先进方法实现了0.8 dB的PSNR提升,显著改善了远场和近场区域的深度保真度。 Conclusion: 该方法通过融合物理成像原理与学习型深度正则化,为城市环境中复杂深度分层问题提供了可扩展的有效解决方案。 Abstract: Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.

[146] Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment

Wenti Yin,Huaxin Zhang,Xiang Wang,Yuqing Lu,Yicheng Zhang,Bingquan Gong,Jialong Zuo,Li Yu,Changxin Gao,Nong Sang

Main category: cs.CV

TL;DR: 提出了一种新的解耦语义对齐网络(DSANet),通过在粗粒度和细粒度层面分离正常与异常特征,提升视频异常检测的性能。

Details Motivation: 现有弱监督视频异常检测方法倾向于关注最显著的片段,忽视多样化的正常模式挖掘,且因外观相似导致类别混淆,影响细粒度分类效果。 Method: 在粗粒度层面引入自指导的正常性建模分支,通过学习的正常原型重构视频特征;在细粒度层面采用解耦的对比语义对齐机制,将视频分解为事件中心和背景中心成分,并利用视觉-语言对比学习增强类别判别表示。 Result: 在XD-Violence和UCF-Crime两个标准数据集上的实验证明,DSANet优于现有的最先进方法。 Conclusion: DSANet能有效分离正常与异常特征,提升弱监督视频异常检测的准确性和细粒度分类能力。 Abstract: Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.

[147] FOUND: Fourier-based von Mises Distribution for Robust Single Domain Generalization in Object Detection

Mengzhu Wang,Changyuan Deng,Shanshan Wang,Nan Yin,Long Lan,Liang Yang

Main category: cs.CV

TL;DR: 本文提出了一种结合vMF分布和傅里叶变换的CLIP引导框架,用于提升单域泛化目标检测性能,通过建模方向特征和频域增强来提高跨域鲁棒性。

Details Motivation: 现有单域泛化方法忽视了特征分布结构和频域特性对模型鲁棒性的影响,需进一步提升模型在未见域上的泛化能力。 Method: 采用vMF分布建模对象表示的方向特征,利用傅里叶变换对振幅和相位进行扰动以模拟域偏移,并结合CLIP进行语义一致性引导。 Result: 在恶劣天气驾驶基准上显著优于现有最先进方法,验证了所提方法在特征鲁棒性和跨域一致性方面的优势。 Conclusion: 所提出的框架有效融合了方向特征建模与频域数据增强,提升了单域泛化检测器的鲁棒性和泛化性能。 Abstract: Single Domain Generalization (SDG) for object detection aims to train a model on a single source domain that can generalize effectively to unseen target domains. While recent methods like CLIP-based semantic augmentation have shown promise, they often overlook the underlying structure of feature distributions and frequency-domain characteristics that are critical for robustness. In this paper, we propose a novel framework that enhances SDG object detection by integrating the von Mises-Fisher (vMF) distribution and Fourier transformation into a CLIP-guided pipeline. Specifically, we model the directional features of object representations using vMF to better capture domain-invariant semantic structures in the embedding space. Additionally, we introduce a Fourier-based augmentation strategy that perturbs amplitude and phase components to simulate domain shifts in the frequency domain, further improving feature robustness. Our method not only preserves the semantic alignment benefits of CLIP but also enriches feature diversity and structural consistency across domains. Extensive experiments on the diverse weather-driving benchmark demonstrate that our approach outperforms the existing state-of-the-art method.

[148] DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile

Thales Bezerra,Emanoel Thyago,Kelvin Cunha,Rodrigo Abreu,Fábio Papais,Francisco Mauro,Natália Lopes,Érico Medeiros,Jéssica Guido,Shirley Cruz,Paulo Borba,Tsang Ing Ren

Main category: cs.CV

TL;DR: DermAI是一个轻量级的智能手机应用,支持在日常诊疗中实时捕捉、标注和分类皮肤病变,强调了标准化、多样化数据收集的重要性。

Details Motivation: 由于数据集偏差、图像质量参差不齐以及验证不足,基于AI的皮肤病诊断应用受限。 Method: 开发了一个名为DermAI的智能手机应用程序,具备设备端质量检查和本地模型自适应功能,并使用涵盖多种肤色、种族和设备来源的临床数据集进行训练和验证。 Result: 在初步实验中,基于公开数据集训练的模型无法泛化到新样本,而通过本地数据微调后性能显著提升。 Conclusion: 为了更好地支持机器学习开发并满足医疗需求,必须建立标准化且多样化的数据采集方法。 Abstract: AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.

[149] SHRUG-FM: Reliability-Aware Foundation Models for Earth Observation

Kai-Hendrik Cohrs,Zuzanna Osika,Maria Gonzalez-Calabuig,Vishal Nedungadi,Ruben Cartuyvels,Steffen Knoblauch,Joppe Massant,Shruti Nath,Patrick Ebel,Vasileios Sitokonstantinou

Main category: cs.CV

TL;DR: SHRUG-FM是一个用于地球观测的地理空间基础模型框架,通过结合输入空间和嵌入空间的分布外检测以及任务特定的预测不确定性,提升在欠代表环境中的可靠性。

Details Motivation: 现有地理空间基础模型在预训练中未充分表示的环境中表现不可靠,需要提高其在现实世界中的鲁棒性和可解释性。 Method: 提出SHRUG-FM框架,整合三种信号:输入空间OOD检测、嵌入空间OOD检测和任务特定的预测不确定性,并应用于火烧迹地分割任务。 Result: SHRUG-FM显示OOD得分与特定环境条件下性能下降相关,不确定性标志有助于过滤表现差的预测;失败集中在低海拔区和大河区域,与HydroATLAS土地覆盖属性相关。 Conclusion: SHRUG-FM为GFMs在气候敏感应用中的安全、可解释部署提供了可行路径,缩小了基准性能与实际可靠性之间的差距。 Abstract: Geospatial foundation models for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty. Applied to burn scar segmentation, SHRUG-FM shows that OOD scores correlate with lower performance in specific environmental conditions, while uncertainty-based flags help discard many poorly performing predictions. Linking these flags to land cover attributes from HydroATLAS shows that failures are not random but concentrated in certain geographies, such as low-elevation zones and large river areas, likely due to underrepresentation in pretraining data. SHRUG-FM provides a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, helping bridge the gap between benchmark performance and real-world reliability.

[150] MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

Xun Huang,Shijia Zhao,Yunxiang Wang,Xin Lu,Wanfa Zhang,Rongsheng Qu,Weixin Li,Yunhong Wang,Chenglu Wen

Main category: cs.CV

TL;DR: 提出了一种基于多模态3D场景图(M3DSG)的零样本导航系统MSGNav,通过保留视觉线索和动态图像边改进现有方法,并引入多个模块实现高效推理、开放词汇支持和闭环推理,同时解决了零样本导航中的“最后一英里”问题,在GOAT-Bench和HM3D-OVON数据集上达到SOTA性能。

Details Motivation: 现有的零样本具身导航方法在构建3D场景图时将丰富的视觉信息压缩为纯文本关系,导致视觉信息丢失、构建成本高且词汇受限,难以满足现实部署中对开放词汇和低训练开销的需求。 Method: 提出多模态3D场景图(M3DSG),用动态分配的图像替代文本关系边以保留视觉线索;在此基础上构建MSGNav系统,包含关键子图选择、自适应词汇更新和闭环推理模块,并引入基于可视性的视角决策模块解决“最后一英里”问题。 Result: MSGNav在GOAT-Bench和HM3D-OVON两个基准上实现了最先进的零样本导航性能,有效提升了定位精度与路径效率。 Conclusion: MSGNav通过保留视觉信息和多模块协同设计,显著提升了零样本具身导航的性能,尤其在开放词汇设置和最终目标定位方面表现突出,具备实际应用潜力。 Abstract: Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.

[151] Fragile by Design: On the Limits of Adversarial Defenses in Personalized Generation

Zhen Chen,Yi Zhang,Xiangyu Yin,Chengxuan Qin,Xingyu Zhao,Xiaowei Huang,Wenjie Ruan

Main category: cs.CV

TL;DR: 现有防御方法在防止个性化生成模型中的面部身份泄露方面存在明显缺陷,对抗性扰动易被检测且脆弱,经简单过滤即可失效,研究提出新评估框架AntiDB_Purify揭示了当前方法的不足,强调需要更隐蔽且鲁棒的保护机制。

Details Motivation: 个性化AI应用如DreamBooth存在隐私泄露风险,现有防御方法未能有效应对现实中的净化攻击,需系统评估其安全性。 Method: 提出名为AntiDB_Purify的新评估框架,系统测试现有防御方法在传统图像滤波和对抗性净化等真实威胁下的有效性。 Result: 实验表明,当前所有防御方法在经历净化处理后均失去保护效果,扰动易被去除,用户身份仍可被恢复。 Conclusion: 现有防御机制提供的是虚假的安全感,亟需开发更不易察觉且更具鲁棒性的防护手段以真正保护用户身份隐私。 Abstract: Personalized AI applications such as DreamBooth enable the generation of customized content from user images, but also raise significant privacy concerns, particularly the risk of facial identity leakage. Recent defense mechanisms like Anti-DreamBooth attempt to mitigate this risk by injecting adversarial perturbations into user photos to prevent successful personalization. However, we identify two critical yet overlooked limitations of these methods. First, the adversarial examples often exhibit perceptible artifacts such as conspicuous patterns or stripes, making them easily detectable as manipulated content. Second, the perturbations are highly fragile, as even a simple, non-learned filter can effectively remove them, thereby restoring the model's ability to memorize and reproduce user identity. To investigate this vulnerability, we propose a novel evaluation framework, AntiDB_Purify, to systematically evaluate existing defenses under realistic purification threats, including both traditional image filters and adversarial purification. Results reveal that none of the current methods maintains their protective effectiveness under such threats. These findings highlight that current defenses offer a false sense of security and underscore the urgent need for more imperceptible and robust protections to safeguard user identity in personalized generation.

[152] SAMIRO: Spatial Attention Mutual Information Regularization with a Pre-trained Model as Oracle for Lane Detection

Hyunjong Lee,Jangho Lee,Jaekoo Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为SAMIRO的车道检测方法,通过利用预训练模型和空间注意力互信息正则化来提升性能,并在多个主流基准上验证了其有效性。

Details Motivation: 由于真实环境中的背景杂乱、光照变化和遮挡等问题,基于数据驱动的车道检测面临数据收集和标注成本高的挑战,需要利用上下文和全局信息来提升检测效果。 Method: 提出SAMIRO方法,结合预训练模型作为Oracle,引入空间注意力互信息正则化,以保留领域无关的空间信息,并可即插即用集成到多种先进车道检测模型中。 Result: 在CULane、Tusimple和LLAMAS等多个主流基准上实验表明,SAMIRO在不同模型和数据集上均能持续提升性能。 Conclusion: SAMIRO能有效增强车道检测性能,具备良好的通用性和实用性,适用于多种现有检测框架。 Abstract: Lane detection is an important topic in the future mobility solutions. Real-world environmental challenges such as background clutter, varying illumination, and occlusions pose significant obstacles to effective lane detection, particularly when relying on data-driven approaches that require substantial effort and cost for data collection and annotation. To address these issues, lane detection methods must leverage contextual and global information from surrounding lanes and objects. In this paper, we propose a Spatial Attention Mutual Information Regularization with a pre-trained model as an Oracle, called SAMIRO. SAMIRO enhances lane detection performance by transferring knowledge from a pretrained model while preserving domain-agnostic spatial information. Leveraging SAMIRO's plug-and-play characteristic, we integrate it into various state-of-the-art lane detection approaches and conduct extensive experiments on major benchmarks such as CULane, Tusimple, and LLAMAS. The results demonstrate that SAMIRO consistently improves performance across different models and datasets. The code will be made available upon publication.

[153] Physics informed Transformer-VAE for biophysical parameter estimation: PROSAIL model inversion in Sentinel-2 imagery

Prince Mensah,Pelumi Victor Aderinto,Ibrahim Salihu Yusuf,Arnu Pretorius

Main category: cs.CV

TL;DR: 提出了一种物理信息驱动的Transformer-VAE架构,用于从Sentinel-2数据中反演PROSAIL模型以同时估计植被生化参数,仅使用模拟数据训练即可达到使用真实影像训练的最先进方法的性能。

Details Motivation: 准确从卫星影像中反演植被生物物理参数对生态系统监测和农业管理至关重要,但现有混合方法通常依赖真实遥感图像进行自监督训练,限制了其广泛应用。 Method: 提出一种将PROSAIL辐射传输模型作为可微分物理解码器的Transformer-VAE架构,仅在模拟数据上训练,实现对叶面积指数(LAI)和冠层叶绿素含量(CCC)的同时反演。 Result: 在FRM4Veg和BelSAR真实野外数据集上,该方法的反演精度与使用真实Sentinel-2影像训练的最先进方法相当,且无需实地标签或真实图像校准。 Conclusion: 将物理模型与深度网络结合可有效提升RTM反演能力,为全球尺度、物理约束的植被性状遥感监测提供了低成本、自监督的新方案。 Abstract: Accurate retrieval of vegetation biophysical variables from satellite imagery is crucial for ecosystem monitoring and agricultural management. In this work, we propose a physics-informed Transformer-VAE architecture to invert the PROSAIL radiative transfer model for simultaneous estimation of key canopy parameters from Sentinel-2 data. Unlike previous hybrid approaches that require real satellite images for self-supevised training. Our model is trained exclusively on simulated data, yet achieves performance on par with state-of-the-art methods that utilize real imagery. The Transformer-VAE incorporates the PROSAIL model as a differentiable physical decoder, ensuring that inferred latent variables correspond to physically plausible leaf and canopy properties. We demonstrate retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world field datasets (FRM4Veg and BelSAR) with accuracy comparable to models trained with real Sentinel-2 data. Our method requires no in-situ labels or calibration on real images, offering a cost-effective and self-supervised solution for global vegetation monitoring. The proposed approach illustrates how integrating physical models with advanced deep networks can improve the inversion of RTMs, opening new prospects for large-scale, physically-constrained remote sensing of vegetation traits.

[154] MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Jiarui Zhang,Yuliang Liu,Zijun Wu,Guosheng Pang,Zhili Ye,Yupei Zhong,Junteng Ma,Tao Wei,Haiyang Xu,Weikai Chen,Zeen Wang,Qiangjun Ji,Fanxi Zhou,Qi Zhang,Yuanrui Hu,Jiahao Liu,Zhang Li,Ziyang Zhang,Qiang Liu,Xiang Bai

Main category: cs.CV

TL;DR: MonkeyOCR v1.5 是一种统一的视觉-语言框架,通过两阶段解析流程提升复杂文档的布局理解和内容识别性能。

Details Motivation: 现有OCR系统在处理具有复杂布局、多级表格、嵌入式图像或公式以及跨页结构的真实文档时仍面临挑战。 Method: 采用两阶段解析流程:第一阶段使用大型多模态模型联合预测文档布局和阅读顺序;第二阶段在检测区域中进行文本、公式和表格的局部识别,并引入基于视觉一致性的强化学习方案及两个专用模块处理复杂表格。 Result: 在OmniDocBench v1.5上实验表明,MonkeyOCR v1.5性能优于PPOCR-VL和MinerU 2.5,达到最先进的水平,且在视觉复杂的文档场景中表现出卓越的鲁棒性。 Conclusion: MonkeyOCR v1.5有效提升了复杂文档的解析精度与鲁棒性,尤其在处理多级表格和跨页结构方面表现突出。 Abstract: Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.

[155] GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models

Oussema Dhaouadi,Johannes Meier,Jacques Kaiser,Daniel Cremers

Main category: cs.CV

TL;DR: 本文提出了GrounDiff,首个基于扩散模型的数字地形模型(DTM)生成框架,通过将去除非地面结构问题建模为去噪任务,结合门控设计和置信度引导生成实现选择性滤波,并提出Prior-Guided Stitching(PrioStitch)提升可扩展性,在多个数据集上显著优于现有方法。

Details Motivation: 传统DTM生成方法依赖人工调参或复杂的网络结构并需后处理,缺乏高效、自动化的解决方案。 Method: 提出GrounDiff,采用扩散模型框架,引入门控机制与置信度引导生成;结合PrioStitch策略,利用降采样全局先验指导局部高分辨率预测。 Result: 在ALS2DTM和USGS基准上RMSE分别降低达93%和47%;在GeRoD道路重建任务中距离误差降低81%,表面更平滑;GrounDiff+进一步提升性能。 Conclusion: GrounDiff是首个基于扩散模型的DSM-to-DTM框架,无需后处理即可实现高精度、高平滑性的DTM生成,在多种任务和数据集上均显著优于现有最先进方法。 Abstract: Digital Terrain Models (DTMs) represent the bare-earth elevation and are important in numerous geospatial applications. Such data models cannot be directly measured by sensors and are typically generated from Digital Surface Models (DSMs) derived from LiDAR or photogrammetry. Traditional filtering approaches rely on manually tuned parameters, while learning-based methods require well-designed architectures, often combined with post-processing. To address these challenges, we introduce Ground Diffusion (GrounDiff), the first diffusion-based framework that iteratively removes non-ground structures by formulating the problem as a denoising task. We incorporate a gated design with confidence-guided generation that enables selective filtering. To increase scalability, we further propose Prior-Guided Stitching (PrioStitch), which employs a downsampled global prior automatically generated using GrounDiff to guide local high-resolution predictions. We evaluate our method on the DSM-to-DTM translation task across diverse datasets, showing that GrounDiff consistently outperforms deep learning-based state-of-the-art methods, reducing RMSE by up to 93% on ALS2DTM and up to 47% on USGS benchmarks. In the task of road reconstruction, which requires both high precision and smoothness, our method achieves up to 81% lower distance error compared to specialized techniques on the GeRoD benchmark, while maintaining competitive surface smoothness using only DSM inputs, without task-specific optimization. Our variant for road reconstruction, GrounDiff+, is specifically designed to produce even smoother surfaces, further surpassing state-of-the-art methods. The project page is available at https://deepscenario.github.io/GrounDiff/.

[156] LLM-YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components

Yaru Li,Yanxue Wang,Meng Li,Xinming Li,Jianbo Feng

Main category: cs.CV

TL;DR: 提出了一种结合YOLOMS与大语言模型(LLM)的风力涡轮机故障智能分析框架,通过多尺度检测和滑窗裁剪提升特征提取,并利用轻量级KV映射模块将视觉输出转化为富含语义的文本表示,由领域调优的LLM生成可解释的故障分析与维护建议。

Details Motivation: 现有故障检测方法多局限于可视化识别,输出缺乏语义可解释性,难以支持运维决策。 Method: 采用YOLOMS进行多尺度检测与滑窗裁剪以增强故障特征提取;设计轻量级KV映射模块将检测结果转换为结构化文本;结合领域调优的大语言模型进行语义推理,生成故障分析与维护建议。 Result: 在真实数据集上实验表明,该框架故障检测准确率达90.6%,生成维护报告的平均准确率为89%。 Conclusion: 所提框架有效提升了风力涡轮机故障诊断结果的可解释性,为运维决策提供了实用支持。 Abstract: The health condition of wind turbine (WT) components is crucial for ensuring stable and reliable operation. However, existing fault detection methods are largely limited to visual recognition, producing structured outputs that lack semantic interpretability and fail to support maintenance decision-making. To address these limitations, this study proposes an integrated framework that combines YOLOMS with a large language model (LLM) for intelligent fault analysis and diagnosis. Specifically, YOLOMS employs multi-scale detection and sliding-window cropping to enhance fault feature extraction, while a lightweight key-value (KV) mapping module bridges the gap between visual outputs and textual inputs. This module converts YOLOMS detection results into structured textual representations enriched with both qualitative and quantitative attributes. A domain-tuned LLM then performs semantic reasoning to generate interpretable fault analyses and maintenance recommendations. Experiments on real-world datasets demonstrate that the proposed framework achieves a fault detection accuracy of 90.6\% and generates maintenance reports with an average accuracy of 89\%, thereby improving the interpretability of diagnostic results and providing practical decision support for the operation and maintenance of wind turbines.

[157] 3DFETUS: Standardizing Fetal Facial Planes in 3D Ultrasound

Alomar Antonia,Rubio Ricardo,Albaiges Gerard,Salort-Benejam Laura,Caminal Julia,Prat Maria,Rueda Carolina,Cortes Berta,Piella Gemma,Sukno Federico

Main category: cs.CV

TL;DR: 提出了一种名为GT++的算法和一种名为3DFETUS的深度学习模型,用于在3D胎儿超声中自动标准化面部平面定位,显著提高了准确性。

Details Motivation: 常规胎儿超声检查中获取标准面部平面具有挑战性,主要由于胎儿运动、方向变异和操作者依赖性,导致检查不一致、耗时增加和潜在诊断偏差。 Method: 采用标注的解剖标志点,利用GT++算法估计标准面部平面,并通过3DFETUS深度学习模型实现自动化定位。 Result: 该方法在定量评估中实现了平均4.13毫米的平移误差和7.93度的旋转误差,优于现有最先进方法;临床评估也证实了其在平面估计准确性上的显著提升。 Conclusion: GT++和3DFETUS能够有效提高3D胎儿超声中面部平面定位的自动化与标准化水平,有助于减少人为因素影响并提升诊断一致性。 Abstract: Acquiring standard facial planes during routine fetal ultrasound (US) examinations is often challenging due to fetal movement, variability in orientation, and operator-dependent expertise. These factors contribute to inconsistencies, increased examination time, and potential diagnostic bias. To address these challenges in the context of facial assessment, we present: 1) GT++, a robust algorithm that estimates standard facial planes from 3D US volumes using annotated anatomical landmarks; and 2) 3DFETUS, a deep learning model that automates and standardizes their localization in 3D fetal US volumes. We evaluated our methods both qualitatively, through expert clinical review, and quantitatively. The proposed approach achieved a mean translation error of 4.13 mm and a mean rotation error of 7.93 degrees per plane, outperforming other state-of-the-art methods on 3D US volumes. Clinical assessments further confirmed the effectiveness of both GT++ and 3DFETUS, demonstrating statistically significant improvements in plane estimation accuracy.

[158] RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation

Daniele Perlo,Vladimir Despotovic,Selma Boudissa,Sang-Yoon Kim,Petr Nazarov,Yanrong Zhang,Max Wintermark,Olivier Keunen

Main category: cs.CV

TL;DR: 提出并发布了RodEpil数据集,包含实验室啮齿类动物的短时视频片段,用于自动检测癫痫发作事件,采用TimeSformer模型实现97%的F1分数。

Details Motivation: 为支持基于视频的非侵入式癫痫监测研究,提供高质量、标注良好的啮齿类动物行为数据集,解决现有数据缺乏和标注不一致问题。 Method: 构建了包含正负样本的视频数据集,采用严格按个体划分的五折交叉验证,并使用TimeSformer架构进行视频分类实验。 Result: TimeSformer在五折交叉验证下达到平均97%的F1-score,显示出对癫痫发作与正常活动的良好区分能力。 Conclusion: RodEpil数据集可用于可重复的癫痫前临床研究,所提方法证明了Transformer在视频级行为识别中的有效性。 Abstract: We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: 10.5281/zenodo.17601357

[159] Histology-informed tiling of whole tissue sections improves the interpretability and predictability of cancer relapse and genetic alterations

Willem Bonnaffé,Yang Hu,Andrea Chatrian,Mengran Fan,Stefano Malacrino,Sandy Figiel,CRUK ICGC Prostate Group,Srinivasa R. Rao,Richard Colling,Richard J. Bryant,Freddie C. Hamdy,Dan J. Woodcock,Ian G. Mills,Clare Verrill,Jens Rittscher

Main category: cs.CV

TL;DR: 本文提出了一种名为组织学信息分块(HIT)的新方法,利用语义分割从全切片图像中提取腺体作为有意义的输入块,用于多实例学习和表型分析,显著提高了癌症相关基因拷贝数变异检测的准确性和模型可解释性。

Details Motivation: 传统数字病理学流程常采用忽略组织结构的网格化分块方法,导致引入无关信息且可解释性差,因此需要一种基于生物意义结构的分块策略。 Method: 提出HIT方法,结合语义分割技术从全切片图像中提取腺体作为输入块,应用于多实例学习模型,并在ProMPT、ICGC-C和TCGA-PRAD队列上验证其性能。 Result: HIT在腺体水平上Dice得分为0.83 +/- 0.17,使多实例学习模型对EMT和MYC相关基因拷贝数变异检测的AUC提升10%,并识别出15个腺体聚类,其中多个与癌症复发、致癌突变和高Gleason评分相关。 Conclusion: HIT通过聚焦生物有意义的结构,提升了模型的准确性、可解释性,并简化了计算过程,为数字病理分析提供了更优的特征提取方式。 Abstract: Histopathologists establish cancer grade by assessing histological structures, such as glands in prostate cancer. Yet, digital pathology pipelines often rely on grid-based tiling that ignores tissue architecture. This introduces irrelevant information and limits interpretability. We introduce histology-informed tiling (HIT), which uses semantic segmentation to extract glands from whole slide images (WSIs) as biologically meaningful input patches for multiple-instance learning (MIL) and phenotyping. Trained on 137 samples from the ProMPT cohort, HIT achieved a gland-level Dice score of 0.83 +/- 0.17. By extracting 380,000 glands from 760 WSIs across ICGC-C and TCGA-PRAD cohorts, HIT improved MIL models AUCs by 10% for detecting copy number variation (CNVs) in genes related to epithelial-mesenchymal transitions (EMT) and MYC, and revealed 15 gland clusters, several of which were associated with cancer relapse, oncogenic mutations, and high Gleason. Therefore, HIT improved the accuracy and interpretability of MIL predictions, while streamlining computations by focussing on biologically meaningful structures during feature extraction.

[160] OpenSR-SRGAN: A Flexible Super-Resolution Framework for Multispectral Earth Observation Data

Simon Donike,Cesar Aybar,Julio Contreras,Luis Gómez-Chova

Main category: cs.CV

TL;DR: OpenSR-SRGAN是一个开源、模块化的地球观测单图像超分辨率框架,采用配置驱动的方式统一实现SRGAN风格模型,支持多光谱卫星数据(如Sentinel-2),降低研究人员和实践者使用GAN进行超分辨率实验、模型比较和部署的门槛。

Details Motivation: 为了简化超分辨率模型在地球观测中的应用,解决现有方法需修改代码、难以复现和扩展的问题,提供一个统一、易用且可配置的开源框架。 Method: 设计一个模块化框架,通过配置文件定义生成器、判别器、损失函数和训练策略,支持多种网络架构、缩放因子和波段设置,并集成日志、验证和大场景推理功能。 Result: 实现了OpenSR-SRGAN框架,提供了即用型配置和合理的默认参数,支持多光谱遥感图像的超分辨率任务,便于模型比较与实际部署。 Conclusion: OpenSR-SRGAN作为一个实用工具和基准实现,成功将基于GAN的超分辨率转化为配置驱动的工作流,提升了可复现性和可访问性,适用于多样化的地球观测数据。 Abstract: We present OpenSR-SRGAN, an open and modular framework for single-image super-resolution in Earth Observation. The software provides a unified implementation of SRGAN-style models that is easy to configure, extend, and apply to multispectral satellite data such as Sentinel-2. Instead of requiring users to modify model code, OpenSR-SRGAN exposes generators, discriminators, loss functions, and training schedules through concise configuration files, making it straightforward to switch between architectures, scale factors, and band setups. The framework is designed as a practical tool and benchmark implementation rather than a state-of-the-art model. It ships with ready-to-use configurations for common remote sensing scenarios, sensible default settings for adversarial training, and built-in hooks for logging, validation, and large-scene inference. By turning GAN-based super-resolution into a configuration-driven workflow, OpenSR-SRGAN lowers the entry barrier for researchers and practitioners who wish to experiment with SRGANs, compare models in a reproducible way, and deploy super-resolution pipelines across diverse Earth-observation datasets.

[161] Utility of Pancreas Surface Lobularity as a CT Biomarker for Opportunistic Screening of Type 2 Diabetes

Tejas Sudharshan Mathai,Anisa V. Prasad,Xinya Wang,Praveen T. S. Balamuralikrishna,Yan Zhuang,Abhinav Suri,Jianfei Liu,Perry J. Pickhardt,Ronald M. Summers

Main category: cs.CV

TL;DR: 本研究提出了一种全自动深度学习方法,用于通过CT影像中的胰腺表面小叶性(PSL)等生物标志物进行2型糖尿病(T2DM)的机会性筛查。结果显示,T2DM患者PSL显著升高,模型预测T2DM的AUC达0.90。

Details Motivation: 早期检测T2DM至关重要,但胰腺表面小叶性(PSL)在T2DM中的作用尚未充分研究,因此需要一种自动化的影像分析方法来探索其作为生物标志物的潜力。 Method: 使用四个深度学习模型对584例患者的腹部CT图像进行胰腺及其他结构分割,自动提取包括PSL在内的影像生物标志物,并构建多变量模型预测T2DM。 Result: 糖尿病患者的PSL显著高于非糖尿病患者(p=0.01);PancAP模型分割性能最优(Dice=0.79,ASSD=1.94 mm);基于CT生物标志物的预测模型达到0.90 AUC、66.7%敏感性和91.9%特异性。 Conclusion: PSL是T2DM潜在的有效影像生物标志物,结合深度学习的自动化CT分析可用于T2DM的早期筛查和预测。 Abstract: Type 2 Diabetes Mellitus (T2DM) is a chronic metabolic disease that affects millions of people worldwide. Early detection is crucial as it can alter pancreas function through morphological changes and increased deposition of ectopic fat, eventually leading to organ damage. While studies have shown an association between T2DM and pancreas volume and fat content, the role of increased pancreatic surface lobularity (PSL) in patients with T2DM has not been fully investigated. In this pilot work, we propose a fully automated approach to delineate the pancreas and other abdominal structures, derive CT imaging biomarkers, and opportunistically screen for T2DM. Four deep learning-based models were used to segment the pancreas in an internal dataset of 584 patients (297 males, 437 non-diabetic, age: 45$\pm$15 years). PSL was automatically detected and it was higher for diabetic patients (p=0.01) at 4.26 $\pm$ 8.32 compared to 3.19 $\pm$ 3.62 for non-diabetic patients. The PancAP model achieved the highest Dice score of 0.79 $\pm$ 0.17 and lowest ASSD error of 1.94 $\pm$ 2.63 mm (p$<$0.05). For predicting T2DM, a multivariate model trained with CT biomarkers attained 0.90 AUC, 66.7\% sensitivity, and 91.9\% specificity. Our results suggest that PSL is useful for T2DM screening and could potentially help predict the early onset of T2DM.

[162] SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers

Oded Schlesinger,Amirhossein Farzam,J. Matias Di Martino,Guillermo Sapiro

Main category: cs.CV

TL;DR: 提出SPOT框架,利用令牌相关性进行早期冗余令牌检测与稀疏化,提升ViT的计算效率。

Details Motivation: Vision Transformers计算复杂度高,需通过早期识别并去除不重要的令牌来降低计算开销。 Method: 基于令牌嵌入、交互和注意力动态设计轻量级预测器,跨层推断令牌重要性并实现上下文感知的令牌稀疏化。 Result: 相比标准ViT最高提升40%的计算效率,同时保持或提高准确率。 Conclusion: SPOT能有效平衡ViT的性能与计算成本,具有良好的可扩展性和实用性。 Abstract: While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .

[163] Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising

Yusuf Talha Basak,Mehmet Ozan Unal,Metin Ertas,Isa Yildirim

Main category: cs.CV

TL;DR: 提出了一种可学习的总变差(LTV)框架,通过将展开的TV求解器与数据驱动的LambdaNet结合,实现端到端训练,生成空间自适应平滑,在去噪和边缘保持方面优于传统TV和FBP+U-Net。

Details Motivation: 传统TV方法依赖固定的lambda参数,限制了其效率和实用性,难以在不同区域自适应调节正则化强度。 Method: 设计了一个可学习的TV框架(LTV),包含一个展开的TV求解器和一个预测像素级正则化图的LambdaNet,整体进行端到端训练,实现重建与正则化的联合优化。 Result: 在DeepLesion数据集上使用真实噪声模型实验显示,相比经典TV和FBP+U-Net,LTV平均提升2.9 dB PSNR和6% SSIM,实现了更强的均匀区域平滑和边界附近的宽松平滑。 Conclusion: LTV提供了一种可解释的替代黑盒CNN的方法,并为3D重建和数据一致性驱动的重建奠定了基础。 Abstract: Although Total Variation (TV) performs well in noise reduction and edge preservation on images, its dependence on the lambda parameter limits its efficiency and makes it difficult to use effectively. In this study, we present a Learnable Total Variation (LTV) framework that couples an unrolled TV solver with a data-driven Lambda Mapping Network (LambdaNet) predicting a per-pixel regularization map. The pipeline is trained end-to-end so that reconstruction and regularization are optimized jointly, yielding spatially adaptive smoothing: strong in homogeneous regions, relaxed near anatomical boundaries. Experiments on the DeepLesion dataset, using a realistic noise model adapted from the LoDoPaB-CT methodology, show consistent gains over classical TV and FBP+U-Net: +2.9 dB PSNR and +6% SSIM on average. LTV provides an interpretable alternative to black-box CNNs and a basis for 3D and data-consistency-driven reconstruction. Our codes are available at: https://github.com/itu-biai/deep_tv_for_ldct

[164] SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Wei Li,Renshan Zhang,Rui Shao,Zhijian Fang,Kaiwen Zhou,Zhuotao Tian,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出了一种名为SemanticVLA的新框架,通过语义对齐的稀疏化与增强机制,解决视觉-语言-动作模型在机器人操作中的感知冗余和指令-视觉表面对齐问题,显著提升了性能与效率。

Details Motivation: 现有VLA模型存在感知冗余和指令与视觉信息表面对齐的问题,导致语义接地不足,影响了实际部署的效率与效果。 Method: 提出SemanticVLA框架,包括:1) SD-Pruner(含ID-Pruner和SA-Pruner)进行语义引导的双路径视觉剪枝;2) SH-Fuser融合密集patch与稀疏token以整合语义与几何信息;3) SA-Coupler实现感知到动作的语义条件化建模。 Result: 在仿真和真实世界任务中,SemanticVLA在LIBERO基准上比OpenVLA成功率提高21.1%,训练成本和推理延迟分别降低3.0倍和2.7倍。 Conclusion: SemanticVLA通过语义对齐的稀疏化与增强策略,在提升机器人操作性能的同时显著提高了效率,为VLA模型的实际部署提供了有效解决方案。 Abstract: Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA

[165] Dynamic Avatar-Scene Rendering from Human-centric Context

Wenqing Wang,Haosen Yang,Josef Kittler,Xiatian Zhu

Main category: cs.CV

TL;DR: 提出了一种“先分离后映射”(Separate-then-Map, StM)策略,用于从单目视频中重建动态人类与真实环境的交互,通过共享变换函数统一分别建模的人体与场景,显著提升了渲染质量与精度。

Details Motivation: 现有方法在建模动态人类与环境交互时,要么忽略人体与场景的不同运动特性,导致重建不完整,要么因缺乏组件间信息交换而产生边界处的空间不一致和视觉伪影。 Method: 采用分离建模策略,分别优化人体与场景,并引入专用的信息映射机制,通过为每个高斯属性设计共享变换函数,实现组件间的高效信息融合,避免复杂的成对交互计算。 Result: 在多个单目视频数据集上的实验表明,StM在视觉质量和渲染精度上均优于当前最先进方法,尤其在人体与场景交互的复杂边界区域表现突出。 Conclusion: StM策略有效解决了动态人体-场景交互重建中的信息隔离与计算效率问题,实现了更一致、高质量的4D神经渲染。 Abstract: Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.

[166] Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

Isabela Albuquerque,Ira Ktena,Olivia Wiles,Ivana Kajić,Amal Rannen-Triki,Cristina Vasconcelos,Aida Nematzadeh

Main category: cs.CV

TL;DR: 提出了一种评估文本到图像(T2I)模型多样性的新框架,包括人类评估模板、带变化因素的提示集和基于二项检验的模型比较方法,有效提升了多样性评估与模型排名能力。

Details Motivation: 当前T2I模型生成结果缺乏多样性,亟需系统化、细粒度的多样性评估方法。 Method: 设计了包含人类评估模板、覆盖多概念及其变化因素的提示集,并采用二项检验对比模型;同时评估了多种图像嵌入方法在多样性度量中的表现。 Result: 该框架能够系统评估T2I模型在不同概念上的多样性,识别模型薄弱类别,并实现基于人类标注的模型排序。 Conclusion: 所提方法为T2I模型的多样性评估提供了可靠工具,有助于推动多样性改进与评估指标的发展。 Abstract: Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.

[167] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Huijie Liu,Shuhao Cui,Haoxiang Cao,Shuai Ma,Kai Wu,Guoliang Kang

Main category: cs.CV

TL;DR: 本文提出了CoTyle,首个开源的“代码到风格”图像生成方法,通过数值风格码实现新颖且一致的视觉风格生成。

Details Motivation: 现有风格生成方法依赖复杂输入(如长文本、参考图或微调),难以保证风格一致性与创造性,且学术界在该领域缺乏开放研究。 Method: 首先从图像集中训练离散风格码本以提取风格嵌入;用这些嵌入作为条件指导文本到图像扩散模型生成风格化图像;再训练一个自回归风格生成器来建模离散风格嵌入的分布,从而合成新风格嵌入。推理时,数值风格码被映射为唯一风格嵌入,指导扩散模型生成对应风格图像。 Result: 实验证明CoTyle能有效将数值代码转化为风格控制器,在风格多样性、一致性与可复现性方面表现优异,仅需简单输入即可生成丰富新颖风格。 Conclusion: CoTyle填补了学术界在代码到风格生成领域的空白,验证了‘一种风格值一个代码’的理念,为简化和扩展风格化图像生成提供了新方向。 Abstract: Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.

[168] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded

Haosong Peng,Hao Li,Yalun Dai,Yushi Lan,Yihang Luo,Tianyu Qi,Zhengshen Zhang,Yufeng Zhan,Junfei Zhang,Wenchao Xu,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了OmniVGGT,一种能有效融合任意数量几何模态(如深度、相机内参/外参)的通用3D基础模型框架,通过GeoAdapter和随机多模态融合策略,在保持推理效率的同时提升了多种视觉任务性能,并在视觉-语言-动作模型中验证了其实用性。

Details Motivation: 现有3D基础模型大多仅使用RGB输入,忽略了易于获取的几何信息(如相机参数和深度图),限制了空间表征能力,因此需要一种能灵活利用多种几何模态的方法。 Method: 提出OmniVGGT框架,包含GeoAdapter模块(利用零初始化卷积逐步注入几何信息)和随机多模态融合机制(训练时随机采样模态子集),实现对任意数量几何输入的支持并防止过拟合。 Result: 在单目/多视角深度估计、多视图立体匹配和相机位姿估计等任务上超越先前方法,即使仅用RGB输入也达到SOTA;集成到VLA模型后在机器人任务中显著优于基线。 Conclusion: OmniVGGT能够有效融合多种几何模态,在不牺牲推理速度的前提下提升3D视觉任务性能,并具备良好的扩展性和实际应用价值。 Abstract: General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.

[169] From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis

Yen Nhi Truong Vu,Dan Guo,Sripad Joshi,Harshit Kumar,Jason Su,Thomas Paul Matthews

Main category: cs.CV

TL;DR: 本文提出了一种名为M&M-3D的新架构,用于数字乳腺断层合成(DBT)中的乳腺癌检测,能够在不增加参数的情况下实现可学习的三维推理,有效解决了数据稀缺问题,并在多个指标上显著优于现有方法。

Details Motivation: 由于标注的DBT数据有限,深度学习模型在DBT上的应用受到限制;现有方法要么丢失三维信息,要么依赖复杂架构和更多训练数据。 Method: M&M-3D通过修改原有M&M模型中的操作,在不增加参数的前提下构建恶性肿瘤引导的3D特征,并通过反复融合切片级信息实现3D推理,支持从FFDM模型直接迁移权重。 Result: 实验表明,M&M-3D在定位任务上比2D投影和3D切片方法高出11-54%,分类任务上高出3-10%;在低数据环境下优于复杂3D方法20-47%(定位)和2-10%(分类),在高数据下性能相当;在BCS-DBT基准上分类提升4%,定位提升10%。 Conclusion: M&M-3D在不增加模型参数的情况下实现了有效的3D推理,解决了DBT中数据稀缺与体积信息利用之间的矛盾,显著提升了乳腺癌检测性能。 Abstract: Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.

[170] Multitask GLocal OBIA-Mamba for Sentinel-2 Landcover Mapping

Zack Dewis,Yimin Zhu,Zhengsen Xu,Mabel Heffring,Saeid Taleghanidoozdoozan,Kaylee Xiao,Motasem Alkayid,Lincoln Linlin Xu

Main category: cs.CV

TL;DR: 提出了一种基于多任务全局-局部OBIA-Mamba(MSOM)的Sentinel-2影像土地利用分类新方法,显著提升了分类精度和细节表现。

Details Motivation: 由于空间异质性、上下文信息和光谱混淆等问题,基于Sentinel-2的LULC分类面临挑战,现有方法难以兼顾细粒度细节与全局一致性。 Method: 设计了以超像素为Mamba token的OBIA-Mamba模型;构建了融合局部空间细节与全局上下文信息的GLocal双分支CNN-Mamba架构;采用多任务优化框架与双重损失函数平衡局部精度与全局一致性。 Result: 在加拿大阿尔伯塔省的Sentinel-2影像上进行测试,结果表明该方法相比其他先进方法具有更高的分类精度和更精细的分类结果。 Conclusion: MSOM模型有效解决了Sentinel-2 LULC分类中的关键难题,在保持计算效率的同时显著提升了分类性能,具有广泛的应用前景。 Abstract: Although Sentinel-2 based land use and land cover (LULC) classification is critical for various environmental monitoring applications, it is a very difficult task due to some key data challenges (e.g., spatial heterogeneity, context information, signature ambiguity). This paper presents a novel Multitask Glocal OBIA-Mamba (MSOM) for enhanced Sentinel-2 classification with the following contributions. First, an object-based image analysis (OBIA) Mamba model (OBIA-Mamba) is designed to reduce redundant computation without compromising fine-grained details by using superpixels as Mamba tokens. Second, a global-local (GLocal) dual-branch convolutional neural network (CNN)-mamba architecture is designed to jointly model local spatial detail and global contextual information. Third, a multitask optimization framework is designed to employ dual loss functions to balance local precision with global consistency. The proposed approach is tested on Sentinel-2 imagery in Alberta, Canada, in comparison with several advanced classification approaches, and the results demonstrate that the proposed approach achieves higher classification accuracy and finer details that the other state-of-the-art methods.

[171] One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

Aleksandr Razin,Danil Kazantsev,Ilya Makarov

Main category: cs.CV

TL;DR: 提出了一种名为Latent Upscaler Adapter (LUA)的轻量级模块,可在潜在空间中直接进行超分辨率处理,提升扩散模型在高分辨率图像生成中的效率和质量。

Details Motivation: 扩散模型在超出训练分辨率时难以扩展,直接高分辨率采样缓慢且昂贵,而后期图像超分辨率(ISR)在解码后操作会引入伪影并增加延迟。 Method: 设计了一个轻量级的LUA模块,集成于生成器的潜在编码上,在VAE解码前进行超分辨率;采用共享的Swin风格主干网络和特定尺度的像素打乱头,支持2x和4x放大,并兼容图像空间SR方法。 Result: LUA在保持接近原生高分辨率生成保真度的同时,将解码和超分时间减少了近3倍(从1.87秒降至0.42秒),且无需修改基础模型或额外扩散步骤;在不同VAE的潜在空间中表现出强泛化能力。 Conclusion: LUA为现代扩散模型提供了一种高效、可扩展的高分辨率图像生成方案,兼具高质量与低延迟优势。 Abstract: Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

[172] Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin,Sili Chen,Junhao Liew,Donny Y. Chen,Zhenyu Li,Guang Shi,Jiashi Feng,Bingyi Kang

Main category: cs.CV

TL;DR: Depth Anything 3 (DA3) 是一个无需相机姿态信息即可从任意数量视觉输入中预测空间一致几何的模型,采用简洁的Transformer架构和单一半深度射线预测目标,在多个视觉几何任务上达到新SOTA。

Details Motivation: 追求极简建模,提升多视图几何预测的一致性与泛化能力,同时减少对复杂架构和多任务学习的依赖。 Method: 使用普通Transformer作为主干网络,提出单一深度射线预测目标,并通过教师-学生训练范式进行模型训练。 Result: 在新构建的视觉几何基准上,DA3在相机姿态估计和几何精度上分别超越先前SOTA VGGT 平均44.3% 和 25.1%,并在单目深度估计上优于DA2。 Conclusion: DA3通过简化模型设计实现了卓越的性能和泛化能力,验证了极简架构在视觉几何任务中的有效性。 Abstract: We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

[173] Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang,Weiye Xu,Aijun Yang,Wengang Zhou,Lewei Lu,Houqiang Li,Xiaohua Wang,Jinguo Zhu

Main category: cs.CV

TL;DR: 提出Self-Consistency Sampling (SCS) 方法,通过引入视觉扰动和重复截断重采样,利用一致性得分减轻多模态大语言模型在结果奖励强化学习中因错误推理链猜对答案带来的训练偏差,显著提升性能。

Details Motivation: 在多选设置下,结果奖励强化学习会因错误推理链猜中正确选项而给予相同奖励,导致训练信号不准确,影响模型推理能力的优化。 Method: 对于每个问题,SCS引入轻微视觉扰动,并对初始推理路径进行多次截断与重采样,通过多个生成路径之间的一致性计算可微分的一致性得分,在策略更新时降低不可靠轨迹的权重。 Result: 在Qwen2.5-VL-7B-Instruct上结合RLOO、GRPO和REINFORCE++方法,在六个多模态基准上最高提升7.7个百分点;同时在Qwen2.5-VL-3B-Instruct和InternVL3-8B上也取得显著增益,且计算开销极低。 Conclusion: SCS是一种简单、通用且有效的解决方案,能纠正结果奖励强化学习中的奖励失真问题,提升多模态大语言模型的推理训练效果。 Abstract: Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.