Skip to content

Table of Contents

cs.CL [Back]

[1] Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Omnilingual ASR team,Gil Keren,Artyom Kozhevnikov,Yen Meng,Christophe Ropers,Matthew Setzler,Skyler Wang,Ife Adebara,Michael Auli,Can Balioglu,Kevin Chan,Chierh Cheng,Joe Chuang,Caley Droof,Mark Duppenthaler,Paul-Ambroise Duquenne,Alexander Erben,Cynthia Gao,Gabriel Mejia Gonzalez,Kehan Lyu,Sagar Miglani,Vineel Pratap,Kaushik Ram Sadagopan,Safiyyah Saleem,Arina Turkatenko,Albert Ventayol-Boada,Zheng-Xin Yong,Yu-An Chung,Jean Maillard,Rashel Moritz,Alexandre Mourachko,Mary Williamson,Shireen Yates

Main category: cs.CL

TL;DR: Omnilingual ASR 是首个大规模可扩展的语音识别系统,通过自监督预训练和受大语言模型启发的解码器架构,实现了对1600多种语言的支持,其中超过500种为首次覆盖,显著提升了低资源语言的识别性能。

Details Motivation: 大多数语言缺乏自动语音识别(ASR)支持,现有系统受限于架构和数据成本,且常忽视与语言社区的合作,导致长尾语言难以被覆盖。 Method: 采用70亿参数的自监督预训练,结合编码器-解码器架构与大语言模型启发的解码器,利用包含公共资源和社区合作采集数据的大规模多语言语料库进行训练,实现零样本泛化能力。 Result: 系统支持超过1600种语言,包括500多种此前未被ASR覆盖的语言,在低资源条件下显著优于先前系统,并发布从3亿到70亿参数的开源模型系列。 Conclusion: Omnilingual ASR 实现了高度可扩展的语音识别,通过开放模型和工具降低了研究与应用门槛,强调社区协作与伦理设计,具有广泛的社会影响潜力。 Abstract: Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.

[2] Order Matters: Rethinking Prompt Construction in In-Context Learning

Warren Li,Yiqian Wang,Zihan Wang,Jingbo Shang

Main category: cs.CL

TL;DR: 本文重新审视了上下文学习中示例选择与顺序对模型性能的影响,发现示例顺序的影响与选择相当,并可通过开发集找到高效顺序。

Details Motivation: 以往研究认为在上下文学习中,示例的选择比顺序更重要,本文质疑这一假设,系统比较两者的影响。 Method: 在分类和生成任务上,使用多个开源模型(0.5B至27B参数)和GPT-5,通过控制实验比较不同示例选择和顺序对性能的影响,并利用开发集寻找最优顺序。 Result: 示例顺序带来的性能波动与更换整个示例集相当;仅用开发集即可找到接近测试标签 oracle 性能的强排序。 Conclusion: 示例选择与顺序在提示设计中具有同等且相互关联的重要性,需重新评估上下文学习中的现有假设。 Abstract: In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.

[3] Contextual morphologically-guided tokenization for Latin encoder models

Marisa Hudspeth,Patrick J. Burns,Brendan O'Connor

Main category: cs.CL

TL;DR: 本文研究了拉丁语这种形态丰富的中等资源语言的形态感知分词方法,发现基于形态指导的分词在四个下游任务中提升了整体性能,尤其改善了领域外文本的表现,表明利用语言学资源可有效提升形态复杂语言的语言模型性能。

Details Motivation: 标准分词方法通常优先考虑信息论目标,如高压缩率和低生育率,而非语言学目标如形态对齐,这对形态丰富的语言不利。因此,本文旨在探索更优的、考虑形态结构的分词方法以提升语言模型在形态丰富语言上的表现。 Method: 提出并采用形态引导的分词方法,结合现有的高质量词汇资源,对拉丁语进行预处理,并在多个下游任务上评估其效果。 Result: 形态引导的分词方法在四个下游任务中均带来性能提升,尤其在领域外文本上表现更优,显示出更好的泛化能力。 Conclusion: 利用语言学资源进行形态感知分词是提升形态复杂语言语言模型性能的有效途径,对于缺乏大规模预训练数据的低资源语言而言,开发和整合语言学资源是一种可行的替代方案。 Abstract: Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts, highlighting our models' improved generalization ability. Our findings demonstrate the utility of linguistic resources to improve language modeling for morphologically complex languages. For low-resource languages that lack large-scale pretraining data, the development and incorporation of linguistic resources can serve as a feasible alternative to improve LM performance.

[4] Assessing the Applicability of Natural Language Processing to Traditional Social Science Methodology: A Case Study in Identifying Strategic Signaling Patterns in Presidential Directives

C. LeMay,A. Lane,J. Seales,M. Winstead,S. Baty

Main category: cs.CL

TL;DR: 本研究探讨了自然语言处理(NLP)在从总统指令文本库中提取主要主题的应用,比较了NLP与人工标注的结果,揭示了其潜力与局限性,并指出需进一步验证NLP在此类社会科学研究中的有效性。

Details Motivation: 探索NLP技术在大规模文本分析中的适用性,特别是在识别政治文本中信号主题方面的潜力,以提升社会科学研究的效率与准确性。 Method: 应用NLP技术对里根至克林顿政府时期的总统指令进行主题提取,并与人工标注结果进行对比分析。 Result: NLP能够有效识别相关文档,显示出在处理大规模文本时的潜力,但与人工标注结果存在差异,表明其准确性仍有待提高。 Conclusion: 尽管NLP在分析总统指令等政治文本中展现出潜力,但其结果与人工判断存在偏差,需进一步研究以评估其有效性,尤其是在快速发展的AI背景下,当前工具的表现可能已过时。 Abstract: Our research investigates how Natural Language Processing (NLP) can be used to extract main topics from a larger corpus of written data, as applied to the case of identifying signaling themes in Presidential Directives (PDs) from the Reagan through Clinton administrations. Analysts and NLP both identified relevant documents, demonstrating the potential utility of NLPs in research involving large written corpuses. However, we also identified discrepancies between NLP and human-labeled results that indicate a need for more research to assess the validity of NLP in this use case. The research was conducted in 2023, and the rapidly evolving landscape of AIML means existing tools have improved and new tools have been developed; this research displays the inherent capabilities of a potentially dated AI tool in emerging social science applications.

[5] How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Muskaan Chopra,Lorenz Sparrenberg,Sarthak Khanna,Rafet Sifa

Main category: cs.CL

TL;DR: 小型语言模型(<2B参数)经轻量级校准和小样本监督后,可在设备端高效检测英德翻译中的关键错误,Gemma-3-1B在质量与效率间取得最佳平衡。

Details Motivation: 大型语言模型虽能有效评估机器翻译,但因其规模和成本难以部署于边缘设备和隐私敏感场景,需探索更小模型在保持检测性能的同时降低资源消耗。 Method: 聚焦英德关键错误检测任务,评估多个子2B模型(如Gemma-3-1B、Qwen-3等),采用统一提示模板、轻量级logit偏置校准和多数投票策略,并在WMT21、WMT22和SynCED-EnDe-2025数据集上评测语义质量与计算开销。 Result: 约10亿参数的模型表现最佳:Gemma-3-1B经合并权重微调后在SynCED-EnDe-2025上达到MCC=0.77、F1-ERR=0.98,MacBook Pro M4上单样本延迟仅400ms;Qwen-3-1.7B性能更高但计算成本大;0.6B模型仍可使用但漏检较多实体和数字错误。 Conclusion: 紧凑型指令微调模型结合轻量校准可在边缘设备实现可靠、低成本、私密的翻译错误检测,适用于实际翻译流程中的质量筛查。 Abstract: Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.

[6] Predicate-Argument Structure Divergences in Chinese and English Parallel Sentences and their Impact on Language Transfer

Rocco Tripodi,Xiaoyu Liu

Main category: cs.CL

TL;DR: 本文分析了中英文平行句中的谓词-论元结构,探讨了跨语言迁移中的对齐与错位现象,发现语言迁移具有不对称性,提示在跨语言自然语言处理中需谨慎选择源语言。

Details Motivation: 由于语言间的类型差异,跨语言知识迁移面临挑战,尤其是在资源匮乏的语言中,如何有效进行语言适应成为关键问题。 Method: 通过对中英文平行句的谓词标注进行投影实验,开展定性和定量分析,提出结构差异的分类体系。 Result: 实验结果表明,从一种语言向另一种语言的标注投影存在显著不对称性,说明跨语言迁移效果依赖于源语言的选择。 Conclusion: 跨语言NLP中的语言迁移具有不对称性,源语言的选择对迁移效果有重要影响,未来研究应在做出科学结论前充分考虑这一因素。 Abstract: Cross-lingual Natural Language Processing (NLP) has gained significant traction in recent years, offering practical solutions in low-resource settings by transferring linguistic knowledge from resource-rich to low-resource languages. This field leverages techniques like annotation projection and model transfer for language adaptation, supported by multilingual pre-trained language models. However, linguistic divergences hinder language transfer, especially among typologically distant languages. In this paper, we present an analysis of predicate-argument structures in parallel Chinese and English sentences. We explore the alignment and misalignment of predicate annotations, inspecting similarities and differences and proposing a categorization of structural divergences. The analysis and the categorization are supported by a qualitative and quantitative analysis of the results of an annotation projection experiment, in which, in turn, one of the two languages has been used as source language to project annotations into the corresponding parallel sentences. The results of this analysis show clearly that language transfer is asymmetric. An aspect that requires attention when it comes to selecting the source language in transfer learning applications and that needs to be investigated before any scientific claim about cross-lingual NLP is proposed.

[7] TARG: Training-Free Adaptive Retrieval Gating for Efficient RAG

Yufeng Wang,Lu wei,Haibin Ling

Main category: cs.CL

TL;DR: 提出了一种无需训练的自适应检索门控方法TARG,通过基于模型草稿输出的轻量级不确定性分数决定是否检索,显著减少检索次数和延迟,同时保持甚至提升RAG系统的准确率。

Details Motivation: 传统的RAG系统对每个查询都进行检索,导致效率低下、延迟高且资源浪费,因此需要一种智能机制来判断何时真正需要检索。 Method: TARG利用基础模型生成的短草稿前缀的logits计算不确定性得分,包括平均token熵、top-1/top-2 logit差距的margin信号或少量随机前缀的小N方差,并在得分超过阈值时触发检索。该方法无需额外训练,模型无关,仅增加少量草稿token。 Result: 在NQ-Open、TriviaQA和PopQA数据集上,TARG相比Always-RAG减少了70-90%的检索次数,降低了端到端延迟,同时保持或提升了EM/F1指标;其开销接近Never-RAG,且margin信号被验证为鲁棒的默认选择。 Conclusion: TARG提供了一种高效、通用且无需训练的检索门控方案,有效平衡了RAG系统的准确性与推理效率,推动了准确性和效率边界的前移。 Abstract: Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (TARG), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft's prefix logits, TARG computes lightweight uncertainty scores: mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-N variance across a handful of stochastic prefixes, and triggers retrieval only when the score exceeds a threshold. The gate is model agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently shifts the accuracy-efficiency frontier: compared with Always-RAG, TARG matches or improves EM/F1 while reducing retrieval by 70-90% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-N variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a delta-latency view to make budget trade-offs explicit.

[8] Khmer Spellchecking: A Holistic Approach

Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing

Main category: cs.CL

TL;DR: 本文提出了一种综合性的高棉语拼写检查方法,结合了子词分割、命名实体识别、音素转换和语言模型,实现了最高达94.4%的准确率,并公开了基准数据集。

Details Motivation: 高棉语拼写检查存在词汇与分词模型不匹配、一词多形、复合词灵活构成以及专有名词误判等挑战,现有方法未能有效解决这些问题。 Method: 集成高棉语子词分割、命名实体识别(NER)、图素到音素(G2P)转换和语言模型,以识别纠错候选并排序最优结果。 Result: 实验结果显示该方法在高棉语拼写检查上的准确率最高达到94.4%,优于现有方案。同时发布了拼写检查和NER任务的基准数据集。 Conclusion: 所提出的综合方法有效解决了高棉语拼写检查中的多个关键挑战,显著提升了拼写纠正的准确性,具有实际应用价值。 Abstract: Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the most suitable candidate. Experimental results show that the proposed approach achieves a state-of-the-art Khmer spellchecking accuracy of up to 94.4%, compared to existing solutions. The benchmark datasets for Khmer spellchecking and NER tasks in this study will be made publicly available.

[9] Improving Graduate Outcomes by Identifying Skills Gaps and Recommending Courses Based on Career Interests

Rahul Soni,Basem Suleiman,Sonit Singh

Main category: cs.CL

TL;DR: 本文提出了一种结合数据挖掘、机器学习与用户偏好的课程推荐系统,旨在弥合高校教育与产业需求之间的差距,提升学生选课的针对性和职业发展支持。

Details Motivation: 当前学生在选课时缺乏与行业趋势对接的指导,导致所学技能与就业市场需求脱节,因此需要一个智能化、个性化的课程推荐工具来提升教育与职业发展的衔接。 Method: 采用数据挖掘、协同过滤和机器学习算法,结合用户偏好、学术标准及职业目标,设计并实现一个算法框架,并通过迭代原型和用户反馈优化前端界面设计。 Result: 开发出一个具备良好用户体验的课程推荐系统,能够根据历史选课数据和职业规划提供个性化推荐,并通过用户反馈不断优化系统性能。 Conclusion: 该课程推荐系统有助于学生做出数据驱动且符合行业需求的选课决策,对提升毕业生就业能力和高等教育质量具有积极意义,可为学生、教师及职业顾问提供有效支持。 Abstract: This paper aims to address the challenge of selecting relevant courses for students by proposing the design and development of a course recommendation system. The course recommendation system utilises a combination of data analytics techniques and machine learning algorithms to recommend courses that align with current industry trends and requirements. In order to provide customised suggestions, the study entails the design and implementation of an extensive algorithmic framework that combines machine learning methods, user preferences, and academic criteria. The system employs data mining and collaborative filtering techniques to examine past courses and individual career goals in order to provide course recommendations. Moreover, to improve the accessibility and usefulness of the recommendation system, special attention is given to the development of an easy-to-use front-end interface. The front-end design prioritises visual clarity, interaction, and simplicity through iterative prototyping and user input revisions, guaranteeing a smooth and captivating user experience. We refined and optimised the proposed system by incorporating user feedback, ensuring that it effectively meets the needs and preferences of its target users. The proposed course recommendation system could be a useful tool for students, instructors, and career advisers to use in promoting lifelong learning and professional progression as it fills the gap between university learning and industry expectations. We hope that the proposed course recommendation system will help university students in making data-drive and industry-informed course decisions, in turn, improving graduate outcomes for the university sector.

[10] Answering Students' Questions on Course Forums Using Multiple Chain-of-Thought Reasoning and Finetuning RAG-Enabled LLM

Neo Wang,Sonit Singh

Main category: cs.CL

TL;DR: 本文提出了一种基于开源大语言模型和检索增强生成(RAG)方法的课程问答系统,结合本地知识库和多链思维链推理,以应对学生提问增多和模型幻觉问题,在HotpotQA数据集上表现出色。

Details Motivation: 随着课程注册学生数量增加,论坛中重复性问题增多,教师难以及时回复,亟需自动化问答系统减轻负担。 Method: 采用开源大语言模型,结合检索增强生成(RAG)方法,利用包含课程内容的本地知识库进行文档检索,并引入多链思维链推理以减少模型幻觉,同时在相关课程数据集上对模型进行微调。 Result: 在HotpotQA数据集上的实验表明,微调后的LLM结合RAG方法在问答任务中表现出较强的性能。 Conclusion: 该方法能有效提升课程论坛中自动问答的准确性和可靠性,缓解教师压力,具有实际应用潜力。 Abstract: The course forums are increasingly significant and play vital role in facilitating student discussions and answering their questions related to the course. It provides a platform for students to post their questions related to the content and admin issues related to the course. However, there are several challenges due to the increase in the number of students enrolled in the course. The primary challenge is that students' queries cannot be responded immediately and the instructors have to face lots of repetitive questions. To mitigate these issues, we propose a question answering system based on large language model with retrieval augmented generation (RAG) method. This work focuses on designing a question answering system with open source Large Language Model (LLM) and fine-tuning it on the relevant course dataset. To further improve the performance, we use a local knowledge base and applied RAG method to retrieve relevant documents relevant to students' queries, where the local knowledge base contains all the course content. To mitigate the hallucination of LLMs, We also integrate it with multi chain-of-thought reasoning to overcome the challenge of hallucination in LLMs. In this work, we experiment fine-tuned LLM with RAG method on the HotpotQA dataset. The experimental results demonstrate that the fine-tuned LLM with RAG method has a strong performance on question answering task.

Yidan Sun,Mengying Zhu,Feiyue Chen,Yangyang Wu,Xiaolei Dan,Mengyuan Yang,Xiaolin Zheng,Shenglin Ben

Main category: cs.CL

TL;DR: 提出TermGPT,一种多级对比微调框架,用于提升大语言模型在金融和法律领域术语表示的判别能力。

Details Motivation: 大语言模型的嵌入空间存在各向同性问题,导致对领域术语(尤其是法律和金融)的区分能力差,影响下游任务性能。 Method: 构建句子图以捕捉语义和结构关系,并基于上下文和拓扑线索生成正负样本;设计句子级和词元级的多级对比学习机制,增强全局理解和细粒度术语区分能力;构建首个源自官方监管文件的金融术语数据集。 Result: 实验表明,TermGPT在金融和法律领域的术语判别任务上优于现有基线方法。 Conclusion: TermGPT有效缓解了大模型嵌入空间的各向同性问题,显著提升了领域术语的表示质量,有助于改善法律判决预测和金融风险分析等下游任务。 Abstract: Large language models (LLMs) have demonstrated impressive performance in text generation tasks; however, their embedding spaces often suffer from the isotropy problem, resulting in poor discrimination of domain-specific terminology, particularly in legal and financial contexts. This weakness in terminology-level representation can severely hinder downstream tasks such as legal judgment prediction or financial risk analysis, where subtle semantic distinctions are critical. To address this problem, we propose TermGPT, a multi-level contrastive fine-tuning framework designed for terminology adaptation. We first construct a sentence graph to capture semantic and structural relations, and generate semantically consistent yet discriminative positive and negative samples based on contextual and topological cues. We then devise a multi-level contrastive learning approach at both the sentence and token levels, enhancing global contextual understanding and fine-grained terminology discrimination. To support robust evaluation, we construct the first financial terminology dataset derived from official regulatory documents. Experiments show that TermGPT outperforms existing baselines in term discrimination tasks within the finance and legal domains.

[12] In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

Mingye Zhu,Yi Liu,Zheren Fu,Quan Wang,Yongdong Zhang

Main category: cs.CL

TL;DR: 本文提出了InTRO(In-Token Rationality Optimization)框架,用于提升大语言模型在思维链推理中的准确性和简洁性,通过token级探索与自反馈机制,在多个数学推理任务上显著优于基线方法,并展现出跨领域迁移能力。

Details Motivation: 监督微调仅使用单一‘黄金’推理路径限制了模型泛化能力,而基于验证奖励的强化学习存在信用分配困难和计算成本高的问题。 Method: InTRO引入修正因子——通过生成策略与其答案条件版本之间的信息差异估计token-wise重要性权重,指导关键token的选择,实现单次前向传播中的token级探索与自反馈。 Result: 在六个数学推理基准上,InTRO相比基础模型最高提升20%的解答准确率,推理过程更简洁,且能有效迁移到非数学领域的推理任务。 Conclusion: InTRO通过细粒度的token级优化机制,有效提升了大模型在复杂推理任务中的性能与泛化能力,为思维链训练提供了高效可行的新范式。 Abstract: Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single "golden" rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.

[13] HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

Nikunj Gupta,Bill Guo,Rajgopal Kannan,Viktor K. Prasanna

Main category: cs.CL

TL;DR: 提出了一种名为HierRouter的分层路由方法,通过强化学习动态选择轻量级语言模型进行多跳推理,在降低计算成本的同时显著提升响应质量。

Details Motivation: 大型语言模型虽然性能优越,但计算和内存开销高,难以在资源受限或实时场景中部署,因此需要一种高效且低成本的推理方法。 Method: 将分层路由建模为有限视界马尔可夫决策过程(MDP),使用近端策略优化(PPO)强化学习代理,根据上下文和累积成本动态选择每一步调用的轻量模型。 Result: 在六个基准任务上实验表明,与单独使用单个模型相比,HierRouter将响应质量最高提升了2.4倍,且平均仅增加极少推理成本。 Conclusion: HierRouter展示了通过分层路由实现高效、高性能LLM推理的可行性,为资源受限环境下的模型部署提供了有效解决方案。 Abstract: Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.

[14] EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Jialin Wu,Kecen Li,Zhicong Huang,Xinfeng Li,Xiaofeng Wang,Cheng Hong

Main category: cs.CL

TL;DR: 本文提出EnchTable,一种无需重训练即可在下游大语言模型中保持安全对齐的新型框架,通过NTK安全向量蒸馏和干扰感知融合技术,在多个任务和模型架构中实现安全性与性能的平衡。

Details Motivation: 微调大语言模型常导致安全对齐性能下降,增加有害输出风险,亟需一种能保持安全性的高效对齐方法。 Method: 提出EnchTable框架,采用基于神经正切核(NTK)的安全向量蒸馏方法解耦安全约束与任务推理,并设计干扰感知融合技术平衡安全与效用。 Result: 在三个任务领域、三种LLM架构及十一个数据集上验证,EnchTable在降低不安全率、提升效用得分方面优于六种参数修改方法和两种推理时对齐基线,且具备抗 jailbreaking 攻击能力。 Conclusion: EnchTable能有效迁移并维持下游LLM的安全对齐,具有通用性、强健性和低部署开销,适用于多厂商、多架构的实际应用场景。 Abstract: Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable's generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.

[15] HI-TransPA: Hearing Impairments Translation Personal Assistant

Zhiming Ma,Shiyu Gan,Junhao Zhao,Xianming Li,Qingyun Pan,Peidong Wang,Mingjun Pan,Yuhao Mo,Jiajie Cheng,Chengxin Chen,Zhonglun Cao,Chonghan Liu,Shi Cheng

Main category: cs.CL

TL;DR: 本文提出了HI-TransPA,一种基于Omni-Model范式的听障人士语音-视觉个人助手,通过融合模糊语音与高帧率唇动信息,在统一框架下实现翻译与对话功能。

Details Motivation: 现有Omni-Model对听障人群的模糊语音适应性差,且原始数据噪声大、异质性强,缺乏统一高效的处理框架。因此需要构建一个专为听障交流设计的鲁棒多模态系统。 Method: 提出HI-TransPA模型,结合SigLIP编码器与Unified 3D-Resampler高效编码高帧率唇动;构建包含面部关键点检测、唇部区域稳定化和多模态质量评估的预处理流程,并采用基于质量评分的课程学习策略逐步训练模型。 Result: 在自建HI-Dialogue数据集上实验表明,HI-TransPA在字面准确性和语义保真度方面均达到最先进水平。 Conclusion: 该工作为将Omni-Model应用于辅助通信技术奠定了基础,提供了端到端建模范式及关键数据处理工具,推动面向听障人群的智能交互研究发展。 Abstract: To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

[16] MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection

Pritish Sahu,Anirudh Som,Dimitra Vergyri,Ajay Divakaran

Main category: cs.CL

TL;DR: 本文提出了Norm-RAG,一种基于检索增强的代理框架,用于多轮对话中的社会规范推理,并引入了双语数据集MINDS,以提升跨文化场景下规范识别与对话系统的社会智能性。

Details Motivation: 现有研究多关注孤立语句或合成对话中的规范标注,难以捕捉真实多轮对话中动态、情境依赖且跨文化的规范表达,因此需要更细粒度、上下文敏感的计算模型。 Method: 提出Norm-RAG框架,结合语义分块技术检索结构化规范文档,建模话语层面的交际意图、说话人角色、人际框架和语言线索,实现可解释、情境感知的规范推理;同时构建包含31个多轮中英和西英对话的MINDS数据集,每轮标注规范类别与遵守状态。 Result: 实验表明,Norm-RAG在规范检测与泛化能力上表现更优,尤其在跨语言和跨文化对话中展现出更强的适应性与推理性能。 Conclusion: Norm-RAG通过融合检索增强与多维度话语建模,有效支持了多轮、多语言对话中的社会规范理解,为构建文化敏感、社会智能的对话系统提供了新路径。 Abstract: Social norms are implicit, culturally grounded expectations that guide interpersonal communication. Unlike factual commonsense, norm reasoning is subjective, context-dependent, and varies across cultures, posing challenges for computational models. Prior works provide valuable normative annotations but mostly target isolated utterances or synthetic dialogues, limiting their ability to capture the fluid, multi-turn nature of real-world conversations. In this work, we present Norm-RAG, a retrieval-augmented, agentic framework for nuanced social norm inference in multi-turn dialogues. Norm-RAG models utterance-level attributes including communicative intent, speaker roles, interpersonal framing, and linguistic cues and grounds them in structured normative documentation retrieved via a novel Semantic Chunking approach. This enables interpretable and context-aware reasoning about norm adherence and violation across multilingual dialogues. We further introduce MINDS (Multilingual Interactions with Norm-Driven Speech), a bilingual dataset comprising 31 multi-turn Mandarin-English and Spanish-English conversations. Each turn is annotated for norm category and adherence status using multi-annotator consensus, reflecting cross-cultural and realistic norm expression. Our experiments show that Norm-RAG improves norm detection and generalization, demonstrates improved performance for culturally adaptive and socially intelligent dialogue systems.

[17] Leveraging Large Language Models for Identifying Knowledge Components

Canwen Wang,Jionghao Lin,Kenneth R. Koedinger

Main category: cs.CL

TL;DR: 本研究探讨了使用大语言模型(LLM)自动生成知识组件(KC)标签的可行性,并提出一种基于余弦相似度的语义合并方法以减少冗余。在646道选择题的大规模数据集上,原始LLM方法生成了过多且性能较差的KC(569个,RMSE 0.4285),而专家设计的模型更优(101个,RMSE 0.4206)。通过设定0.8的余弦相似度阈值进行合并后,KC数量降至428,RMSE提升至0.4259,表明结合语义合并的LLM可有效推进KC自动构建。

Details Motivation: 手动识别知识组件(KC)耗时费力,依赖领域专家,成为自适应学习系统发展的瓶颈。尽管大语言模型(LLM)有望自动化该过程,但已有研究受限于小数据集且易产生冗余KC标签,亟需可扩展且精确的自动化方案。 Method: 采用GPT-4o-mini对包含646道选择题的较大数据集应用“模拟教科书”提示策略来自动生成KC标签;随后提出并评估一种基于向量表示间余弦相似度的语义相似KC标签合并方法,通过设置不同阈值进行聚类合并,评估其对KC数量与预测性能的影响。 Result: 初始LLM生成569个KC,RMSE为0.4285,表现差于专家模型(101个KC,RMSE 0.4206);引入余弦相似度合并策略后,当阈值设为0.8时效果最佳,KC数量减少至428,RMSE改善至0.4259,显著优于未合并情况。 Conclusion: 单纯扩大LLM生成规模不足以超越专家设计的KC模型,但结合语义层面的标签合并策略能显著提升自动化KC构建的质量,为实现高效、可扩展的知识结构建模提供了可行路径。 Abstract: Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a "simulated textbook" LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model's performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.

[18] REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering

Yijie Zhu,Haojie Zhou,Wanting Hong,Tailin Liu,Ning Wang

Main category: cs.CL

TL;DR: 提出了一种名为REAP的递归评估与自适应规划方法,通过子任务规划和事实提取模块提升多跳推理中检索增强生成的全局规划与内容利用能力。

Details Motivation: 现有检索增强生成方法在多跳推理任务中缺乏全局规划,易陷入局部推理困境,且对检索内容和潜在线索利用不足,影响推理准确性。 Method: 设计了子任务规划器(SP)和事实提取器(FE)模块:SP维护全局视角并动态优化推理路径,FE从检索内容中细粒度提取可靠信息;二者协同构建连贯的全局知识表示,并采用统一任务范式进行多任务微调。 Result: 在多个公开多跳数据集上实验表明,REAP在领域内和跨领域设置下均显著优于现有RAG方法。 Conclusion: REAP通过结构化子任务与事实管理,有效提升了复杂多跳推理任务中推理的可靠性与可追溯性。 Abstract: Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose Recursive Evaluation and Adaptive Planning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP's performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.

[19] NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction

Peter Røysland Aarnes,Vinay Setty

Main category: cs.CL

TL;DR: 本文系统评估了最先进的大语言模型在数值声明真实性预测中的表现,发现即使是最先进的模型在面对特定扰动时准确率也会显著下降,且没有模型能在所有条件下保持鲁棒性。增加上下文长度通常会降低准确性,但通过引入扰动示例可帮助模型恢复性能,表明当前模型在数值事实核查中的鲁棒性仍面临挑战。

Details Motivation: 大语言模型在知识密集型任务中表现良好,但在数值推理方面存在困难。为了揭示现有模型在处理数值声明时的脆弱性,研究旨在通过系统性扰动测试其鲁棒性。 Method: 采用受控扰动(包括标签翻转探测)对最新模型进行系统评估,分析不同条件下模型在数值声明与证据对的真实性预测表现,并考察上下文长度及扰动示例的影响。 Result: 领先模型在某些扰动下准确率下降高达62%,无一模型在所有条件下均表现出鲁棒性;增加上下文长度通常降低准确性,但若扩展上下文中包含扰动示例,多数模型性能显著恢复。 Conclusion: 当前大语言模型在数值事实核查任务中存在严重鲁棒性问题,亟需改进以应对数值推理中的扰动挑战。 Abstract: Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62\% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.

Bo Li,Tian Tian,Zhenghua Xu,Hao Cheng,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 提出了一种无需训练的动态检索时机控制方法ETC,通过建模token级不确定性的趋势变化,在检索增强生成中实现更早、更精确的检索触发。

Details Motivation: 现有基于低置信度触发检索的方法往往在错误传播后才进行干预,无法及时有效触发检索。 Method: 利用熵序列的一阶和二阶差分来检测不确定性趋势,提出Entropy-Trend Constraint(ETC)方法,动态判断最佳检索时机。 Result: 在六个问答基准和三种大语言模型上验证了ETC的有效性,相比强基线提升了性能并减少了检索频率,尤其在领域特定场景下表现突出。 Conclusion: ETC是一种即插即用、模型无关的方法,能够通过趋势感知的不确定性建模实现更优的检索时机决策,具有良好的通用性和集成性。 Abstract: Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.

[21] Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation

Bo Li,Zhenghua Xu,Rui Xie

Main category: cs.CL

TL;DR: 本文研究了多语言检索增强生成(RAG)中的输出语言漂移问题,发现该问题源于解码器层面的崩溃而非理解失败,并提出一种轻量级、无需训练的软约束解码(SCD)方法来缓解此问题。

Details Motivation: 在多语言RAG中,当检索到的文档、用户查询和上下文示例语言不一致时,模型容易产生非预期语言的响应,尤其是在链式思维推理过程中语言漂移更为严重。 Method: 通过在多个数据集、语言和大语言模型上的受控实验分析语言漂移现象,并提出软约束解码(SCD)策略,通过惩罚非目标语言标记来引导生成过程。 Result: 实验证明SCD能有效提升多语言RAG中的语言对齐性和任务性能,且适用于不同生成算法和模型结构。 Conclusion: SCD是一种通用、高效且无需训练的解决方案,可显著减轻多语言RAG中的语言漂移问题。 Abstract: Multilingual Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to perform knowledge-intensive tasks in multilingual settings by leveraging retrieved documents as external evidence. However, when the retrieved evidence differs in language from the user query and in-context exemplars, the model often exhibits language drift by generating responses in an unintended language. This phenomenon is especially pronounced during reasoning-intensive decoding, such as Chain-of-Thought (CoT) generation, where intermediate steps introduce further language instability. In this paper, we systematically study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones. Our controlled experiments reveal that the drift results not from comprehension failure but from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language. We further observe that English serves as a semantic attractor under cross-lingual conditions, emerging as both the strongest interference source and the most frequent fallback language. To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language by penalizing non-target-language tokens. SCD is model-agnostic and can be applied to any generation algorithm without modifying the architecture or requiring additional data. Experiments across three multilingual datasets and multiple typologically diverse languages show that SCD consistently improves language alignment and task performance, providing an effective and generalizable solution in multilingual RAG.

[22] FinNuE: Exposing the Risks of Using BERTScore for Numerical Semantic Evaluation in Finance

Yu-Shiang Huang,Yun-Yu Lee,Tzu-Hsin Chou,Che Lin,Chuan-Ju Wang

Main category: cs.CL

TL;DR: BERTScore存在对数值变化不敏感的问题,这在金融领域尤为突出;为此提出FinNuE数据集以评估其在数值语义变化上的表现,并揭示现有嵌入式指标在金融NLP中的局限性。

Details Motivation: BERTScore在衡量自然语言语义相似性时广泛应用,但在金融场景中对数值变化的低敏感性可能导致严重误判,因此需要研究其在数值语义差异上的不足。 Method: 构建了一个包含受控数值扰动的金融文本数据集FinNuE,涵盖财报电话会议、监管文件、社交媒体和新闻文章,并用其评估BERTScore的表现。 Result: 实验表明,BERTScore无法有效区分具有重要财务差异的数值变化,常对语义迥异的金融文本对给出高相似度评分。 Conclusion: 现有基于嵌入的评价指标在金融NLP任务中存在根本局限,需发展更具数值感知能力的评估框架。 Abstract: BERTScore has become a widely adopted metric for evaluating semantic similarity between natural language sentences. However, we identify a critical limitation: BERTScore exhibits low sensitivity to numerical variation, a significant weakness in finance where numerical precision directly affects meaning (e.g., distinguishing a 2% gain from a 20% loss). We introduce FinNuE, a diagnostic dataset constructed with controlled numerical perturbations across earnings calls, regulatory filings, social media, and news articles. Using FinNuE, demonstrate that BERTScore fails to distinguish semantically critical numerical differences, often assigning high similarity scores to financially divergent text pairs. Our findings reveal fundamental limitations of embedding-based metrics for finance and motivate numerically-aware evaluation frameworks for financial NLP.

[23] PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models

Shivam Sharma,Riya Naik,Tejas Gawas,Heramb Patil,Kunal Korgaonkar

Main category: cs.CL

TL;DR: 本文提出了一个名为PustakAI的框架,用于设计和评估与印度NCERT课程对齐的“NCERT-QA”问答数据集,涵盖6至8年级的英语和科学科目,并通过多种提示技术评估开源和高端大语言模型在正式教育系统中的适用性。

Details Motivation: 将大语言模型有效适配到特定课程(如印度NCERT大纲)面临准确性、一致性及教学相关性的挑战,尤其是在教育资源匮乏地区推动个性化学习的需求驱动下。 Method: 构建了NCERT-QA数据集,分类为事实型、推断型和其他类型(评估与推理),并采用元提示、少样本提示和思维链提示等技术,结合多种评估指标进行实验。 Result: 评估了Gemma3:1b、Llama3.2:3b、Nemotron-mini:4b、Llama-4-Scout-17B和Deepseek-r1-70B等模型在不同提示策略下的表现,分析了其在课程对齐和教育应用中的有效性与局限性。 Conclusion: PustakAI框架和NCERT-QA数据集有助于提升大语言模型在正规教育中的课程对齐能力,研究结果为选择合适的模型和提示策略以支持教学提供了实践指导。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework "PustakAI"\footnote{Pustak means `book' in many Indian languages.} for the design and evaluation of a novel question-answering dataset "NCERT-QA" aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.

[24] ScaleFormer: Span Representation Cumulation for Long-Context Transformer

Jiangshu Du,Wenpeng Yin,Philip Yu

Main category: cs.CL

TL;DR: ScaleFormer提出了一种无需修改架构的即插即用方法,通过重叠分块和上下文累积表示来增强预训练Transformer模型处理长文本的能力。

Details Motivation: 标准自注意力机制的二次复杂度限制了Transformer在长文本任务中的应用,而现有高效变体通常需要架构更改和从头预训练。 Method: 将长输入分割为重叠块,利用无参数的融合机制,在每个块的边界表示中累积前后上下文信息,生成具有结构感知的压缩表示。 Result: 在长文档摘要任务上,该方法表现出与最先进方法相当甚至更优的性能,且无需架构修改或外部检索机制。 Conclusion: ScaleFormer是一种简单有效的框架,可使现成的预训练编码器-解码器模型高效处理长序列,同时保持线性计算复杂度。 Abstract: The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk's representation with structural awareness of its position within the document. It achieves this by enriching each chunk's boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document's narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.

[25] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Jinhong Jeong,Sunghyun Lee,Jaeyoung Lee,Seonah Han,Youngjae Yu

Main category: cs.CL

TL;DR: 本文提出并研究了多模态大语言模型(MLLMs)在语音象似性(phonetic iconicity)方面的表现,利用新构建的LEX-ICON数据集(包含真实词与伪词),从文本和音频模态分析模型对语音-语义关联的理解,并通过注意力机制揭示其处理过程。

Details Motivation: 探索MLLMs如何理解人类语言中的声音象征性(sound symbolism),即语音形式与意义之间的非任意关联,为模型的听觉信息处理机制提供新的分析视角。 Method: 构建包含8,052个真实词和2,930个伪词的LEX-ICON数据集,覆盖四种语言(英、法、日、韩),标注多个语义维度;在文本(拼写和IPA)与音频输入下评估MLLMs的表现,并通过层间音素级注意力分数分析模型的信息处理机制。 Result: 发现MLLMs在多个语义维度上展现出与语言学研究一致的语音直觉,且注意力模式突出显示模型关注具有象似性的关键音素。 Conclusion: 本研究首次大规模定量分析了MLLMs中的语音象似性,连接了人工智能与认知语言学领域,揭示了模型在多模态下对语音语义关联的可解释性。 Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs' performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models' layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs' phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models' focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs' interpretability.

[26] GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt

Zhenhe Li,Can Lin,Ling Zheng,Wen-Da Wei,Junli Liang,Qi Song

Main category: cs.CL

TL;DR: 本文提出了一种名为GraphIF的即插即用框架,通过将多轮对话建模为有向关系图,并利用图提示增强大语言模型的多轮指令遵循能力。

Details Motivation: 现有方法在优化过程中未显式建模多轮指令遵循,导致模型难以处理复杂的长距离约束问题。 Method: GraphIF包含三个模块:基于代理的关系抽取模块、关系图提示生成模块和响应重写模块,将多轮对话转化为结构化图并生成自然语言提示以优化输出。 Result: 在两个长多轮对话数据集上的实验表明,GraphIF能显著提升四个多轮指令遵循评估指标的表现。 Conclusion: GraphIF有效提升了大语言模型在多轮对话中对复杂、长距离指令的遵循能力,且具有良好的可集成性。 Abstract: Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.

[27] ADI-20: Arabic Dialect Identification dataset and models

Haroun Elleuch,Salima Mdhaffar,Yannick Estève,Fethi Bougares

Main category: cs.CL

TL;DR: 本文介绍了ADI-20,这是对先前发布的ADI-17阿拉伯语方言识别数据集的扩展,涵盖了所有阿拉伯语国家的方言,包含来自19种阿拉伯语方言和现代标准阿拉伯语(MSA)的3,556小时语音数据。作者利用该数据集训练并评估了多种最先进的ADI系统,探索了基于预训练ECAPA-TDNN模型的微调方法,以及结合注意力池化层和分类密集层的Whisper编码器模块。研究还分析了训练数据规模和模型参数量对识别性能的影响,结果表明即使仅使用原始训练数据的30%,F1分数也仅有小幅下降。作者开源了收集的数据和训练好的模型,以支持后续研究和成果复现。

Details Motivation: 为了提升阿拉伯语方言识别(ADI)任务的性能与覆盖范围,解决现有数据集方言覆盖不全、资源不足的问题,推动该领域的研究发展。 Method: 构建并发布扩展数据集ADI-20,涵盖19种阿拉伯语方言及MSA;采用基于ECAPA-TDNN的预训练模型微调方法,并设计基于Whisper编码器块、注意力池化层和分类密集层的模型架构;系统评估不同训练数据量和模型参数规模对性能的影响。 Result: 在完整数据上取得了较高的ADI识别性能;仅使用30%训练数据时,F1分数仅有轻微下降,表明模型具有较强的数据效率;开源了数据集和训练模型。 Conclusion: ADI-20是一个大规模、全面覆盖阿拉伯语各方言的数据集,有效支持ADI系统的研究;基于ECAPA-TDNN和Whisper的模型在该任务上表现良好,且对数据量变化具有鲁棒性;开源贡献有助于推动该领域的开放研究与可重复性。 Abstract: We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.

[28] Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Xanh Ho,Yun-Ang Wu,Sunisth Kumar,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa

Main category: cs.CL

TL;DR: 本文研究了多模态大语言模型在科学声明验证任务中对表格和图表证据的理解能力,发现现有模型在处理图表时表现较差,尤其小规模模型跨模态泛化能力弱,建议未来工作应加强图表理解。

Details Motivation: 随着科学论文数量增加,亟需系统辅助审稿人评估科研声明;然而当前多模态大模型在不同证据格式(如表格与图表)下的科学声明验证能力尚不明确。 Method: 设计并开展一系列实验,基于两个现有科学论文数据集构建适用于多模态声明验证的标注数据集,并评估12种多模态大模型在表格和图表两种证据下的表现,同时进行人类对比实验。 Result: 现有模型在表格证据下表现优于图表;人类在两种格式上均表现良好;小于8B的小模型在表格与图表任务间性能相关性弱,显示其跨模态泛化能力有限。 Conclusion: 当前多模态大模型在图表理解方面存在显著短板,限制了其在科学声明验证中的应用,未来应重点提升模型对图表信息的推理能力。 Abstract: With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models' multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.

[29] ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks

Haroun Elleuch,Youssef Saidi,Salima Mdhaffar,Yannick Estève,Fethi Bougares

Main category: cs.CL

TL;DR: 本文介绍了Elyadata和LIA在NADI 2025多方言阿拉伯语语音处理任务中的联合提交方案,在口语阿拉伯语方言识别(ADI)和多方言阿拉伯语ASR子任务中分别排名第一和第二。

Details Motivation: 旨在探索大型预训练语音模型在多方言阿拉伯语语音处理中的有效性,特别是在低资源条件下提升方言识别和语音识别性能。 Method: 对于ADI任务,采用Whisper-large-v3编码器并结合数据增强进行微调;对于多方言ASR任务,分别对SeamlessM4T-v2 Large(埃及变体)针对八种方言进行独立微调。 Result: ADI系统在测试集上达到79.83%的准确率,位居第一;多方言ASR平均WER为38.54%,CER为14.53%,排名第二。 Conclusion: 研究表明,通过针对性微调,大型预训练语音模型能有效提升多方言阿拉伯语的语音处理性能。 Abstract: This paper describes Elyadata \& LIA's joint submission to the NADI multi-dialectal Arabic Speech Processing 2025. We participated in the Spoken Arabic Dialect Identification (ADI) and multi-dialectal Arabic ASR subtasks. Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants. Our ADI system is a fine-tuned Whisper-large-v3 encoder with data augmentation. This system obtained the highest ADI accuracy score of \textbf{79.83\%} on the official test set. For multi-dialectal Arabic ASR, we fine-tuned SeamlessM4T-v2 Large (Egyptian variant) separately for each of the eight considered dialects. Overall, we obtained an average WER and CER of \textbf{38.54\%} and \textbf{14.53\%}, respectively, on the test set. Our results demonstrate the effectiveness of large pre-trained speech models with targeted fine-tuning for Arabic speech processing.

[30] On the Military Applications of Large Language Models

Satu Johansson,Taneli Riihonen

Main category: cs.CL

TL;DR: 本文探讨了自然语言处理和大语言模型在军事领域的应用,通过询问基于GPT的模型(如Microsoft Copilot)并评估其提供的信息,同时研究了利用商业云服务(如Microsoft Azure)构建此类应用的可行性。

Details Motivation: 随着GPT和大规模预训练模型的发展,探索其在军事领域中的潜在应用成为重要课题。 Method: 通过与基于GPT的语言模型(如Microsoft Copilot)进行交互,获取其对军事应用的看法,并评估商业云服务(如Microsoft Azure)在实现这些应用方面的可行性。 Result: 语言模型的摘要和生成能力直接促进了多种大规模应用的发展,某些特定功能也可能有专门用途。 Conclusion: 大语言模型在军事应用中具有广泛潜力,且通过现有商业云平台可以实现部分应用,具备较高的可行性。 Abstract: In this paper, military use cases or applications and implementation thereof are considered for natural language processing and large language models, which have broken into fame with the invention of the generative pre-trained transformer (GPT) and the extensive foundation model pretraining done by OpenAI for ChatGPT and others. First, we interrogate a GPT-based language model (viz. Microsoft Copilot) to make it reveal its own knowledge about their potential military applications and then critically assess the information. Second, we study how commercial cloud services (viz. Microsoft Azure) could be used readily to build such applications and assess which of them are feasible. We conclude that the summarization and generative properties of language models directly facilitate many applications at large and other features may find particular uses.

[31] Generalizing to Unseen Disaster Events: A Causal View

Philipp Seeberger,Steffen Freisinger,Tobias Bocklet,Korbinian Riedhammer

Main category: cs.CL

TL;DR: 本文提出了一种基于因果学习的去偏方法,用于提升社交媒体灾害信息分类中的模型泛化能力,在三个任务中优于基线模型。

Details Motivation: 现有灾害事件分析系统容易受到事件相关偏差的影响,导致对新发事件的泛化能力差。 Method: 采用因果学习视角,提出一种减轻事件和领域相关偏差的方法,以增强模型对未来事件的适应性。 Result: 在三个灾害分类任务中,所提方法比多个基线模型F1分数最高提升+1.9%,显著改善了基于预训练语言模型的分类器性能。 Conclusion: 通过因果去偏方法可有效缓解灾害信息处理中的偏差问题,提升模型在真实场景下的泛化能力。 Abstract: Due to the rapid growth of social media platforms, these tools have become essential for monitoring information during ongoing disaster events. However, extracting valuable insights requires real-time processing of vast amounts of data. A major challenge in existing systems is their exposure to event-related biases, which negatively affects their ability to generalize to emerging events. While recent advancements in debiasing and causal learning offer promising solutions, they remain underexplored in the disaster event domain. In this work, we approach bias mitigation through a causal lens and propose a method to reduce event- and domain-related biases, enhancing generalization to future events. Our approach outperforms multiple baselines by up to +1.9% F1 and significantly improves a PLM-based classifier across three disaster classification tasks.

[32] Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA

Yiran Zhang,Mingyang Lin,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: 本文提出了VISTA,一个用于多轮推理任务的可视化交互式文本分析系统,支持上下文影响可视化、对话历史修改和推理依赖树生成,以降低大语言模型推理过程分析的复杂性。

Details Motivation: 现有研究缺乏专门的可视化工具来分析大语言模型在多轮交互中的复杂推理过程,导致研究人员面临较高的认知负荷。 Method: 设计并实现了一个基于Web的可视化交互系统VISTA,能够可视化上下文对模型决策的影响,支持交互式修改对话历史进行“假设”分析,并自动解析会话生成推理依赖树。 Result: VISTA提供了一个统一且交互式的框架,显著降低了分析多轮推理链的复杂性,有助于深入理解当前大语言模型的能力与局限,并支持自定义基准和本地模型集成。 Conclusion: VISTA有效提升了多轮推理过程中模型行为的可解释性与分析效率,是一个开放源码、可扩展的分析平台。 Abstract: Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning processes within these interactions presents a significant challenge due to complex contextual dependencies and a lack of specialized visualization tools, leading to a high cognitive load for researchers. To address this gap, we present VISTA, an web-based Visual Interactive System for Textual Analytics in multi-turn reasoning tasks. VISTA allows users to visualize the influence of context on model decisions and interactively modify conversation histories to conduct "what-if" analyses across different models. Furthermore, the platform can automatically parse a session and generate a reasoning dependency tree, offering a transparent view of the model's step-by-step logical path. By providing a unified and interactive framework, VISTA significantly reduces the complexity of analyzing reasoning chains, thereby facilitating a deeper understanding of the capabilities and limitations of current LLMs. The platform is open-source and supports easy integration of custom benchmarks and local models.

[33] Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Qifeng Cai,Hao Liang,Chang Xu,Tao Xie,Wentao Zhang,Bin Cui

Main category: cs.CL

TL;DR: 本文提出了一种SQL感知的数据增强框架Text2SQL-Flow,用于生成大规模、语义有效且结构多样的Text-to-SQL数据对,并构建了高质量数据集SQLFlow,显著提升了开源和闭源大模型在Text-to-SQL任务上的性能。

Details Motivation: 现有Text-to-SQL数据集稀缺、简单且多样性不足,限制了模型性能,亟需高质量、高多样性的数据支持。 Method: 提出Text2SQL-Flow框架,涵盖六个增强维度,集成SQL执行验证、自然语言生成、思维链推理和数据分类模块,并设计模块化数据库管理器;基于该框架构建SQLFlow数据集,并提出掩码对齐检索方法用于闭源模型的示例匹配。 Result: 在相同数据预算下,基于SQLFlow微调的开源大模型在多个基准上表现更优;提出的检索方法在闭源模型中优于现有技术,验证了数据质量和对齐策略的有效性。 Conclusion: 高质量结构化数据对Text-to-SQL系统至关重要,Text2SQL-Flow和SQLFlow为数据驱动的Text-to-SQL研究提供了可扩展的基础。 Abstract: The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

[34] EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models

Junquan Huang,Haotian Wu,Yubo Gao,Yibo Yan,Junyan Zhang,Yonghua Hei,Song Dai,Jie Zhang,Puay Siew Tan,Xuming Hu

Main category: cs.CL

TL;DR: 本文提出了EffiReason-Bench,一个用于评估高效推理方法的统一基准,并引入E3-Score指标,在多种任务和模型上进行实验,发现没有单一方法在所有场景下都表现最优。

Details Motivation: 现有的大语言模型在使用思维链提示时往往生成冗长的推理过程,影响效率和准确性,且缺乏统一的评估标准来公平比较不同的高效推理方法。 Method: 构建了一个包含三个类别的统一评估基准EffiReason-Bench,提出E3-Score作为评估指标,并通过标准化的推理结构、选项分析和人工验证构建了CommonsenseQA和LogiQA的验证性CoT标注。 Result: 在6个开源大模型和4个数据集上评估了7种方法,结果表明不同方法的表现依赖于模型规模、任务复杂度和架构,无单一最优策略。 Conclusion: 高效的推理方法需要根据具体模型和任务进行选择,未来的研究应考虑这些因素以实现最佳性能。 Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.

[35] Persona-Aware Alignment Framework for Personalized Dialogue Generation

Guanrong Li,Xinyu Liu,Zhen Wu,Xinyu Dai

Main category: cs.CL

TL;DR: 提出了一种新的个性化对话生成框架PAL,通过两阶段训练方法直接优化个性对齐,提升生成回复的个性化和相关性。

Details Motivation: 主流模型依赖于隐式学习个性化,容易忽略给定的人格信息,生成通用回复。 Method: 提出Persona-Aware Alignment Framework (PAL),采用两阶段训练:Persona-aware Learning 和 Persona Alignment,并结合Select then Generate的推理策略。 Result: 实验表明,PAL在多个基准上优于现有的最先进个性化对话模型和大语言模型。 Conclusion: PAL能有效提升对话系统在语义层面的个性敏感性和一致性。 Abstract: Personalized dialogue generation aims to leverage persona profiles and dialogue history to generate persona-relevant and consistent responses. Mainstream models typically rely on token-level language model training with persona dialogue data, such as Next Token Prediction, to implicitly achieve personalization, making these methods tend to neglect the given personas and generate generic responses. To address this issue, we propose a novel Persona-Aware Alignment Framework (PAL), which directly treats persona alignment as the training objective of dialogue generation. Specifically, PAL employs a two-stage training method including Persona-aware Learning and Persona Alignment, equipped with an easy-to-use inference strategy Select then Generate, to improve persona sensitivity and generate more persona-relevant responses at the semantics level. Through extensive experiments, we demonstrate that our framework outperforms many state-of-the-art personalized dialogue methods and large language models.

[36] LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning

Yangfan Ye,Xiaocheng Feng,Xiachong Feng,Lei Huang,Weitao Ma,Qichen Hong,Yunfei Lu,Duyu Tang,Dandan Tu,Bing Qin

Main category: cs.CL

TL;DR: 本文提出了LangGPS,一种基于语言可分离性的轻量级两阶段预筛选框架,用于提升多语言大模型训练中数据选择的有效性。

Details Motivation: 现有数据选择方法通常忽略多语言数据的内在语言结构,导致多语言能力对训练数据构成敏感。 Method: LangGPS首先根据语言可分离性得分过滤训练数据,然后结合现有选择方法进行精炼,以优化多语言指令微调的数据组成。 Result: 在六个基准和22种语言上的实验表明,LangGPS能显著提升多种选择方法在理解和低资源语言任务上的效果与泛化能力;高可分离性样本有助于形成清晰的语言边界,低可分离性样本促进跨语言对齐;语言可分离性还可作为多语言课程学习的有效信号。 Conclusion: 语言可分离性是衡量多语言数据效用的重要新维度,LangGPS为构建更符合语言结构的LLM提供了新思路。 Abstract: Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs), but the resulting multilingual capability remains highly sensitive to the composition and selection of the training data. Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data. In this paper, we propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability which quantifies how well samples in different languages can be distinguished in the model's representation space. LangGPS first filters training data based on separability scores and then refines the subset using existing selection methods. Extensive experiments across six benchmarks and 22 languages demonstrate that applying LangGPS on top of existing selection methods improves their effectiveness and generalizability in multilingual training, especially for understanding tasks and low-resource languages. Further analysis reveals that highly separable samples facilitate the formation of clearer language boundaries and support faster adaptation, while low-separability samples tend to function as bridges for cross-lingual alignment. Besides, we also find that language separability can serve as an effective signal for multilingual curriculum learning, where interleaving samples with diverse separability levels yields stable and generalizable gains. Together, we hope our work offers a new perspective on data utility in multilingual contexts and support the development of more linguistically informed LLMs.

[37] VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction

Yuhao Wang,Ziyang Cheng,Heyang Liu,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: 本文提出了一种低延迟的端到端口语语言模型VocalNet-M2,通过多码本分词器和多令牌预测策略显著降低响应延迟,同时保持竞争力的性能。

Details Motivation: 现有端到端口语语言模型存在较高的响应延迟,主要源于语音标记的自回归生成和复杂的流匹配合成模型,限制了实时交互应用。 Method: 引入VocalNet-M2,采用多码本分词器直接生成语音标记,并结合多令牌预测(MTP)策略提升生成效率,避免使用高延迟的流匹配模型。 Result: 实验显示,VocalNet-M2将首块延迟从约725ms降至350ms,在主流SLM中保持竞争性性能,并提供了单码本与多码本策略的全面比较。 Conclusion: VocalNet-M2有效降低了语音生成延迟,为实现实时高效口语语言模型提供了可行方案,具有较强的应用前景。 Abstract: Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model. Furthermore, our MTP strategy enhances generation efficiency and improves overall performance. Extensive experiments demonstrate that VocalNet-M2 achieves a substantial reduction in first chunk latency (from approximately 725ms to 350ms) while maintaining competitive performance across mainstream SLMs. This work also provides a comprehensive comparison of single-codebook and multi-codebook strategies, offering valuable insights for developing efficient and high-performance SLMs for real-time interactive applications.

[38] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

He Zhang,Wenqian Cui,Haoning Xu,Xiaohui Li,Lei Zhu,Shaohua Ma,Irwin King

Main category: cs.CL

TL;DR: 本文提出了MTR-DuplexBench,一个用于评估全双工语音语言模型(FD-SLMs)在多轮对话中表现的新基准,解决了现有评测在多轮交互、指令遵循和安全性方面的不足。

Details Motivation: 现有FD-SLMs的评测主要集中在单轮对话和基本对话特征上,缺乏对多轮交互中的复杂性(如回合边界模糊和上下文不一致)以及关键能力(如指令遵循和安全性)的有效评估。 Method: 提出MTR-DuplexBench,通过将连续的全双工对话分割为离散的回合,实现对FD-SLMs在对话质量、对话动态、指令遵循和安全性等方面的逐轮综合评估。 Result: 实验结果表明,当前的FD-SLMs在多轮对话和多个评估维度上难以保持一致性能,验证了所提基准的必要性和有效性。 Conclusion: MTR-DuplexBench能够更全面地评估FD-SLMs在真实交互场景下的表现,推动其在复杂多轮对话中的发展与优化。 Abstract: Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions and conversational features, neglecting the complexities of multi-round communication and critical capabilities such as instruction following and safety. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark that segments continuous full-duplex dialogues into discrete turns, enabling comprehensive, turn-by-turn evaluation of FD-SLMs across dialogue quality, conversational dynamics, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our proposed benchmark. The benchmark and code will be available in the future.

[39] Local Hybrid Retrieval-Augmented Document QA

Paolo Astrino

Main category: cs.CL

TL;DR: 提出一种完全在本地运行的问答系统,结合语义理解与关键词精确检索,在不牺牲数据隐私的前提下实现高准确率的复杂查询响应。

Details Motivation: 解决企业在使用云AI系统时面临的隐私与性能权衡问题,尤其是在处理敏感文档时既需保障数据安全又需高精度问答能力。 Method: 融合语义理解和关键词检索两种互补的检索策略,并利用消费级硬件加速,整个系统在本地基础设施上离线运行。 Result: 在法律、科学和对话类文档上实现了具有竞争力的准确率,且错误率极低,所有数据保留在本地。 Conclusion: 企业可以在不将专有信息传输给外部服务提供商的情况下部署高性能的对话式文档AI系统,证明隐私与性能在企业AI应用中并非不可兼得。 Abstract: Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.

[40] Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

Changyuan Tian,Zhicong Lu,Shuang Qian,Nayu Liu,Peiguang Li,Li Jin,Leiyi Hu,Zhizhao Zeng,Sirui Wang,Ke Zeng,Zhi Guo

Main category: cs.CL

TL;DR: 本文提出一种基于困惑度感知的强化学习算法,以纠正大语言模型在多步数学推理中评估偏好不平衡的问题,从而提升其批判能力。

Details Motivation: 现有方法主要依赖高质量的监督微调示例来提升批判能力,较少探究大语言模型批判性能差的根本原因。本文旨在分析并解决其中潜在的“评估偏好不平衡”问题。 Method: 构建了一对多问题-解决方案(OPS)基准,进行统计偏好分析,发现LLMs倾向于将困惑度较低的解判断为正确;基于此,提出一种困惑度感知的群体相对策略优化算法,引导模型探索将低困惑度解判断为错误、高困惑度解判断为正确的轨迹。 Result: 在自建的OPS和现有批评基准上的实验表明,所提方法能有效提升大语言模型的批判能力。 Conclusion: 通过识别并纠正LLMs中的‘不平衡评估偏好’,可显著增强其在多步数学推理中的自动批评能力,为提升推理质量提供了新思路。 Abstract: To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

[41] BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Guduru Manoj,Neel Prabhanjan Rachamalla,Ashish Kulkarni,Gautam Rajeev,Jay Piplodiya,Arul Menezes,Shaharukh Khan,Souvik Rana,Manya Sah,Chandra Khatri,Shubham Agarwal

Main category: cs.CL

TL;DR: 本研究系统探讨了印度语言的合成多语言预训练数据的生成与评估,构建了包含5400亿token的大规模数据集BhashaKritika,并提出了一套可扩展的质量评估框架。

Details Motivation: 在大型语言模型预训练中,低资源语言的数据匮乏导致模型发展不均衡,亟需高质量、可扩展的多语言合成数据解决方案。 Method: 采用5种技术生成10种印度语言的合成数据,结合文档、人物设定和主题进行生成控制;比较英语翻译与本地生成的效果,并分析提示语言和文档语种对质量的影响;设计包含脚本检测、元数据校验、n-gram重复分析和KenLM困惑度过滤的模块化评估流程。 Result: 成功构建了540B token的BhashaKritika数据集,实验揭示了不同生成策略间的权衡,发现基于文档 grounding 和使用本地语言提示能显著提升数据质量,且本地生成优于翻译生成。 Conclusion: 合理的生成策略和语言选择对合成数据质量至关重要,所提出的生成与评估框架为构建高质量多语言预训练数据提供了有效实践路径。 Abstract: In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.

[42] Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Andrea Schimmenti,Valentina Pasqual,Fabio Vitali,Marieke van Erp

Main category: cs.CL

TL;DR: 本文提出了ATR4CH,一种用于从文化遗产文本中进行大语言模型驱动的知识抽取的五步系统性方法,并在真实性评估争议案例中验证了其有效性。

Details Motivation: 文化遗产文本包含丰富知识,但因其非结构化特性难以转化为可查询的知识图谱,亟需系统化的方法支持知识提取。 Method: ATR4CH结合标注模型、本体框架和基于大语言模型的知识抽取,通过基础分析、标注模式设计、流水线架构、集成优化和综合评估五个步骤实现文本到RDF的转换,并在维基百科争议文物条目上构建了包含三个大模型的顺序流水线进行验证。 Result: 该方法在元数据提取上达到0.96-0.99 F1分数,实体识别0.7-0.8,假设提取0.65-0.75,证据提取0.95-0.97,论述表示得分为0.62 G-EVAL;小型模型表现具竞争力,具备成本效益。 Conclusion: ATR4CH是首个将大语言模型与文化遗产本体系统结合的知识抽取框架,具有跨领域可复制性,有助于文化遗产机构将文本知识转化为可查询的知识图谱,但产出仍需人工后处理。 Abstract: Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts...), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.

[43] TruthfulRAG: Resolving Factual-level Conflicts in Retrieval-Augmented Generation with Knowledge Graphs

Shuyi Liu,Yuming Shang,Xi Zhang

Main category: cs.CL

TL;DR: 提出TruthfulRAG,首个利用知识图谱解决RAG系统中事实级知识冲突的框架,通过三元组提取、基于查询的图检索和熵过滤机制提升生成内容的准确性和可信度。

Details Motivation: 现有RAG系统在处理检索信息与LLM内部知识冲突时多局限于token或语义层面,难以全面识别事实性矛盾,影响生成结果的准确性与可靠性。 Method: 构建从检索内容中提取三元组形成知识图谱,采用基于查询的图检索获取相关知识,并设计熵-based过滤机制精确定位冲突元素以缓解不一致。 Result: 实验表明TruthfulRAG在多个知识密集型任务中优于现有方法,能有效减轻知识冲突,提升RAG系统的鲁棒性和可信度。 Conclusion: TruthfulRAG通过知识图谱实现了细粒度的事实级冲突检测与解析,显著增强了RAG系统在面对内外知识不一致时的准确性与可信赖性。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing the capabilities of Large Language Models (LLMs) by integrating retrieval-based methods with generative models. As external knowledge repositories continue to expand and the parametric knowledge within models becomes outdated, a critical challenge for RAG systems is resolving conflicts between retrieved external information and LLMs' internal knowledge, which can significantly compromise the accuracy and reliability of generated content. However, existing approaches to conflict resolution typically operate at the token or semantic level, often leading to fragmented and partial understanding of factual discrepancies between LLMs' knowledge and context, particularly in knowledge-intensive tasks. To address this limitation, we propose TruthfulRAG, the first framework that leverages Knowledge Graphs (KGs) to resolve factual-level knowledge conflicts in RAG systems. Specifically, TruthfulRAG constructs KGs by systematically extracting triples from retrieved content, utilizes query-based graph retrieval to identify relevant knowledge, and employs entropy-based filtering mechanisms to precisely locate conflicting elements and mitigate factual inconsistencies, thereby enabling LLMs to generate faithful and accurate responses. Extensive experiments reveal that TruthfulRAG outperforms existing methods, effectively alleviating knowledge conflicts and improving the robustness and trustworthiness of RAG systems.

[44] Position: On the Methodological Pitfalls of Evaluating Base LLMs for Reasoning

Jason Chan,Zhixue Zhao,Robert Gaizauskas

Main category: cs.CL

TL;DR: 本文探讨了在评估基础大语言模型(LLM)推理能力时存在的方法论问题,指出其预训练目标与推理评估标准之间存在根本性不匹配。

Details Motivation: 现有研究常通过评估基础LLM来探究其推理能力,但忽略了这些模型仅基于统计语言模式进行预测,而非真正追求逻辑正确性,因此对其推理能力的评估可能存在误导。 Method: 本文以分析性立场论文的形式,论证基础LLM在生成结论时只是偶然符合逻辑,其本质仍是遵循统计上合理的语言模式,而非有意进行正确推理。 Result: 揭示了两个关键问题:一是不应将基础LLM的输出视为其真实推理尝试;二是基于基础LLM得出的结论难以推广到经过指令优化的后训练LLM。 Conclusion: 呼吁对现有依赖基础LLM评估推理能力的研究进行批判性反思,并建议未来研究应正视这一方法论缺陷。 Abstract: Existing work investigates the reasoning capabilities of large language models (LLMs) to uncover their limitations, human-like biases and underlying processes. Such studies include evaluations of base LLMs (pre-trained on unlabeled corpora only) for this purpose. Our position paper argues that evaluating base LLMs' reasoning capabilities raises inherent methodological concerns that are overlooked in such existing studies. We highlight the fundamental mismatch between base LLMs' pretraining objective and normative qualities, such as correctness, by which reasoning is assessed. In particular, we show how base LLMs generate logically valid or invalid conclusions as coincidental byproducts of conforming to purely linguistic patterns of statistical plausibility. This fundamental mismatch challenges the assumptions that (a) base LLMs' outputs can be assessed as their bona fide attempts at correct answers or conclusions; and (b) conclusions about base LLMs' reasoning can generalize to post-trained LLMs optimized for successful instruction-following. We call for a critical re-examination of existing work that relies implicitly on these assumptions, and for future work to account for these methodological pitfalls.

[45] DELICATE: Diachronic Entity LInking using Classes And Temporal Evidence

Cristian Santini,Sebastian Barzaghi,Paolo Sernani,Emanuele Frontoni,Mehwish Alam

Main category: cs.CL

TL;DR: 本文提出了DELICATE,一种结合BERT编码器与Wikidata上下文信息的神经符号方法,用于历史意大利语文本的实体链接,并发布了多领域语料库ENEIDE,实验证明该方法在小众语言和长尾实体场景下优于大规模模型,且具有更高的可解释性。

Details Motivation: 人文领域的实体链接面临文档类型复杂、缺乏领域特定数据集与模型以及知识库中长尾实体表示不足等挑战,现有方法难以有效处理历史文本中的实体消歧问题。 Method: 提出DELICATE方法,融合BERT-based编码器与Wikidata中的时间合理性和实体类型一致性等上下文信息,采用神经-符号结合的方式进行实体链接;同时构建了名为ENEIDE的多领域历史意大利语标注语料库。 Result: DELICATE在历史意大利语实体链接任务上优于其他模型,包括参数量更大的模型;分析显示其置信度分数和特征敏感性提供了更强的可解释性和可读性。 Conclusion: DELICATE通过融合神经与符号方法,在低资源、长尾实体的历史文本实体链接中表现出优越性能与可解释性,ENEDIE语料库为相关研究提供了宝贵资源。 Abstract: In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.

[46] Analogical Structure, Minimal Contextual Cues and Contrastive Distractors: Input Design for Sample-Efficient Linguistic Rule Induction

Chunyang Jiang,Paola Merlo

Main category: cs.CL

TL;DR: 提出一种基于类比范式组织的计算方法,使轻量级模型在极少量数据下(仅100个样本)即可在结构化语言任务中超越零样本大模型(如GPT-3),验证了类比结构、对比学习和最小上下文线索的有效性。

Details Motivation: 探索是否可以通过认知启发的类比组织方式,使参数量小、训练数据少的模型达到或超过大规模语言模型在语言规则学习上的表现,从而降低对海量数据和大模型的依赖。 Method: 设计了一种结合类比结构、对比学习和最小上下文线索的计算方法,在结构化补全任务上训练轻量级模型(BERT+CNN,50万参数),使用仅100个英语因果/起始交替现象样例进行训练,并与零样本GPT-3对比;通过消融实验和跨现象验证检验方法有效性。 Result: 在因果/起始交替任务上,轻量级模型达到F1=0.95,超过零样本GPT-3的F1=0.87;消融实验显示类比和对比结构显著提升性能;在未指定宾语交替现象上的跨任务验证也复现了高效性,证明方法具有鲁棒性。 Conclusion: 类比范式组织能显著提升小模型在有限数据下的语言规则学习能力,实现与大模型相当甚至更优的性能,为低资源条件下的高效语言学习提供了新路径。 Abstract: Large language models achieve strong performance through training on vast datasets. Can analogical paradigm organization enable lightweight models to match this performance with minimal data? We develop a computational approach implementing three cognitive-inspired principles: analogical structure, contrastive learning, and minimal contextual cues. We test this approach with structured completion tasks where models identify correct sentence completions from analogical patterns with contrastive alternatives. Training lightweight models (BERT+CNN, $0.5M$ parameters) on only one hundred structured examples of English causative/inchoative alternations achieves $F1=0.95$, outperforming zero-shot \texttt{GPT-o3} ($F1=0.87$). Ablation studies confirm that analogical organization and contrastive structure improve performance, consistently surpassing randomly shuffled baselines across architectures. Cross-phenomenon validation using unspecified object alternations replicates these efficiency gains, confirming approach robustness. Our results show that analogical paradigm organization enables competitive linguistic rule learning with orders of magnitude less data than conventional approaches require.

[47] Reasoning About Intent for Ambiguous Requests

Irina Saparina,Mirella Lapata

Main category: cs.CL

TL;DR: 提出一种通过生成多个解释-答案对来处理模糊请求的方法,使用强化学习和定制奖励函数训练模型,在对话问答和语义解析任务中实现了比基线方法更高的有效答案覆盖率。

Details Motivation: 大语言模型在面对模糊请求时通常会隐式地选择一种解释,容易导致意图误解,引发用户不满并带来安全风险。因此需要更透明、可靠的方式来处理歧义。 Method: 采用强化学习结合定制化奖励函数,利用多个有效答案作为监督信号,训练模型在一次生成中输出包含多个解释-答案对的结构化响应。 Result: 在对话问答和语义解析实验中,该方法在有效答案覆盖率上优于基线方法,人工评估显示预测的解释与其答案高度一致。 Conclusion: 该方法通过显式表达多种解释提升了透明性,单步生成保证了效率,并以结构化输出支持下游应用,有效应对模糊请求的理解难题。 Abstract: Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.

[48] Exploring State Tracking Capabilities of Large Language Models

Kiamehr Rezaee,Jose Camacho-Collados,Mohammad Taher Pilehvar

Main category: cs.CL

TL;DR: 本文研究了大语言模型在状态跟踪任务中的表现,提出一个基于三个明确定义任务的基准测试。结果表明,新一代模型(如GPT-4和Llama3)在引入思维链等机制后能有效跟踪状态,而早期模型在多步之后常会失败。

Details Motivation: 为了评估大语言模型在需要推理和持续状态管理的任务中的能力,特别是隔离状态跟踪这一核心挑战。 Method: 设计了一个包含三个明确定义状态跟踪任务的基准测试,评估不同大语言模型在多种设置下的表现,包括是否使用思维链等提示技术。 Result: GPT-4和Llama3等最新模型能够有效完成状态跟踪任务,尤其是在使用思维链时;而较早一代的模型虽能理解任务并在初期成功,但在经过多个步骤后往往失败。 Conclusion: 状态跟踪对大语言模型仍具挑战性,模型代际之间存在明显性能差距,提示机制如思维链有助于提升表现。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.

[49] LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning

Zihan Gao,Yifei Xu,Jacob Thebault-Spieker

Main category: cs.CL

TL;DR: 本文提出了LocalBench,首个系统评估大语言模型在美国县级本地知识能力的基准,揭示了现有模型在处理超本地化知识上的严重不足。

Details Motivation: 现有大语言模型在宏观地理任务上表现良好,但对超本地知识(如社区动态、地方治理)的理解能力尚不清楚,而这一能力在现实应用中日益重要。 Method: 基于Localness概念框架,构建包含14,782个验证问题的LocalBench基准,覆盖526个县,整合人口普查数据、地方子论坛和区域新闻,从物理、认知和关系维度评估13种最先进LLM在闭卷和网络增强设置下的表现。 Result: 最佳模型在叙事类问题上准确率仅为56.8%,数值推理低于15.5%;模型规模和网络增强并不总能提升性能,例如搜索使Gemini提升+13.6%,却使GPT系列下降-11.4%。 Conclusion: 当前大语言模型在处理细粒度本地知识方面存在显著缺陷,亟需发展支持公平、具地点感知能力的AI系统。 Abstract: Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini's accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.

[50] Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks

Yunzhe Xu,Zhuosheng Zhang,Zhe Liu

Main category: cs.CL

TL;DR: 提出知识提供型提示优化框架KPPO,通过知识填补、批量化评估和自适应知识剪枝,在知识密集型任务中显著优于传统提示优化方法。

Details Motivation: 现有提示优化方法主要依赖激活模型已有能力,难以满足知识密集型任务对精确知识和专业推理的需求。 Method: KPPO将提示优化重构为系统性知识整合过程,包含知识缺口识别与填补、兼顾性能与分布稳定性的批量化候选评估,以及平衡性能与token效率的自适应知识剪枝策略。 Result: 在15个跨领域知识密集型基准上,KPPO平均比最强基线提升约6%,同时实现相当或更低的token消耗,最多减少29% token使用。 Conclusion: KPPO通过主动注入领域知识而非仅激发模型潜能,为知识密集型任务提供了更有效和高效的提示优化范式。 Abstract: While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models' capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO's superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.

[51] Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

Yun He,Wenzhe Li,Hejia Zhang,Songlin Li,Karishma Mandyam,Sopan Khosla,Yuanhao Xiong,Nanshu Wang,Selina Peng,Beibin Li,Shengjie Bi,Shishir G. Patil,Qi Qi,Shengyu Feng,Julian Katz-Samuels,Richard Yuanzhe Pang,Sujan Gonugondla,Hunter Lang,Yue Yu,Yundi Qian,Maryam Fazel-Zarandi,Licheng Yu,Amine Benhalloum,Hany Awadalla,Manaal Faruqui

Main category: cs.CL

TL;DR: 本文提出了AdvancedIF基准和RIFL训练方法,通过基于评分标准的强化学习显著提升大语言模型在复杂指令跟随任务上的表现。

Details Motivation: 现有的大语言模型在复杂、多轮和系统级指令跟随方面仍存在挑战,缺乏高质量的人工标注基准和可靠的奖励信号来有效评估和训练此类能力。 Method: 提出AdvancedIF基准,包含1600多个提示和专家设计的评分标准;并提出RIFL训练框架,结合评分标准生成、微调的评分验证器和奖励塑形,实现基于评分标准的指令跟随强化学习。 Result: 实验表明RIFL在AdvancedIF上实现了6.7%的绝对性能提升,并在公开基准上表现出色,消融研究验证了各组件的有效性。 Conclusion: 评分标准被证实是训练和评估高级指令跟随能力的有力工具,为构建更强大、可靠的人工智能系统提供了新路径。 Abstract: Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.

[52] LOCA-R: Near-Perfect Performance on the Chinese Physics Olympiad 2025

Dong-Shan Jian,Xiang Li,Chen-Xu Yan,Hui-Wen Zheng,Zhi-Zhang Bian,You-Le Fang,Sheng-Qi Zhang,Bing-Rui Gong,Ren-Xi He,Jing-Tian Zhang,Ce Meng,Yan-Qing Ma

Main category: cs.CL

TL;DR: 本文提出了LOCA-R(用于推理的逻辑链增强),一种改进的复杂推理框架,并将其应用于2025年中国物理奥林匹克理论考试,取得了313/320的接近完美分数,超越了人类最高成绩和所有基线方法。

Details Motivation: 物理竞赛问题求解对AI来说极具挑战性,需要精确计算、抽象推理和对物理原理的深刻理解。中国物理奥林匹克(CPhO)因其复杂性和深度,成为检验AI高级能力的理想平台。 Method: 提出并改进了LOCA框架,发展出LOCA-R(逻辑链增强用于推理),通过增强逻辑推理链来提升在复杂数理问题上的表现,并应用于CPhO 2025理论考试。 Result: LOCA-R在CPhO 2025理论考试中获得313/320分,显著优于所有基线方法,并超过最高分人类选手。 Conclusion: LOCA-R在高难度物理竞赛问题上展现出卓越的推理能力,表明其在整合计算、推理与物理知识方面具有巨大潜力,为AI解决复杂科学问题提供了新方向。 Abstract: Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles. The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods.

[53] Say It Differently: Linguistic Styles as Jailbreak Vectors

Srikant Panda,Avinash Rai

Main category: cs.CL

TL;DR: 本研究探讨了语言风格(如恐惧、好奇)作为大语言模型越狱攻击的新表面,发现通过风格重构可显著提高越狱成功率,并提出使用二级大模型进行风格中和的防御方法。

Details Motivation: 现有研究多关注语义等价的越狱提示,而忽视语言风格变化带来的安全风险,本文旨在系统性探究不同语言风格对对齐模型安全性的影响。 Method: 通过手工模板和大模型重写,将三个标准数据集的提示转换为11种不同语言风格,构建风格增强的越狱基准,并评估16个主流指令微调模型;同时提出一种基于辅助大模型的输入预处理方法——风格中和,以削弱风格操控的影响。 Result: 实验显示,风格重构使越狱成功率最高提升57个百分点,其中恐惧、好奇和同情等风格最有效,且上下文化重写优于模板化变体;所提风格中和方法能显著降低越狱成功率。 Conclusion: 语言风格是一种被忽视的系统性漏洞,当前安全机制对此类攻击防御不足,需在部署流程中纳入风格鲁棒性考量。 Abstract: Large Language Models (LLMs) are commonly evaluated for robustness against paraphrased or semantically equivalent jailbreak prompts, yet little attention has been paid to linguistic variation as an attack surface. In this work, we systematically study how linguistic styles such as fear or curiosity can reframe harmful intent and elicit unsafe responses from aligned models. We construct style-augmented jailbreak benchmark by transforming prompts from 3 standard datasets into 11 distinct linguistic styles using handcrafted templates and LLM-based rewrites, while preserving semantic intent. Evaluating 16 open- and close-source instruction-tuned models, we find that stylistic reframing increases jailbreak success rates by up to +57 percentage points. Styles such as fearful, curious and compassionate are most effective and contextualized rewrites outperform templated variants. To mitigate this, we introduce a style neutralization preprocessing step using a secondary LLM to strip manipulative stylistic cues from user inputs, significantly reducing jailbreak success rates. Our findings reveal a systemic and scaling-resistant vulnerability overlooked in current safety pipelines.

[54] Convomem Benchmark: Why Your First 150 Conversations Don't Need RAG

Egor Pakhomov,Erik Nijkamp,Caiming Xiong

Main category: cs.CL

TL;DR: 本文提出了一个用于评估对话记忆的综合基准,包含75,336个问答对,涵盖用户事实、助手回忆、偏好、时间变化等多个类别。研究发现,在对话历史较短时,简单的全上下文方法优于复杂的RAG-based记忆系统(如Mem0),而长上下文方法在前30次对话中表现优异,最多可扩展至150次对话,之后则需采用混合或RAG方法以控制成本和延迟。研究表明,对话记忆的小规模语料优势应得到专门研究,而非简单套用通用RAG方案。

Details Motivation: 现有对话记忆评估基准在统计效力、数据生成一致性和评估灵活性方面存在局限,且当前研究多直接应用检索增强生成(RAG)方法于对话记忆,忽视了其从零开始、随对话逐步增长的独特特性,因此需要专门针对该特性设计评估框架与方法。 Method: 构建了一个大规模、多类别的对话记忆评估基准,系统分析了全上下文模型与RAG-based记忆系统(如Mem0)在不同对话长度下的表现,并识别出性能拐点,探讨长上下文与RAG方法的适用边界。 Result: 实验表明,简单全上下文方法在挑战性多消息证据任务上达到70-82%准确率,而Mem0等RAG系统仅达30-45%;长上下文方法在少于150轮对话时具有可行性,超过后成本与延迟显著上升,需转向混合或RAG方案。 Conclusion: 对话记忆具有独特的小规模语料优势,应在前150次对话内充分利用长上下文方法的高效性,未来研究应聚焦于此阶段的优化,而非照搬传统RAG框架。 Abstract: We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns--temporal reasoning, implicit extraction, knowledge updates, and graph representations--memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory--where exhaustive search and complete reranking are feasible--deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.

[55] Computing the Formal and Institutional Boundaries of Contemporary Genre and Literary Fiction

Natasha Johnson

Main category: cs.CL

TL;DR: 本研究利用计算方法分析文学体裁的形式与制度特征,发现文学类别具有显著的正式标记,并揭示女性作者身份在获得文学地位方面的限制。

Details Motivation: 探讨体裁作为形式分类而非制度分类的有效性,以及作者性别在文学分类中的作用。 Method: 使用Piper的CONLIT数据集构建文学与类型小说语料库,运用Welch's ANOVA比较不同体裁中作者性别的叙事特征分布,采用逻辑回归模型分析各特征对文学分类的影响及性别调节效应,并通过风格和语义向量表示分析形式与内容的重要性。 Result: 发现了每个文学类别中具有统计显著性的形式标记,且女性作者身份会缩小并模糊获得文学地位的目标。 Conclusion: 体裁不仅受形式影响,也受制度因素(如作者性别)制约,女性作者在文学场域中面临更高的门槛。 Abstract: Though the concept of genre has been a subject of discussion for millennia, the relatively recent emergence of genre fiction has added a new layer to this ongoing conversation. While more traditional perspectives on genre have emphasized form, contemporary scholarship has invoked both formal and institutional characteristics in its taxonomy of genre, genre fiction, and literary fiction. This project uses computational methods to explore the soundness of genre as a formal designation as opposed to an institutional one. Pulling from Andrew Piper's CONLIT dataset of Contemporary Literature, we assemble a corpus of literary and genre fiction, with the latter category containing romance, mystery, and science fiction novels. We use Welch's ANOVA to compare the distribution of narrative features according to author gender within each genre and within genre versus literary fiction. Then, we use logistic regression to model the effect that each feature has on literary classification and to measure how author gender moderates these effects. Finally, we analyze stylistic and semantic vector representations of our genre categories to understand the importance of form and content in literary classification. This project finds statistically significant formal markers of each literary category and illustrates how female authorship narrows and blurs the target for achieving literary status.

[56] URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

Yongxin Shi,Jiapeng Wang,Zeyu Shan,Dezhi Peng,Zening Lin,Lianwen Jin

Main category: cs.CL

TL;DR: 本文提出URaG框架,通过在多模态大语言模型内部统一检索与生成,利用早期层作为跨模态检索模块进行证据选择,从而提升长文档理解的效率与准确性。

Details Motivation: 现有方法在处理长文档时面临信息干扰和Transformer计算成本高的问题,且压缩或引入外部检索器会损失细节或增加复杂性。作者希望通过挖掘MLLM自身在推理过程中隐含的粗到精推理模式,实现高效、端到端的长文档理解。 Method: 提出URaG框架,引入轻量级跨模态检索模块,将Transformer的早期层转化为证据选择器,筛选出相关页面并丢弃无关内容,使深层网络能集中资源处理关键信息,实现检索与生成的统一。 Result: 实验表明,URaG在多个任务上达到最先进性能,同时降低44%-56%的计算开销。 Conclusion: URaG有效利用了MLLM内在的粗到精推理机制,在不依赖外部模块的情况下实现了高效准确的长文档理解,为构建更高效的多模态模型提供了新思路。 Abstract: Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.

[57] DESS: DeBERTa Enhanced Syntactic-Semantic Aspect Sentiment Triplet Extraction

Vishal Thenuwara,Nisansa de Silva

Main category: cs.CL

TL;DR: 本文提出了一种新的细粒度情感分析方法DESS,结合DeBERTa和LSTM双通道结构,在方面-意见对提取及情感极性判断上显著优于现有方法。

Details Motivation: 现有的方面情感三元组提取(ASTE)方法在捕捉方面、意见和情感极性之间的复杂关系时仍存在不足,尤其是对长距离依赖和复杂语言结构的理解有限。 Method: 提出DESS模型,采用DeBERTa的增强注意力机制与LSTM双通道结构,分别处理语义和语法信息,并优化二者融合方式以提升多类型语言信息的交互效果。 Result: 在标准数据集上测试显示,DESS在F1分数上分别提升了4.85、8.36和2.42,尤其在处理长距离依赖和复杂句式时表现更优。 Conclusion: 通过合理整合更先进的语言模型如DeBERTa,可有效提升细粒度情感分析性能,验证了模型升级与结构优化的协同价值。 Abstract: Fine-grained sentiment analysis faces ongoing challenges in Aspect Sentiment Triple Extraction (ASTE), particularly in accurately capturing the relationships between aspects, opinions, and sentiment polarities. While researchers have made progress using BERT and Graph Neural Networks, the full potential of advanced language models in understanding complex language patterns remains unexplored. We introduce DESS, a new approach that builds upon previous work by integrating DeBERTa's enhanced attention mechanism to better understand context and relationships in text. Our framework maintains a dual-channel structure, where DeBERTa works alongside an LSTM channel to process both meaning and grammatical patterns in text. We have carefully refined how these components work together, paying special attention to how different types of language information interact. When we tested DESS on standard datasets, it showed meaningful improvements over current methods, with F1-score increases of 4.85, 8.36, and 2.42 in identifying aspect opinion pairs and determining sentiment accurately. Looking deeper into the results, we found that DeBERTa's sophisticated attention system helps DESS handle complicated sentence structures better, especially when important words are far apart. Our findings suggest that upgrading to more advanced language models when thoughtfully integrated, can lead to real improvements in how well we can analyze sentiments in text. The implementation of our approach is publicly available at: https://github.com/VishalRepos/DESS.

[58] Evaluating Prompting Strategies with MedGemma for Medical Order Extraction

Abhinand Balachandran,Bavana Durgapraveen,Gowsikkan Sikkan Sudhagar,Vidhya Varshany J S,Sriram Rajkumar

Main category: cs.CL

TL;DR: 本文研究了MedGemma在从医患对话中提取结构化医疗指令任务中的表现,比较了三种提示范式,发现简单的单样本提示在人工标注数据上效果最佳。

Details Motivation: 减轻临床文档负担并确保患者安全,需准确地从医患对话中提取医疗指令。 Method: 采用MedGemma模型,评估了三种提示范式:单样本提示、ReAct推理框架和多步代理工作流。 Result: 在官方验证集上,简单的单样本提示方法性能最高,复杂框架可能因‘过度思考’引入噪声。 Conclusion: 在处理人工标注的医患对话时,直接的提示方法更稳健高效,为临床信息提取提供了实用的策略选择依据。 Abstract: The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to "overthinking" and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.

[59] Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering

Bavana Durgapraveen,Sornaraj Sivasankaran,Abhinand Balachandran,Sriram Rajkumar

Main category: cs.CL

TL;DR: 本文提出了两种用于生成伤口护理文本回复的方法:基于检索的提示策略和基于元数据增强的生成方法,有效提升了回复的相关性和临床准确性。

Details Motivation: 随着远程异步医疗的快速发展,医生工作负担加重,亟需AI系统协助处理患者咨询。MEDIQA-WV 2025任务旨在通过结合图像和文本生成伤口护理回复来应对这一挑战。 Method: 第一种方法采用检索式提示(mined prompting),利用训练数据中相似样例作为少样本示范;第二种方法通过元数据消融研究选出四个关键元数据属性,训练分类器预测这些属性并动态调整生成结果。 Result: 实验表明,检索式提示提高了回复的相关性,元数据引导生成进一步增强了临床准确性。 Conclusion: 结合检索式提示与元数据指导的生成策略,为开发高效、可靠的AI辅助伤口护理系统提供了可行方向。 Abstract: The rapid expansion of asynchronous remote care has intensified provider workload, creating demand for AI systems that can assist clinicians in managing patient queries more efficiently. The MEDIQA-WV 2025 shared task addresses this challenge by focusing on generating free-text responses to wound care queries paired with images. In this work, we present two complementary approaches developed for the English track. The first leverages a mined prompting strategy, where training data is embedded and the top-k most similar examples are retrieved to serve as few-shot demonstrations during generation. The second approach builds on a metadata ablation study, which identified four metadata attributes that consistently enhance response quality. We train classifiers to predict these attributes for test cases and incorporate them into the generation pipeline, dynamically adjusting outputs based on prediction confidence. Experimental results demonstrate that mined prompting improves response relevance, while metadata-guided generation further refines clinical precision. Together, these methods highlight promising directions for developing AI-driven tools that can provide reliable and efficient wound care support.

[60] Know Your Limits: Entropy Estimation Modeling for Compression and Generalization

Benjamin L. Badger,Matthew Neligeorge

Main category: cs.CL

TL;DR: 提出编码器增强的因果解码器模型架构,实现更高效的训练和更高的压缩率,并基于每token的熵估计提升语言模型的泛化能力。

Details Motivation: 由于语言的信息熵限制,语言预测存在准确率上限和压缩下限,现有大模型难以高效估算语言熵。 Method: 引入编码器增强的因果解码器模型架构,结合每token熵估计进行训练,提升训练效率和压缩性能。 Result: 新模型在有限硬件上优于传统因果Transformer,按熵目标训练的模型表现出更强的泛化能力。 Conclusion: 考虑每token熵目标的训练策略有助于提升模型泛化性能,为语言模型压缩与效率优化提供了新路径。 Abstract: Language prediction is constrained by informational entropy intrinsic to language, such that there exists a limit to how accurate any language model can become and equivalently a lower bound to language compression. The most efficient language compression algorithms today are causal (next token prediction) large language models, but the use of these models to form accurate estimates of language entropy is currently computationally infeasible. We introduce encoder-augmented causal decoder model architectures that exhibit superior training efficiency characteristics and achieve higher compression than causal transformers even when trained on modest hardware. We demonstrate how entropy estimates can be obtained on a per-token basis, and show that the generalization of models trained to approach the entropy of their training data necessarily exceeds the generalization of models trained to minimize loss beyond this value. We show empirically that causal models trained to approach but not exceed estimated per-token entropies exhibit greater generalization than models trained without taking entropy into account.

[61] SSR: Socratic Self-Refine for Large Language Model Reasoning

Haizhou Shi,Ye Liu,Bo Pang,Zeyu Leo Liu,Hao Wang,Silvio Savarese,Caiming Xiong,Yingbo Zhou,Semih Yavuz

Main category: cs.CL

TL;DR: 提出了一种名为Socratic Self-Refine (SSR)的新框架,通过细粒度验证和逐步精炼提升大语言模型的推理能力。

Details Motivation: 现有测试时框架依赖粗略的自我验证和修正,难以有效处理复杂任务,限制了大模型推理性能的进一步提升。 Method: 将模型响应分解为可验证的(子问题,子答案)对,通过受控重求解和自洽性检查进行步骤级置信度评估,并迭代精炼不可靠步骤。 Result: 在五个推理基准和三个大语言模型上的实验表明,SSR持续优于当前最先进的自精炼方法。 Conclusion: SSR不仅提升了推理准确性,还提供了一种可解释、原则性的黑盒方法来评估和理解大模型的内部推理过程。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.

[62] Instella: Fully Open Language Models with Stellar Performance

Jiang Liu,Jialian Wu,Xiaodong Yu,Yusheng Su,Prakamya Mishra,Gowtham Ramesh,Sudhanshu Ranjan,Chaitanya Manem,Ximeng Sun,Ze Wang,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: Instella是一个完全开源的三亿参数语言模型家族,基于公开数据和代码训练,在性能上达到同类开源模型的领先水平,并发布长上下文和数学推理两个专用变体。

Details Motivation: 大多数高性能大语言模型为闭源或部分开源,限制了研究的透明性和可复现性,因此需要一个完全开源且高性能的语言模型。 Method: 通过大规模预训练、通用指令微调以及基于人类偏好的对齐方法,在AMD Instinct MI300X GPU上训练Instella;并使用监督微调和强化学习开发数学专用模型。 Result: Instella在较少预训练token的情况下,在同类完全开源模型中达到最先进水平,并与主流开源权重模型表现相当;Instella-Long支持最长128K token上下文,Instella-Math在数学推理任务中表现优异。 Conclusion: Instella提供了一个透明、高效且多功能的开源语言模型方案,推动开放和可复现的语言模型研究发展。 Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.

[63] Black-Box On-Policy Distillation of Large Language Models

Tianzhu Ye,Li Dong,Zewen Chi,Xun Wu,Shaohan Huang,Furu Wei

Main category: cs.CL

TL;DR: 提出生成对抗蒸馏(GAD),实现基于黑盒、策略内语言模型蒸馏,通过对抗训练使学生模型性能超越传统序列级知识蒸馏,接近教师模型表现。

Details Motivation: 现有黑盒知识蒸馏方法无法访问教师模型内部信息,且多为离策略学习,缺乏动态反馈机制,限制学生模型性能提升。 Method: 将学生模型视为生成器,训练一个判别器区分其输出与教师模型输出,构建最小最大博弈;判别器作为随学生演化的策略内奖励模型提供稳定自适应反馈。 Result: 实验表明GAD在多个评估中优于传统序列级知识蒸馏,Qwen2.5-14B-Instruct学生模型经GAD训练后在LMSYS-Chat自动评测上表现接近教师模型GPT-5-Chat。 Conclusion: GAD是一种有效的黑盒大模型蒸馏范式,通过对抗式策略内学习实现了更优的知识迁移效果。 Abstract: Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model's text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM's, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.

[64] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Yesheng Liang,Haisheng Chen,Song Han,Zhijian Liu

Main category: cs.CL

TL;DR: 提出了一种名为ParoQuant的权重仅后训练量化方法,通过结合独立Givens旋转和逐通道缩放来均衡通道间幅度并缩小量化组内的动态范围,有效抑制异常值,提升推理任务精度。

Details Motivation: 现有PTQ方法在处理大语言模型中的权重和激活异常值时,要么抑制不足,要么引入过高推理开销,尤其影响推理类LLM的准确性。 Method: 提出ParoQuant,采用硬件高效的可优化独立Givens旋转与通道级缩放相结合的方法,并协同设计推理内核以充分利用GPU并行性,保持运行时轻量。 Result: 在推理任务上平均比AWQ提升2.4%的准确率,且运行时开销低于10%。 Conclusion: ParoQuant为推理型大语言模型的高效、高精度部署提供了可行路径。 Abstract: Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.

cs.CV [Back]

[65] FedeCouple: Fine-Grained Balancing of Global-Generalization and Local-Adaptability in Federated Learning

Ming Yang,Dongrun Li,Xin Wang,Feng Li,Lisheng Fan,Chunxiao Wang,Xiaoming Wu,Peng Cheng

Main category: cs.CV

TL;DR: 本文提出了一种名为FedeCouple的联邦学习方法,通过细粒度平衡全局泛化与局部适应性,提升异构数据下隐私保护移动网络中的个性化学习性能。

Details Motivation: 现有个性化联邦学习方法多关注特征空间一致性与分类器个性化,忽视了特征提取器的局部适应性和分类器的全局泛化能力,导致组件间耦合不足、性能下降。 Method: FedeCouple联合学习全局与局部特征表示,引入动态知识蒸馏增强分类器泛化能力,并利用不传输的本地锚点优化特征空间,兼顾隐私保护与通信效率。 Result: 在五个图像分类数据集上,FedeCouple在有效性、稳定性、可扩展性和安全性方面均优于九个基线方法,有效性实验中性能超越最佳基线达4.3%。 Conclusion: FedeCouple通过增强特征提取器与分类器间的协同,实现了更好的模型性能,并理论证明其对非凸目标的收敛性,适用于隐私敏感的移动网络场景。 Abstract: In privacy-preserving mobile network transmission scenarios with heterogeneous client data, personalized federated learning methods that decouple feature extractors and classifiers have demonstrated notable advantages in enhancing learning capability. However, many existing approaches primarily focus on feature space consistency and classification personalization during local training, often neglecting the local adaptability of the extractor and the global generalization of the classifier. This oversight results in insufficient coordination and weak coupling between the components, ultimately degrading the overall model performance. To address this challenge, we propose FedeCouple, a federated learning method that balances global generalization and local adaptability at a fine-grained level. Our approach jointly learns global and local feature representations while employing dynamic knowledge distillation to enhance the generalization of personalized classifiers. We further introduce anchors to refine the feature space; their strict locality and non-transmission inherently preserve privacy and reduce communication overhead. Furthermore, we provide a theoretical analysis proving that FedeCouple converges for nonconvex objectives, with iterates approaching a stationary point as the number of communication rounds increases. Extensive experiments conducted on five image-classification datasets demonstrate that FedeCouple consistently outperforms nine baseline methods in effectiveness, stability, scalability, and security. Notably, in experiments evaluating effectiveness, FedeCouple surpasses the best baseline by a significant margin of 4.3%.

[66] MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Ye Tian,Ling Yang,Jiongfan Yang,Anran Wang,Yu Tian,Jiani Zheng,Haochen Wang,Zhiyang Teng,Zhuochen Wang,Yinjie Wang,Yunhai Tong,Mengdi Wang,Xiangtai Li

Main category: cs.CV

TL;DR: 提出并分析了现有自回归方法在复杂任务中因错误传播导致性能下降的问题,通过新基准ParaBench揭示推理与图像生成间的对齐缺陷,并提出MMaDA-Parallel框架结合并行强化学习提升跨模态一致性。

Details Motivation: 现有顺序生成方法在多步推理中易因错误传播导致性能下降,尤其在图文生成任务中缺乏跨模态对齐机制。 Method: 构建ParaBench基准评估图文输出;提出MMaDA-Parallel并行扩散框架,实现文本与图像在整个去噪过程中的双向交互;采用监督微调和新型并行强化学习(ParaRL)优化,施加轨迹语义奖励以增强一致性。 Result: MMaDA-Parallel在ParaBench上相比SOTA模型Bagel显著提升6.9%的输出对齐性,验证了其在跨模态对齐与语义一致性上的优势。 Conclusion: MMaDA-Parallel通过并行架构与轨迹级强化学习有效缓解了错误传播问题,为思维感知的图像生成提供了更鲁棒的范式。 Abstract: While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

[67] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild

Felix B. Mueller,Jan F. Meier,Timo Lueddecke,Richard Vogg,Roger L. Freixanet,Valentin Hassler,Tiffany Bosshard,Elif Karakoc,William J. O'Hearn,Sofia M. Pereira,Sandro Sehner,Kaja Wierucka,Judith Burkart,Claudia Fichtel,Julia Fischer,Alexander Gail,Catherine Hobaiter,Julia Ostner,Liran Samuni,Oliver Schülke,Neda Shahidi,Erin G. Wessling,Alexander S. Ecker

Main category: cs.CV

TL;DR: 提出了一种数据为中心的方法,构建了大规模灵长类中心视频预训练数据集PriVi,并通过在多个基准上的实验验证了其在低标签场景下的优越性能。

Details Motivation: 现有计算机视觉方法多依赖于以人为中心的预训练模型,且局限于单一数据集,导致在灵长类行为分析中泛化能力不足。因此需要一种更通用、专注于灵长类的数据驱动方法。 Method: 构建了一个包含424小时视频的大规模灵长类视频预训练数据集PriVi,结合科研视频和网络素材,并使用V-JEPA进行预训练,采用冻结分类器在多个基准数据集上评估性能。 Result: 在ChimpACT、BaboonLand、PanAf500和ChimpBehave四个基准数据集上均优于先前方法,包括全微调基线,且在较少标签下表现良好,显示出更高的数据效率和泛化能力。 Conclusion: 以灵长类为中心的预训练能显著提升模型在跨场景行为识别任务中的数据效率和泛化性能,是低标签应用场景下的有效途径。 Abstract: Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We pretrain V-JEPA on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets, ChimpACT, BaboonLand, PanAf500, and ChimpBehave, our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.

[68] Classifying Phonotrauma Severity from Vocal Fold Images with Soft Ordinal Regression

Katie Matton,Purvaja Balaji,Hamzeh Ghasemzadeh,Jameson C. Cooper,Daryush D. Mehta,Jarrad H. Van Stan,Robert E. Hillman,Rosalind Picard,John Guttag,S. Mazdak Abulnaga

Main category: cs.CV

TL;DR: 提出了一种基于软标签序数回归的自动分类方法,用于声带损伤严重程度的评估,性能接近临床专家,并提供可靠的不确定性估计。

Details Motivation: 声带损伤严重程度评估依赖医生主观判断,成本高且可靠性差异大,需要自动化、客观的评估方法。 Method: 采用序数回归框架,并提出一种新的损失函数修改方法,使其能够处理反映标注者评分分布的软标签,以应对标签不确定性。 Result: 所提出的软序数回归方法在预测性能上接近临床专家水平,并能产生校准良好的不确定性估计。 Conclusion: 该方法为声带损伤严重程度提供了自动化的评估工具,有助于开展大规模研究,提升临床理解和患者护理。 Abstract: Phonotrauma refers to vocal fold tissue damage resulting from exposure to forces during voicing. It occurs on a continuum from mild to severe, and treatment options can vary based on severity. Assessment of severity involves a clinician's expert judgment, which is costly and can vary widely in reliability. In this work, we present the first method for automatically classifying phonotrauma severity from vocal fold images. To account for the ordinal nature of the labels, we adopt a widely used ordinal regression framework. To account for label uncertainty, we propose a novel modification to ordinal regression loss functions that enables them to operate on soft labels reflecting annotator rating distributions. Our proposed soft ordinal regression method achieves predictive performance approaching that of clinical experts, while producing well-calibrated uncertainty estimates. By providing an automated tool for phonotrauma severity assessment, our work can enable large-scale studies of phonotrauma, ultimately leading to improved clinical understanding and patient care.

[69] SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

Arman Zarei,Samyadeep Basu,Mobina Pournemat,Sayan Nag,Ryan Rossi,Soheil Feizi

Main category: cs.CV

TL;DR: 本文提出了SliderEdit框架,用于实现基于指令的图像编辑中的连续、细粒度控制,通过解耦多部分编辑指令并将其映射为全局训练的滑块,实现对单个编辑强度的平滑调节。

Details Motivation: 现有基于指令的图像编辑模型对每个指令使用固定强度,缺乏对个体编辑强度的精确和连续控制能力。 Method: SliderEdit通过学习一组跨不同编辑、属性和组合指令通用的低秩自适应矩阵,将多部分编辑指令解耦,并为每个指令提供可调节的滑块控制。 Result: 在FLUX-Kontext和Qwen-Image-Edit等先进模型上应用SliderEdit后,显著提升了编辑的可控性、视觉一致性和用户可引导性。 Conclusion: SliderEdit首次实现了在基于指令的图像编辑模型中对多个编辑维度的连续、细粒度控制,为交互式、指令驱动的图像操作提供了新方向。 Abstract: Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user's ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

[70] Density Estimation and Crowd Counting

Balachandra Devarangadi Sunil,Rakshith Venkatesh,Shantanu Todmal

Main category: cs.CV

TL;DR: 本研究提出一种基于视频的去噪概率模型,结合扩散过程和事件驱动采样技术,生成高质量人群密度图,有效提升人群密度估计的准确性和计算效率。

Details Motivation: 传统基于图像的人群密度估计方法难以应对视频数据中的时间动态性挑战,且计算和存储开销大,需更高效准确的视频级解决方案。 Method: 引入基于扩散过程的去噪概率模型,采用窄高斯核生成多密度图,结合回归分支提取特征,并通过相似性评分融合密度图;利用Farneback光流算法实现事件驱动采样,仅处理显著运动帧。 Result: 模型在密集和稀疏场景下均表现出色,显著降低平均绝对误差(MAE),并通过叠加图验证了对人群动态的有效捕捉;事件驱动采样大幅减少帧数,同时保留关键事件。 Conclusion: 该方法为视频人群密度估计提供了可扩展且高效的框架,适用于公共安全、灾害响应和活动管理等实时监控应用。 Abstract: This study enhances a crowd density estimation algorithm originally designed for image-based analysis by adapting it for video-based scenarios. The proposed method integrates a denoising probabilistic model that utilizes diffusion processes to generate high-quality crowd density maps. To improve accuracy, narrow Gaussian kernels are employed, and multiple density map outputs are generated. A regression branch is incorporated into the model for precise feature extraction, while a consolidation mechanism combines these maps based on similarity scores to produce a robust final result. An event-driven sampling technique, utilizing the Farneback optical flow algorithm, is introduced to selectively capture frames showing significant crowd movements, reducing computational load and storage by focusing on critical crowd dynamics. Through qualitative and quantitative evaluations, including overlay plots and Mean Absolute Error (MAE), the model demonstrates its ability to effectively capture crowd dynamics in both dense and sparse settings. The efficiency of the sampling method is further assessed, showcasing its capability to decrease frame counts while maintaining essential crowd events. By addressing the temporal challenges unique to video analysis, this work offers a scalable and efficient framework for real-time crowd monitoring in applications such as public safety, disaster response, and event management.

[71] PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model

Yunqian Cheng,Benjamin Princen,Roberto Manduchi

Main category: cs.CV

TL;DR: PALMS+ 是一种无需基础设施的室内定位系统,利用单目深度估计模型从RGB图像重建3D点云,并结合平面图进行几何布局匹配,实现在GPS拒止环境中的高精度定位。

Details Motivation: 现有视觉定位方法受限于智能手机LiDAR范围短和室内布局模糊性,且多数依赖训练或特定基础设施,难以实现高效、可扩展的室内定位。 Method: 提出PALMS+,通过基础单目深度模型(Depth Pro)从带位姿的RGB图像重建尺度对齐的3D点云,再与平面图进行卷积实现几何布局匹配,输出位置和朝向的后验分布,支持静态和序列化定位。 Result: 在Structured3D和自建校园数据集上,PALMS+在静态定位中优于PALMS和F3Loc;在33条真实轨迹的序列定位中结合粒子滤波,定位误差更低,表现出更强鲁棒性。 Conclusion: PALMS+实现了无需训练、不依赖专用硬件或基础设施的高精度室内定位,具有良好的实际应用潜力,尤其适用于应急响应和辅助导航等场景。 Abstract: Indoor localization in GPS-denied environments is crucial for applications like emergency response and assistive navigation. Vision-based methods such as PALMS enable infrastructure-free localization using only a floor plan and a stationary scan, but are limited by the short range of smartphone LiDAR and ambiguity in indoor layouts. We propose PALMS$+$, a modular, image-based system that addresses these challenges by reconstructing scale-aligned 3D point clouds from posed RGB images using a foundation monocular depth estimation model (Depth Pro), followed by geometric layout matching via convolution with the floor plan. PALMS$+$ outputs a posterior over the location and orientation, usable for direct or sequential localization. Evaluated on the Structured3D and a custom campus dataset consisting of 80 observations across four large campus buildings, PALMS$+$ outperforms PALMS and F3Loc in stationary localization accuracy -- without requiring any training. Furthermore, when integrated with a particle filter for sequential localization on 33 real-world trajectories, PALMS$+$ achieved lower localization errors compared to other methods, demonstrating robustness for camera-free tracking and its potential for infrastructure-free applications. Code and data are available at https://github.com/Head-inthe-Cloud/PALMS-Plane-based-Accessible-Indoor-Localization-Using-Mobile-Smartphones

[72] Social LSTM with Dynamic Occupancy Modeling for Realistic Pedestrian Trajectory Prediction

Ahmed Alia,Mohcine Chraibi,Armin Seyfried

Main category: cs.CV

TL;DR: 提出一种改进的Social LSTM模型,结合动态占用空间损失函数,有效降低行人轨迹预测中的碰撞率并提高预测精度。

Details Motivation: 现有方法将行人视为点实体,忽略其实际占据的空间,导致在密集场景中预测结果不够真实且易发生碰撞。 Method: 在Social LSTM基础上引入新的动态占用空间损失函数,该损失结合平均位移误差与对场景密度和个体空间占用敏感的碰撞惩罚项。 Result: 在五个基于真实数据集(不同密度场景)上的实验表明,新模型平均减少31%的碰撞率,平均位移误差和最终位移误差分别降低5%和6%,且在多数测试集上优于现有模型。 Conclusion: 所提出的动态占用空间损失函数能有效提升行人轨迹预测的真实性和准确性,尤其在高密度和异质密度场景下表现更优。 Abstract: In dynamic and crowded environments, realistic pedestrian trajectory prediction remains a challenging task due to the complex nature of human motion and the mutual influences among individuals. Deep learning models have recently achieved promising results by implicitly learning such patterns from 2D trajectory data. However, most approaches treat pedestrians as point entities, ignoring the physical space that each person occupies. To address these limitations, this paper proposes a novel deep learning model that enhances the Social LSTM with a new Dynamic Occupied Space loss function. This loss function guides Social LSTM in learning to avoid realistic collisions without increasing displacement error across different crowd densities, ranging from low to high, in both homogeneous and heterogeneous density settings. Such a function achieves this by combining the average displacement error with a new collision penalty that is sensitive to scene density and individual spatial occupancy. For efficient training and evaluation, five datasets were generated from real pedestrian trajectories recorded during the Festival of Lights in Lyon 2022. Four datasets represent homogeneous crowd conditions -- low, medium, high, and very high density -- while the fifth corresponds to a heterogeneous density distribution. The experimental findings indicate that the proposed model not only lowers collision rates but also enhances displacement prediction accuracy in each dataset. Specifically, the model achieves up to a 31% reduction in the collision rate and reduces the average displacement error and the final displacement error by 5% and 6%, respectively, on average across all datasets compared to the baseline. Moreover, the proposed model consistently outperforms several state-of-the-art deep learning models across most test sets.

[73] Soiling detection for Advanced Driver Assistance Systems

Filip Beránek,Václav Diviš,Ivan Gruber

Main category: cs.CV

TL;DR: 本文将汽车摄像头的污垢检测视为语义分割问题,比较了多种分割方法,并指出其优于基于瓦片分类的方法;同时分析了Woodscape数据集存在的数据泄露和标注不准确问题,构建了一个更小但高效的新子集,实现了更快且相当的性能。

Details Motivation: 提高自动驾驶辅助系统在恶劣环境(如天气、灰尘)下的鲁棒性,解决现有数据集中存在的数据泄露和标注不准问题。 Method: 采用主流语义分割方法进行污垢检测,并与传统的瓦片级分类方法进行对比;重新分析并修正Woodscape数据集,构建高质量的子集用于训练和评估。 Result: 语义分割方法显著优于传统分类方法;使用修正后的小规模数据子集可在更短时间内达到与原数据集相当的性能。 Conclusion: 语义分割是有效的汽车摄像头污垢检测方案,且通过数据集优化可提升训练效率与结果可靠性。 Abstract: Soiling detection for automotive cameras is a crucial part of advanced driver assistance systems to make them more robust to external conditions like weather, dust, etc. In this paper, we regard the soiling detection as a semantic segmentation problem. We provide a comprehensive comparison of popular segmentation methods and show their superiority in performance while comparing them to tile-level classification approaches. Moreover, we present an extensive analysis of the Woodscape dataset showing that the original dataset contains a data-leakage and imprecise annotations. To address these problems, we create a new data subset, which, despite being much smaller, provides enough information for the segmentation method to reach comparable results in a much shorter time. All our codes and dataset splits are available at https://github.com/filipberanek/woodscape_revision.

[74] Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and Segmentation

Frank Li,Theo Dapamede,Mohammadreza Chavoshi,Young Seok Jeon,Bardia Khosravi,Abdulhameed Dere,Beatrice Brown-Mulry,Rohan Satya Isaac,Aawez Mansuri,Chiratidzo Sanyika,Janice Newsome,Saptarshi Purkayastha,Imon Banerjee,Hari Trivedi,Judy Gichoya

Main category: cs.CV

TL;DR: 该研究评估了八种医学和通用领域基础模型在胸部X光分析中的视觉编码器性能,发现医学领域的预训练模型在线性探测中表现更优,但特征有效性高度依赖任务类型;对于复杂细微的病理分割(如气胸),所有基础模型在未充分微调时表现较差,且存在利用混淆特征进行分类的问题。研究还发现,无需昂贵的图文对齐,仅使用图像或标签监督的模型也能取得优异表现,而传统的端到端监督模型在分割任务上仍具竞争力。

Details Motivation: 目前尚不清楚预训练领域(医学 vs. 通用)、学习范式(如文本引导)和架构如何影响嵌入质量,导致难以为放射学任务选择最优编码器。 Method: 评估了来自八个医学和通用领域基础模型的视觉编码器,在胸部X光的分类(气胸、心脏肥大)和分割(气胸、心脏边界)任务上,采用线性探测和微调两种方式 benchmark。 Result: 医学领域预训练模型在线性探测中显著优于通用模型,表明其初始特征质量更高;预训练嵌入在全局分类和显著解剖结构分割中表现良好,但在复杂细微病理(如气胸)的定位分割中普遍表现差,需大量微调;子组分析显示模型会利用混淆线索(如胸管)进行分类,这在精确分割中失效;无需图文对齐的图像专用(RAD-DINO)和标签监督(Ark+)模型表现优异;传统监督模型在分割任务上与最佳基础模型相当甚至更优。 Conclusion: 医学预训练有助于提升特征质量,但其优势受限于任务类型;架构设计(如多尺度)至关重要,预训练特征并非普遍有效,尤其在复杂定位任务中,传统监督模型仍是强有力的竞争者。 Abstract: Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.

[75] Gradient-Guided Exploration of Generative Model's Latent Space for Controlled Iris Image Augmentations

Mahsa Mitcheff,Siamul Karim Khan,Adam Czajka

Main category: cs.CV

TL;DR: 提出一种基于生成模型潜在空间遍历的虹膜图像增强策略,可控制特定属性并保持身份信息。

Details Motivation: 现有方法难以在保持虹膜身份一致性的同时合成具有特定属性变化的图像,需要更灵活、可控的数据增强方法以提升虹膜识别与活体检测的鲁棒性。 Method: 通过在预训练GAN的潜在空间中沿特定虹膜特征(如清晰度、瞳孔大小等)的梯度方向进行遍历,并结合GAN反演技术将真实或生成的虹膜图像映射到潜在空间,实现对指定属性的操控同时保留身份信息。 Result: 该方法能有效生成同一身份但具有不同几何、纹理或质量属性的虹膜图像,支持任意可微损失定义的属性扩展,适用于数据增强和活体攻击检测。 Conclusion: 所提潜在空间遍历策略为虹膜图像的可控生成提供了灵活且有效的方法,增强了数据多样性,有助于提升虹膜识别系统的可靠性。 Abstract: Developing reliable iris recognition and presentation attack detection methods requires diverse datasets that capture realistic variations in iris features and a wide spectrum of anomalies. Because of the rich texture of iris images, which spans a wide range of spatial frequencies, synthesizing same-identity iris images while controlling specific attributes remains challenging. In this work, we introduce a new iris image augmentation strategy by traversing a generative model's latent space toward latent codes that represent same-identity samples but with some desired iris image properties manipulated. The latent space traversal is guided by a gradient of specific geometrical, textural, or quality-related iris image features (e.g., sharpness, pupil size, iris size, or pupil-to-iris ratio) and preserves the identity represented by the image being manipulated. The proposed approach can be easily extended to manipulate any attribute for which a differentiable loss term can be formulated. Additionally, our approach can use either randomly generated images using either a pre-train GAN model or real-world iris images. We can utilize GAN inversion to project any given iris image into the latent space and obtain its corresponding latent code.

[76] STORM: Segment, Track, and Object Re-Localization from a Single 3D Model

Yu Deng,Teng Cao,Hikaru Shindo,Jiahong Xue,Quentin Delfosse,Kristian Kersting

Main category: cs.CV

TL;DR: STORM 是一个无需手动标注的开源实时6D姿态估计系统,通过结合视觉-语言理解和自监督特征匹配,在复杂工业场景中实现了最先进的精度和实时性能。

Details Motivation: 现有6D姿态估计方法依赖首帧的手动分割标注,费时且在遮挡或快速运动下性能下降,因此需要一种更鲁棒、免标注的解决方案。 Method: STORM采用三阶段 pipeline:利用上下文对象描述进行定位,通过自交叉注意力机制识别候选区域,并使用分割模型生成精确掩码;结合特征相似性监控实现自动重注册以应对跟踪失败。 Result: 在多物体遮挡、高速运动和光照变化的挑战性工业数据集上达到最先进精度,同时实现实时运行速度,无需额外训练。 Conclusion: STORM提供了一种免标注、高效且鲁棒的6D姿态估计方案,显著降低部署成本,适用于柔性制造和智能质检等实际应用场景。 Abstract: Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically rely on a manually annotated segmentation mask of the target in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limi- tations, we propose STORM (Segment, Track, and Object Re-localization from a single 3D Model), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with self-supervised feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and a segmentation model produces precise masks for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.

[77] PANDA - Patch And Distribution-Aware Augmentation for Long-Tailed Exemplar-Free Continual Learning

Siddeshwar Raghavan,Jiangpeng He,Fengqing Zhu

Main category: cs.CV

TL;DR: 提出PANDA框架,用于解决真实数据流中双层不平衡问题,在无需存储旧数据的情况下提升预训练模型在持续学习中的表现。

Details Motivation: 现有无样例持续学习方法忽视真实世界数据分布的双重不平衡性,导致学习效果差和灾难性遗忘。 Method: PANDA利用CLIP编码器识别代表性区域并将其移植到高频类样本中,增强低频类;同时采用自适应平衡策略缓解任务间不平衡。 Result: 实验表明PANDA能有效提升基于预训练模型的持续学习方法的准确性,并显著减少灾难性遗忘。 Conclusion: PANDA通过补丁级增强与分布感知机制,在不存储历史数据的前提下有效应对双层数据不平衡问题,增强了模型的泛化能力。 Abstract: Exemplar-Free Continual Learning (EFCL) restricts the storage of previous task data and is highly susceptible to catastrophic forgetting. While pre-trained models (PTMs) are increasingly leveraged for EFCL, existing methods often overlook the inherent imbalance of real-world data distributions. We discovered that real-world data streams commonly exhibit dual-level imbalances, dataset-level distributions combined with extreme or reversed skews within individual tasks, creating both intra-task and inter-task disparities that hinder effective learning and generalization. To address these challenges, we propose PANDA, a Patch-and-Distribution-Aware Augmentation framework that integrates seamlessly with existing PTM-based EFCL methods. PANDA amplifies low-frequency classes by using a CLIP encoder to identify representative regions and transplanting those into frequent-class samples within each task. Furthermore, PANDA incorporates an adaptive balancing strategy that leverages prior task distributions to smooth inter-task imbalances, reducing the overall gap between average samples across tasks and enabling fairer learning with frozen PTMs. Extensive experiments and ablation studies demonstrate PANDA's capability to work with existing PTM-based CL methods, improving accuracy and reducing catastrophic forgetting.

[78] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

Konstantinos M. Dafnis,Dimitris N. Metaxas

Main category: cs.CV

TL;DR: 提出了一种名为Spectrum-Aware Test-Time Steering (STS)的轻量级测试时适应框架,通过在潜在空间中调整少量每样本偏移参数来实现对视觉-语言模型的高效自适应,无需反向传播或修改冻结编码器。

Details Motivation: 现有的测试时适应方法通常需要反向传播大模型权重或修改核心组件,导致计算开销大,限制了在实际应用中的部署。因此,需要一种更轻量、高效的适应方法。 Method: STS从文本嵌入中提取谱子空间以定义主要语义方向,并通过最小化增强视图间的熵,在谱感知的方式下学习调整潜在表示,仅需优化少量每样本的偏移参数。 Result: 实验表明,STS在标准评估协议下显著优于或媲美当前最先进的测试时适应方法,额外参数极少,推理速度提升高达8倍,内存占用仅为传统提示调优的1/12。 Conclusion: STS是一种高效、可扩展的测试时适应方法,能够在不修改冻结编码器的情况下实现强大的域适应性能,适合资源受限场景下的实际部署。 Abstract: Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.

[79] Lumos3D: A Single-Forward Framework for Low-Light 3D Scene Restoration

Hanzhou Liu,Peng Jiang,Jia Huang,Mi Lu

Main category: cs.CV

TL;DR: 提出Lumos3D,一种可泛化的无需位姿的3D低光场景恢复框架,通过几何感知骨干网络和跨光照蒸馏策略,实现对未标注多视角图像的高质量照明与结构恢复。

Details Motivation: 现有方法依赖预定义相机位姿和场景特定优化,难以扩展到动态真实环境,因此需要一个无需位姿且具备良好泛化能力的3D低光恢复框架。 Method: 构建基于几何感知骨干网络的Lumos3D框架,采用跨光照蒸馏策略将教师网络的几何信息(如深度)传递给学生模型,并引入Lumos损失以增强重建3D空间内的光度一致性,实现端到端的前馈推理。 Result: 在真实世界数据集上验证了Lumos3D能够高保真地恢复低光3D场景,保持准确几何结构并具有强泛化能力,同时可自然扩展至过曝校正任务。 Conclusion: Lumos3D实现了无需每场景优化或已知位姿的高效3D低光恢复,具备良好的通用性和应用潜力,为实际光照恢复任务提供了新思路。 Abstract: Restoring 3D scenes captured under low-light con- ditions remains a fundamental yet challenging problem. Most existing approaches depend on precomputed camera poses and scene-specific optimization, which greatly restricts their scala- bility to dynamic real-world environments. To overcome these limitations, we introduce Lumos3D, a generalizable pose-free framework for 3D low-light scene restoration. Trained once on a single dataset, Lumos3D performs inference in a purely feed- forward manner, directly restoring illumination and structure from unposed, low-light multi-view images without any per- scene training or optimization. Built upon a geometry-grounded backbone, Lumos3D reconstructs a normal-light 3D Gaussian representation that restores illumination while faithfully pre- serving structural details. During training, a cross-illumination distillation scheme is employed, where the teacher network is distilled on normal-light ground truth to transfer accurate geometric information, such as depth, to the student model. A dedicated Lumos loss is further introduced to promote photomet- ric consistency within the reconstructed 3D space. Experiments on real-world datasets demonstrate that Lumos3D achieves high- fidelity low-light 3D scene restoration with accurate geometry and strong generalization to unseen cases. Furthermore, the framework naturally extends to handle over-exposure correction, highlighting its versatility for diverse lighting restoration tasks.

[80] From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance

Jeongho Min,Dongyoung Kim,Jaehyup Lee

Main category: cs.CV

TL;DR: 提出一种无需训练、基于预训练视觉编码器和大语言模型的街景到卫星图像检索框架,在零样本设置下优于先前方法,并能自动生成语义对齐的跨视角数据集。

Details Motivation: 现有跨视角图像检索方法依赖监督训练和特定数据(如全景图或无人机图像),限制了实际应用。需要一种更通用、低成本且适用于真实场景的解决方案。 Method: 利用预训练视觉编码器(如DINOv2)和大语言模型,通过网络图像搜索提取地理线索,结合LLM进行位置推断,使用地理编码API生成卫星查询图像,并采用PCA白化进行特征优化,整个过程无需微调或额外训练。 Result: 在基准数据集上零样本设置下超越了以往学习型方法,同时实现了高质量的跨视角图像匹配,并可自动构建街景-卫星配对数据集。 Conclusion: 该方法展示了无需监督训练即可实现高效跨视角检索的潜力,为城市规划、自主导航等应用提供了可扩展且经济高效的解决方案。 Abstract: Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://jeonghomin.github.io/street2orbit.github.io/.

[81] AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting

Aymen Mir,Jian Wang,Riza Alp Guler,Chuan Guo,Gerard Pons-Moll,Bing Zhou

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯点阵(3DGS)的新框架,用于在3D场景中实现几何一致的自由视角人类动画与交互,无需配对数据,支持从单目视频编辑生成新的人类动画。

Details Motivation: 现有动画流程多使用网格或点云表示,而在人类-场景交互中3DGS尚未充分探索。本文旨在利用3DGS在新视角合成中的优势,提升人类在3D场景中动画的真实感与灵活性。 Method: 将人类和场景均表示为高斯分布,通过解耦渲染与运动合成,引入高斯对齐的运动模块,利用不透明度和投影结构指导人体姿态与位置;并设计人类-场景高斯优化策略以增强接触与导航的真实性。 Result: 在Scannet++和SuperSplat场景及多视角重建人像上验证了方法有效性,实现了高质量自由视角渲染,并支持单目RGB视频编辑后生成具几何一致性的新动画人物。 Conclusion: 3DGS作为人类-场景动画的统一表示具有显著优势,所提框架无需配对数据即可实现逼真、灵活的人类动作合成与场景交互,拓展了其在单目视频动画中的应用潜力。 Abstract: We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation to the problem of animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows for geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that the rendering can be decoupled from the motion synthesis and each sub-problem can be addressed independently, without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework allows for novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with new animated humans, showcasing the unique advantage of 3DGS for monocular video-based human animation.

[82] CertMask: Certifiable Defense Against Adversarial Patches via Theoretically Optimal Mask Coverage

Xuntao Lyu,Ching-Chi Lin,Abdullah Al Arafat,Georg von der Brüggen,Jian-Jia Chen,Zhishan Guo

Main category: cs.CV

TL;DR: 提出CertMask,一种具有可证明鲁棒性的防御方法,通过单轮O(n)复杂度的掩码策略有效抵御对抗性图像补丁攻击,相比现有方法更高效且性能更优。

Details Motivation: 对抗性补丁攻击可通过物理部署威胁现实世界应用,需要高效且具备理论保障的防御方法。 Method: 设计了一种数学上严谨的覆盖策略,构建能充分覆盖所有可能补丁位置至少k次的二值掩码集,仅需单轮掩码且时间复杂度为O(n),并提供理论证明其认证充分性。 Result: 在ImageNet、ImageNette和CIFAR-10上的实验表明,CertMask比PatchCleanser将认证鲁棒准确率最高提升13.4%,同时保持与原始模型相当的干净样本准确率。 Conclusion: CertMask在保证强理论安全性的同时显著提升了效率与鲁棒性能,是防御对抗性补丁攻击的高效解决方案。 Abstract: Adversarial patch attacks inject localized perturbations into images to mislead deep vision models. These attacks can be physically deployed, posing serious risks to real-world applications. In this paper, we propose CertMask, a certifiably robust defense that constructs a provably sufficient set of binary masks to neutralize patch effects with strong theoretical guarantees. While the state-of-the-art approach (PatchCleanser) requires two rounds of masking and incurs $O(n^2)$ inference cost, CertMask performs only a single round of masking with $O(n)$ time complexity, where $n$ is the cardinality of the mask set to cover an input image. Our proposed mask set is computed using a mathematically rigorous coverage strategy that ensures each possible patch location is covered at least $k$ times, providing both efficiency and robustness. We offer a theoretical analysis of the coverage condition and prove its sufficiency for certification. Experiments on ImageNet, ImageNette, and CIFAR-10 show that CertMask improves certified robust accuracy by up to +13.4\% over PatchCleanser, while maintaining clean accuracy nearly identical to the vanilla model.

[83] CORONA-Fields: Leveraging Foundation Models for Classification of Solar Wind Phenomena

Daniela Martin,Jinsu Hong,Connor O'Brien,Valmir P Moraes Filho,Jasmine R. Kobayashi,Evangelia Samara,Joseph Gallego

Main category: cs.CV

TL;DR: 本研究首次尝试将太阳物理领域的基础模型应用于太阳风结构的原位分析,通过结合遥感图像嵌入和傅里叶特征编码,构建神经场模型以连接遥感与原位观测,尽管分类性能有限,但验证了该方法的可行性,为未来提升空间天气预报能力奠定了基础。

Details Motivation: 太阳活动引发的空间天气对卫星和地面基础设施构成日益增长的风险,而太阳风和日冕物质抛射的复杂性使得其自动分类具有挑战性,亟需新方法实现可靠的空间天气预测。 Method: 采用基于Solar Dynamics Observatory图像预训练的基础模型生成嵌入表示,并将其与航天器位置及太阳磁连接性(通过傅里叶特征编码)结合,构建神经场模型;利用Parker Solar Probe的测量数据生成标签,进行下游分类任务的微调。 Result: 整体分类性能较为有限,可能受限于标签粗糙、类别不平衡以及预训练模型迁移能力不足,但成功实现了遥感与原位观测的桥接。 Conclusion: 研究表明利用基础模型嵌入进行太阳风原位分析是可行的,作为概念验证为后续改进提供了方向,有助于推动更可靠的空间天气预测发展。 Abstract: Space weather at Earth, driven by the solar activity, poses growing risks to satellites around our planet as well as to critical ground-based technological infrastructure. Major space weather contributors are the solar wind and coronal mass ejections whose variable density, speed, temperature, and magnetic field make the automated classification of those structures challenging. In this work, we adapt a foundation model for solar physics, originally trained on Solar Dynamics Observatory imagery, to create embeddings suitable for solar wind structure analysis. These embeddings are concatenated with the spacecraft position and solar magnetic connectivity encoded using Fourier features which generates a neural field-based model. The full deep learning architecture is fine-tuned bridging the gap between remote sensing and in situ observations. Labels are derived from Parker Solar Probe measurements, forming a downstream classification task that maps plasma properties to solar wind structures. Although overall classification performance is modest, likely due to coarse labeling, class imbalance, and limited transferability of the pretrained model, this study demonstrates the feasibility of leveraging foundation model embeddings for in situ solar wind tasks. As a first proof-of-concept, it lays the groundwork for future improvements toward more reliable space weather predictions. The code and configuration files used in this study are publicly available to support reproducibility.

[84] IPCD: Intrinsic Point-Cloud Decomposition

Shogo Sato,Takuhiro Kaneko,Shoichiro Takeda,Tomoyasu Shimada,Kazuhiko Murasaki,Taiga Yoshida,Ryuichi Tanida,Akisato Kimura

Main category: cs.CV

TL;DR: 本文提出了Intrinsic Point-Cloud Decomposition (IPCD),用于将彩色点云分解为反照率和阴影,以实现逼真的重光照和纹理编辑。为此提出IPCD-Net网络,结合点级特征聚合与基于投影的亮度分布(PLD)来处理非网格数据并捕捉全局光照信息。

Details Motivation: 点云的非网格结构使传统图像分解方法不适用,且现有点云模型未显式考虑全局光照方向,导致阴影估计不准。因此需要一种能准确分离反照率与阴影的方法。 Method: 提出IPCD-Net,扩展图像分解模型,采用点级特征聚合处理非网格数据;引入基于投影的亮度分布(PLD)和分层特征优化,通过多视角投影捕捉全局光照线索。 Result: 在合成户外场景数据集上实验表明,IPCD-Net能有效减少反照率中的投射阴影,提升阴影的颜色准确性,并在纹理编辑、重光照和变光照下的点云配准中展示应用效果。 Conclusion: IPCD-Net能够有效实现点云的内在分解,支持高质量的视觉任务,在合成与真实场景中均表现出良好性能。 Abstract: Point clouds are widely used in various fields, including augmented reality (AR) and robotics, where relighting and texture editing are crucial for realistic visualization. Achieving these tasks requires accurately separating albedo from shade. However, performing this separation on point clouds presents two key challenges: (1) the non-grid structure of point clouds makes conventional image-based decomposition models ineffective, and (2) point-cloud models designed for other tasks do not explicitly consider global-light direction, resulting in inaccurate shade. In this paper, we introduce \textbf{Intrinsic Point-Cloud Decomposition (IPCD)}, which extends image decomposition to the direct decomposition of colored point clouds into albedo and shade. To overcome challenge (1), we propose \textbf{IPCD-Net} that extends image-based model with point-wise feature aggregation for non-grid data processing. For challenge (2), we introduce \textbf{Projection-based Luminance Distribution (PLD)} with a hierarchical feature refinement, capturing global-light ques via multi-view projection. For comprehensive evaluation, we create a synthetic outdoor-scene dataset. Experimental results demonstrate that IPCD-Net reduces cast shadows in albedo and enhances color accuracy in shade. Furthermore, we showcase its applications in texture editing, relighting, and point-cloud registration under varying illumination. Finally, we verify the real-world applicability of IPCD-Net.

[85] Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies

Peng Gao,Yujian Lee,Xiaofeng Zhang,Zailong Chen,Hui Zhang

Main category: cs.CV

TL;DR: 提出了一种无需训练的三步衰减恢复策略(T-DRS),以解决大型视觉语言模型中Rotary位置编码导致的长距离依赖衰减问题,显著提升视觉问答任务性能。

Details Motivation: Rotary位置编码虽能精确定位token,但会导致远距离token对的注意力衰减,损害模型对全局上下文的记忆能力。 Method: 提出T-DRS,包含语义驱动、距离感知控制和远程再强化三种策略,分别通过内容感知残差、基于位置距离的平滑调制和保留远程依赖来恢复被抑制的长距离依赖。 Result: 在多个视觉问答基准上验证了T-DRS的有效性,无需训练即可一致提升模型性能。 Conclusion: T-DRS有效缓解了ROPE带来的注意力衰减问题,在不破坏局部归纳偏置的前提下增强了模型对全局上下文的建模能力。 Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model's ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise while preserving locality, and (3) re-Reinforce Distant DRS (reRD-DRS), consolidating the remaining informative remote dependencies to maintain global coherence. Together, the T-DRS recover suppressed long-range token pairs without harming local inductive biases. Extensive experiments on Vision Question Answering (VQA) benchmarks demonstrate that T-DRS can consistently improve performance in a training-free manner. The code can be accessed in https://github.com/labixiaoq-qq/Remember-me

[86] SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection

Jia Lin,Xiaofei Zhou,Jiyuan Liu,Runmin Cong,Guodao Zhang,Zhi Liu,Jiyong Zhang

Main category: cs.CV

TL;DR: 提出了一种基于SAM2的深度引导自适应查询方法(SAM-DAQ),用于RGB-D视频显著性目标检测,通过并行多模态编码器和查询驱动的记忆模块解决提示依赖、内存消耗大和计算负担重的问题。

Details Motivation: 现有SAM模型在RGB-D视频显著性检测中面临依赖人工提示、内存消耗高和计算复杂度大的问题,需设计更高效、无需提示且能融合深度与时间信息的方法。 Method: 提出SAM-DAQ,包含两个核心模块:1)深度引导的并行适配器图像编码器(PAMIE),通过跳接结构引入深度信息,在无提示条件下微调SAM编码器;2)查询驱动的时序记忆模块(QTM),统一记忆库与提示嵌入,利用帧级和视频级查询提取时序一致性特征并更新查询表示。 Result: 在三个RGB-D VSOD数据集上实验表明,SAM-DAQ在所有评价指标上均优于现有最先进方法,具有更强的性能和效率。 Conclusion: SAM-DAQ有效解决了将SAM应用于RGB-D视频显著性检测中的关键挑战,实现了无需人工提示、低内存消耗且高效的多模态时序建模,推动了通用分割模型在视频显著性检测中的应用。 Abstract: Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.

[87] RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion

Wenzhe He,Xiaojun Chen,Wentang Chen,Hongyu Wang,Ying Liu,Ruihui Li

Main category: cs.CV

TL;DR: 本文提出了一种轻量化的点云语义场景补全网络RWKV-PCSSC,基于RWKV机制,通过RWKV-SG和RWKV-PD模块实现高效特征聚合与点云恢复,在显著降低参数量和提升内存效率的同时,在多个室内外数据集上达到SOTA性能。

Details Motivation: 现有语义场景补全方法通常采用密集网络结构,导致模型复杂度高、资源消耗大,亟需更轻量高效的解决方案。 Method: 提出RWKV-PCSSC网络,包含RWKV Seed Generator(RWKV-SG)模块用于从部分点云生成带粗略特征的粗略点云,以及多阶段的RWKV Point Deconvolution(RWKV-PD)模块逐步恢复点云点特征,利用RWKV机制实现轻量化设计。 Result: 相比当前方法PointSSC,RWKV-PCSSC参数量减少4.18倍,内存效率提升1.37倍,并在SSC-PC、NYUCAD-PC、PointSSC等多个标准数据集及新提出的NYUCAD-PC-V2、3D-FRONT-PC数据集上实现了最先进的性能。 Conclusion: RWKV-PCSSC通过引入基于RWKV机制的轻量架构,在显著降低模型复杂度的同时保持并提升了语义场景补全的性能,适用于资源受限场景下的实际应用。 Abstract: Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18$\times$ and improves memory efficiency by 1.37$\times$ compared to state-of-the-art methods PointSSC. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).

[88] HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models

Liheng Zhang,Jin Wang,Hui Li,Bingfeng Zhang,Weifeng Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为HCC-3D的分层补偿压缩方法,用于高效压缩3D点云token,显著降低3D视觉语言模型的计算开销,同时保持关键信息完整性。

Details Motivation: 现有3D-VLM因将所有3D token输入大语言模型而导致计算成本过高,限制了应用,亟需一种既能压缩token又能保留关键信息的方法。 Method: 提出Hierarchical Compensatory Compression (HCC-3D),包括全局结构压缩(GSC)模块将点云压缩为少量关键token,并通过自适应细节挖掘(ADM)模块补偿性地恢复被忽略的重要特征。 Result: 实验表明HCC-3D实现了约98%的极端压缩比,显著优于先前方法,同时在多个任务上达到新的SOTA性能。 Conclusion: HCC-3D有效解决了3D-VLM中计算效率与信息保留之间的矛盾,为高效多模态3D理解提供了新范式。 Abstract: 3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.

[89] Scale-Aware Relay and Scale-Adaptive Loss for Tiny Object Detection in Aerial Images

Jinfu Li,Yuqi Huang,Hong Song,Ting Wang,Jianghan Xia,Yucong Lin,Jingfan Fan,Jian Yang

Main category: cs.CV

TL;DR: 本文提出了一种用于航拍图像中微小目标检测的尺度感知中继层(SARL)和尺度自适应损失(SAL),有效提升了现有检测器在多个基准上的性能,尤其增强了对微小目标的检测能力。

Details Motivation: 现代检测器在航拍图像中难以检测微小目标,主要因为微小目标特征少且在网络传播中易丢失,同时训练时大目标带来的回归损失占主导。 Method: 提出SARL模块,通过跨尺度空间-通道注意力机制增强各层特征并促进跨层特征共享;设计SAL损失函数,动态降低大目标的损失权重,使训练更关注小目标。 Result: 在AI-TOD、DOTA-v2.0和VisDrone2019三个基准上实验表明,该方法在YOLOv5和YOLOX基线上提升平均精度(AP)达5.5%,在真实噪声数据集AI-TOD-v2.0上达到29.0% AP。 Conclusion: 所提出的SARL和SAL方法显著提升了航拍图像中微小目标的检测性能,具有良好的通用性和鲁棒性,适用于主流检测框架。 Abstract: Recently, despite the remarkable advancements in object detection, modern detectors still struggle to detect tiny objects in aerial images. One key reason is that tiny objects carry limited features that are inevitably degraded or lost during long-distance network propagation. Another is that smaller objects receive disproportionately greater regression penalties than larger ones during training. To tackle these issues, we propose a Scale-Aware Relay Layer (SARL) and a Scale-Adaptive Loss (SAL) for tiny object detection, both of which are seamlessly compatible with the top-performing frameworks. Specifically, SARL employs a cross-scale spatial-channel attention to progressively enrich the meaningful features of each layer and strengthen the cross-layer feature sharing. SAL reshapes the vanilla IoU-based losses so as to dynamically assign lower weights to larger objects. This loss is able to focus training on tiny objects while reducing the influence on large objects. Extensive experiments are conducted on three benchmarks (\textit{i.e.,} AI-TOD, DOTA-v2.0 and VisDrone2019), and the results demonstrate that the proposed method boosts the generalization ability by 5.5\% Average Precision (AP) when embedded in YOLOv5 (anchor-based) and YOLOx (anchor-free) baselines. Moreover, it also promotes the robust performance with 29.0\% AP on the real-world noisy dataset (\textit{i.e.,} AI-TOD-v2.0).

[90] Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

Zubia Naz,Farhan Asghar,Muhammad Ishfaq Hussain,Yahya Hadadi,Muhammad Aasim Rafique,Wookjin Choi,Moongu Jeon

Main category: cs.CV

TL;DR: 提出了一种基于Swin-BART的编码器-解码器系统,结合轻量级区域注意力模块,用于医学图像自动描述,在ROCO数据集上实现了最先进的语义保真度。

Details Motivation: 医学图像描述生成可辅助放射科报告流程,但现有方法在语义准确性和临床可解释性方面存在不足,因此需要一种既能提升描述质量又能提供诊断区域可解释性的模型。 Method: 采用Swin-BART作为编码器-解码器框架,并引入轻量级区域注意力模块,在跨注意力之前增强诊断显著区域;使用ROCO数据集进行训练和评估,通过beam search解码,并进行消融实验、模态分析和显著性测试。 Result: 在ROUGE和BERTScore指标上显著优于基线模型(ROUGE: 0.603 vs. 0.356/0.255;BERTScore: 0.807 vs. 0.645/0.623),BLEU、CIDEr和METEOR表现相当;提供了区域注意力消融、按模态分析、配对显著性检验及可视化热图。 Conclusion: 所提出的模型在保持紧凑和可解释的同时,生成了准确且符合临床表述的图像描述,具备支持带有人工监督的安全研究应用潜力。 Abstract: Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.

[91] Simulating Distribution Dynamics: Liquid Temporal Feature Evolution for Single-Domain Generalized Object Detection

Zihao Zhang,Yang Li,Aming Wu,Yahong Han

Main category: cs.CV

TL;DR: 本文提出了一种名为Liquid Temporal Feature Evolution(LTFE)的新方法,用于单域广义目标检测(Single-DGOD),通过引入时间建模和液态神经网络驱动的参数调整机制,模拟特征从源域到潜在分布的渐进演化,从而提升模型在未知域上的泛化能力。

Details Motivation: 现有方法依赖离散数据增强或静态扰动来缓解目标域数据缺失问题,难以捕捉真实场景中连续渐变的域偏移(如天气、光照变化),限制了对细粒度跨域差异的感知。 Method: 提出Liquid Temporal Feature Evolution(LTFE),结合可控高斯噪声注入和多尺度高斯模糊生成初始特征扰动,利用时间建模与液态神经网络动态调节参数,实现跨域的平滑连续自适应。 Result: 在Diverse Weather数据集和Real-to-Art基准上取得显著性能提升,验证了方法在应对未见域偏移时的优越泛化性和鲁棒性。 Conclusion: LTFE通过建模渐进式特征演化和动态调节适应路径,有效缩小源域与未知域之间的分布差距,为Single-DGOD提供了更强大的连续域适应解决方案。 Abstract: In this paper, we focus on Single-Domain Generalized Object Detection (Single-DGOD), aiming to transfer a detector trained on one source domain to multiple unknown domains. Existing methods for Single-DGOD typically rely on discrete data augmentation or static perturbation methods to expand data diversity, thereby mitigating the lack of access to target domain data. However, in real-world scenarios such as changes in weather or lighting conditions, domain shifts often occur continuously and gradually. Discrete augmentations and static perturbations fail to effectively capture the dynamic variation of feature distributions, thereby limiting the model's ability to perceive fine-grained cross-domain differences. To this end, we propose a new method, Liquid Temporal Feature Evolution, which simulates the progressive evolution of features from the source domain to simulated latent distributions by incorporating temporal modeling and liquid neural network-driven parameter adjustment. Specifically, we introduce controllable Gaussian noise injection and multi-scale Gaussian blurring to simulate initial feature perturbations, followed by temporal modeling and a liquid parameter adjustment mechanism to generate adaptive modulation parameters, enabling a smooth and continuous adaptation across domains. By capturing progressive cross-domain feature evolution and dynamically regulating adaptation paths, our method bridges the source-unknown domain distribution gap, significantly boosting generalization and robustness to unseen shifts. Significant performance improvements on the Diverse Weather dataset and Real-to-Art benchmark demonstrate the superiority of our method. Our code is available at https://github.com/2490o/LTFE.

[92] MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

Ketong Chen,Yuhao Chen,Yang Xue

Main category: cs.CV

TL;DR: 本文提出了MosaicDoc,一个大规模、双语(中英文)、多任务的视觉丰富文档理解基准,通过多智能体流水线自动生成,用于评估现有视觉语言模型在复杂版面和密集文本场景下的性能。

Details Motivation: 现有视觉语言模型的基准测试主要以英文为主,版面简单,任务有限,无法有效评估模型在复杂布局和密集文本的视觉文档理解中的表现。 Method: 提出DocWeaver多智能体流水线,利用大语言模型自动生成MosaicDoc基准,包含来自报纸和杂志的72K图像和60万以上问答对,涵盖OCR、VQA、阅读顺序和定位等多任务标注。 Result: MosaicDoc具有复杂的版面结构(如多栏、非曼哈顿布局)、丰富的样式多样性(来自196家出版商),并在中英文双语环境下支持多任务评估;实验表明当前最先进的模型在处理真实文档复杂性方面仍有显著局限。 Conclusion: MosaicDoc为视觉丰富文档理解提供了更具挑战性和现实意义的基准,揭示了现有模型的不足,并为未来研究指明方向。 Abstract: Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.

[93] Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers

Xuan Rao,Simian Xu,Zheng Li,Bo Zhao,Derong Liu,Mingming Ha,Cesare Alippi

Main category: cs.CV

TL;DR: 提出了一种用于类增量学习的序列学习方法SLDC,通过补偿特征分布漂移来提升模型性能。

Details Motivation: 现有的序列微调方法在类增量学习中容易受到分布漂移的影响,导致先前学习类别的特征分布与更新后模型不匹配,影响分类器性能。 Method: 引入一个潜在空间转换算子,提出线性与弱非线性版本的SLDC,并结合知识蒸馏技术对齐不同任务间的特征分布以缓解漂移问题。 Result: 在多个标准类增量学习基准上的实验表明,SLDC显著提升了SeqFT的性能,结合知识蒸馏后可达到与联合训练相当的效果。 Conclusion: SLDC能有效补偿分布漂移,提升序列微调在类增量学习中的表现,具有良好的应用潜力。 Abstract: Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: https://github.com/raoxuan98-hash/sldc.git.

[94] Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification

Yuhang Zhou,Yanxiang Zhao,Zhongyun Hua,Zhipu Liu,Zhaoquan Gu,Qing Liao,Leo Yu Zhang

Main category: cs.CV

TL;DR: 本文提出了一种去偏双不变防御框架,用于提升行人重识别(ReID)模型在对抗攻击下的鲁棒性,通过数据平衡和新型度量对抗训练机制,在未见身份和未见攻击类型上均实现了更好的泛化性能。

Details Motivation: 现有的对抗防御方法多针对分类任务,难以直接应用于行人ReID等度量学习任务;同时现有ReID防御方法未能有效解决模型偏差和复合泛化需求两大挑战。 Method: 提出去偏双不变防御框架:1)在数据平衡阶段,采用基于扩散模型的数据重采样策略缓解模型偏差;2)在双对抗自元防御阶段,引入远负样本扩展软化策略的度量对抗训练,并设计对抗增强的自元机制以实现对未见身份和未见攻击类型的双重泛化。 Result: 实验表明,所提方法在多种攻击下显著优于现有最先进防御方法,提升了模型在对抗环境下的鲁棒性和公平性。 Conclusion: 该框架有效解决了ReID中模型偏差和复合泛化问题,为构建安全可靠的行人重识别系统提供了新的思路。 Abstract: Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking. However, advanced deep learning-based ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions, posing significant security threats. Although numerous adversarial defense strategies have been proposed for classification tasks, their extension to metric learning tasks such as person ReID remains relatively unexplored. Moreover, the several existing defenses for person ReID fail to address the inherent unique challenges of adversarially robust ReID. In this paper, we systematically identify the challenges of adversarial defense in person ReID into two key issues: model bias and composite generalization requirements. To address them, we propose a debiased dual-invariant defense framework composed of two main phases. In the data balancing phase, we mitigate model bias using a diffusion-model-based data resampling strategy that promotes fairness and diversity in training data. In the bi-adversarial self-meta defense phase, we introduce a novel metric adversarial training approach incorporating farthest negative extension softening to overcome the robustness degradation caused by the absence of classifier. Additionally, we introduce an adversarially-enhanced self-meta mechanism to achieve dual-generalization for both unseen identities and unseen attack types. Experiments demonstrate that our method significantly outperforms existing state-of-the-art defenses.

[95] AdaptViG: Adaptive Vision GNN with Exponential Decay Gating

Mustafa Munir,Md Mostafijur Rahman,Radu Marculescu

Main category: cs.CV

TL;DR: 提出AdaptViG,一种高效的混合视觉图神经网络,通过自适应图卷积和指数衰减门控机制,在保持高性能的同时显著降低计算开销。

Details Motivation: ViGs在视觉架构中具有潜力,但其图构建阶段的高计算成本限制了效率,因此需要更高效的图构建机制。 Method: 引入AdaptViG,采用静态轴向骨架与动态内容感知的指数衰减门控策略,结合早期阶段的高效门控和最后阶段的全局注意力,实现高效的长距离依赖建模。 Result: AdaptViG-M达到82.6% top-1准确率,参数减少80%,GMACs减少84%;在下游任务中也超越更大模型的表现。 Conclusion: AdaptViG在精度和效率之间实现了新的SOTA权衡,显著优于现有Vision GNN和大型基线模型。 Abstract: Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.

[96] TSPE-GS: Probabilistic Depth Extraction for Semi-Transparent Surface Reconstruction via 3D Gaussian Splatting

Zhiyuan Xu,Nan Min,Yuhang Guo,Tong Wei

Main category: cs.CV

TL;DR: 提出TSPE-GS方法,通过建模像素级多模态不透明度和深度分布,解决3D高斯溅射中半透明表面重建的深度歧义问题。

Details Motivation: 现有方法假设每个像素只有一个深度,无法处理多个可见表面(如半透明物体)的情况,导致重建效果差。 Method: 均匀采样透射率,建模像素级多峰不透明度与深度分布,并通过融合截断符号距离函数,在统一框架内分别重建内外表面。 Result: 在公开和自采集的半透明与不透明数据集上实验表明,TSPE-GS显著提升了半透明几何重建质量,同时保持对不透明场景的良好性能。 Conclusion: TSPE-GS有效解决了单深度假设带来的跨表面深度歧义问题,可推广至其他基于高斯的重建流程,且无需额外训练开销。 Abstract: 3D Gaussian Splatting offers a strong speed-quality trade-off but struggles to reconstruct semi-transparent surfaces because most methods assume a single depth per pixel, which fails when multiple surfaces are visible. We propose TSPE-GS (Transparent Surface Probabilistic Extraction for Gaussian Splatting), which uniformly samples transmittance to model a pixel-wise multi-modal distribution of opacity and depth, replacing the prior single-peak assumption and resolving cross-surface depth ambiguity. By progressively fusing truncated signed distance functions, TSPE-GS reconstructs external and internal surfaces separately within a unified framework. The method generalizes to other Gaussian-based reconstruction pipelines without extra training overhead. Extensive experiments on public and self-collected semi-transparent and opaque datasets show TSPE-GS significantly improves semi-transparent geometry reconstruction while maintaining performance on opaque scenes.

[97] Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment

Zhicheng Liao,Dongxu Wu,Zhenshan Shi,Sijie Mai,Hanwei Zhu,Lingyu Zhu,Yuncheng Jiang,Baoliang Chen

Main category: cs.CV

TL;DR: 提出一种新的自适应融合框架,结合CLIP图像特征的余弦相似度与幅值信息,提升无参考图像质量评估性能。

Details Motivation: 现有基于CLIP的无参考图像质量评估方法仅使用语义相似性(如余弦相似度),忽略了图像特征幅值与感知质量之间的强相关性。 Method: 提取绝对CLIP图像特征,采用Box-Cox变换进行统计归一化以降低语义敏感性,并设计置信度引导的融合策略来自适应地结合余弦相似度与归一化后的幅值线索。 Result: 在多个标准IQA数据集上实验表明,该方法在无需任务特定训练的情况下, consistently 优于标准CLIP-based方法和当前最先进的基准方法。 Conclusion: 通过引入幅值感知的质量线索并与其语义相似性互补融合,可有效提升CLIP在无参考图像质量评估中的表现。 Abstract: Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as "a good photo" or "a bad photo." However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

[98] Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching

Uday Bhaskar,Rishabh Bhattacharya,Avinash Patel,Sarthak Khoche,Praveen Anil Kulkarni,Naresh Manwani

Main category: cs.CV

TL;DR: 提出一种利用视觉语言模型(VLM)生成伪标签并结合每对象协同教学策略训练高效实时目标检测器的新方法,显著减少对人工标注的依赖,在KITTI等数据集上显著提升mAP@0.5。

Details Motivation: VLM在零样本目标检测中具有潜力,但存在延迟高和预测幻觉问题,难以直接用于自动驾驶等实时场景,且依赖大量人工标注数据。 Method: 设计了一个新流程:用VLM生成伪标签,并提出每对象协同教学训练策略,通过两个YOLO模型协作,基于彼此的每对象损失过滤噪声边界框,从而提升伪标签质量,训练出高效、实时的检测器。 Result: 在KITTI数据集上,相比基线YOLOv5m,mAP@0.5从31.12%提升至46.61%;加入10%真实标签后达到57.97%。在ACDC和BDD100k数据集上也表现出类似提升,同时保持实时检测速度。 Conclusion: 该方法提供了一种高效、鲁棒且可扩展的方式,利用VLM生成伪标签训练高性能检测器,大幅降低对昂贵人工标注的依赖,适用于自动驾驶等实际应用。 Abstract: Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12\%$ to $46.61\%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10\%$) leads to further performance gains, reaching $57.97\%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.

[99] Equivariant Sampling for Improving Diffusion Model-based Image Restoration

Chenxu Wu,Qingpeng Kong,Peiang Zhao,Wendi Yang,Wenxin Ma,Fenghe Tang,Zihang Jiang,S. Kevin Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的扩散模型图像恢复方法EquS,通过双采样轨迹引入等变信息,并结合时间步感知调度(TAS)提升性能,在不增加计算成本的情况下显著提高了现有方法的效果。

Details Motivation: 现有的问题无关型扩散模型图像恢复方法难以充分挖掘扩散先验,导致性能受限,本文旨在分析其采样过程中的不足并提出改进方案。 Method: 提出EquS方法,利用双采样轨迹引入等变性信息;进一步设计时间步感知调度(TAS),优先处理确定性步骤以提高采样效率和恢复质量。 Result: 在多个基准数据集上的实验表明,所提方法兼容现有DMIR方法,并显著提升其性能,且无需额外计算开销。 Conclusion: EquS与EquS+有效增强了扩散模型在图像恢复中的表现,为问题无关型DMIR提供了更优的采样策略。 Abstract: Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by analyzing their sampling process and providing effective solutions. We introduce EquS, a DMIR method that imposes equivariant information through dual sampling trajectories. To further boost EquS, we propose the Timestep-Aware Schedule (TAS) and introduce EquS$^+$. TAS prioritizes deterministic steps to enhance certainty and sampling efficiency. Extensive experiments on benchmarks demonstrate that our method is compatible with previous problem-agnostic DMIR methods and significantly boosts their performance without increasing computational costs. Our code is available at https://github.com/FouierL/EquS.

[100] Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models

Satoshi Suzuki,Shin'ya Yamaguchi,Shoichiro Takeda,Taiga Yamane,Naoki Makishima,Naotaka Kawata,Mana Ihori,Tomohiro Tanaka,Shota Orihashi,Ryo Masumura

Main category: cs.CV

TL;DR: 本文提出了一种名为Difference Vector Equalization (DiVE)的方法,用于在保持预训练视觉-语言模型(如CLIP)的几何结构不变的情况下进行鲁棒微调,从而在分布内、分布外和零样本场景中均取得良好性能。

Details Motivation: 现有的微调方法在微调过程中破坏了嵌入空间的几何结构,导致模型在分布外和零样本任务上的泛化能力受限。因此,需要一种能够保留几何结构的微调方法。 Method: 提出DiVE方法,通过约束预训练模型与微调模型之间嵌入差向量的一致性来保持几何结构。引入两种损失:平均向量损失(AVL)全局地将差向量约束为其加权平均,成对向量损失(PVL)局部地维持多模态对齐。 Result: 实验表明,DiVE能有效保持嵌入空间的几何结构,在分布内、分布外和零样本分类任务上均优于现有方法。 Conclusion: DiVE通过保留视觉-语言模型嵌入空间的几何结构,实现了鲁棒的微调,在不牺牲零样本和分布外性能的前提下提升了分布内性能。 Abstract: Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.

[101] STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data

Yongdeuk Seo,Hyun-seok Min,Sungchul Choi

Main category: cs.CV

TL;DR: 本文提出了STELLAR,一种针对低资源语言和真实场景数据的场景文本编辑方法,通过语言自适应字形编码器和多阶段训练策略,在合成数据上预训练并在真实图像上微调,提升了多语言文本编辑的可靠性。同时构建了新数据集STIPLAR,并提出Text Appearance Similarity(TAS)指标来评估字体、颜色和背景的一致性。实验表明STELLAR在视觉一致性和识别准确率上优于现有方法。

Details Motivation: 现有基于扩散模型的场景文本编辑方法在低资源语言支持、合成与真实数据之间的领域差异以及缺乏有效的文本风格保持评估指标方面存在局限性。 Method: 提出STELLAR框架,采用语言自适应字形编码器和多阶段训练策略(先在合成数据上预训练,再在真实图像上微调),并构建新数据集STIPLAR用于训练与评估,同时提出Text Appearance Similarity(TAS)作为评估文本风格保持的新指标。 Result: 实验结果显示,STELLAR在视觉一致性和识别准确率方面优于现有最先进模型,在多种语言上平均TAS指标提升2.2%。 Conclusion: STELLAR有效解决了低资源语言支持、领域差距和风格评估问题,显著提升了场景文本编辑在真实世界应用中的性能和可靠性。 Abstract: Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.

[102] MOBA: A Material-Oriented Backdoor Attack against LiDAR-based 3D Object Detection Systems

Saket S. Chaturvedi,Gaurav Bagwe,Lan Zhang,Pan He,Xiaoyong Yuan

Main category: cs.CV

TL;DR: 本文提出了一种面向材料的后门攻击框架MOBA,通过建模真实触发器的材料特性,弥合了数字与物理域之间的差距,在LiDAR 3D目标检测系统中实现了高达93.50%的攻击成功率,显著优于现有方法。

Details Motivation: 现有的LiDAR后门攻击缺乏物理可实现性,因为数字触发器忽略了材料相关的LiDAR反射特性,而物理触发器往往未优化,导致效果差或易被检测。 Method: 提出MOBA框架:1)系统选择高漫反射和环境鲁棒性的材料(如TiO_2);2)构建包含Oren-Nayar BRDF模型角度无关近似和距离感知缩放机制的仿真管道,以确保数字触发器准确模拟物理行为。 Result: 在先进LiDAR和多模态融合模型上实验显示,MOBA攻击成功率达93.50%,超过先前方法41%以上。 Conclusion: MOBA揭示了现实中基于材料属性的新型物理可实现威胁,强调防御机制需考虑真实环境中材料层面的特性。 Abstract: LiDAR-based 3D object detection is widely used in safety-critical systems. However, these systems remain vulnerable to backdoor attacks that embed hidden malicious behaviors during training. A key limitation of existing backdoor attacks is their lack of physical realizability, primarily due to the digital-to-physical domain gap. Digital triggers often fail in real-world settings because they overlook material-dependent LiDAR reflection properties. On the other hand, physically constructed triggers are often unoptimized, leading to low effectiveness or easy detectability.This paper introduces Material-Oriented Backdoor Attack (MOBA), a novel framework that bridges the digital-physical gap by explicitly modeling the material properties of real-world triggers. MOBA tackles two key challenges in physical backdoor design: 1) robustness of the trigger material under diverse environmental conditions, 2) alignment between the physical trigger's behavior and its digital simulation. First, we propose a systematic approach to selecting robust trigger materials, identifying titanium dioxide (TiO_2) for its high diffuse reflectivity and environmental resilience. Second, to ensure the digital trigger accurately mimics the physical behavior of the material-based trigger, we develop a novel simulation pipeline that features: (1) an angle-independent approximation of the Oren-Nayar BRDF model to generate realistic LiDAR intensities, and (2) a distance-aware scaling mechanism to maintain spatial consistency across varying depths. We conduct extensive experiments on state-of-the-art LiDAR-based and Camera-LiDAR fusion models, showing that MOBA achieves a 93.50% attack success rate, outperforming prior methods by over 41%. Our work reveals a new class of physically realizable threats and underscores the urgent need for defenses that account for material-level properties in real-world environments.

[103] DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation

Xuexun Liu,Xiaoxu Xu,Qiudan Zhang,Lin Ma,Xu Wang

Main category: cs.CV

TL;DR: 提出DBGroup,一种基于场景级标注的两阶段弱监督3D实例分割框架,通过双分支点群组模块和伪标签优化策略实现高效准确的3D实例分割。

Details Motivation: 现有弱监督3D实例分割方法依赖点击或边界框标注,仍需较高标注成本且过程繁琐,亟需更高效、可扩展的弱监督方案。 Method: 设计两阶段框架:第一阶段利用多视图图像提取语义与掩码线索,通过双分支点群组模块生成伪标签,并采用粒度感知实例合并与语义选择传播策略进行优化;第二阶段使用优化后的伪标签进行多轮自训练,并引入实例掩码滤波策略缓解伪标签不一致性。 Result: 实验表明,DBGroup在仅使用场景级标注的情况下,性能优于现有的场景级监督语义分割方法,并与稀疏点级监督的实例分割方法相当。 Conclusion: DBGroup验证了场景级标注用于3D实例分割的可行性,显著降低标注成本,同时保持良好性能,推动了弱监督3D场景理解的发展。 Abstract: Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

[104] LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

Minjun Kim,Jaeri Lee,Jongjin Kim,Jeongin Yun,Yongmo Kwon,U Kang

Main category: cs.CV

TL;DR: 提出了一种名为LampQ的层级别混合精度量化方法,用于高效准确地量化视觉Transformer模型。

Details Motivation: 现有量化方法对视觉Transformer各组件采用统一精度,忽略了不同组件对量化敏感度的差异,且现有混合精度量化方法存在粒度粗、指标尺度不匹配和量化感知位分配不足的问题。 Method: LampQ采用层级别量化实现细粒度控制与高效加速,引入类型感知的Fisher-based指标衡量组件敏感度,并通过整数线性规划优化位宽分配,进行迭代更新。 Result: 在图像分类、目标检测和零样本量化等多种任务上,LampQ均实现了最先进的量化性能。 Conclusion: LampQ有效解决了现有ViT量化方法的局限性,在保持高精度的同时显著压缩模型,适用于多种视觉任务。 Abstract: How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.

[105] MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging

Shufeng Kong,Zijie Wang,Nuan Cui,Hao Tang,Yihan Meng,Yuanyuan Wei,Feifan Chen,Yingheng Wang,Zhuo Cai,Yaonan Wang,Yulong Zhang,Yuzheng Li,Zibin Zheng,Caihua Liu

Main category: cs.CV

TL;DR: 提出MIRNet框架,结合自监督预训练与基于图的约束推理,用于医学图像分析,尤其适用于舌象诊断,并发布大规模数据集TongueAtlas-4K。

Details Motivation: 解决医学图像自动解读中的标注稀缺、标签不平衡和临床合理性约束等问题,特别是在舌象诊断这一细粒度视觉语义理解挑战性领域。 Method: 采用自监督掩码自动编码器(MAE)从无标签数据学习视觉表示,利用图注意力网络(GAT)建模专家定义的标签相关性,通过KL散度和正则化损失引入临床先验约束,并使用非对称损失(ASL)和提升集成缓解类别不平衡。 Result: 在舌象诊断任务上达到最先进性能,同时框架具有良好的泛化能力,可扩展至其他医学影像诊断任务。此外发布了包含4000张图像和22个诊断标签的TongueAtlas-4K数据集。 Conclusion: MIRNet有效整合了自监督学习与结构化知识推理,在应对标注稀缺和临床约束的同时提升了诊断准确性,为可解释、可靠的AI辅助诊断提供了新思路。 Abstract: Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels--representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.

[106] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Xinyi Wang,Xun Yang,Yanlong Xu,Yuchen Wu,Zhen Li,Na Zhao

Main category: cs.CV

TL;DR: 本文提出了细粒度3D具身推理任务,旨在根据指令预测3D场景中可操作元素的位置、运动类型和运动轴;为此设计了AffordBot框架,结合多模态大模型与链式思维推理,通过环绕视图渲染和视角选择实现先进的性能。

Details Motivation: 现有方法通常在物体级别操作或割裂地处理细粒度功能推理,缺乏连贯的、指令驱动的定位与推理能力,难以满足物理环境中人机协作对精确交互理解的需求。 Method: 提出AffordBot框架,利用多模态大语言模型(MLLM)与定制的链式思维(CoT)推理流程:首先将3D场景渲染为环绕视图图像,并将3D候选元素投影到这些视图中;然后通过主动感知阶段让MLLM选择最具信息量的视角,再逐步推理定位功能元素并推断交互动作。 Result: 在SceneFun3D数据集上评估显示,AffordBot在仅使用3D点云输入和MLLM的情况下达到了最先进的性能,表现出强泛化能力和物理 grounded 的推理效果。 Conclusion: AffordBot通过结合MLLM与结构化推理流程,有效实现了指令驱动的细粒度3D功能推理,为具身智能体在复杂环境中的精细交互提供了新思路。 Abstract: Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.

[107] Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation

Yuxin Jiang,Wei Luo,Hui Zhang,Qiyu Chen,Haiming Yao,Weiming Shen,Yunkang Cao

Main category: cs.CV

TL;DR: 提出Anomagic,一种无需异常样本的零样本异常生成方法,通过跨模态提示编码和对比细化策略生成语义连贯的异常,并构建包含12,987个三元组的AnomVerse数据集用于训练,显著提升下游异常检测性能。

Details Motivation: 现有异常生成方法依赖异常样本或示例,限制了泛化能力与应用范围,因此需要一种无需异常输入即可生成多样化、语义一致异常的新方法。 Method: 提出Anomagic,采用跨模态提示编码融合视觉与文本线索,指导基于修复的异常生成流程,并通过对比细化策略确保生成异常与掩码精确对齐;同时构建AnomVerse数据集,利用多模态大模型自动生成带标注的异常-掩码-描述三元组用于训练。 Result: 实验表明,Anomagic在AnomVerse上训练后能生成更真实、更多样的异常,在下游异常检测任务中优于先前方法;且可基于用户定义提示为任意正常类别图像生成异常,具备良好通用性。 Conclusion: Anomagic实现了无需异常样本的零样本异常生成,结合AnomVerse数据集推动了异常生成与检测的发展,展现出作为通用异常生成基础模型的潜力。 Abstract: We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.

[108] DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection

Feiyang Jia,Caiyan Jia,Ailin Liu,Shaoqing Xu,Qiming Xia,Lin Liu,Lei Yang,Yan Gong,Ziying Song

Main category: cs.CV

TL;DR: 本文提出了一种基于双引导范式的多模态3D目标检测方法DGFusion,通过难度感知的实例匹配机制提升对远距离、小尺寸和遮挡物体等困难样本的检测性能。

Details Motivation: 现有单引导范式的多模态3D检测方法未能充分考虑不同模态在困难样本上的信息密度差异,导致对困难实例检测效果不佳。 Method: 提出DGFusion,采用双引导范式(Point-guide-Image和Image-guide-Point),并设计难度感知实例对匹配器(DIPM)进行基于难度的实例级特征匹配,生成易/难实例对,通过双引导模块实现有效的多模态特征融合。 Result: 在nuScenes数据集上,相比基线方法mAP提升+1.0%,NDS提升+0.8%,平均召回率提升+1.3%,并在多种困难场景下表现出更强的鲁棒性。 Conclusion: DGFusion通过双引导范式和难度感知匹配机制,有效提升了多模态3D检测中对困难实例的检测能力,增强了自动驾驶感知系统的安全性。 Abstract: As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0\% mAP, +0.8\% NDS, and +1.3\% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.

[109] LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning

Xinran Yang,Shuichang Lai,Jiangjing Lyu,Hongjie Li,Bowen Pan,Yuanqi Li,Jie Guo,Zhou Zhengkang,Yanwen Guo

Main category: cs.CV

TL;DR: 提出一种基于无符号距离场(UDF)的3D变分自编码器(VAE)框架,通过局部到全局(LoG)架构实现高保真3D内容生成,支持复杂拓扑结构并达到2048^3超高分辨率。

Details Motivation: 现有基于有符号距离场(SDF)和点云的方法在处理非流形几何、开放表面和内部结构时存在预处理成本高、表面不连续等问题,难以兼顾几何细节与复杂拓扑。 Method: 设计一种新型3D VAE框架,采用无符号距离场(UDF)表示;引入局部到全局(LoG)架构,将UDF划分为UBlocks,结合3D卷积捕捉局部细节,稀疏Transformer保证全局一致性,并使用Pad-Average策略平滑子块边界。 Result: 实现了高达2048^3的分辨率,显著优于以往3D VAE方法;在重建精度和生成质量上达到SOTA,表面更光滑,几何灵活性更强。 Conclusion: 该方法有效解决了复杂3D形状建模中的拓扑与细节保持难题,推动了高分辨率3D内容生成的发展。 Abstract: Generating high-fidelity 3D contents remains a fundamental challenge due to the complexity of representing arbitrary topologies-such as open surfaces and intricate internal structures-while preserving geometric details. Prevailing methods based on signed distance fields (SDFs) are hampered by costly watertight preprocessing and struggle with non-manifold geometries, while point-cloud representations often suffer from sampling artifacts and surface discontinuities. To overcome these limitations, we propose a novel 3D variational autoencoder (VAE) framework built upon unsigned distance fields (UDFs)-a more robust and computationally efficient representation that naturally handles complex and incomplete shapes. Our core innovation is a local-to-global (LoG) architecture that processes the UDF by partitioning it into uniform subvolumes, termed UBlocks. This architecture couples 3D convolutions for capturing local detail with sparse transformers for enforcing global coherence. A Pad-Average strategy further ensures smooth transitions at subvolume boundaries during reconstruction. This modular design enables seamless scaling to ultra-high resolutions up to 2048^3-a regime previously unattainable for 3D VAEs. Experiments demonstrate state-of-the-art performance in both reconstruction accuracy and generative quality, yielding superior surface smoothness and geometric flexibility.

[110] FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Wencong Wu,Xiuwei Zhang,Hanlin Yin,Shun Dai,Hongxi Zhang,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于可见光-红外目标检测的频域融合Transformer(FreDFT),通过频域注意力和多尺度频域前馈层挖掘模态间互补信息,并设计跨模态全局建模模块与局部特征增强模块以缓解多模态信息不平衡并提升特征融合效果,在多个公开数据集上表现出优越性能。

Details Motivation: 现有方法在复杂场景下面临可见光与红外模态间的信息不平衡问题,且大多局限于空间域Transformer,忽视了频域中挖掘互补信息的潜力。 Method: 提出FreDFT模型,包含多模态频域注意力(MFDA)、频域前馈层(FDFFL)、跨模态全局建模模块(CGMM)和局部特征增强模块(LFEM),实现频域特征融合与信息平衡。 Result: 在多个公共数据集上实验表明,FreDFT显著优于现有先进方法,取得了优异的检测性能。 Conclusion: 频域Transformer能有效挖掘跨模态互补信息,结合全局建模与局部增强策略可提升可见光-红外目标检测性能,为多模态检测提供了新思路。 Abstract: Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.

[111] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples

Xurui Li,Feng Xue,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出了一种用于零样本异常分类和分割的互评框架MuSc-V2,利用正常样本在2D外观和3D形状上的相似性与异常样本的孤立性之间的差异,通过多模态融合和互评机制显著提升了性能。

Details Motivation: 现有方法忽略了工业产品中正常图像块在2D和3D上具有高度相似性而异常则孤立这一关键特性,导致检测效果受限。 Method: 提出MuSc-V2框架,包括迭代点分组(IPG)增强3D表示,多度相似邻域聚合(SNAMD)融合2D/3D多尺度特征,互评机制(MSM)进行模态内评分,跨模态异常增强(CAE)融合2D/3D评分,并通过约束邻域重评(RsCon)抑制误检。 Result: 在MVTec 3D-AD数据集上AP提升+23.7%,Eyecandies数据集上提升+19.3%,超越此前零样本方法并优于多数少样本方法。 Conclusion: MuSc-V2通过显式建模正常与异常样本的分布差异,实现了灵活、鲁棒的零样本异常检测,适用于不同产品线和数据子集。 Abstract: Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7\%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3\%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.

[112] Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

Zhiyuan Hu,Zheng Sun,Yi Wei,Long Yu

Main category: cs.CV

TL;DR: 本文提出了一种针对图像美学推理能力不足的完整解决方案,包括构建大规模图像筛选数据集和引入HCM-GRPO方法以提升多模态大模型在图像审美任务上的表现。

Details Motivation: 由于缺乏数据和多模态大模型在图像美学推理能力上的不足,现有图像筛选性能较差,因此需要系统性解决方案。 Method: 构建包含12.8万样本、约64万图像的大规模数据集,涵盖四种美学评估维度,并采用多种标注方式获取高质量思维链数据;提出结合Hard Cases Mining与Dynamic Proportional Accuracy奖励的HCM-GRPO方法。 Result: 实验显示现有主流闭源MLLM(如GPT4o、Qwen-VL-Max)在该任务上表现接近随机猜测,而采用HCM-GRPO的小模型显著超越开源及闭源大模型。 Conclusion: HCM-GRPO结合高质量数据集能有效提升多模态模型在图像美学推理任务上的性能,为图像筛选提供了高效可行的新范式。 Abstract: The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

[113] When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Qilang Ye,Wei Zeng,Meng Liu,Jie Zhang,Yupeng Hu,Zitong Yu,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的基准AV-ConfuseBench,用于评估多模态大语言模型(MLLMs)在视听混淆场景下的表现,并引入了基于强化学习的协作多MLLM框架RL-CoMM以提高音频-视觉推理准确性。

Details Motivation: 研究发现现有的MLLMs由于视觉主导推理而难以区分不存在的声音,因此需要一种新方法来改善这种视听混淆问题。 Method: 提出了一个两阶段的方法:首先使用大型音频语言模型(LALM)生成仅音频推理,并设计逐步推理奖励函数;其次通过答案中心置信度优化减少异质推理差异的不确定性。 Result: 实验表明,RL-CoMM在有限训练数据下比基线模型提高了10~30%的准确率。 Conclusion: RL-CoMM有效提升了MLLMs在视听问答和视听幻觉任务中的性能,解决了视觉主导导致的音频识别困难问题。 Abstract: Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion'' scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound''. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30\% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.

[114] Multivariate Gaussian Representation Learning for Medical Action Evaluation

Luming Yang,Haoxian Liu,Siqing Li,Alper Yilmaz

Main category: cs.CV

TL;DR: 本文提出了CPREval-6k医学动作基准数据集和GaussMedAct框架,用于提升医疗场景中精细动作分析的精度与鲁棒性。

Details Motivation: 由于缺乏全面的数据集、对精度要求高以及快速动作的时空动态建模不足,医学视觉中的细粒度动作评估面临挑战。 Method: 提出GaussMedAct,采用多变量高斯编码框架,通过自适应时空表示学习建模关节运动;引入混合空间编码双流策略利用骨骼信息。 Result: 在CPREval-6k上达到92.1% Top-1准确率,实时推理,较ST-GCN基线提升5.9%,仅使用10% FLOPs,跨数据集实验验证了其鲁棒性优势。 Conclusion: GaussMedAct通过自适应3D高斯建模和混合编码策略,在精度、效率和鲁棒性方面显著优于现有方法,推动了医学动作分析的发展。 Abstract: Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming the ST-GCN baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.

[115] Perceive, Act and Correct: Confidence Is Not Enough for Hyperspectral Classification

Muzhou Yang,Wuzhou Quan,Mingqiang Wei

Main category: cs.CV

TL;DR: 本文提出了CABIN,一种认知感知的半监督学习框架,通过感知、行动与纠正的闭环学习过程,提升高光谱图像分类中的不确定性建模与伪标签质量。

Details Motivation: 现有模型在高光谱图像分类中常因过度依赖高置信度预测而忽视不确定性,导致确认偏差,尤其在标注稀疏或类别不平衡时表现更差。 Method: CABIN通过估计认知不确定性识别模糊区域(感知),采用不确定性引导的双采样策略选择样本进行探索或作为稳定伪标签(行动),并引入细粒度动态分配策略对伪标签数据分类并应用定制损失函数以纠正噪声监督(纠正)。 Result: 实验表明,结合CABIN后多种先进方法在性能和标注效率上均有提升,有效缓解过拟合与确认偏差问题。 Conclusion: CABIN通过闭环的认知感知学习机制,显著提升了模型在不确定情况下的泛化能力,为半监督高光谱图像分类提供了有效解决方案。 Abstract: Confidence alone is often misleading in hyperspectral image classification, as models tend to mistake high predictive scores for correctness while lacking awareness of uncertainty. This leads to confirmation bias, especially under sparse annotations or class imbalance, where models overfit confident errors and fail to generalize. We propose CABIN (Cognitive-Aware Behavior-Informed learNing), a semi-supervised framework that addresses this limitation through a closed-loop learning process of perception, action, and correction. CABIN first develops perceptual awareness by estimating epistemic uncertainty, identifying ambiguous regions where errors are likely to occur. It then acts by adopting an Uncertainty-Guided Dual Sampling Strategy, selecting uncertain samples for exploration while anchoring confident ones as stable pseudo-labels to reduce bias. To correct noisy supervision, CABIN introduces a Fine-Grained Dynamic Assignment Strategy that categorizes pseudo-labeled data into reliable, ambiguous, and noisy subsets, applying tailored losses to enhance generalization. Experimental results show that a wide range of state-of-the-art methods benefit from the integration of CABIN, with improved labeling efficiency and performance.

[116] VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System

Gwangyeon Ahn,Jiwan Seo,Joonhyuk Kang

Main category: cs.CV

TL;DR: 提出了一种基于视觉-语言特征的多模态语义通信系统VLF-MSC,通过统一的视觉-语言表征同时支持接收端的图像和文本生成,提升了频谱效率与抗噪声鲁棒性。

Details Motivation: 现有语义通信方法通常分离处理不同模态,导致带宽浪费和语义失配,因此需要一种统一的多模态表征方法以提升通信效率和语义一致性。 Method: 利用预训练的视觉-语言模型(VLM)将源图像编码为紧凑的视觉-语言语义特征(VLF),通过无线信道传输,并在接收端分别驱动基于解码器的语言模型和扩散模型生成文本和图像。 Result: 实验表明,VLF-MSC在低信噪比下优于仅文本或仅图像的基线方法,显著降低带宽需求的同时提高了两种模态的语义准确率。 Conclusion: VLF-MSC通过统一的跨模态语义表征实现了高效、鲁棒的多模态语义通信,为未来语义通信系统提供了新的框架。 Abstract: We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.

[117] Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints

Xiangyue Zhang,Jianfang Li,Jianqiang Ren,Jiaxu Zhang

Main category: cs.CV

TL;DR: 提出GlobalDiff,首个在全局关节旋转空间中进行扩散的框架,通过多级约束机制解决结构先验缺失问题,显著提升手势生成质量。

Details Motivation: 现有方法基于局部关节旋转生成共语手势,存在层级误差累积问题,导致末端执行器运动不稳定和不自然。 Method: 提出GlobalDiff,首次在全局关节旋转空间中进行扩散生成,并引入多层次约束:关节结构约束(虚拟锚点)、骨骼结构约束(骨间角度一致性)和时序结构约束(多尺度变分编码器)以增强结构感知。 Result: 在标准共语手势基准上广泛评估,相比当前SOTA方法,在多种说话人身份下性能提升46.0%,生成动作更平滑、准确。 Conclusion: GlobalDiff通过解耦关节预测与层级依赖,有效缓解误差累积,结合多级约束机制,在全局旋转空间中实现了高质量的共语手势生成。 Abstract: Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint's prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0 % compared to the current SOTA under multiple speaker identities.

[118] GridPrune: From "Where to Look" to "What to Select" in Visual Token Pruning for MLLMs

Yuxiang Duan,Ao Li,Yingqin Li,Luyu Li,Pengwei Wang

Main category: cs.CV

TL;DR: 本文提出GridPrune,一种基于“先全局引导,后局部选择”的视觉token剪枝方法,通过动态分配空间区域的token预算并进行局部筛选,显著提升多模态大语言模型的推理效率。

Details Motivation: 现有视觉token剪枝方法多直接优化‘保留哪些token’(what to select),忽视了‘应关注哪些区域’(where to look),导致空间分配低效、位置偏差及冗余token保留问题。受人类视觉注意机制启发,本文旨在引入两阶段策略以更高效地分配注意力资源。 Method: 提出GridPrune方法,将图像划分为网格区域(zonal selection),首先利用文本条件信号全局指导各区域的token预算分配(guide-globally),然后在每个区域内独立进行局部token筛选(select-locally),实现细粒度与高效性的平衡。 Result: 在LLaVA-NeXT-7B等多模态大模型上验证,GridPrune在仅使用11.1%视觉token的情况下保持了96.98%的完整性能,相比最优基线在同一剪枝率下提升2.34%。 Conclusion: GridPrune通过模拟人类视觉注意的两阶段机制,有效解决了现有剪枝方法的空间分配低效问题,为多模态大模型的高效推理提供了新的设计思路。 Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to ("where to look") before deciding which specific elements within those regions to process in detail ("what to select"). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing "what to select", typically using attention scores or similarity metrics. They rarely consider "where to look", which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a "guide-globally, select-locally" zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.

[119] SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition

Qilang Ye,Yu Zhou,Lian He,Jie Zhang,Xuanming Guo,Jiayu Zhang,Mingkui Tan,Weicheng Xie,Yue Sun,Tao Tan,Xiaochen Yuan,Ghada Khoriba,Zitong Yu

Main category: cs.CV

TL;DR: 本文提出了一种名为SUGAR的新范式,通过结合大语言模型(LLM)与人体骨架数据进行动作分类与描述,利用视觉-运动知识指导骨架学习,并设计了TQP模块以建模长序列骨架信号,在多个基准上表现出色,且在零样本场景中优于线性方法。

Details Motivation: 探索如何将大语言模型应用于骨架数据进行动作识别,解决LLM难以直接理解骨架信息以及区分不同动作的问题。 Method: 利用现成的大规模视频模型提取视觉和运动信息作为先验知识,监督骨架学习生成离散表示,并通过保持预训练权重的LLM理解这些表示并生成动作标签与描述;提出Temporal Query Projection (TQP) 模块以处理长序列骨架信号。 Result: 在多个基于骨架的动作分类基准上验证了SUGAR的有效性,在零样本场景下表现优于基于线性的方法,显示出更强的泛化能力。 Conclusion: SUGAR成功地将LLM与骨架数据结合,通过引入视觉-运动先验知识和TQP模块,实现了高效的动作识别与描述,具备良好的可扩展性和应用潜力。 Abstract: Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.

[120] MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models

Zihan Wang,Guansong Pang,Wenjun Miao,Jin Zheng,Xiao Bai

Main category: cs.CV

TL;DR: 本文提出了MTAttack,首个针对大型视觉语言模型(LVLMs)的多目标后门攻击框架,通过引入代理空间划分和触发原型锚定约束,有效解决了多触发器间的特征干扰问题,实现了高成功率、强泛化性和抗防御能力的多目标攻击。

Details Motivation: 现有后门攻击主要集中在单目标攻击,而实际应用中多目标攻击更具威胁。然而,由于不同触发器之间的严重特征干扰,实现多目标后门攻击面临挑战。因此,亟需研究能够在单一训练过程中植入多个独立触发-目标映射的攻击方法。 Method: 提出MTAttack框架,核心是一种包含两种新约束的优化方法:代理空间划分约束和触发原型锚定约束。该方法在潜在空间中联合优化多个触发器,使每个触发器将干净图像映射到唯一的代理类,并保证它们之间的可分性。 Result: 在多个主流基准上的实验表明,MTAttack在多目标攻击中实现了很高的攻击成功率,显著优于现有攻击方法,并展现出跨数据集的良好泛化能力和对多种后门防御策略的鲁棒性。 Conclusion: LVLMs容易受到多目标后门攻击,MTAttack揭示了这一严重安全漏洞,强调了开发相应防御机制的紧迫性。 Abstract: Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at https://github.com/mala-lab/MTAttack.

[121] RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo

Jueun Ko,Hyewon Park,Hyesong Choi,Dongbo Min

Main category: cs.CV

TL;DR: 本文提出了一种名为RobIA的鲁棒、实例感知的连续测试时自适应框架,用于立体深度估计,通过动态路由和伪监督提升在动态域中的性能。

Details Motivation: 由于现实环境中存在动态域偏移、稀疏或不可靠的监督以及密集真值标签获取成本高,立体深度估计面临挑战,现有TTA方法多基于静态假设,难以应对持续变化的环境。 Method: 提出RobIA框架,包含两个核心组件:一是基于轻量级自注意力机制的Attend-and-Excite Mixture-of-Experts(AttEx-MoE),实现输入驱动的专家动态路由;二是基于PEFT的Robust AdaptBN Teacher模型,结合稀疏手工标签生成密集伪监督信号。 Result: 实验表明,RobIA在多个动态目标域上实现了优于现有方法的自适应性能,同时保持较高的计算效率。 Conclusion: RobIA通过实例感知的动态适应策略和增强的监督方式,有效提升了立体深度估计在连续域偏移下的鲁棒性和泛化能力。 Abstract: Stereo Depth Estimation in real-world environments poses significant challenges due to dynamic domain shifts, sparse or unreliable supervision, and the high cost of acquiring dense ground-truth labels. While recent Test-Time Adaptation (TTA) methods offer promising solutions, most rely on static target domain assumptions and input-invariant adaptation strategies, limiting their effectiveness under continual shifts. In this paper, we propose RobIA, a novel Robust, Instance-Aware framework for Continual Test-Time Adaptation (CTTA) in stereo depth estimation. RobIA integrates two key components: (1) Attend-and-Excite Mixture-of-Experts (AttEx-MoE), a parameter-efficient module that dynamically routes input to frozen experts via lightweight self-attention mechanism tailored to epipolar geometry, and (2) Robust AdaptBN Teacher, a PEFT-based teacher model that provides dense pseudo-supervision by complementing sparse handcrafted labels. This strategy enables input-specific flexibility, broad supervision coverage, improving generalization under domain shift. Extensive experiments demonstrate that RobIA achieves superior adaptation performance across dynamic target domains while maintaining computational efficiency.

[122] Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

Mingda Jia,Weiliang Meng,Zenghuang Fu,Yiheng Li,Qi Zeng,Yifan Zhang,Ju Xin,Rongtao Xu,Jiguang Zhang,Xiaopeng Zhang

Main category: cs.CV

TL;DR: 提出了一种显式的时序-语义建模框架CACMI,用于密集视频描述任务,通过跨模态帧聚合和上下文感知特征增强提升性能。

Details Motivation: 现有方法依赖隐式建模,难以捕捉事件序列间的时序连贯性和视觉上下文的完整语义。 Method: 设计了上下文感知的跨模态交互框架(CACMI),包括跨模态帧聚合和上下文感知特征增强两个模块,利用文本引导注意力整合视觉动态与伪事件语义。 Result: 在ActivityNet Captions和YouCook2数据集上取得了当前最优的性能。 Conclusion: CACMI能有效建模视频中的时序连贯性和语义丰富性,显著提升密集视频描述的质量。 Abstract: Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.

[123] Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation

Mayank Vatsa,Aparna Bharati,Richa Singh

Main category: cs.CV

TL;DR: 本文研究了当前主流文本到图像模型在处理逻辑组合(如否定、计数和空间关系)时的严重缺陷,指出其性能在组合条件下急剧下降,并分析了数据、架构和评估指标三方面的原因,认为需要根本性的表征与推理进步才能解决。

Details Motivation: 当前文本到图像模型无法有效处理逻辑组合任务,例如同时理解‘没有红色的球’或‘两个左边的物体’,这限制了其在复杂语义场景中的应用。 Method: 通过分析否定、计数和空间关系三大逻辑原语的组合表现,结合现有基准和方法,系统评估模型在组合任务上的性能崩溃现象,并从训练数据分布、注意力架构特性及评估指标偏好三个方面进行归因分析。 Result: 发现模型在单一原语上表现良好,但在组合条件下准确率急剧下降;训练数据中缺乏显式否定表达,连续注意力机制难以处理离散逻辑,且现有评估指标更偏好视觉合理性而非逻辑满足性。 Conclusion: 实现真正的组合性需要在表示和推理机制上的根本性突破,而非对现有架构的渐进式改进。 Abstract: The architectural blueprint of today's leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.

[124] Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space

Zhicheng Cai,Hao Zhu,Linsen Chen,Qiu Shen,Xun Cao

Main category: cs.CV

TL;DR: 提出了一种称为split-layer的新型MLP结构,通过将每层分为多个并行分支并使用Hadamard积整合输出,显著增强了隐式神经表示(INR)的表达能力,同时避免了计算和内存成本的急剧增加。

Details Motivation: 传统的多层感知机(MLP)由于低维特征空间限制了隐式神经表示(INR)的表达能力,而简单加宽网络会导致计算和内存成本二次增长,因此需要一种高效提升特征空间维度的方法。 Method: 提出split-layer方法,将MLP的每一层拆分为多个并行分支,并通过Hadamard积融合各分支输出,从而构建高次多项式空间,在不显著增加计算开销的情况下扩展特征空间维度。 Result: 在2D图像拟合、2D CT重建、3D形状表示和5D新视角合成等多个任务上,split-layer显著提升了INR性能,优于现有方法。 Conclusion: split-layer是一种高效且可扩展的MLP重构方法,有效解决了INR中特征空间有限的问题,为各类逆问题中的连续信号表示提供了更强的建模能力。 Abstract: Implicit neural representation (INR) models signals as continuous functions using neural networks, offering efficient and differentiable optimization for inverse problems across diverse disciplines. However, the representational capacity of INR defined by the range of functions the neural network can characterize, is inherently limited by the low-dimensional feature space in conventional multilayer perceptron (MLP) architectures. While widening the MLP can linearly increase feature space dimensionality, it also leads to a quadratic growth in computational and memory costs. To address this limitation, we propose the split-layer, a novel reformulation of MLP construction. The split-layer divides each layer into multiple parallel branches and integrates their outputs via Hadamard product, effectively constructing a high-degree polynomial space. This approach significantly enhances INR's representational capacity by expanding the feature space dimensionality without incurring prohibitive computational overhead. Extensive experiments demonstrate that the split-layer substantially improves INR performance, surpassing existing methods across multiple tasks, including 2D image fitting, 2D CT reconstruction, 3D shape representation, and 5D novel view synthesis.

[125] Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection

Feng Ding,Wenhui Yi,Yunpeng Zhou,Xinan He,Hong Rao,Shu Hu

Main category: cs.CV

TL;DR: 提出了一种双机制协同优化框架,通过结构公平解耦和全局分布对齐,在保持检测精度的同时提升跨域深伪检测模型的群体间和群体内公平性。

Details Motivation: 现有深伪检测模型在不同人口统计组间存在偏见,且提升公平性常以牺牲检测精度为代价,亟需兼顾公平与准确性的解决方案。 Method: 在模型架构层面解耦对人口统计特征敏感的通道,并在特征层面通过全局分布对齐缩小整体样本分布与各组分布之间的距离,实现双机制协同优化。 Result: 实验表明,该方法在多个领域均优于现有方法,显著提升了群体间和群体内公平性,同时保持了较高的检测准确率。 Conclusion: 所提出的双机制框架能有效平衡深伪检测中的公平性与准确性,有助于推动可信、公正的数字身份安全技术发展。 Abstract: Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.

[126] GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval

Hao Zou,Runqing Zhang,Xue Zhou,Jianxiao Zou

Main category: cs.CV

TL;DR: 提出生成增强对齐方法(GEA),通过扩散生成图像作为中介语义表示,提升文本到图像行人检索的跨模态对齐性能。

Details Motivation: 现有文本到图像行人检索方法受限于文本查询表达不完整及图文模态鸿沟,导致跨模态对齐困难和过拟合问题。 Method: 设计两个并行模块:文本引导的令牌增强(TGTE)利用扩散生成图像作为中介语义桥梁;生成式中间融合(GIF)通过交叉注意力融合生成图像、原始图像与文本特征,并采用三元组对齐损失优化统一表征。 Result: 在CUHK-PEDES、RSTPReid和ICFG-PEDES三个公开数据集上进行了广泛实验,结果验证了GEA方法的有效性。 Conclusion: 从生成视角出发的GEA框架能有效缓解模态差距与数据过拟合问题,显著提升文本到图像行人检索的性能。 Abstract: Text-to-Image Person Retrieval (TIPR) aims to retrieve person images based on natural language descriptions. Although many TIPR methods have achieved promising results, sometimes textual queries cannot accurately and comprehensively reflect the content of the image, leading to poor cross-modal alignment and overfitting to limited datasets. Moreover, the inherent modality gap between text and image further amplifies these issues, making accurate cross-modal retrieval even more challenging. To address these limitations, we propose the Generation-Enhanced Alignment (GEA) from a generative perspective. GEA contains two parallel modules: (1) Text-Guided Token Enhancement (TGTE), which introduces diffusion-generated images as intermediate semantic representations to bridge the gap between text and visual patterns. These generated images enrich the semantic representation of text and facilitate cross-modal alignment. (2) Generative Intermediate Fusion (GIF), which combines cross-attention between generated images, original images, and text features to generate a unified representation optimized by triplet alignment loss. We conduct extensive experiments on three public TIPR datasets, CUHK-PEDES, RSTPReid, and ICFG-PEDES, to evaluate the performance of GEA. The results justify the effectiveness of our method. More implementation details and extended results are available at https://github.com/sugelamyd123/Sup-for-GEA.

[127] Physically Interpretable Multi-Degradation Image Restoration via Deep Unfolding and Explainable Convolution

Hu Gao,Xiaoning Lei,Xichen Xu,Depeng Dang,Lizhuang Ma

Main category: cs.CV

TL;DR: 提出一种基于深度展开网络的可解释多退化图像恢复方法InterIR,通过改进的二阶半光滑牛顿算法和可解释卷积模块,实现对多种退化类型的灵活处理,在保持高可解释性的同时取得优异性能。

Details Motivation: 现有图像恢复方法多针对单一退化类型,且模块堆叠导致可解释性差;实际场景中图像常含多种退化,需兼具性能与可解释性的模型。 Method: 基于深度展开网络,将优化算法迭代过程映射为可学习结构;采用改进的二阶半光滑牛顿算法保证模块物理可解释性;设计受人脑信息处理启发的可解释卷积模块,实现参数自适应调整。 Result: InterIR在多退化图像恢复任务上表现优异,同时在单退化任务上具有竞争力,具备良好可解释性和适应性。 Conclusion: 该方法通过结合数学优化与可解释模块设计,实现了高性能、高可解释的多退化图像恢复,为复杂真实场景下的图像复原提供了有效解决方案。 Abstract: Although image restoration has advanced significantly, most existing methods target only a single type of degradation. In real-world scenarios, images often contain multiple degradations simultaneously, such as rain, noise, and haze, requiring models capable of handling diverse degradation types. Moreover, methods that improve performance through module stacking often suffer from limited interpretability. In this paper, we propose a novel interpretability-driven approach for multi-degradation image restoration, built upon a deep unfolding network that maps the iterative process of a mathematical optimization algorithm into a learnable network structure. Specifically, we employ an improved second-order semi-smooth Newton algorithm to ensure that each module maintains clear physical interpretability. To further enhance interpretability and adaptability, we design an explainable convolution module inspired by the human brain's flexible information processing and the intrinsic characteristics of images, allowing the network to flexibly leverage learned knowledge and autonomously adjust parameters for different input. The resulting tightly integrated architecture, named InterIR, demonstrates excellent performance in multi-degradation restoration while remaining highly competitive on single-degradation tasks.

[128] CephRes-MHNet: A Multi-Head Residual Network for Accurate and Robust Cephalometric Landmark Detection

Ahmed Jaheen,Islam Hassan,Mohanad Abouserie,Abdelaty Rehab,Adham Elasfar,Knzy Elmasry,Mostafa El-Dawlatly,Seif Eldawlatly

Main category: cs.CV

TL;DR: 本文提出了一种名为CephRes-MHNet的多头残差卷积网络,用于从2D侧位颅骨X光片中自动检测头影测量标志点,取得了优于现有方法的精度和效率。

Details Motivation: 准确的头影测量标志点定位对正畸诊断至关重要,但手动标注耗时且易出错,现有自动化方法在低对比度和复杂解剖结构下表现不佳。 Method: 提出CephRes-MHNet,结合残差编码、双注意力机制和多头解码器,以增强上下文推理和解剖学定位精度。模型在包含1000张X光片的Aariz头影数据集上训练。 Result: CephRes-MHNet达到平均径向误差(MRE)1.23 mm,2.0 mm内的成功检测率(SDR)为85.5%,优于所有对比模型,尤其是参数量不足最强基线AFPF-Net的25%。 Conclusion: CephRes-MHNet通过架构优化实现了最先进的检测性能,兼具高精度与轻量化,适用于实际临床正畸分析。 Abstract: Accurate localization of cephalometric landmarks from 2D lateral skull X-rays is vital for orthodontic diagnosis and treatment. Manual annotation is time-consuming and error-prone, whereas automated approaches often struggle with low contrast and anatomical complexity. This paper introduces CephRes-MHNet, a multi-head residual convolutional network for robust and efficient cephalometric landmark detection. The architecture integrates residual encoding, dual-attention mechanisms, and multi-head decoders to enhance contextual reasoning and anatomical precision. Trained on the Aariz Cephalometric dataset of 1,000 radiographs, CephRes-MHNet achieved a mean radial error (MRE) of 1.23 mm and a success detection rate (SDR) @ 2.0 mm of 85.5%, outperforming all evaluated models. In particular, it exceeded the strongest baseline, the attention-driven AFPF-Net (MRE = 1.25 mm, SDR @ 2.0 mm = 84.1%), while using less than 25% of its parameters. These results demonstrate that CephRes-MHNet attains state-of-the-art accuracy through architectural efficiency, providing a practical solution for real-world orthodontic analysis.

[129] Utilizing a Geospatial Foundation Model for Coastline Delineation in Small Sandy Islands

Tishya Chhabra,Manisha Bajpai,Walter Zesk,Skylar Tibbits

Main category: cs.CV

TL;DR: 本研究评估了NASA和IBM的Prithvi-EO-2.0地理空间基础模型在马尔代夫小沙岛岸线提取中的应用,使用225张多光谱图像进行训练和测试,结果表明即使仅用5张训练图像,模型也能取得高性能(F1=0.94,IoU=0.79),显示出其在数据稀缺地区海岸监测中的强大迁移学习能力。

Details Motivation: 由于小岛屿地区常缺乏足够的标注数据,传统方法难以有效进行岸线监测,因此需要具备强迁移学习能力的基础模型来支持此类数据贫乏区域的环境监测。 Method: 收集并标注了225幅马尔代夫两个岛屿的多光谱卫星图像,公开发布该数据集,并使用5到181张图像的子集对Prithvi-EO-2.0的3亿和6亿参数版本进行微调,评估其在岸线提取任务上的性能。 Result: 即使仅用5张训练图像,Prithvi模型仍能达到F1分数0.94、IoU 0.79的高水平性能,且在不同规模模型上表现稳定,验证了其出色的迁移学习能力。 Conclusion: Prithvi-EO-2.0在小样本条件下表现出色,证明了地理空间基础模型在数据稀缺的沿海地区进行高效岸线监测的巨大潜力,为未来遥感应用提供了可行的技术路径。 Abstract: We present an initial evaluation of NASA and IBM's Prithvi-EO-2.0 geospatial foundation model on shoreline delineation of small sandy islands using satellite images. We curated and labeled a dataset of 225 multispectral images of two Maldivian islands, which we publicly release, and fine-tuned both the 300M and 600M parameter versions of Prithvi on training subsets ranging from 5 to 181 images. Our experiments show that even with as few as 5 training images, the models achieve high performance (F1 of 0.94, IoU of 0.79). Our results demonstrate the strong transfer learning capability of Prithvi, underscoring the potential of such models to support coastal monitoring in data-poor regions.

[130] VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction

Stephane Da Silva Martins,Emanuel Aldea,Sylvie Le Hégarat-Mascle

Main category: cs.CV

TL;DR: VISTA是一种用于多智能体轨迹预测的递归目标条件变换器,通过融合长期意图与过去运动、灵活建模社会交互并提供可解释的社会影响模式,在密集交互环境中实现了更真实、安全的轨迹预测。

Details Motivation: 现有方法难以同时捕捉智能体的长期目标及其细粒度的社会交互,导致预测结果不够真实。因此需要一种能够联合建模目标与社会互动的模型以提高多智能体轨迹预测的准确性和现实性。 Method: 提出VISTA模型,包含三个核心组件:跨注意力融合模块(整合长期意图与历史运动)、社会令牌注意力机制(灵活建模智能体间交互)以及成对注意力图(实现推理时的社会影响可解释性),将单智能体目标条件预测扩展为一致的多智能体预测框架。 Result: 在MADRAS和SDD数据集上,VISTA在标准位移误差指标上达到最先进水平,并显著降低碰撞率:在MADRAS上平均碰撞率从2.14%降至0.03%,在SDD上实现零碰撞,同时提升ADE、FDE和minFDE。 Conclusion: VISTA能生成符合社会行为、具备目标感知且可解释的轨迹,适用于对安全性要求高的自主系统。 Abstract: Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents' long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.

[131] LiNeXt: Revisiting LiDAR Completion with Efficient Non-Diffusion Architectures

Wenzhe He,Xiaojun Chen,Ruiqi Wang,Ruihui Li,Huilong Pi,Jiapeng Zhang,Zhuo Tang,Kenli Li

Main category: cs.CV

TL;DR: 提出了一种轻量级非扩散网络LiNeXt,用于快速准确的3D点云补全,在速度、精度和参数量上显著优于现有扩散模型。

Details Motivation: 现有基于扩散模型的方法因多步迭代采样导致计算开销大,难以满足实时性需求。 Method: 设计了N2C模块单步去噪,结合Refine模块进行精细化修复,并提出距离感知的采样策略以均衡点云分布。 Result: 在SemanticKITTI数据集上,推理速度提升199.8倍, Chamfer Distance降低50.7%,参数量仅为LiDiff的6.1%。 Conclusion: LiNeXt在保持高重建质量的同时极大提升了效率,适合实时自动驾驶感知系统中的点云补全任务。 Abstract: 3D LiDAR scene completion from point clouds is a fundamental component of perception systems in autonomous vehicles. Previous methods have predominantly employed diffusion models for high-fidelity reconstruction. However, their multi-step iterative sampling incurs significant computational overhead, limiting its real-time applicability. To address this, we propose LiNeXt-a lightweight, non-diffusion network optimized for rapid and accurate point cloud completion. Specifically, LiNeXt first applies the Noise-to-Coarse (N2C) Module to denoise the input noisy point cloud in a single pass, thereby obviating the multi-step iterative sampling of diffusion-based methods. The Refine Module then takes the coarse point cloud and its intermediate features from the N2C Module to perform more precise refinement, further enhancing structural completeness. Furthermore, we observe that LiDAR point clouds exhibit a distance-dependent spatial distribution, being densely sampled at proximal ranges and sparsely sampled at distal ranges. Accordingly, we propose the Distance-aware Selected Repeat strategy to generate a more uniformly distributed noisy point cloud. On the SemanticKITTI dataset, LiNeXt achieves a 199.8x speedup in inference, reduces Chamfer Distance by 50.7%, and uses only 6.1% of the parameters compared with LiDiff. These results demonstrate the superior efficiency and effectiveness of LiNeXt for real-time scene completion.

[132] HeatV2X: Scalable Heterogeneous Collaborative Perception via Efficient Alignment and Interaction

Yueran Zhao,Zhang Zhang,Chao Sun,Tianze Wang,Chao Yue,Nuoran Li

Main category: cs.CV

TL;DR: 提出了一种可扩展的车辆到万物协同感知框架HeatV2X,通过异构图注意力和自适应微调机制解决多模态异构代理间的特征对齐与协作问题,在降低训练开销的同时提升了感知性能。

Details Motivation: 现有协同感知框架在面对多模态异构代理和可扩展性需求时存在特征对齐困难和全参数训练不可行的问题,需设计更高效、可扩展的协作机制。 Method: 构建基于异构图注意力的高性能基础代理,设计局部异构微调(含Hetero-Aware Adapters)和全局协同微调(含Multi-Cognitive Adapter)实现跨代理特征对齐与融合。 Result: 在OPV2V-H和DAIR-V2X数据集上验证,相较现有方法显著降低训练开销并提升感知性能。 Conclusion: HeatV2X有效解决了V2X协同感知中的异构性和可扩展性挑战,为低代价高性能协同感知提供了可行方案。 Abstract: Vehicle-to-Everything (V2X) collaborative perception extends sensing beyond single vehicle limits through transmission. However, as more agents participate, existing frameworks face two key challenges: (1) the participating agents are inherently multi-modal and heterogeneous, and (2) the collaborative framework must be scalable to accommodate new agents. The former requires effective cross-agent feature alignment to mitigate heterogeneity loss, while the latter renders full-parameter training impractical, highlighting the importance of scalable adaptation. To address these issues, we propose Heterogeneous Adaptation (HeatV2X), a scalable collaborative framework. We first train a high-performance agent based on heterogeneous graph attention as the foundation for collaborative learning. Then, we design Local Heterogeneous Fine-Tuning and Global Collaborative Fine-Tuning to achieve effective alignment and interaction among heterogeneous agents. The former efficiently extracts modality-specific differences using Hetero-Aware Adapters, while the latter employs the Multi-Cognitive Adapter to enhance cross-agent collaboration and fully exploit the fusion potential. These designs enable substantial performance improvement of the collaborative framework with minimal training cost. We evaluate our approach on the OPV2V-H and DAIR-V2X datasets. Experimental results demonstrate that our method achieves superior perception performance with significantly reduced training overhead, outperforming existing state-of-the-art approaches. Our implementation will be released soon.

[133] Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization

Ashutosh Anshul,Shreyas Gopal,Deepu Rajan,Eng Siong Chng

Main category: cs.CV

TL;DR: 提出一种单阶段训练框架,通过引入单模态和跨模态的下一帧预测以及窗口级注意力机制,提升多模态深度伪造检测的泛化能力和时序定位精度。

Details Motivation: 现有方法依赖预训练且主要关注音视频不一致性,难以泛化到未见伪造类型,并忽略保留音视频对齐的伪造操作中的局部伪影。 Method: 在单阶段训练中引入下一帧预测任务(包括单模态和跨模态),并设计窗口级注意力机制来捕捉预测帧与实际帧之间的差异,以检测每帧周围的局部异常。 Result: 在多个基准数据集上验证了模型具有良好的泛化性能和精确的时间维度定位能力,尤其在全伪造视频分类和部分伪造片段定位方面表现优异。 Conclusion: 所提方法无需额外预训练即可实现强泛化,有效应对跨数据集和跨伪造类型的检测挑战,同时能精确定位伪造时间段。 Abstract: Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.

[134] TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding

Jinxuan Li,Yi Zhang,Jian-Fang Hu,Chaolei Tan,Tianming Liang,Beihao Xia

Main category: cs.CV

TL;DR: 提出了一种新的弱监督时空视频定位框架TubeRMC,通过文本条件候选管生成和基于时空约束的管条件重建,提升了目标识别和跟踪的一致性。

Details Motivation: 现有弱监督STVG方法多采用简单晚融合方式,生成的候选管与文本描述无关,导致目标识别失败和跟踪不一致。 Method: 设计了基于时空、空间和时间三个视角的重建策略,利用预训练视觉定位模型生成文本条件候选管,并通过管条件重建器在时空约束下优化候选管,引入空间与时间提议之间的相互约束以提升重建质量。 Result: 在VidSTG和HCSTVG两个公开基准上超越了现有方法,可视化结果表明TubeRMC有效减少了目标识别错误和不一致的跟踪。 Conclusion: TubeRMC通过引入文本条件生成和多维度重建机制,在弱监督设置下显著提升了时空视频定位的性能。 Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.

[135] Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

Shruti Singh Baghel,Yash Pratap Singh Rathore,Sushovan Jena,Anurag Pradhan,Amit Shukla,Arnav Bhavsar,Pawan Goyal

Main category: cs.CV

TL;DR: 本研究评估了不同规模的视觉语言模型(SmolVLM2)在为盲人和低视力用户生成无障碍视频描述时的表现,提出了两个新的评估框架,并在智能手机上测试了模型的实际部署性能。

Details Motivation: 大型视觉语言模型虽然能力强,但资源消耗高,难以在移动设备上为盲人和低视力用户提供实时、上下文感知的描述,因此需要研究小模型在无障碍场景下的有效性。 Method: 采用500M和2.2B参数的SmolVLM2模型,在AVCaps和Charades数据集上评估;提出多上下文BLV框架和导航辅助框架进行质量评估;系统分析四种提示设计策略,并在智能手机上测试FP32和INT8精度版本的性能。 Result: 较小的模型在特定提示下可达到接近大模型的描述质量,且INT8量化显著降低资源消耗同时保持可用性能,证明小模型在移动端具备实用潜力。 Conclusion: 轻量级视觉语言模型通过优化提示设计和量化可在资源受限设备上有效支持盲人和低视力用户的无障碍需求,未来应注重面向特定应用场景的小模型定制与评估。 Abstract: Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.

[136] FineSkiing: A Fine-grained Benchmark for Skiing Action Quality Assessment

Yongji Zhang,Siqi Li,Yue Gao,Yu Jiang

Main category: cs.CV

TL;DR: 本文提出了首个包含细粒度子分数和扣分项标注的空中滑雪动作质量评估(AQA)数据集,并提出了一种名为JudgeMind的新方法,通过模拟专业裁判的评分思维,结合阶段划分、阶段感知特征增强和知识引导的解码器,显著提升了AQA的性能与可靠性。

Details Motivation: 现有AQA方法依赖整段视频特征,缺乏可解释性和可靠性,且数据集缺少细粒度评分标注,难以支持精细化评分机制研究。 Method: 提出JudgeMind方法:将动作视频分阶段评分;设计阶段感知特征增强与融合模块以提升关键区域感知;引入基于知识的评分感知解码器,结合扣分项先验知识进行评分预测。 Result: 在自建的细粒度AQA数据集上实验表明,所提方法实现了最先进的性能,显著提升了评分准确性与模型鲁棒性,尤其在多视角切换场景下表现优异。 Conclusion: 通过构建细粒度标注数据集和模拟裁判思维的建模方式,本研究为AQA任务提供了更可靠、可解释的技术路径和基准数据支持。 Abstract: Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.

[137] Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Jiulong Wu,Yucheng Shen,Lingyong Yan,Haixin Sun,Deguo Xia,Jizhou Huang,Min Cao

Main category: cs.CV

TL;DR: 提出Facial-R1,一种三阶段对齐框架,用于面部情感分析(FEA),在最小监督下解决幻觉推理和情绪识别与推理不一致的问题,并构建FEA-20K数据集,实现在多个基准上的最先进性能。

Details Motivation: 现有基于视觉-语言模型的方法在面部情感分析中存在幻觉推理和情绪识别与推理过程不一致的问题,缺乏细粒度、可解释的联合建模能力。 Method: 提出三阶段对齐框架Facial-R1:1)指令微调建立基本情感推理能力;2)以情绪和AU标签为奖励信号进行强化训练,对齐推理与预测;3)设计数据合成 pipeline 进行自迭代扩展数据集。同时构建FEA-20K数据集。 Result: 在八个标准基准上实验表明,Facial-R1在面部情感分析任务中达到最先进的性能,具有良好的泛化性和强可解释性。 Conclusion: Facial-R1通过三阶段对齐有效解决了VLM在FEA中的幻觉和不一致问题,实现了高性能、可解释的细粒度情感分析,FEA-20K为该领域提供了重要数据支持。 Abstract: Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

[138] H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification

Yongji Zhang,Siqi Li,Kuiyang Huang,Yue Gao,Yu Jiang

Main category: cs.CV

TL;DR: 提出H3Former,一种基于高阶语义关系的token-to-region框架,用于细粒度视觉分类,通过超图卷积和双曲层次对比损失提升性能。

Details Motivation: 现有方法在捕捉判别性特征时存在不足,常引入大量类别无关冗余,难以应对细粒度类别间的微小差异和类内变化。 Method: 提出语义感知聚合模块(SAAM),利用多尺度上下文构建加权超图并通过超图卷积聚合token特征为区域表示;引入双曲层次对比损失(HHCL),在非欧空间中增强类间可分性和类内一致性。 Result: 在四个标准细粒度分类基准上实验表明,所提H3Former框架优于现有方法。 Conclusion: H3Former通过建模高阶语义关系和层次化对比学习,有效提升了细粒度视觉分类的性能。 Abstract: Fine-Grained Visual Classification (FGVC) remains a challenging task due to subtle inter-class differences and large intra-class variations. Existing approaches typically rely on feature-selection mechanisms or region-proposal strategies to localize discriminative regions for semantic analysis. However, these methods often fail to capture discriminative cues comprehensively while introducing substantial category-agnostic redundancy. To address these limitations, we propose H3Former, a novel token-to-region framework that leverages high-order semantic relations to aggregate local fine-grained representations with structured region-level modeling. Specifically, we propose the Semantic-Aware Aggregation Module (SAAM), which exploits multi-scale contextual cues to dynamically construct a weighted hypergraph among tokens. By applying hypergraph convolution, SAAM captures high-order semantic dependencies and progressively aggregates token features into compact region-level representations. Furthermore, we introduce the Hyperbolic Hierarchical Contrastive Loss (HHCL), which enforces hierarchical semantic constraints in a non-Euclidean embedding space. The HHCL enhances inter-class separability and intra-class consistency while preserving the intrinsic hierarchical relationships among fine-grained categories. Comprehensive experiments conducted on four standard FGVC benchmarks validate the superiority of our H3Former framework.

[139] PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning

Yanbei Jiang,Chao Lei,Yihao Ding,Krista Ehinger,Jey Han Lau

Main category: cs.CV

TL;DR: 本文提出了PROPA,一种结合蒙特卡洛树搜索与强化学习的视觉语言模型推理优化框架,通过生成密集的过程级奖励来提升复杂视觉推理的性能,在多个基准上显著超越现有方法。

Details Motivation: 现有视觉语言模型在复杂视觉推理中因早期错误传播而表现不佳,且当前后训练方法依赖昂贵的人工标注或稀疏反馈,难以有效优化中间推理步骤。 Method: 提出PROPA框架,将蒙特卡洛树搜索(MCTS)与基于可验证奖励的强化学习(GRPO)结合,生成过程级密集奖励;通过交替进行GRPO更新和监督微调(SFT)解决冷启动问题,并训练过程奖励模型(PRM)以指导推理时搜索。 Result: 在七个基准和四种视觉语言模型主干上,PROPA均优于SFT和RLVR基线方法,在领域内任务上最高提升17.0%,在跨领域任务上提升达21.0%。 Conclusion: PROPA通过引入过程级优化机制,在无需人工标注的情况下显著提升了视觉语言模型的推理能力和泛化性能,为复杂视觉推理提供了有效的新范式。 Abstract: Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.

[140] Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

Zhengtao Zou,Ya Gao,Jiarui Guan,Bin Li,Pekka Marttinen

Main category: cs.CV

TL;DR: 本文提出RUDDER,一种低开销的框架,通过残差更新和自适应门控机制在单次前向传播中抑制大视觉语言模型的对象幻觉,兼顾有效性与计算效率。

Details Motivation: 大视觉语言模型(LVLMs)常出现对象幻觉问题,现有推理时干预方法虽有效但计算开销大,难以应用于对延迟敏感的实际场景,因此需要一种高效且实用的解决方案。 Method: 提出RUDDER框架,包含两个核心组件:(1) 上下文激活残差方向(CARD)向量,从自注意力层的残差更新中提取每样本的视觉证据;(2) 受贝叶斯启发的自适应门,根据模型偏离视觉上下文的程度动态调节逐词元的校正信号强度。整个过程仅需一次标准前向传播。 Result: 在POPE和CHAIR等关键幻觉基准上的实验表明,RUDDER性能媲美当前最先进方法,同时引入的计算延迟可忽略不计。 Conclusion: RUDDER是一种兼具实用性与有效性的LVLM幻觉抑制方法,在不显著牺牲效率的前提下提升了生成结果的视觉一致性与模型可靠性。 Abstract: Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model's deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs' reliability without a significant compromise on efficiency.

[141] Generalizable Slum Detection from Satellite Imagery with Mixture-of-Experts

Sumin Lee,Sungwon Park,Jeasurk Yang,Jihee Kim,Meeyoung Cha

Main category: cs.CV

TL;DR: 提出GRAM框架,利用大规模卫星图像数据集和两阶段测试时自适应方法,实现无需目标区域标注数据的鲁棒贫民窟分割。

Details Motivation: 解决现有模型因贫民窟形态异质性导致在未见地区泛化能力差的问题。 Method: 构建百万级跨洲卫星图像数据集,采用Mixture-of-Experts架构,在共享主干网络基础上捕捉区域特征;通过两阶段测试时自适应,利用专家间预测一致性过滤不可靠伪标签以实现无监督域适应。 Result: GRAM在非洲等低资源城市中优于现有最先进基线方法,实现了对未见地区的有效泛化。 Conclusion: GRAM为全球贫民窟制图提供了可扩展、标签高效的解决方案,有助于数据驱动的城市规划。 Abstract: Satellite-based slum segmentation holds significant promise in generating global estimates of urban poverty. However, the morphological heterogeneity of informal settlements presents a major challenge, hindering the ability of models trained on specific regions to generalize effectively to unseen locations. To address this, we introduce a large-scale high-resolution dataset and propose GRAM (Generalized Region-Aware Mixture-of-Experts), a two-phase test-time adaptation framework that enables robust slum segmentation without requiring labeled data from target regions. We compile a million-scale satellite imagery dataset from 12 cities across four continents for source training. Using this dataset, the model employs a Mixture-of-Experts architecture to capture region-specific slum characteristics while learning universal features through a shared backbone. During adaptation, prediction consistency across experts filters out unreliable pseudo-labels, allowing the model to generalize effectively to previously unseen regions. GRAM outperforms state-of-the-art baselines in low-resource settings such as African cities, offering a scalable and label-efficient solution for global slum mapping and data-driven urban planning.

[142] Rethinking Visual Information Processing in Multimodal LLMs

Dongwan Kim,Viresh Ranjan,Takashi Nagata,Arnab Dhua,Amit Kumar K C

Main category: cs.CV

TL;DR: 提出LLaViT,将大语言模型同时作为视觉编码器,通过三个关键改进提升视觉-语言建模性能。

Details Motivation: 解决LLaVA架构在视觉特征融合上的模态不匹配问题。 Method: 引入独立的视觉QKV投影、视觉token的双向注意力机制,以及全局和局部视觉表征。 Result: 在多个基准上显著优于LLaVA,甚至超越参数量两倍的模型。 Conclusion: LLaViT提供了一种更高效的视觉-语言建模方法。 Abstract: Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.

[143] Revisiting Evaluation of Deep Neural Networks for Pedestrian Detection

Patrick Feifel,Benedikt Franke,Frank Bonarens,Frank Köster,Arne Raulf,Friedhelm Schwenker

Main category: cs.CV

TL;DR: 提出了一种基于图像分割的细粒度评估方法,定义了八类行人检测错误并设计新指标,实现了在CityPersons数据集上的SOTA性能。

Details Motivation: 现有行人检测评估指标无法真实反映DNN性能,缺乏对安全关键场景的细粒度误差分析。 Method: 利用图像分割信息,提出八类行人检测错误类型,并设计新的评估指标;基于简化的APD框架比较不同主干网络。 Result: 在CityPersons-reasonable子集上达到SOTA(无额外训练数据),并通过新指标实现更精细、稳健的模型比较。 Conclusion: 所提细粒度评估方法能更准确地衡量行人检测模型的安全关键性能,有助于自动驾驶系统的可靠评估。 Abstract: Reliable pedestrian detection represents a crucial step towards automated driving systems. However, the current performance benchmarks exhibit weaknesses. The currently applied metrics for various subsets of a validation dataset prohibit a realistic performance evaluation of a DNN for pedestrian detection. As image segmentation supplies fine-grained information about a street scene, it can serve as a starting point to automatically distinguish between different types of errors during the evaluation of a pedestrian detector. In this work, eight different error categories for pedestrian detection are proposed and new metrics are proposed for performance comparison along these error categories. We use the new metrics to compare various backbones for a simplified version of the APD, and show a more fine-grained and robust way to compare models with each other especially in terms of safety-critical performance. We achieve SOTA on CityPersons-reasonable (without extra training data) by using a rather simple architecture.

[144] CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification

Xiaomei Yang,Xizhan Gao,Sijie Niu,Fa Zhu,Guang Feng,Xiaofeng Qu,David Camacho

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的模态共享表示学习网络CLIP4VI-ReID,用于可见光-红外行人重识别(VI-ReID),通过文本语义桥接实现跨模态对齐,在多个数据集上优于现有方法。

Details Motivation: 由于可见光与红外图像在物理特性上存在巨大差异,直接进行跨模态对齐困难,因此需要一种有效机制来缩小模态差距并提升共享表征的判别能力。 Method: 提出CLIP4VI-ReID,包含三个模块:文本语义生成(TSG)为可见光图像生成文本语义,实现可见-文本对齐;红外特征嵌入(IFE)利用文本语义修正红外特征;高层语义对齐(HSA)优化高阶语义对齐,确保文本语义仅包含身份相关信息。 Result: 在多个主流VI-ReID数据集上实验表明,CLIP4VI-ReID性能优于当前最先进的方法。 Conclusion: 通过引入文本作为桥梁并设计分阶段对齐策略,CLIP4VI-ReID有效提升了可见光与红外模态间的共享表示学习效果,实现了更准确的跨模态对齐。 Abstract: This paper proposes a novel CLIP-driven modality-shared representation learning network named CLIP4VI-ReID for VI-ReID task, which consists of Text Semantic Generation (TSG), Infrared Feature Embedding (IFE), and High-level Semantic Alignment (HSA). Specifically, considering the huge gap in the physical characteristics between natural images and infrared images, the TSG is designed to generate text semantics only for visible images, thereby enabling preliminary visible-text modality alignment. Then, the IFE is proposed to rectify the feature embeddings of infrared images using the generated text semantics. This process injects id-related semantics into the shared image encoder, enhancing its adaptability to the infrared modality. Besides, with text serving as a bridge, it enables indirect visible-infrared modality alignment. Finally, the HSA is established to refine the high-level semantic alignment. This process ensures that the fine-tuned text semantics only contain id-related information, thereby achieving more accurate cross-modal alignment and enhancing the discriminability of the learned modal-shared representations. Extensive experimental results demonstrate that the proposed CLIP4VI-ReID achieves superior performance than other state-of-the-art methods on some widely used VI-ReID datasets.

[145] Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision

Yu Deng,Baozhu Zhao,Junyan Su,Xiaohan Zhang,Qi Liu

Main category: cs.CV

TL;DR: 本文提出了一种结合景深监督和多视角一致性监督的新型计算框架,用于提升复杂深度变化场景下的3D高斯点阵重建质量,在Waymo数据集上取得了优于现有方法0.8 dB PSNR的表现。

Details Motivation: 现有方法难以同时解决远距离区域深度估计不准和近距离区域结构退化的问题,尤其在深度变化剧烈的场景中缺乏一致的监督信号。 Method: 提出两个核心组件:1)景深监督,利用单目深度估计器生成深度先验,通过散焦卷积合成物理准确的模糊图像,并设计景深损失来增强远近区域的深度保真度;2)多视角一致性监督,采用LoFTR进行半稠密特征匹配,通过最小化跨视角几何误差并利用可靠匹配点进行最小二乘优化来约束深度一致性。 Result: 该方法在Waymodo Open Dataset上实现了比当前最先进方法高出0.8 dB PSNR的重建性能,显著提升了远场和近场区域的深度精度与结构完整性。 Conclusion: 通过融合物理成像原理与学习型深度正则化,所提框架为城市环境中具有复杂深度分层的场景提供了可扩展且高效的3D重建解决方案。 Abstract: Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.

[146] Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment

Wenti Yin,Huaxin Zhang,Xiang Wang,Yuqing Lu,Yicheng Zhang,Bingquan Gong,Jialong Zuo,Li Yu,Changxin Gao,Nong Sang

Main category: cs.CV

TL;DR: 提出了一种新的解耦语义对齐网络(DSANet),通过在粗粒度和细粒度层面分离正常与异常特征,提升弱监督视频异常检测的性能。

Details Motivation: 现有方法倾向于检测最显著的片段,忽视多样化的正常模式挖掘,且因外观相似导致类别混淆,影响细粒度分类效果。 Method: 在粗粒度层面引入自引导正常性建模分支,利用学习到的正常原型重构视频特征;在细粒度层面提出解耦对比语义对齐机制,将视频分解为事件中心和背景中心成分,并采用视觉-语言对比学习增强类别判别表示。 Result: 在XD-Violence和UCF-Crime两个标准数据集上的实验表明,DSANet优于现有的最先进方法。 Conclusion: DSANet能有效分离正常与异常特征,提升异常检测的准确性和细粒度分类能力。 Abstract: Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.

[147] FOUND: Fourier-based von Mises Distribution for Robust Single Domain Generalization in Object Detection

Mengzhu Wang,Changyuan Deng,Shanshan Wang,Nan Yin,Long Lan,Liang Yang

Main category: cs.CV

TL;DR: 本文提出了一种结合von Mises-Fisher分布和傅里叶变换的CLIP引导框架,用于提升单域泛化目标检测的鲁棒性和跨域适应能力。

Details Motivation: 现有单域泛化方法忽略特征分布结构和频域特性,导致模型在未见域上的泛化能力受限。 Method: 采用vMF分布建模方向性特征,并引入基于傅里叶变换的增强策略,扰动幅度和相位以模拟域偏移。 Result: 在恶劣天气驾驶基准上实验表明,该方法优于当前最先进的单域泛化目标检测方法。 Conclusion: 所提框架有效提升了特征鲁棒性与语义一致性,显著增强了模型在未见域中的泛化性能。 Abstract: Single Domain Generalization (SDG) for object detection aims to train a model on a single source domain that can generalize effectively to unseen target domains. While recent methods like CLIP-based semantic augmentation have shown promise, they often overlook the underlying structure of feature distributions and frequency-domain characteristics that are critical for robustness. In this paper, we propose a novel framework that enhances SDG object detection by integrating the von Mises-Fisher (vMF) distribution and Fourier transformation into a CLIP-guided pipeline. Specifically, we model the directional features of object representations using vMF to better capture domain-invariant semantic structures in the embedding space. Additionally, we introduce a Fourier-based augmentation strategy that perturbs amplitude and phase components to simulate domain shifts in the frequency domain, further improving feature robustness. Our method not only preserves the semantic alignment benefits of CLIP but also enriches feature diversity and structural consistency across domains. Extensive experiments on the diverse weather-driving benchmark demonstrate that our approach outperforms the existing state-of-the-art method.

[148] DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile

Thales Bezerra,Emanoel Thyago,Kelvin Cunha,Rodrigo Abreu,Fábio Papais,Francisco Mauro,Natália Lopes,Érico Medeiros,Jéssica Guido,Shirley Cruz,Paulo Borba,Tsang Ing Ren

Main category: cs.CV

TL;DR: DermAI是一个基于智能手机的轻量级应用,用于在日常诊疗中实时捕捉、标注和分类皮肤病变,强调了标准化和多样化数据收集的重要性。

Details Motivation: AI皮肤病学的应用受限于有偏见的数据集、图像质量的差异以及验证不足的问题。 Method: 开发了一个名为DermAI的应用程序,支持设备端质量检查和本地模型适应,并使用涵盖多种肤色、种族和设备来源的临床数据集进行训练和测试。 Result: 在初步实验中,使用公开数据集训练的模型无法很好地泛化到新样本,而通过本地数据微调后性能显著提升。 Conclusion: 标准化且多样化的数据采集对于满足医疗需求并推动机器学习发展至关重要。 Abstract: AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.

[149] SHRUG-FM: Reliability-Aware Foundation Models for Earth Observation

Kai-Hendrik Cohrs,Zuzanna Osika,Maria Gonzalez-Calabuig,Vishal Nedungadi,Ruben Cartuyvels,Steffen Knoblauch,Joppe Massant,Shruti Nath,Patrick Ebel,Vasileios Sitokonstantinou

Main category: cs.CV

TL;DR: SHRUG-FM是一个面向地球观测的地理空间基础模型可靠性预测框架,结合输入空间和嵌入空间的分布外检测与任务特定预测不确定性,提升模型在气候敏感应用中的可解释性与安全性。

Details Motivation: 现有地理空间基础模型在预训练中未充分表示的环境中表现不可靠,缺乏对失败模式的识别与解释能力。 Method: 提出SHRUG-FM框架,融合三种信号:输入空间OOD检测、嵌入空间OOD检测和任务相关预测不确定性,并通过HydroATLAS土地覆盖属性分析失败地理分布。 Result: 在烧伤疤痕分割任务中,OOD得分与特定环境(如低海拔区、大河区域)性能下降相关,不确定性标志可有效过滤大量错误预测,揭示模型失败非随机而是数据代表性不足所致。 Conclusion: SHRUG-FM有助于缩小基准性能与真实世界可靠性之间的差距,为GFMs的安全部署提供可解释路径。 Abstract: Geospatial foundation models for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty. Applied to burn scar segmentation, SHRUG-FM shows that OOD scores correlate with lower performance in specific environmental conditions, while uncertainty-based flags help discard many poorly performing predictions. Linking these flags to land cover attributes from HydroATLAS shows that failures are not random but concentrated in certain geographies, such as low-elevation zones and large river areas, likely due to underrepresentation in pretraining data. SHRUG-FM provides a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, helping bridge the gap between benchmark performance and real-world reliability.

[150] MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

Xun Huang,Shijia Zhao,Yunxiang Wang,Xin Lu,Wanfa Zhang,Rongsheng Qu,Weixin Li,Yunhong Wang,Chenglu Wen

Main category: cs.CV

TL;DR: 提出了一种基于多模态3D场景图(M3DSG)的零样本导航系统MSGNav,通过保留视觉线索、支持开放词汇和闭环推理,在GOAT-Bench和HM3D-OVON数据集上实现了最先进的性能。

Details Motivation: 现有零样本具身导航方法在构建3D场景图时将视觉信息压缩为纯文本关系,导致视觉信息丢失、构建成本高且词汇受限,难以满足现实部署中对开放词汇和低训练开销的需求。 Method: 提出多模态3D场景图(M3DSG),用图像替代文本边以保留视觉线索;在此基础上构建MSGNav系统,包含关键子图选择、自适应词汇更新和闭环推理模块,并引入基于可视性的视点决策模块解决最后一英里问题。 Result: MSGNav在GOAT-Bench和HM3D-OVON两个基准上均取得当前最优的零样本导航性能,验证了其有效性。 Conclusion: MSGNav通过保留多模态信息和增强推理机制,显著提升了零样本具身导航的性能,为开放词汇、低开销的现实场景导航提供了有效解决方案。 Abstract: Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.

[151] Fragile by Design: On the Limits of Adversarial Defenses in Personalized Generation

Zhen Chen,Yi Zhang,Xiangyu Yin,Chengxuan Qin,Xingyu Zhao,Xiaowei Huang,Wenjie Ruan

Main category: cs.CV

TL;DR: 现有防御方法在防止个性化生成模型泄露面部身份方面存在明显缺陷,本文提出新的评估框架AntiDB_Purify,揭示了当前方法在面对图像净化攻击时的脆弱性,强调需要更鲁棒和隐蔽的隐私保护机制。

Details Motivation: 个性化AI应用(如DreamBooth)可能导致用户面部身份泄露,现有防御方法(如Anti-DreamBooth)虽尝试通过对抗扰动保护隐私,但存在可感知伪影和易被滤除的问题,缺乏实际有效性。 Method: 提出名为AntiDB_Purify的评估框架,系统评估现有防御方法在面对传统图像滤波和对抗性净化等现实威胁下的鲁棒性。 Result: 实验表明,现有防御方法在经过简单滤波或净化处理后均失效,无法有效阻止模型对用户身份的记忆与重建,防护效果显著下降。 Conclusion: 当前的防御机制提供了一种虚假的安全感,未来需要设计更具鲁棒性和视觉不可察觉性的隐私保护方法以应对真实场景中的净化攻击。 Abstract: Personalized AI applications such as DreamBooth enable the generation of customized content from user images, but also raise significant privacy concerns, particularly the risk of facial identity leakage. Recent defense mechanisms like Anti-DreamBooth attempt to mitigate this risk by injecting adversarial perturbations into user photos to prevent successful personalization. However, we identify two critical yet overlooked limitations of these methods. First, the adversarial examples often exhibit perceptible artifacts such as conspicuous patterns or stripes, making them easily detectable as manipulated content. Second, the perturbations are highly fragile, as even a simple, non-learned filter can effectively remove them, thereby restoring the model's ability to memorize and reproduce user identity. To investigate this vulnerability, we propose a novel evaluation framework, AntiDB_Purify, to systematically evaluate existing defenses under realistic purification threats, including both traditional image filters and adversarial purification. Results reveal that none of the current methods maintains their protective effectiveness under such threats. These findings highlight that current defenses offer a false sense of security and underscore the urgent need for more imperceptible and robust protections to safeguard user identity in personalized generation.

[152] SAMIRO: Spatial Attention Mutual Information Regularization with a Pre-trained Model as Oracle for Lane Detection

Hyunjong Lee,Jangho Lee,Jaekoo Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为SAMIRO的车道检测方法,通过利用预训练模型和空间注意力互信息正则化,提升现有模型在复杂环境下的性能,并在多个主流数据集上验证了其有效性。

Details Motivation: 由于真实环境中存在背景杂乱、光照变化和遮挡等问题,基于数据驱动的车道检测方法面临数据收集和标注成本高的挑战,因此需要有效利用上下文和全局信息来提升检测性能。 Method: 提出SAMIRO方法,结合预训练模型作为Oracle,引入空间注意力互信息正则化,以保留领域无关的空间信息,并可作为插件集成到多种先进车道检测模型中。 Result: 在CULane、Tusimple和LLAMAS等多个主流基准上进行实验,结果表明SAMIRO能持续提升不同模型和数据集上的检测性能。 Conclusion: SAMIRO通过知识迁移和空间信息保留,显著增强了车道检测的鲁棒性和通用性,是一种有效的即插即用增强模块。 Abstract: Lane detection is an important topic in the future mobility solutions. Real-world environmental challenges such as background clutter, varying illumination, and occlusions pose significant obstacles to effective lane detection, particularly when relying on data-driven approaches that require substantial effort and cost for data collection and annotation. To address these issues, lane detection methods must leverage contextual and global information from surrounding lanes and objects. In this paper, we propose a Spatial Attention Mutual Information Regularization with a pre-trained model as an Oracle, called SAMIRO. SAMIRO enhances lane detection performance by transferring knowledge from a pretrained model while preserving domain-agnostic spatial information. Leveraging SAMIRO's plug-and-play characteristic, we integrate it into various state-of-the-art lane detection approaches and conduct extensive experiments on major benchmarks such as CULane, Tusimple, and LLAMAS. The results demonstrate that SAMIRO consistently improves performance across different models and datasets. The code will be made available upon publication.

[153] Physics informed Transformer-VAE for biophysical parameter estimation: PROSAIL model inversion in Sentinel-2 imagery

Prince Mensah,Pelumi Victor Aderinto,Ibrahim Salihu Yusuf,Arnu Pretorius

Main category: cs.CV

TL;DR: 提出了一种基于物理信息的Transformer-VAE架构,用于从Sentinel-2数据中反演PROSAIL模型以同时估计植被冠层参数,仅使用模拟数据训练即可达到与使用真实影像训练的最先进方法相当的性能。

Details Motivation: 准确从卫星图像中反演植被生物物理参数对生态系统监测和农业管理至关重要,现有方法多依赖真实影像进行监督训练,限制了其广泛应用。 Method: 提出一种将PROSAIL辐射传输模型作为可微分物理解码器的Transformer-VAE架构,仅在模拟数据上进行训练,实现对叶面积指数(LAI)和冠层叶绿素含量(CCC)的无监督反演。 Result: 在FRM4Veg和BelSAR真实田间数据集上实现了与使用真实影像训练的先进方法相当的反演精度,无需实地标签或真实图像校准。 Conclusion: 该方法通过融合物理模型与深度学习,实现了高效、自监督且物理一致的植被参数反演,为大规模植被遥感监测提供了低成本解决方案。 Abstract: Accurate retrieval of vegetation biophysical variables from satellite imagery is crucial for ecosystem monitoring and agricultural management. In this work, we propose a physics-informed Transformer-VAE architecture to invert the PROSAIL radiative transfer model for simultaneous estimation of key canopy parameters from Sentinel-2 data. Unlike previous hybrid approaches that require real satellite images for self-supevised training. Our model is trained exclusively on simulated data, yet achieves performance on par with state-of-the-art methods that utilize real imagery. The Transformer-VAE incorporates the PROSAIL model as a differentiable physical decoder, ensuring that inferred latent variables correspond to physically plausible leaf and canopy properties. We demonstrate retrieval of leaf area index (LAI) and canopy chlorophyll content (CCC) on real-world field datasets (FRM4Veg and BelSAR) with accuracy comparable to models trained with real Sentinel-2 data. Our method requires no in-situ labels or calibration on real images, offering a cost-effective and self-supervised solution for global vegetation monitoring. The proposed approach illustrates how integrating physical models with advanced deep networks can improve the inversion of RTMs, opening new prospects for large-scale, physically-constrained remote sensing of vegetation traits.

[154] MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Jiarui Zhang,Yuliang Liu,Zijun Wu,Guosheng Pang,Zhili Ye,Yupei Zhong,Junteng Ma,Tao Wei,Haiyang Xu,Weikai Chen,Zeen Wang,Qiangjun Ji,Fanxi Zhou,Qi Zhang,Yuanrui Hu,Jiahao Liu,Zhang Li,Ziyang Zhang,Qiang Liu,Xiang Bai

Main category: cs.CV

TL;DR: MonkeyOCR v1.5 是一种统一的视觉-语言框架,通过两阶段解析流程提升复杂文档的布局理解和内容识别性能。

Details Motivation: 现有OCR系统在处理具有复杂布局、多级表格、嵌入式图像或公式以及跨页结构的真实文档时仍面临挑战。 Method: 采用两阶段解析流程:第一阶段使用大型多模态模型联合预测文档布局和阅读顺序;第二阶段在检测区域内进行局部化文本、公式和表格识别,并引入基于视觉一致性的强化学习方案优化表格解析,同时设计了两个专用模块处理含图像的表格和跨页表格合并。 Result: 在OmniDocBench v1.5上实验表明,MonkeyOCR v1.5性能优于PPOCR-VL和MinerU 2.5,尤其在视觉复杂的文档场景中表现出卓越的鲁棒性。 Conclusion: MonkeyOCR v1.5有效提升了复杂文档的解析精度与鲁棒性,推动了文档智能领域的发展。 Abstract: Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.

[155] GrounDiff: Diffusion-Based Ground Surface Generation from Digital Surface Models

Oussema Dhaouadi,Johannes Meier,Jacques Kaiser,Daniel Cremers

Main category: cs.CV

TL;DR: 本文提出了GrounDiff,首个基于扩散模型的数字地形模型(DTM)生成框架,通过将去除非地面结构问题建模为去噪任务,结合门控设计和置信度引导机制实现选择性滤波,并提出Prior-Guided Stitching(PrioStitch)提升可扩展性,在多个数据集上显著优于现有方法。

Details Motivation: 传统DTM生成方法依赖人工调参或复杂的网络结构并需后处理,缺乏高效、自动化的滤波能力,因此需要一种更鲁棒、可扩展且无需任务特定优化的方法。 Method: 提出GrounDiff,采用扩散模型将DSM转为DTM,引入门控机制与置信度引导生成;为提升可扩展性,设计PrioStitch,利用GrounDiff生成的低分辨率全局先验指导局部高分辨率预测;针对道路重建任务进一步提出平滑优化的GrounDiff+变体。 Result: 在ALS2DTM和USGS数据集上RMSE分别降低最高93%和47%;在GeRoD道路重建基准上距离误差降低达81%,同时保持良好表面平滑性;GrounDiff+进一步超越现有最先进方法。 Conclusion: GrounDiff是首个基于扩散模型的DTM生成框架,具有强鲁棒性、高精度和良好可扩展性,无需任务特定优化即可在多种应用场景中显著优于现有方法。 Abstract: Digital Terrain Models (DTMs) represent the bare-earth elevation and are important in numerous geospatial applications. Such data models cannot be directly measured by sensors and are typically generated from Digital Surface Models (DSMs) derived from LiDAR or photogrammetry. Traditional filtering approaches rely on manually tuned parameters, while learning-based methods require well-designed architectures, often combined with post-processing. To address these challenges, we introduce Ground Diffusion (GrounDiff), the first diffusion-based framework that iteratively removes non-ground structures by formulating the problem as a denoising task. We incorporate a gated design with confidence-guided generation that enables selective filtering. To increase scalability, we further propose Prior-Guided Stitching (PrioStitch), which employs a downsampled global prior automatically generated using GrounDiff to guide local high-resolution predictions. We evaluate our method on the DSM-to-DTM translation task across diverse datasets, showing that GrounDiff consistently outperforms deep learning-based state-of-the-art methods, reducing RMSE by up to 93% on ALS2DTM and up to 47% on USGS benchmarks. In the task of road reconstruction, which requires both high precision and smoothness, our method achieves up to 81% lower distance error compared to specialized techniques on the GeRoD benchmark, while maintaining competitive surface smoothness using only DSM inputs, without task-specific optimization. Our variant for road reconstruction, GrounDiff+, is specifically designed to produce even smoother surfaces, further surpassing state-of-the-art methods. The project page is available at https://deepscenario.github.io/GrounDiff/.

[156] LLM-YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components

Yaru Li,Yanxue Wang,Meng Li,Xinming Li,Jianbo Feng

Main category: cs.CV

TL;DR: 提出了一种结合YOLOMS与大语言模型(LLM)的风力涡轮机故障智能分析框架,通过多尺度检测和KV映射模块提升故障检测与可解释性,实验显示检测准确率达90.6%,维护建议准确率达89%。

Details Motivation: 现有风力涡轮机故障检测方法多依赖视觉识别,输出缺乏语义可解释性,难以支持运维决策。 Method: 结合YOLOMS进行多尺度检测与滑窗裁剪,并设计轻量级KV映射模块将检测结果转化为富含定性定量信息的文本;利用领域调优的LLM进行语义推理,生成可解释的故障分析与维护建议。 Result: 在真实数据集上,故障检测准确率为90.6%,生成的维护报告平均准确率为89%。 Conclusion: 该框架有效提升了风力涡轮机故障诊断结果的可解释性,为运维决策提供了实用支持。 Abstract: The health condition of wind turbine (WT) components is crucial for ensuring stable and reliable operation. However, existing fault detection methods are largely limited to visual recognition, producing structured outputs that lack semantic interpretability and fail to support maintenance decision-making. To address these limitations, this study proposes an integrated framework that combines YOLOMS with a large language model (LLM) for intelligent fault analysis and diagnosis. Specifically, YOLOMS employs multi-scale detection and sliding-window cropping to enhance fault feature extraction, while a lightweight key-value (KV) mapping module bridges the gap between visual outputs and textual inputs. This module converts YOLOMS detection results into structured textual representations enriched with both qualitative and quantitative attributes. A domain-tuned LLM then performs semantic reasoning to generate interpretable fault analyses and maintenance recommendations. Experiments on real-world datasets demonstrate that the proposed framework achieves a fault detection accuracy of 90.6\% and generates maintenance reports with an average accuracy of 89\%, thereby improving the interpretability of diagnostic results and providing practical decision support for the operation and maintenance of wind turbines.

[157] 3DFETUS: Standardizing Fetal Facial Planes in 3D Ultrasound

Alomar Antonia,Rubio Ricardo,Albaiges Gerard,Salort-Benejam Laura,Caminal Julia,Prat Maria,Rueda Carolina,Cortes Berta,Piella Gemma,Sukno Federico

Main category: cs.CV

TL;DR: 提出GT++算法和3DFETUS深度学习模型,用于在3D胎儿超声中自动定位标准面部平面,提高了估计精度并减少了操作依赖性。

Details Motivation: 常规胎儿超声检查中获取标准面部平面困难,受胎儿运动、方向变异和操作者经验影响,导致不一致性和诊断偏差。 Method: 采用标注的解剖标志点,利用GT++算法估计标准面部平面,并使用3DFETUS深度学习模型实现自动化和标准化定位。 Result: 该方法在定量评估中达到平均平移误差4.13 mm和旋转误差7.93度,优于现有技术;临床评估证实其显著提升平面估计准确性。 Conclusion: GT++与3DFETUS能有效提高3D胎儿面部超声中标准平面定位的准确性和一致性,具有临床应用潜力。 Abstract: Acquiring standard facial planes during routine fetal ultrasound (US) examinations is often challenging due to fetal movement, variability in orientation, and operator-dependent expertise. These factors contribute to inconsistencies, increased examination time, and potential diagnostic bias. To address these challenges in the context of facial assessment, we present: 1) GT++, a robust algorithm that estimates standard facial planes from 3D US volumes using annotated anatomical landmarks; and 2) 3DFETUS, a deep learning model that automates and standardizes their localization in 3D fetal US volumes. We evaluated our methods both qualitatively, through expert clinical review, and quantitatively. The proposed approach achieved a mean translation error of 4.13 mm and a mean rotation error of 7.93 degrees per plane, outperforming other state-of-the-art methods on 3D US volumes. Clinical assessments further confirmed the effectiveness of both GT++ and 3DFETUS, demonstrating statistically significant improvements in plane estimation accuracy.

[158] RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation

Daniele Perlo,Vladimir Despotovic,Selma Boudissa,Sang-Yoon Kim,Petr Nazarov,Yanrong Zhang,Max Wintermark,Olivier Keunen

Main category: cs.CV

TL;DR: 本文介绍了一个用于自动检测实验动物癫痫发作的视频数据集RodEpil,包含13,053个标注的短视频片段,并基于TimeSformer模型实现了97%平均F1分数的基线性能,数据集和代码已公开。

Details Motivation: 为了支持非侵入式、基于视频的临床前癫痫研究,需要一个高质量、标注良好的动物行为视频数据集以推动自动化检测方法的发展。 Method: 构建了一个包含顶部和侧面视角的啮齿类动物短视频数据集,采用严格的主体间五折交叉验证,使用TimeSformer模型进行视频分类。 Result: TimeSformer模型在区分癫痫发作与正常活动方面达到97%的平均F1分数,验证了其有效性。 Conclusion: RodEpil数据集为癫痫行为自动识别提供了可靠基准,公开的数据和代码有助于推动可重复的预临床研究。 Abstract: We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: 10.5281/zenodo.17601357

[159] Histology-informed tiling of whole tissue sections improves the interpretability and predictability of cancer relapse and genetic alterations

Willem Bonnaffé,Yang Hu,Andrea Chatrian,Mengran Fan,Stefano Malacrino,Sandy Figiel,CRUK ICGC Prostate Group,Srinivasa R. Rao,Richard Colling,Richard J. Bryant,Freddie C. Hamdy,Dan J. Woodcock,Ian G. Mills,Clare Verrill,Jens Rittscher

Main category: cs.CV

TL;DR: 本文提出了一种名为组织学信息切片(HIT)的新方法,利用语义分割从全切片图像中提取腺体作为有意义的输入块,用于多实例学习和表型分析,显著提高了癌症相关基因拷贝数变异检测的准确性和模型可解释性。

Details Motivation: 传统的数字病理学流程常采用基于网格的切片方法,忽略了组织结构,引入了无关信息并限制了模型的可解释性。因此需要一种能聚焦生物意义结构的方法来改进分析。 Method: 提出HIT方法,通过语义分割提取前列腺癌全切片图像中的腺体,将其作为多实例学习(MIL)和表型分析的输入单元,并在ProMPT、ICGC-C和TCGA-PRAD队列上进行训练与验证。 Result: HIT在腺体层面实现了0.83的Dice分数,在检测EMT和MYC相关基因拷贝数变异时使MIL模型AUC提升了10%,并识别出15个腺体聚类,其中多个与癌症复发、致癌突变和高Gleason评分相关。 Conclusion: HIT通过聚焦于生物学上有意义的结构,提高了MIL模型在准确性、可解释性和计算效率方面的表现,为数字病理分析提供了更优的组织切片策略。 Abstract: Histopathologists establish cancer grade by assessing histological structures, such as glands in prostate cancer. Yet, digital pathology pipelines often rely on grid-based tiling that ignores tissue architecture. This introduces irrelevant information and limits interpretability. We introduce histology-informed tiling (HIT), which uses semantic segmentation to extract glands from whole slide images (WSIs) as biologically meaningful input patches for multiple-instance learning (MIL) and phenotyping. Trained on 137 samples from the ProMPT cohort, HIT achieved a gland-level Dice score of 0.83 +/- 0.17. By extracting 380,000 glands from 760 WSIs across ICGC-C and TCGA-PRAD cohorts, HIT improved MIL models AUCs by 10% for detecting copy number variation (CNVs) in genes related to epithelial-mesenchymal transitions (EMT) and MYC, and revealed 15 gland clusters, several of which were associated with cancer relapse, oncogenic mutations, and high Gleason. Therefore, HIT improved the accuracy and interpretability of MIL predictions, while streamlining computations by focussing on biologically meaningful structures during feature extraction.

[160] OpenSR-SRGAN: A Flexible Super-Resolution Framework for Multispectral Earth Observation Data

Simon Donike,Cesar Aybar,Julio Contreras,Luis Gómez-Chova

Main category: cs.CV

TL;DR: OpenSR-SRGAN 是一个开源、模块化的地球观测图像超分辨率框架,基于 SRGAN 风格模型,支持通过配置文件灵活切换网络结构、尺度因子和波段设置,适用于多光谱卫星数据(如 Sentinel-2),旨在降低研究人员和从业者使用 GAN 进行图像超分辨率实验和部署的门槛。

Details Motivation: 现有的超分辨率模型通常需要修改代码才能适配不同架构或数据,缺乏统一、易用的框架来支持地球观测领域的多样化需求,限制了模型的可复现性和广泛应用。 Method: 提出 OpenSR-SRGAN 框架,采用模块化设计,将生成器、判别器、损失函数和训练调度通过配置文件定义,无需修改代码即可灵活配置;支持多光谱数据处理,集成日志、验证和大场景推理功能。 Result: 框架提供了即用型配置、合理的对抗训练默认参数,并支持常见遥感场景,实现了配置驱动的超分辨率工作流,提升了可扩展性、可复现性和易用性。 Conclusion: OpenSR-SRGAN 有效降低了在地球观测中应用 GAN 基础超分辨率技术的门槛,是一个实用且可扩展的工具,有助于推动遥感图像处理中的模型比较与实际部署。 Abstract: We present OpenSR-SRGAN, an open and modular framework for single-image super-resolution in Earth Observation. The software provides a unified implementation of SRGAN-style models that is easy to configure, extend, and apply to multispectral satellite data such as Sentinel-2. Instead of requiring users to modify model code, OpenSR-SRGAN exposes generators, discriminators, loss functions, and training schedules through concise configuration files, making it straightforward to switch between architectures, scale factors, and band setups. The framework is designed as a practical tool and benchmark implementation rather than a state-of-the-art model. It ships with ready-to-use configurations for common remote sensing scenarios, sensible default settings for adversarial training, and built-in hooks for logging, validation, and large-scene inference. By turning GAN-based super-resolution into a configuration-driven workflow, OpenSR-SRGAN lowers the entry barrier for researchers and practitioners who wish to experiment with SRGANs, compare models in a reproducible way, and deploy super-resolution pipelines across diverse Earth-observation datasets.

[161] Utility of Pancreas Surface Lobularity as a CT Biomarker for Opportunistic Screening of Type 2 Diabetes

Tejas Sudharshan Mathai,Anisa V. Prasad,Xinya Wang,Praveen T. S. Balamuralikrishna,Yan Zhuang,Abhinav Suri,Jianfei Liu,Perry J. Pickhardt,Ronald M. Summers

Main category: cs.CV

TL;DR: 本研究提出了一种全自动深度学习方法,用于通过CT影像中的胰腺表面小叶度(PSL)等生物标志物进行2型糖尿病(T2DM)的机会性筛查。结果显示,T2DM患者PSL显著升高,模型预测AUC达0.90,具有较高特异性。

Details Motivation: 早期检测2型糖尿病至关重要,但胰腺表面小叶度(PSL)在T2DM中的作用尚未充分研究。因此,需要一种自动化的影像学方法来探索PSL作为潜在生物标志物的价值。 Method: 采用四个深度学习模型对584例患者的腹部CT图像进行胰腺及其他结构分割,自动提取包括PSL在内的影像生物标志物,并构建多变量模型预测T2DM。 Result: T2DM患者PSL显著高于非糖尿病患者(p=0.01);PancAP模型分割性能最优(Dice=0.79,ASSD=1.94 mm);基于CT生物标志物的预测模型AUC为0.90,敏感度66.7%,特异度91.9%。 Conclusion: PSL是T2DM潜在的有效影像生物标志物,结合深度学习的自动化CT分析可用于T2DM的早期筛查和预测。 Abstract: Type 2 Diabetes Mellitus (T2DM) is a chronic metabolic disease that affects millions of people worldwide. Early detection is crucial as it can alter pancreas function through morphological changes and increased deposition of ectopic fat, eventually leading to organ damage. While studies have shown an association between T2DM and pancreas volume and fat content, the role of increased pancreatic surface lobularity (PSL) in patients with T2DM has not been fully investigated. In this pilot work, we propose a fully automated approach to delineate the pancreas and other abdominal structures, derive CT imaging biomarkers, and opportunistically screen for T2DM. Four deep learning-based models were used to segment the pancreas in an internal dataset of 584 patients (297 males, 437 non-diabetic, age: 45$\pm$15 years). PSL was automatically detected and it was higher for diabetic patients (p=0.01) at 4.26 $\pm$ 8.32 compared to 3.19 $\pm$ 3.62 for non-diabetic patients. The PancAP model achieved the highest Dice score of 0.79 $\pm$ 0.17 and lowest ASSD error of 1.94 $\pm$ 2.63 mm (p$<$0.05). For predicting T2DM, a multivariate model trained with CT biomarkers attained 0.90 AUC, 66.7\% sensitivity, and 91.9\% specificity. Our results suggest that PSL is useful for T2DM screening and could potentially help predict the early onset of T2DM.

[162] SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers

Oded Schlesinger,Amirhossein Farzam,J. Matias Di Martino,Guillermo Sapiro

Main category: cs.CV

TL;DR: 提出SPOT框架,通过利用令牌嵌入、交互和注意力动态来早期检测并减少视觉Transformer中的冗余令牌,从而在保持甚至提升准确率的同时显著提高计算效率。

Details Motivation: 视觉Transformer(ViT)的计算需求随处理令牌数量呈二次增长,需有效降低冗余计算。 Method: 设计轻量级预测器,结合令牌嵌入、跨层交互与注意力动态,推断令牌重要性,并动态稀疏化不重要令牌。 Result: 相比标准ViT,在保持或提升准确率的前提下,最高实现40%的效率提升。 Conclusion: SPOT是一种通用、高效且可解释的ViT加速方法,适用于多种架构并适应不同资源约束。 Abstract: While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .

[163] Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising

Yusuf Talha Basak,Mehmet Ozan Unal,Metin Ertas,Isa Yildirim

Main category: cs.CV

TL;DR: 提出了一种可学习的全变分(LTV)框架,通过数据驱动的LambdaNet预测逐像素正则化图,实现端到端训练,在降噪和边缘保持方面优于传统TV和FBP+U-Net。

Details Motivation: 传统全变分(TV)方法依赖于固定的lambda参数,限制了其效率和实用性,难以在不同区域自适应地平衡去噪与边缘保持。 Method: 将非卷积TV求解器与数据驱动的LambdaNet结合,构建可学习的总变分(LTV)框架,通过端到端训练联合优化重建与正则化,实现空间自适应平滑。 Result: 在DeepLesion数据集上实验显示,相比传统TV和FBP+U-Net,LTV平均提升2.9 dB PSNR和6% SSIM,且具有更好的解释性。 Conclusion: LTV提供了一种可解释的、优于黑箱CNN的图像重建方法,适用于低剂量CT,并为3D和数据一致性驱动的重建提供了基础。 Abstract: Although Total Variation (TV) performs well in noise reduction and edge preservation on images, its dependence on the lambda parameter limits its efficiency and makes it difficult to use effectively. In this study, we present a Learnable Total Variation (LTV) framework that couples an unrolled TV solver with a data-driven Lambda Mapping Network (LambdaNet) predicting a per-pixel regularization map. The pipeline is trained end-to-end so that reconstruction and regularization are optimized jointly, yielding spatially adaptive smoothing: strong in homogeneous regions, relaxed near anatomical boundaries. Experiments on the DeepLesion dataset, using a realistic noise model adapted from the LoDoPaB-CT methodology, show consistent gains over classical TV and FBP+U-Net: +2.9 dB PSNR and +6% SSIM on average. LTV provides an interpretable alternative to black-box CNNs and a basis for 3D and data-consistency-driven reconstruction. Our codes are available at: https://github.com/itu-biai/deep_tv_for_ldct

[164] SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Wei Li,Renshan Zhang,Rui Shao,Zhijian Fang,Kaiwen Zhou,Zhuotao Tian,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉-语言-动作(VLA)框架SemanticVLA,通过语义对齐的稀疏化与增强方法提升机器人操作的效率和性能。该框架包含三个核心组件:SD-Pruner用于去除冗余感知信息并保留语义对齐,SH-Fuser融合多模态特征以实现连贯表征,SA-Coupler则改进感知到动作的映射过程。实验表明,SemanticVLA在LIBERO基准上比OpenVLA成功率提高21.1%,训练成本和推理延迟分别降低3.0倍和2.7倍,并已开源。

Details Motivation: 现有VLA模型存在感知冗余和指令-视觉对齐浅层化的问题,导致语义接地能力弱,影响机器人操作的效率与准确性,因此需要一种能有效压缩无关视觉输入并加强语义与空间结构融合的方法。 Method: 提出SemanticVLA框架,包括:1) Semantic-guided Dual Visual Pruner (SD-Pruner),结合指令驱动剪枝(ID-Pruner)和空间聚合剪枝(SA-Pruner)来实现语义对齐的感知稀疏化;2) Semantic-complementary Hierarchical Fuser (SH-Fuser),跨SigLIP和DINOv2融合稠密块与稀疏令牌;3) Semantic-conditioned Action Coupler (SA-Coupler),替代传统观察到关节自由度的映射方式,提升动作建模效率与可解释性。 Result: 在仿真和真实世界任务中,SemanticVLA在LIBERO基准上比OpenVLA成功率提高21.1%,训练成本降低3.0倍,推理延迟降低2.7倍,实现了性能与效率的新SOTA。 Conclusion: SemanticVLA通过语义对齐的稀疏化与增强机制,有效解决了VLA模型中的感知冗余和语义接地问题,显著提升了机器人操作的效率、性能和可解释性,具备实际部署潜力,并已公开开源代码。 Abstract: Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA

[165] Dynamic Avatar-Scene Rendering from Human-centric Context

Wenqing Wang,Haosen Yang,Josef Kittler,Xiatian Zhu

Main category: cs.CV

TL;DR: 提出了一种“先分离后映射”(Separate-then-Map, StM)策略,用于从单目视频中重建动态人类与真实环境的交互,通过共享变换函数统一分别建模的组件,在保证计算效率的同时提升了人与场景间的空间和视觉一致性。

Details Motivation: 现有方法在建模动态人类与场景交互时,要么忽略不同组件的运动特性导致重建不完整,要么因缺乏组件间信息交换而产生边界伪影和空间不一致。 Method: 采用分离建模结合专用信息映射机制的方法,对每个高斯属性使用共享变换函数,实现人体与场景的高效、一致融合,避免复杂的成对交互计算。 Result: 在多个单目视频数据集上的实验表明,StM在视觉质量和渲染精度上显著优于现有最先进方法,尤其在复杂的人-景交互边界表现更优。 Conclusion: StM策略有效解决了动态人体与场景交互重建中的信息隔离与不一致问题,为4D神经渲染提供了高效且鲁棒的建模框架。 Abstract: Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.

[166] Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation

Isabela Albuquerque,Ira Ktena,Olivia Wiles,Ivana Kajić,Amal Rannen-Triki,Cristina Vasconcelos,Aida Nematzadeh

Main category: cs.CV

TL;DR: 本文提出了一种评估文本到图像(T2I)模型多样性的新框架,通过人类评估模板、精心设计的提示集和统计方法系统衡量生成多样性,并比较不同模型和图像嵌入方法的表现。

Details Motivation: 当前T2I模型生成结果缺乏多样性,亟需一种系统且可靠的多样性评估方法来推动模型改进。 Method: 构建包含多样化概念及其变化因素的提示集,设计人类评估模板,并采用二项检验对模型进行基于人类标注的比较;同时评估多种图像嵌入方法在多样性度量中的有效性。 Result: 该框架能够有效区分不同T2I模型的多样性表现,识别模型在特定类别上的不足,并发现某些图像嵌入方法更适合多样性评估。 Conclusion: 所提方法为T2I模型的多样性评估提供了可靠工具,有助于指导未来模型与多样性度量的发展。 Abstract: Despite advances in generation quality, current text-to-image (T2I) models often lack diversity, generating homogeneous outputs. This work introduces a framework to address the need for robust diversity evaluation in T2I models. Our framework systematically assesses diversity by evaluating individual concepts and their relevant factors of variation. Key contributions include: (1) a novel human evaluation template for nuanced diversity assessment; (2) a curated prompt set covering diverse concepts with their identified factors of variation (e.g. prompt: An image of an apple, factor of variation: color); and (3) a methodology for comparing models in terms of human annotations via binomial tests. Furthermore, we rigorously compare various image embeddings for diversity measurement. Notably, our principled approach enables ranking of T2I models by diversity, identifying categories where they particularly struggle. This research offers a robust methodology and insights, paving the way for improvements in T2I model diversity and metric development.

[167] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Huijie Liu,Shuhao Cui,Haoxiang Cao,Shuai Ma,Kai Wu,Guoliang Kang

Main category: cs.CV

TL;DR: 本文提出了CoTyle,首个开源的“代码到风格”图像生成方法,通过数值风格码实现新颖且一致的视觉风格生成。

Details Motivation: 现有风格生成方法依赖复杂输入(如长文本、参考图或微调),难以保证风格一致性与创造性,且缺乏学术界开放研究。 Method: 首先从图像集中训练离散风格码本以提取风格嵌入,用作文本到图像扩散模型的条件;然后在风格嵌入上训练自回归风格生成器以合成新风格;推理时,数值风格码映射为风格嵌入并指导图像生成。 Result: 实验证明CoTyle能有效将数值代码转化为风格控制器,在风格多样性、一致性和可复现性方面表现优异。 Conclusion: CoTyle验证了‘一个风格值一个代码’的理念,为开放学术研究提供了首个代码驱动风格生成的可行框架。 Abstract: Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.

[168] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded

Haosong Peng,Hao Li,Yalun Dai,Yushi Lan,Yihang Luo,Tianyu Qi,Zhengshen Zhang,Yufeng Zhan,Junfei Zhang,Wenchao Xu,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了OmniVGGT,一种能有效融合任意数量几何模态(如深度、相机内/外参)的通用3D基础模型框架,通过GeoAdapter和随机多模态融合策略,在保持高效推理的同时提升了多种视觉任务性能,并在视觉-语言-动作模型中展现出实用价值。

Details Motivation: 现有通用3D基础模型大多仅使用RGB输入,忽略了易于获取的几何信息(如相机内参、位姿、深度图),限制了空间理解能力,因此需要一种能灵活利用多种几何模态的方法。 Method: 提出OmniVGGT框架,包含GeoAdapter(采用零初始化卷积逐步注入几何信息)和随机多模态融合机制(训练时随机采样模态子集),可在训练和推理时灵活引入任意数量的几何模态,同时保持模型稳定性和效率。 Result: 在单目/多视角深度估计、多视图立体匹配和相机位姿估计等任务上超越先前方法,即使仅用RGB输入也达到SOTA;集成到视觉-语言-动作(VLA)模型后,在机器人任务中显著优于基线。 Conclusion: OmniVGGT能够有效且高效地融合多种几何模态,提升3D感知性能,具有良好的泛化性与实用性,为通用视觉模型提供了更强的空间理解能力。 Abstract: General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.

[169] From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis

Yen Nhi Truong Vu,Dan Guo,Sripad Joshi,Harshit Kumar,Jason Su,Thomas Paul Matthews

Main category: cs.CV

TL;DR: 提出M&M-3D架构,可在不增加参数的情况下实现可学习的3D推理,显著提升DBT中的病灶定位与分类性能。

Details Motivation: 由于标注数据稀缺,现有方法在利用深度学习处理数字乳腺断层合成(DBT)时面临挑战:2D方法丢失三维信息,而3D方法需要更多数据和复杂模型。 Method: 通过修改原有FFDM模型M&M的结构,构建恶性肿瘤引导的3D特征,并通过反复融合切片级信息实现可学习的3D推理,同时保持参数量不变,支持从FFDM模型直接迁移权重。 Result: 在多个实验中,M&M-3D在定位性能上比2D和3D切片方法提升11-54%,分类提升3-10%;在低数据场景下优于复杂3D模型20-47%(定位)和2-10%(分类),并在高数据场景下表现相当;在BCS-DBT基准上分类提升4%,定位提升10%。 Conclusion: M&M-3D在不增加参数的前提下有效实现了对DBT数据的3D推理,解决了数据稀缺下的模型迁移难题,在多种评估指标上显著优于现有方法。 Abstract: Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.

[170] Multitask GLocal OBIA-Mamba for Sentinel-2 Landcover Mapping

Zack Dewis,Yimin Zhu,Zhengsen Xu,Mabel Heffring,Saeid Taleghanidoozdoozan,Kaylee Xiao,Motasem Alkayid,Lincoln Linlin Xu

Main category: cs.CV

TL;DR: 提出了一种基于多任务全局-局部OBIA-Mamba(MSOM)模型的Sentinel-2影像土地利用分类方法,显著提升了分类精度和细节表现。

Details Motivation: 由于空间异质性、上下文信息和光谱混淆等问题,基于Sentinel-2的LULC分类具有挑战性,需要更高效的方法来兼顾局部细节与全局一致性。 Method: 设计了以超像素为Mamba token的OBIA-Mamba模型;采用全局-局部双分支CNN-Mamba架构融合局部空间细节与全局上下文信息;构建多任务优化框架,结合双重损失函数平衡局部精度与全局一致性。 Result: 在加拿大阿尔伯塔省的Sentinel-2影像上测试表明,该方法相比现有先进方法具有更高的分类精度和更精细的分类结果。 Conclusion: MSOM模型有效解决了Sentinel-2 LULC分类中的关键难题,在保持计算效率的同时显著提升了分类性能。 Abstract: Although Sentinel-2 based land use and land cover (LULC) classification is critical for various environmental monitoring applications, it is a very difficult task due to some key data challenges (e.g., spatial heterogeneity, context information, signature ambiguity). This paper presents a novel Multitask Glocal OBIA-Mamba (MSOM) for enhanced Sentinel-2 classification with the following contributions. First, an object-based image analysis (OBIA) Mamba model (OBIA-Mamba) is designed to reduce redundant computation without compromising fine-grained details by using superpixels as Mamba tokens. Second, a global-local (GLocal) dual-branch convolutional neural network (CNN)-mamba architecture is designed to jointly model local spatial detail and global contextual information. Third, a multitask optimization framework is designed to employ dual loss functions to balance local precision with global consistency. The proposed approach is tested on Sentinel-2 imagery in Alberta, Canada, in comparison with several advanced classification approaches, and the results demonstrate that the proposed approach achieves higher classification accuracy and finer details that the other state-of-the-art methods.

[171] One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

Aleksandr Razin,Danil Kazantsev,Ilya Makarov

Main category: cs.CV

TL;DR: 提出了一种轻量级的潜在空间上采样器适配器(LUA),可在扩散模型中直接在潜在空间进行高效高分辨率图像生成,无需修改基础模型或额外扩散步骤。

Details Motivation: 扩散模型在超过训练分辨率时难以扩展,直接高分辨率采样慢且昂贵,后处理图像超分(ISR)会引入伪影并增加延迟。 Method: 设计了一个共享Swin结构主干和特定尺度像素打乱头的轻量模块LUA,在VAE解码前对潜在码进行上采样,支持2x和4x放大,并作为即插即用组件集成到现有模型中。 Result: LUA在感知质量上接近现有方法,同时将解码和上采样时间减少了近3倍(从1.87秒降至0.42秒),且能泛化到不同VAE的潜在空间。 Conclusion: LUA为现代扩散模型提供了一条高效、可扩展的高保真图像生成路径,显著提升了高分辨率生成的实用性与效率。 Abstract: Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

[172] Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin,Sili Chen,Junhao Liew,Donny Y. Chen,Zhenyu Li,Guang Shi,Jiashi Feng,Bingyi Kang

Main category: cs.CV

TL;DR: Depth Anything 3 (DA3) 是一种可从任意数量视觉输入中预测空间一致几何结构的模型,无需依赖特定架构或复杂多任务学习,采用单个普通Transformer即可实现优异性能。

Details Motivation: 追求极简建模的同时提升视觉几何预测的精度与泛化能力,解决现有方法在相机位姿未知或多视图场景下的局限性。 Method: 使用教师-学生训练范式,以单一普通Transformer作为主干网络,并采用单一深度射线预测目标,避免复杂的多任务设计。 Result: 在新建立的视觉几何基准上,DA3在相机位姿估计和几何精度上分别比先前最优方法VGGT平均提升44.3%和25.1%,并在单目深度估计上超越Depth Anything 2。 Conclusion: DA3通过简化模型结构和训练目标,在多种视觉几何任务上实现了新的最先进性能,且仅使用公开学术数据集进行训练,具备良好的通用性和实用性。 Abstract: We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

[173] Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang,Weiye Xu,Aijun Yang,Wengang Zhou,Lewei Lu,Houqiang Li,Xiaohua Wang,Jinguo Zhu

Main category: cs.CV

TL;DR: 提出Self-Consistency Sampling (SCS) 方法,通过引入视觉扰动和重复截断重采样,利用一致性得分减轻多模态大语言模型在结果奖励强化学习中因错误推理链猜对答案带来的训练偏差,显著提升多个基准上的准确率。

Details Motivation: 在多选题设置下,结果奖励强化学习会因错误推理链猜中正确选项而给予相同奖励,导致训练信号不忠实,影响模型推理质量。 Method: 对于每个问题,SCS引入轻微视觉扰动,并对初始推理轨迹进行多次截断与重采样,通过多个轨迹之间的一致性计算可微分的共识得分,用于在策略更新时降低不可靠轨迹的权重。 Result: 在Qwen2.5-VL-7B-Instruct上结合RLOO、GRPO和REINFORCE++方法,在六个多模态基准上最高提升7.7个百分点;并在Qwen2.5-VL-3B-Instruct和InternVL3-8B上也取得显著增益,且计算开销极低。 Conclusion: SCS为基于结果奖励的多模态大语言模型强化学习提供了一种简单、通用且有效的解决方案,能显著缓解非忠实推理轨迹带来的训练偏差问题。 Abstract: Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.