Skip to content

Table of Contents

cs.CL [Back]

[1] Interpreting Public Sentiment in Diplomacy Events: A Counterfactual Analysis Framework Using Large Language Models

Leyi Ouyang

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的反事实生成框架,通过调整外交事件叙述中的文本特征来改善公众情绪,在保持核心事实不变的情况下,成功将负面情绪转为中性或正面情绪,成功率达70%。

Details Motivation: 传统测量和引导公众情绪的方法耗时、费力且缺乏前瞻性,亟需一种自动化、数据驱动的方法来有效管理和优化外交事件中的公众舆论。 Method: 首先构建包含外交事件描述及其相关公众讨论的数据集,训练语言模型预测公众反应;结合传播理论与领域专家意见,确定可修改的文本特征;设计基于大语言模型的反事实生成算法,系统生成保留事实但改变叙事框架的文本变体。 Result: 该框架在实验中实现了70%的成功率,能有效将负面公众情绪转变为中性或正面情绪,并提供可解释的叙事改进建议。 Conclusion: 该框架可作为外交人员、政策制定者和传播专家的实用工具,为如何表述外交举措或报道事件以塑造有利的公众情绪提供数据驱动的支持。 Abstract: Diplomatic events consistently prompt widespread public discussion and debate. Public sentiment plays a critical role in diplomacy, as a good sentiment provides vital support for policy implementation, helps resolve international issues, and shapes a nation's international image. Traditional methods for gauging public sentiment, such as large-scale surveys or manual content analysis of media, are typically time-consuming, labor-intensive, and lack the capacity for forward-looking analysis. We propose a novel framework that identifies specific modifications for diplomatic event narratives to shift public sentiment from negative to neutral or positive. First, we train a language model to predict public reaction towards diplomatic events. To this end, we construct a dataset comprising descriptions of diplomatic events and their associated public discussions. Second, guided by communication theories and in collaboration with domain experts, we predetermined several textual features for modification, ensuring that any alterations changed the event's narrative framing while preserving its core facts.We develop a counterfactual generation algorithm that employs a large language model to systematically produce modified versions of an original text. The results show that this framework successfully shifted public sentiment to a more favorable state with a 70\% success rate. This framework can therefore serve as a practical tool for diplomats, policymakers, and communication specialists, offering data-driven insights on how to frame diplomatic initiatives or report on events to foster a more desirable public sentiment.

[2] Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

Shreya G. Upadhyay,Carlos Busso,Chi-Chun Lee

Main category: cs.CL

TL;DR: 提出了一种说话人风格感知的音素锚定框架,用于跨语言语音情感识别,通过在说话人和音素空间中进行双空间锚定,提升跨语言情感迁移效果。

Details Motivation: 由于不同语言间语音变异性及说话人表达风格差异,跨语言语音情感识别具有挑战性,需有效对齐不同语言和说话人的情感外化模式。 Method: 通过基于图的聚类构建情感特定的说话人社区以捕捉共性特征,并在说话人空间和音素空间进行双空间锚定,实现跨语言情感表达对齐。 Result: 在MSP-Podcast(英语)和BIIC-Podcast(台湾普通话)数据集上的实验表明,该方法优于现有基线模型,提升了跨语言情感识别的泛化能力。 Conclusion: 所提出的框架能有效捕捉跨语言和跨说话人的情感共性,通过双空间锚定增强了情感表征的可迁移性。 Abstract: Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.

[3] CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Nithin Somasekharan,Ling Yue,Yadi Cao,Weichao Li,Patrick Emami,Pochinapeddi Sai Bhargav,Anurag Acharya,Xingyu Xie,Shaowu Pan

Main category: cs.CL

TL;DR: 本文提出了CFDLLMBench,一个用于评估大语言模型在计算流体动力学(CFD)中自动化数值实验能力的基准测试套件,包含三个部分:CFDQuery、CFDCodeBench和FoamBench,旨在全面评估LLM在CFD知识、物理推理和工作流实现方面的能力。

Details Motivation: 尽管大语言模型在自然语言处理任务中表现出色,但其在复杂物理系统数值实验自动化中的应用仍缺乏探索。计算流体动力学作为计算科学的核心领域,为评估LLM的科学能力提供了极具挑战性的测试平台。 Method: 构建了一个基于真实CFD实践的基准测试套件CFDLLMBench,包含三个组件:CFDQuery(评估研究生级别CFD知识)、CFDCodeBench(评估数值与物理推理能力)和FoamBench(评估上下文相关的CFD工作流实现能力),并结合任务分类体系和严格的评估框架,从代码可执行性、解的准确性和数值收敛性等方面进行量化评估。 Result: CFDLLMBench成功建立了一个可重复、系统化的评估环境,能够有效衡量LLM在CFD任务中的表现,填补了现有基准在科学计算领域尤其是复杂物理系统仿真自动化方面的空白。 Conclusion: CFDLLMBench为推动大语言模型在复杂物理系统数值实验自动化中的发展和评估奠定了坚实基础,展示了LLM在科学计算领域的潜力与挑战。 Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

[4] Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text

Sharanya Parimanoharan,Ruwan D. Nawarathna

Main category: cs.CL

TL;DR: 本研究评估了多种机器学习方法在区分ChatGPT-3.5生成文本与人类撰写研究摘要中的表现,发现DistilBERT性能最佳,而集成模型未能超越最优单一模型,表明基于Transformer的单一体系优于简单模型集成。

Details Motivation: 随着大型语言模型(如ChatGPT)的广泛应用,AI生成文本与人类文本的界限日益模糊,对学术诚信、知识产权和信息可信度构成挑战,亟需可靠的AI文本检测技术。 Method: 研究采用250对来自多领域研究主题的人工撰写与ChatGPT生成摘要数据集,比较了经典机器学习方法(如逻辑回归结合词袋、POS、TF-IDF特征)与基于Transformer的方法(包括BERT、DistilBERT、自定义BERT分类器及LSTM-Ngram模型),并测试了三种最佳模型的投票集成是否能提升检测性能。 Result: DistilBERT整体表现最优,逻辑回归与自定义BERT分类器表现稳健均衡,而LSTM与BERT-Ngram方法相对落后;三种最佳模型的多数投票集成未能超越DistilBERT单独性能,说明单一高性能Transformer模型优于模型多样性集成。 Conclusion: 当前基于Transformer的模型在检测AI生成研究文本方面最具潜力,未来应构建更大更丰富的数据集以发展更鲁棒的检测框架,应对不断进步的生成式AI。 Abstract: The rapid adoption of large language models (LLMs) such as ChatGPT has blurred the line between human and AI-generated texts, raising urgent questions about academic integrity, intellectual property, and the spread of misinformation. Thus, reliable AI-text detection is needed for fair assessment to safeguard human authenticity and cultivate trust in digital communication. In this study, we investigate how well current machine learning (ML) approaches can distinguish ChatGPT-3.5-generated texts from human-written texts employing a labeled data set of 250 pairs of abstracts from a wide range of research topics. We test and compare both classical (Logistic Regression armed with classical Bag-of-Words, POS, and TF-IDF features) and transformer-based (BERT augmented with N-grams, DistilBERT, BERT with a lightweight custom classifier, and LSTM-based N-gram models) ML detection techniques. As we aim to assess each model's performance in detecting AI-generated research texts, we also aim to test whether an ensemble of these models can outperform any single detector. Results show DistilBERT achieves the overall best performance, while Logistic Regression and BERT-Custom offer solid, balanced alternatives; LSTM- and BERT-N-gram approaches lag. The max voting ensemble of the three best models fails to surpass DistilBERT itself, highlighting the primacy of a single transformer-based representation over mere model diversity. By comprehensively assessing the strengths and weaknesses of these AI-text detection approaches, this work lays a foundation for more robust transformer frameworks with larger, richer datasets to keep pace with ever-improving generative AI models.

[5] ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models

Haoxuan Li,Zhen Wen,Qiqi Jiang,Chenxiao Li,Yuwei Wu,Yuchen Yang,Yiyao Wang,Xiuqi Huang,Minfeng Zhu,Wei Chen

Main category: cs.CL

TL;DR: 提出ConceptViz,一个用于探索大语言模型中概念的可视化分析系统,通过识别-解释-验证流程提升稀疏自动编码器特征的可解释性。

Details Motivation: 稀疏自动编码器(SAE)提取的特征难以与人类可理解的概念对齐,限制了大语言模型知识表示的可解释性。 Method: 设计并实现了一个名为ConceptViz的视觉分析系统,采用“识别=>解释=>验证”流程,支持用户以感兴趣的概念查询SAE、交互式探索概念与特征的对应关系,并通过模型行为验证其关联性。 Result: 通过两个使用场景和一项用户研究表明,ConceptViz能有效促进对LLM中概念表示的发现与验证,提升可解释性研究效率。 Conclusion: ConceptViz有助于研究人员构建更准确的LLM特征心理模型,推动对大语言模型内部知识表示的理解。 Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. Understanding how LLMs internally represent knowledge remains a significant challenge. Despite Sparse Autoencoders (SAEs) have emerged as a promising technique for extracting interpretable features from LLMs, SAE features do not inherently align with human-understandable concepts, making their interpretation cumbersome and labor-intensive. To bridge the gap between SAE features and human concepts, we present ConceptViz, a visual analytics system designed for exploring concepts in LLMs. ConceptViz implements a novel dentification => Interpretation => Validation pipeline, enabling users to query SAEs using concepts of interest, interactively explore concept-to-feature alignments, and validate the correspondences through model behavior verification. We demonstrate the effectiveness of ConceptViz through two usage scenarios and a user study. Our results show that ConceptViz enhances interpretability research by streamlining the discovery and validation of meaningful concept representations in LLMs, ultimately aiding researchers in building more accurate mental models of LLM features. Our code and user guide are publicly available at https://github.com/Happy-Hippo209/ConceptViz.

[6] SKILL-RAG: Self-Knowledge Induced Learning and Filtering for Retrieval-Augmented Generation

Tomoaki Isoda

Main category: cs.CL

TL;DR: 本文提出了一种基于模型自知能力的检索增强生成方法SKILL-RAG,利用强化学习框架从模型中显式激发自知知识,以筛选出有益的检索文档,从而提升生成质量并减少输入文档数量。

Details Motivation: 由于检索系统可能返回无关内容,导致模型产生幻觉,因此需要识别和过滤无用的检索结果;同时,理解模型的“已知”与“未知”有助于更好地融合内部与外部知识。 Method: 提出SKILL-RAG方法,通过强化学习训练框架从模型中激发自知知识,并在句子级别上对检索内容进行过滤,保留有用信息,去除无关内容。 Result: 在Llama2-7B和Qwen3-8B上多个问答基准的实验表明,SKILL-RAG不仅提升了生成质量,还显著减少了输入文档数量。 Conclusion: 模型的自知知识在指导高质量检索内容选择中具有重要作用,SKILL-RAG有效提升了RAG系统的性能。 Abstract: Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive tasks in recent years. However, since retrieval systems may return irrelevant content, incorporating such information into the model often leads to hallucinations. Thus, identifying and filtering out unhelpful retrieved content is a key challenge for improving RAG performance.To better integrate the internal knowledge of the model with external knowledge from retrieval, it is essential to understand what the model "knows" and "does not know" (which is also called "self-knowledge"). Based on this insight, we propose SKILL-RAG (Self-Knowledge Induced Learning and Filtering for RAG), a novel method that leverages the model's self-knowledge to determine which retrieved documents are beneficial for answering a given query. We design a reinforcement learning-based training framework to explicitly elicit self-knowledge from the model and employs sentence-level granularity to filter out irrelevant content while preserving useful knowledge.We evaluate SKILL-RAG using Llama2-7B and Qwen3-8B on several question answering benchmarks. Experimental results demonstrate that SKILL-RAG not only improves generation quality but also significantly reduces the number of input documents, validating the importance of self-knowledge in guiding the selection of high-quality retrievals.

[7] Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation

Sirui Wang,Andong Chen,Tiejun Zhao

Main category: cs.CL

TL;DR: 本文提出了一种细粒度情感建模框架Emo-FiLM,用于基于大语言模型的文本到语音合成,实现词级别的情感控制。

Details Motivation: 现有情感TTS系统多依赖句子级别的全局情感控制,难以捕捉句子内部的情感动态变化。 Method: 通过将emotion2vec的帧级特征对齐到词语,获得词级情感标注,并利用FiLM层调制文本嵌入以实现细粒度情感控制。 Result: 在全局和细粒度情感控制任务上,Emo-FiLM均优于现有方法,并构建了含详细情感转换标注的FEDD数据集用于评估。 Conclusion: Emo-FiLM能有效实现词级别的动态情感控制,提升了情感TTS的表现力与通用性。 Abstract: Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction. Existing systems typically rely on sentence-level control through predefined labels, reference audio, or natural language prompts. While effective for global emotion expression, these approaches fail to capture dynamic shifts within a sentence. To address this limitation, we introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS. Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations, and maps them through a Feature-wise Linear Modulation (FiLM) layer, enabling word-level emotion control by directly modulating text embeddings. To support evaluation, we construct the Fine-grained Emotion Dynamics Dataset (FEDD) with detailed annotations of emotional transitions. Experiments show that Emo-FiLM outperforms existing approaches on both global and fine-grained tasks, demonstrating its effectiveness and generality for expressive speech synthesis.

[8] USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model

Jianyu Wen,Jingyun Wang,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Ying Zhang

Main category: cs.CL

TL;DR: 提出一种基于用户模拟器的训练-推理框架(USB-Rec),通过强化学习训练和自增强策略提升大语言模型在对话推荐系统中的性能。

Details Motivation: 现有基于大语言模型的对话推荐系统多关注如何利用模型的分析与总结能力,忽视了模型训练问题,导致潜力未被充分挖掘。 Method: 设计了一种基于LLM的偏好优化(PO)数据集构建策略用于强化学习训练,并在推理阶段引入自增强策略(SES),以提升模型在对话推荐中的表现。 Result: 在多个数据集上的实验表明,该方法 consistently 优于之前的最先进方法。 Conclusion: 所提出的USB-Rec框架通过结合训练与推理优化,有效提升了大语言模型在对话推荐任务中的性能,验证了模型训练的重要性。 Abstract: Recently, Large Language Models (LLMs) have been widely employed in Conversational Recommender Systems (CRSs). Unlike traditional language model approaches that focus on training, all existing LLMs-based approaches are mainly centered around how to leverage the summarization and analysis capabilities of LLMs while ignoring the issue of training. Therefore, in this work, we propose an integrated training-inference framework, User-Simulator-Based framework (USB-Rec), for improving the performance of LLMs in conversational recommendation at the model level. Firstly, we design a LLM-based Preference Optimization (PO) dataset construction strategy for RL training, which helps the LLMs understand the strategies and methods in conversational recommendation. Secondly, we propose a Self-Enhancement Strategy (SES) at the inference stage to further exploit the conversational recommendation potential obtained from RL training. Extensive experiments on various datasets demonstrate that our method consistently outperforms previous state-of-the-art methods.

[9] Document Summarization with Conformal Importance Guarantees

Bruce Kuwahara,Chen-Yuan Lin,Xiao Shi Huang,Kin Kwan Leung,Jullian Arta Yapeter,Ilya Stanevich,Felipe Perez,Jesse C. Cresswell

Main category: cs.CL

TL;DR: 本文提出了Conformal Importance Summarization,一种结合保重要性摘要生成与合规预测的框架,能够在医疗、法律和金融等高风险领域为关键内容提供可靠的信息覆盖保证。

Details Motivation: 现有的自动摘要系统在高风险领域缺乏对关键内容包含的可靠保证,因此需要一种能够确保重要信息不丢失的摘要方法。 Method: 通过在句子级别的重要性得分上校准阈值,使用合规预测提供严格且无需分布假设的覆盖率保证,该方法与模型无关,仅需小型校准集,并可无缝集成到现有黑箱大语言模型中。 Result: 在多个标准摘要基准上的实验表明,该方法能够实现理论上的信息覆盖率,有效保障关键内容的召回率。 Conclusion: Conformal Importance Summarization 可与现有技术结合,实现可靠且可控的自动摘要,为AI摘要工具在关键应用场景中的安全部署提供了新路径。 Abstract: Automatic summarization systems have advanced rapidly with large language models (LLMs), yet they still lack reliable guarantees on inclusion of critical content in high-stakes domains like healthcare, law, and finance. In this work, we introduce Conformal Importance Summarization, the first framework for importance-preserving summary generation which uses conformal prediction to provide rigorous, distribution-free coverage guarantees. By calibrating thresholds on sentence-level importance scores, we enable extractive document summarization with user-specified coverage and recall rates over critical content. Our method is model-agnostic, requires only a small calibration set, and seamlessly integrates with existing black-box LLMs. Experiments on established summarization benchmarks demonstrate that Conformal Importance Summarization achieves the theoretically assured information coverage rate. Our work suggests that Conformal Importance Summarization can be combined with existing techniques to achieve reliable, controllable automatic summarization, paving the way for safer deployment of AI summarization tools in critical applications. Code is available at https://github.com/layer6ai-labs/conformal-importance-summarization.

[10] ShortCheck: Checkworthiness Detection of Multilingual Short-Form Videos

Henrik Vatndal,Vinay Setty

Main category: cs.CL

TL;DR: ShortCheck是一个用于检测短视频平台(如TikTok)中值得核查的虚假信息的模块化、仅推理的自动化管道,集成多种技术并在多语言环境中表现出良好性能。

Details Motivation: 短视频平台的内容具有多模态、动态和噪声多的特点,给 misinformation 检测带来挑战,需要有效工具辅助人工事实核查。 Method: 构建一个名为ShortCheck的模块化推理管道,整合语音转录、OCR、物体与深度伪造检测、视频到文本摘要以及声明验证等技术。 Result: 在两个手动标注的多语言TikTok视频数据集上进行评估,该管道的加权F1分数超过70%。 Conclusion: ShortCheck能有效识别值得核查的短视频内容,有助于提升人工事实核查效率,适用于多语言环境下的短视频 misinformation 检测。 Abstract: Short-form video platforms like TikTok present unique challenges for misinformation detection due to their multimodal, dynamic, and noisy content. We present ShortCheck, a modular, inference-only pipeline with a user-friendly interface that automatically identifies checkworthy short-form videos to help human fact-checkers. The system integrates speech transcription, OCR, object and deepfake detection, video-to-text summarization, and claim verification. ShortCheck is validated by evaluating it on two manually annotated datasets with TikTok videos in a multilingual setting. The pipeline achieves promising results with F1-weighted score over 70\%.

[11] MARS: toward more efficient multi-agent collaboration for LLM reasoning

Xiao Wang,Jia Wang,Yijie Wang,Pengtao Dang,Sha Cao,Chi Zhang

Main category: cs.CL

TL;DR: 本文提出了MARS(多智能体评审系统),一种基于角色协作的推理框架,通过作者、评审者和元评审者的分工,在保持与多智能体辩论(MAD)相当准确率的同时,将token使用量和推理时间减少约50%。

Details Motivation: 单个大语言模型在推理能力上存在局限,多智能体辩论(MAD)虽能提升推理效果,但因通信频繁导致计算开销大。因此需要一种更高效的多智能体协作推理方法。 Method: 提出MARS框架:一个作者智能体生成初始解,多个评审智能体独立提供反馈,元评审智能体整合意见并指导修订,避免了评审者之间的直接交互,从而降低资源消耗。 Result: 在多个基准测试中,MARS与MAD及其他先进推理方法相比,准确率相当,但token使用量和推理时间减少了约50%。 Conclusion: MARS通过角色分工实现了高效、高质量的多智能体协同推理,在保证性能的同时显著降低了计算成本,是一种更具可扩展性的推理框架。 Abstract: Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.

[12] SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

Hannah Liu,Junghyun Min,Ethan Yue Heng Cheung,Shou-Yi Hung,Syed Mekael Wasti,Runtong Liang,Shiyao Qian,Shizhao Zheng,Elsie Chan,Ka Ieng Charlotte Lo,Wing Yu Yip,Richard Tzong-Han Tsai,En-Shiun Annie Lee

Main category: cs.CL

TL;DR: 本文介绍了SiniticMTError,一个针对英语到普通话、粤语和吴语机器翻译错误的新型标注数据集,包含错误跨度、类型和严重程度信息,旨在支持低资源语言的翻译质量评估与错误感知生成研究。

Details Motivation: 尽管近年来机器翻译取得了显著进展,但许多缺乏大规模训练数据和语言资源的低资源语言(如粤语和吴语)仍进展有限,因此需要专门的数据集来推动相关研究。 Method: 基于现有平行语料库,通过母语者对机器翻译结果进行错误标注,包括错误跨度、错误类型和严重程度,并分析标注者间一致性、迭代反馈及错误模式。 Result: 构建了名为SiniticMTError的多语言错误标注数据集,涵盖英语到普通话、粤语和吴语的翻译错误,并报告了高可信度的标注一致性及常见错误模式。 Conclusion: SiniticMTError为低资源语言的机器翻译质量评估、错误感知生成和模型微调提供了有价值的资源,有助于推动相关领域的研究发展。 Abstract: Despite major advances in machine translation (MT) in recent years, progress remains limited for many low-resource languages that lack large-scale training data and linguistic resources. Cantonese and Wu Chinese are two Sinitic examples, although each enjoys more than 80 million speakers around the world. In this paper, we introduce SiniticMTError, a novel dataset that builds on existing parallel corpora to provide error span, error type, and error severity annotations in machine-translated examples from English to Mandarin, Cantonese, and Wu Chinese. Our dataset serves as a resource for the MT community to utilize in fine-tuning models with error detection capabilities, supporting research on translation quality estimation, error-aware generation, and low-resource language evaluation. We report our rigorous annotation process by native speakers, with analyses on inter-annotator agreement, iterative feedback, and patterns in error type and severity.

[13] SwasthLLM: a Unified Cross-Lingual, Multi-Task, and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations

Ayan Sar,Pranav Singh Puri,Sumit Aich,Tanupriya Choudhury,Abhijit Kumar

Main category: cs.CL

TL;DR: 本文提出了SwasthLLM,一种统一的、零样本、跨语言多任务学习框架,用于在英语、印地语和孟加拉语中进行医疗诊断,无需语言特定微调,在低资源语言下表现出强泛化能力。

Details Motivation: 由于低资源语言中标注医学数据稀缺以及人群间的语言差异,多语言医疗环境下的临床文本自动疾病诊断具有挑战性。 Method: 基于多语言XLM-RoBERTa编码器,引入语言感知注意力机制、Siamese对比学习模块、翻译一致性模块和对比投影头,并采用多任务学习策略联合优化疾病分类、翻译对齐和对比学习目标,结合MAML实现快速适应。 Result: 在监督设置下达到97.22%准确率和97.17% F1分数;零样本场景下,对印地语和孟加拉语分别取得92.78%和73.33%的准确率。 Conclusion: SwasthLLM在无需语言特定微调的情况下,能有效实现跨语言医学诊断,尤其在低资源语言中展现出良好的泛化能力和应用潜力。 Abstract: In multilingual healthcare environments, automatic disease diagnosis from clinical text remains a challenging task due to the scarcity of annotated medical data in low-resource languages and the linguistic variability across populations. This paper proposes SwasthLLM, a unified, zero-shot, cross-lingual, and multi-task learning framework for medical diagnosis that operates effectively across English, Hindi, and Bengali without requiring language-specific fine-tuning. At its core, SwasthLLM leverages the multilingual XLM-RoBERTa encoder augmented with a language-aware attention mechanism and a disease classification head, enabling the model to extract medically relevant information regardless of the language structure. To align semantic representations across languages, a Siamese contrastive learning module is introduced, ensuring that equivalent medical texts in different languages produce similar embeddings. Further, a translation consistency module and a contrastive projection head reinforce language-invariant representation learning. SwasthLLM is trained using a multi-task learning strategy, jointly optimizing disease classification, translation alignment, and contrastive learning objectives. Additionally, we employ Model-Agnostic Meta-Learning (MAML) to equip the model with rapid adaptation capabilities for unseen languages or tasks with minimal data. Our phased training pipeline emphasizes robust representation alignment before task-specific fine-tuning. Extensive evaluation shows that SwasthLLM achieves high diagnostic performance, with a test accuracy of 97.22% and an F1-score of 97.17% in supervised settings. Crucially, in zero-shot scenarios, it attains 92.78% accuracy on Hindi and 73.33% accuracy on Bengali medical text, demonstrating strong generalization in low-resource contexts.

[14] Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures

Sampurna Roy,Ayan Sar,Anurag Kaushish,Kanav Gupta,Tanupriya Choudhury,Abhijit Kumar

Main category: cs.CL

TL;DR: 本文提出了动态推理链(Dynamic Reasoning Chains)与深度专业化混合专家模型(DS-MoE),通过根据输入复杂度动态选择专家模块,实现更高效、准确和可解释的推理。

Details Motivation: 传统Transformer对所有输入采用相同的处理深度,导致资源浪费并限制了复杂推理能力。因此需要一种能根据输入复杂度自适应调整计算深度的架构。 Method: 提出DS-MoE框架,将Mixture of Experts从宽度扩展到深度专业化,设计针对不同推理层次(如浅层模式识别、逻辑推理、记忆整合等)的专家模块,并通过学习路由网络动态构建定制化的推理链。 Result: 在The Pile数据集上实验表明,DS-MoE相比固定深度模型节省最多16%计算量,推理速度快35%,在复杂多步推理任务上准确率提高2.8%,且推理路径更具可解释性。 Conclusion: DS-MoE通过深度专业化和动态路由,在提升大模型推理效率、准确性和可解释性方面取得了显著进展,为自适应神经网络架构提供了新方向。 Abstract: Contemporary transformer architectures apply identical processing depth to all inputs, creating inefficiencies and limiting reasoning quality. Simple factual queries are subjected to the same multilayered computation as complex logical problems, wasting resources while constraining deep inference. To overcome this, we came up with a concept of Dynamic Reasoning Chains through Depth Specialised Mixture of Experts (DS-MoE), a modular framework that extends the Mixture of Experts paradigm from width-based to depth specialised computation. DS-MoE introduces expert modules optimised for distinct reasoning depths, shallow pattern recognition, compositional reasoning, logical inference, memory integration, and meta-cognitive supervision. A learned routing network dynamically assembles custom reasoning chains, activating only the necessary experts to match input complexity. The dataset on which we trained and evaluated DS-MoE is on The Pile, an 800GB corpus covering diverse domains such as scientific papers, legal texts, programming code, and web content, enabling systematic assessment across reasoning depths. Experimental results demonstrate that DS-MoE achieves up to 16 per cent computational savings and 35 per cent faster inference compared to uniform-depth transformers, while delivering 2.8 per cent higher accuracy on complex multi-step reasoning benchmarks. Furthermore, routing decisions yield interpretable reasoning chains, enhancing transparency and scalability. These findings establish DS-MoE as a significant advancement in adaptive neural architectures, demonstrating that depth-specialised modular processing can simultaneously improve efficiency, reasoning quality, and interpretability in large-scale language models.

[15] Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding

Ayan Sar,Sampurna Roy,Kanav Gupta,Anurag Kaushish,Tanupriya Choudhury,Abhijit Kumar

Main category: cs.CL

TL;DR: 提出了一种受小波启发的分层解析Transformer(HRT),通过多分辨率处理语言,实现了与人类语言层次结构对齐的计算架构,在多个基准上优于标准Transformer,同时显著提升效率。

Details Motivation: Transformer将文本视为扁平的token序列,无法有效建模语言的层次性,导致计算成本高、泛化能力弱和篇章建模不足。 Method: 设计了多分辨率注意力机制,结合自底向上的组合和自顶向下的上下文化,并采用跨尺度的指数序列缩减策略,实现O(nlogn)复杂度。 Result: 在GLUE、SuperGLUE和Long Range Arena等基准上平均提升3.8%~6.1%,内存减少42%,推理延迟降低37%,且消融实验验证了跨分辨率注意力和尺度专用模块的有效性。 Conclusion: HRT是首个将计算结构与语言层次组织对齐的模型,证明多尺度、小波启发的处理方式可在理论效率和实际语言理解性能上同时带来提升。 Abstract: Transformer architectures have achieved state-of-the-art performance across natural language tasks, yet they fundamentally misrepresent the hierarchical nature of human language by processing text as flat token sequences. This results in quadratic computational cost, weak computational cost, weak compositional generalization, and inadequate discourse-level modeling. We propose Hierarchical Resolution Transformer (HRT), a novel wavelet-inspired neural architecture that processes language simultaneously across multiple resolutions, from characters to discourse-level units. HRT constructs a multi-resolution attention, enabling bottom-up composition and top-down contextualization. By employing exponential sequence reduction across scales, HRT achieves O(nlogn) complexity, offering significant efficiency improvements over standard transformers. We evaluated HRT on a diverse suite of benchmarks, including GLUE, SuperGLUE, Long Range Arena, and WikiText-103, and results demonstrated that HRT outperforms standard transformer baselines by an average of +3.8% on GLUE, +4.5% on SuperGLUE, and +6.1% on Long Range Arena, while reducing memory usage by 42% and inference latency by 37% compared to BERT and GPT style models of similar parameter count. Ablation studies confirm the effectiveness of cross-resolution attention and scale-specialized modules, showing that each contributes independently to both efficiency and accuracy. Our findings establish HRT as the first architecture to align computational structure with the hierarchical organization of human language, demonstrating that multi-scale, wavelet-inspired processing yields both theoretical efficiency gains and practical improvements in language understanding.

[16] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Amin Karimi Monsefi,Nikhil Bhendawade,Manuel Rafael Ciosici,Dominic Culver,Yizhe Zhang,Irina Belousova

Main category: cs.CL

TL;DR: 本文提出了一种名为FS-DFM(Few-Step Discrete Flow-Matching)的离散流匹配模型,旨在实现快速且高质量的语言生成。相比传统需要数百至数千步的扩散模型,FS-DFM仅用8步即可达到相当的生成质量,显著提升了采样速度和吞吐量。

Details Motivation: 自回归语言模型生成速度慢,而标准离散扩散模型虽可并行但需大量步骤,导致效率低下。因此,需要一种在保持生成质量的同时大幅减少采样步数的模型。 Method: 将采样步数作为显式参数,训练模型在不同步数预算下保持一致性;采用稳定的更新规则避免概率过冲,并利用长轨迹蒸馏的强教师指导来提升训练效果。 Result: 在语言建模基准上,FS-DFM使用8步采样即可匹敌1024步基线模型的困惑度表现,生成1024个token时提速高达128倍,显著降低延迟并提升吞吐。 Conclusion: FS-DFM通过设计一致性的少步生成机制,在不牺牲质量的前提下极大提升了离散扩散语言模型的生成效率,为高效文本生成提供了新方向。 Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.

[17] Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions

Jungsoo Park,Ethan Mendes,Gabriel Stanovsky,Alan Ritter

Main category: cs.CL

TL;DR: 本文提出了一种在不运行实验的情况下预测大语言模型性能的方法,通过构建包含任务描述和配置的脱敏数据集PRECOG,探索文本-only性能预测的可行性。

Details Motivation: 大语言模型的发展受到评估瓶颈的限制,传统方法需要反复构建基准并进行实验迭代,效率低下。因此,作者希望在实验前就能预测模型表现,以提升研发效率。 Method: 提出文本-only性能预测任务,即仅基于脱敏的任务描述和模型配置来估计模型得分,不接触实际数据实例;构建PRECOG数据集支持系统研究,并引入具备检索模块的模型进行预测实验,同时测试零泄漏场景下的预测能力。 Result: 实验表明该任务具有挑战性但可行,配备检索模块的模型在高置信度下达到准确子集平均绝对误差低至8.7的表现;更强的推理模型能进行多样化、迭代式查询,而当前开源模型表现较差;在新发布数据集上的零泄漏预测中,GPT-5结合网络搜索仍表现出非平凡的预测准确性。 Conclusion: PRECOG数据集和相关分析为开放式的前瞻性评估提供了初步基础,有助于任务难度估计和更智能的实验优先级排序,推动大模型评估范式的转变。 Abstract: Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, evaluate models and settings, then iterate. We therefore ask a simple question: can we forecast outcomes before running any experiments? We study text-only performance forecasting: estimating a model's score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. Experiments show the task is challenging but feasible: models equipped with a retrieval module that excludes source papers achieve moderate prediction performance with well-calibrated uncertainty, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds. Our analysis indicates that stronger reasoning models engage in diverse, iterative querying, whereas current open-source models lag and often skip retrieval or gather evidence with limited diversity. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT-5 with built-in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.

[18] Building Tailored Speech Recognizers for Japanese Speaking Assessment

Yotaro Kubo,Richard Sproat,Chihiro Taguchi,Llion Jones

Main category: cs.CL

TL;DR: 本文提出两种方法来缓解数据稀疏性问题,以构建能够输出带音调标记音素标签的日语语音评估专用识别器。

Details Motivation: 由于缺乏带有音调标记的准确音素转录训练数据,难以构建高精度的日语发音评估系统。 Method: 采用多任务训练方案(引入正字法标签和基频模式的辅助损失函数)以及基于有限状态转换器框架融合音素和文本序列估计器的方法。 Result: 所提方法将CSJ核心测试集上的平均音节标签错误率从12.3%降至7.1%,优于通用多语言识别器。 Conclusion: 多任务学习与模型融合策略有效提升了日语语音识别在发音评估任务中的准确性。 Abstract: This paper presents methods for building speech recognizers tailored for Japanese speaking assessment tasks. Specifically, we build a speech recognizer that outputs phonemic labels with accent markers. Although Japanese is resource-rich, there is only a small amount of data for training models to produce accurate phonemic transcriptions that include accent marks. We propose two methods to mitigate data sparsity. First, a multitask training scheme introduces auxiliary loss functions to estimate orthographic text labels and pitch patterns of the input signal, so that utterances with only orthographic annotations can be leveraged in training. The second fuses two estimators, one over phonetic alphabet strings, and the other over text token sequences. To combine these estimates we develop an algorithm based on the finite-state transducer framework. Our results indicate that the use of multitask learning and fusion is effective for building an accurate phonemic recognizer. We show that this approach is advantageous compared to the use of generic multilingual recognizers. The relative advantages of the proposed methods were also compared. Our proposed methods reduced the average of mora-label error rates from 12.3% to 7.1% over the CSJ core evaluation sets.

[19] Enhancing Molecular Property Prediction with Knowledge from Large Language Models

Peng Zhou,Lai Hou Tim,Zhixiang Cheng,Kun Xie,Chaoyi Li,Wei Liu,Xiangxiang Zeng

Main category: cs.CL

TL;DR: 本文提出了一种将大语言模型(LLM)提取的知识与预训练分子模型的结构特征相结合的新框架,用于提升分子属性预测(MPP)性能。通过提示LLM生成领域知识和可执行代码,构建知识特征并与结构表示融合,在多个LLM上实验表明该方法优于现有方法。

Details Motivation: 尽管图神经网络和自监督学习在分子属性预测中取得进展,但大语言模型存在知识缺失和幻觉问题,尤其在研究较少的分子属性上表现受限,因此需要有效整合人类先验知识以提升预测准确性。 Method: 提出一个新框架,利用GPT-4o、GPT-4.1和DeepSeek-R1等大语言模型生成与领域相关的知识及用于分子向量化处理的可执行代码,形成基于知识的特征,并将其与来自预训练分子模型的结构特征进行融合,实现端到端的分子属性预测。 Result: 在多个基准数据集上的实验表明,所提方法显著优于现有的分子属性预测方法,验证了结合LLM衍生知识与结构信息的有效性和鲁棒性。 Conclusion: 将大语言模型生成的知识与分子结构特征融合是一种有效且可靠的分子属性预测策略,为未来药物发现中的AI应用提供了新方向。 Abstract: Predicting molecular properties is a critical component of drug discovery. Recent advances in deep learning, particularly Graph Neural Networks (GNNs), have enabled end-to-end learning from molecular structures, reducing reliance on manual feature engineering. However, while GNNs and self-supervised learning approaches have advanced molecular property prediction (MPP), the integration of human prior knowledge remains indispensable, as evidenced by recent methods that leverage large language models (LLMs) for knowledge extraction. Despite their strengths, LLMs are constrained by knowledge gaps and hallucinations, particularly for less-studied molecular properties. In this work, we propose a novel framework that, for the first time, integrates knowledge extracted from LLMs with structural features derived from pre-trained molecular models to enhance MPP. Our approach prompts LLMs to generate both domain-relevant knowledge and executable code for molecular vectorization, producing knowledge-based features that are subsequently fused with structural representations. We employ three state-of-the-art LLMs, GPT-4o, GPT-4.1, and DeepSeek-R1, for knowledge extraction. Extensive experiments demonstrate that our integrated method outperforms existing approaches, confirming that the combination of LLM-derived knowledge and structural information provides a robust and effective solution for MPP.

[20] RedHerring Attack: Testing the Reliability of Attack Detection

Jonathan Rusert

Main category: cs.CL

TL;DR: 提出并测试了一种新的攻击方法RedHerring,旨在通过使检测模型误报来降低其可靠性,同时保持分类器的准确性。

Details Motivation: 现有的对抗性文本攻击检测模型的可靠性尚未被充分探索,需要研究如何使这些模型变得不可靠。 Method: 提出了RedHerring攻击方法,通过对文本进行修改,使得检测模型预测为攻击,而分类器仍然正确。在4个数据集上对3种检测器和4种分类器进行了测试。 Result: RedHerring能够将检测准确率降低20到71个百分点,同时保持或提高分类器的准确性。提出了一种简单的置信度检查作为初步防御措施,显著提高了检测准确率。 Conclusion: 这种新颖的威胁模型为理解对手如何针对检测模型提供了新的见解。 Abstract: In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an ``incorrect'' prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.

[21] Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms

Abhinay Shankar Belde,Rohit Ramkumar,Jonathan Rusert

Main category: cs.CL

TL;DR: 提出两种新的攻击选择策略Hybrid Select和Dynamic Select,有效减少对抗文本攻击中的查询次数,同时保持攻击效果。

Details Motivation: 现有的黑盒攻击方法查询次数多,计算成本高,对资源有限的研究者不友好,因此需要更高效的攻击策略。 Method: 提出Hybrid Select和Dynamic Select两种策略:前者通过设定大小阈值结合BinarySelect与GreedySelect;后者动态学习不同文本长度下应使用的选取方法。 Result: 在4个数据集和6个目标模型上实验表明,句子级Hybrid Select平均减少25.82%的查询次数,且不损失攻击效果。 Conclusion: 所提方法显著降低了对抗攻击的查询开销,提升了攻击效率,适用于资源受限场景下的NLP模型鲁棒性评估。 Abstract: Adversarial text attack research plays a crucial role in evaluating the robustness of NLP models. However, the increasing complexity of transformer-based architectures has dramatically raised the computational cost of attack testing, especially for researchers with limited resources (e.g., GPUs). Existing popular black-box attack methods often require a large number of queries, which can make them inefficient and impractical for researchers. To address these challenges, we propose two new attack selection strategies called Hybrid and Dynamic Select, which better combine the strengths of previous selection algorithms. Hybrid Select merges generalized BinarySelect techniques with GreedySelect by introducing a size threshold to decide which selection algorithm to use. Dynamic Select provides an alternative approach of combining the generalized Binary and GreedySelect by learning which lengths of texts each selection method should be applied to. This greatly reduces the number of queries needed while maintaining attack effectiveness (a limitation of BinarySelect). Across 4 datasets and 6 target models, our best method(sentence-level Hybrid Select) is able to reduce the number of required queries per attack up 25.82\% on average against both encoder models and LLMs, without losing the effectiveness of the attack.

[22] MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

Hsiao-Ying Huang,Yi-Cheng Lin,Hung-yi Lee

Main category: cs.CL

TL;DR: 提出MI-Fuse框架,利用互信息加权的去噪标签融合,在无源域数据情况下通过API访问大音频语言模型,实现跨域语音情感识别性能提升。

Details Motivation: 解决现实部署中语音情感识别在域不匹配下的性能下降问题,且源数据不可用、仅能通过API访问强大音频语言模型的限制。 Method: 提出MI-Fuse框架,结合API-only大音频语言模型和源域训练的情感分类器作为双教师,通过多次随机预测、基于互信息的不确定性加权均值分布,并采用指数移动平均教师稳定训练。 Result: 在三个公开情感数据集和六种跨域迁移设置下实验显示,学生模型 consistently 超越了大音频语言模型,比最强基线高出3.9%。 Conclusion: MI-Fuse实现了无需共享源数据的高效跨域语音情感识别适应,增强了现实场景下情感感知语音系统的适应能力。 Abstract: Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.

[23] Probability Distribution Collapse: A Critical Bottleneck to Compact Unsupervised Neural Grammar Induction

Jinwook Park,Kangil Kim

Main category: cs.CL

TL;DR: 本文提出了一种针对无监督神经语法归纳中“概率分布崩溃”问题的缓解方法,通过引入“去崩溃神经参数化”显著提升了解析性能,并实现了更紧凑的语法结构。

Details Motivation: 现有无监督神经语法归纳模型存在表达能力瓶颈,常导致语法规模过大但性能不足,其核心原因是概率分布崩溃。 Method: 分析了神经参数化中概率分布崩溃的成因,并提出了“去崩溃神经参数化”方法以缓解该问题。 Result: 新方法在多种语言上显著提升了解析性能,同时允许使用更紧凑的语法结构。 Conclusion: 通过解决概率分布崩溃问题,可以有效提升无监督神经语法归纳的表达能力和效率。 Abstract: Unsupervised neural grammar induction aims to learn interpretable hierarchical structures from language data. However, existing models face an expressiveness bottleneck, often resulting in unnecessarily large yet underperforming grammars. We identify a core issue, $\textit{probability distribution collapse}$, as the underlying cause of this limitation. We analyze when and how the collapse emerges across key components of neural parameterization and introduce a targeted solution, $\textit{collapse-relaxing neural parameterization}$, to mitigate it. Our approach substantially improves parsing performance while enabling the use of significantly more compact grammars across a wide range of languages, as demonstrated through extensive empirical analysis.

[24] Confidence-guided Refinement Reasoning for Zero-shot Question Answering

Youwon Jang,Woo Suk Choi,Minjoon Jung,Minsu Lee,Byoung-Tak Zhang

Main category: cs.CL

TL;DR: 提出了一种无需训练的框架C2R,通过构建和优化子问题及其答案,并利用模型自身的置信度评分来提升跨模态问答任务的性能。

Details Motivation: 为了在不进行额外训练的情况下,提升多模态问答系统推理的可靠性与鲁棒性,探索如何有效利用子问题进行推理优化。 Method: C2R通过选择多样化的子问题路径,基于模型输出的置信度评分对不同答案候选进行比较,筛选出最可靠的最终答案,整个过程无需训练且可集成到现有QA模型中。 Result: C2R在多种模型和基准测试上均表现出一致的性能提升,并揭示了子问题数量与质量对推理效果的影响机制。 Conclusion: C2R是一种通用、有效的训练-free推理优化方法,能够通过置信度引导提升多模态问答系统的准确性和稳定性。 Abstract: We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R strategically constructs and refines sub-questions and their answers (sub-QAs), deriving a better confidence score for the target answer. C2R first curates a subset of sub-QAs to explore diverse reasoning paths, then compares the confidence scores of the resulting answer candidates to select the most reliable final answer. Since C2R relies solely on confidence scores derived from the model itself, it can be seamlessly integrated with various existing QA models, demonstrating consistent performance improvements across diverse models and benchmarks. Furthermore, we provide essential yet underexplored insights into how leveraging sub-QAs affects model behavior, specifically analyzing the impact of both the quantity and quality of sub-QAs on achieving robust and reliable reasoning.

[25] SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

Jiacheng Lin,Zhongruo Wang,Kun Qian,Tian Wang,Arvind Srinivasan,Hansi Zeng,Ruochen Jiao,Xie Zhou,Jiri Gesi,Dakuo Wang,Yufan Guo,Kai Zhong,Weiqi Zhang,Sujay Sanghavi,Changyou Chen,Hyokun Yun,Lihong Li

Main category: cs.CL

TL;DR: 本文研究了在特定领域数据集上进行监督微调(SFT)对大语言模型通用能力的影响,提出使用较小学习率可缓解性能下降,并提出了Token-自适应损失重加权(TALR)方法,在平衡领域特性和通用能力方面优于现有方法。

Details Motivation: 解决SFT在提升领域性能的同时可能损害大模型通用能力的问题,重新审视这一权衡关系。 Method: 通过实验验证小学习率的作用,结合理论分析提出TALR方法,并与其他正则化和微调策略进行比较。 Result: 实验证明小学习率能有效缓解通用性能下降,而TALR在保持领域性能的同时更好地保留了通用能力,优于LoRA、模型平均等基线方法。 Conclusion: SFT不必然损害通用能力,建议采用小学习率和TALR策略来更好平衡领域适应与通用性。 Abstract: Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.

[26] Towards Atoms of Large Language Models

Chenhui Hu,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao

Main category: cs.CL

TL;DR: 本文提出了语言模型内部表示的基本单元——原子(atoms)理论,通过原子内积(AIP)修正表示偏移,并证明了原子满足稀疏表示的稳定性与唯一性条件,实验显示基于阈值激活的稀疏自编码器能可靠识别原子,显著优于神经元和特征。

Details Motivation: 大型语言模型内部表示的基本单元尚不明确,神经元存在多义性,特征存在重建不可靠和不稳定问题,限制了对模型机制的理解。 Method: 提出原子理论,引入原子内积(AIP)修正表示偏移,定义原子并证明其满足受限等距性(RIP),在更强条件下建立稀疏表示的唯一性和ℓ₁可恢复性,理论分析阈值激活稀疏自编码器(SAE)识别原子的可靠性。 Result: 在Gemma2和Llama3.1模型上训练阈值激活SAE,平均实现99.9%的稀疏重建精度,超过99.8%的原子满足唯一性条件,远高于神经元(0.5%)和特征(68.2%),验证了原子更忠实地捕捉LLM内在表示。 Conclusion: 原子理论为理解大语言模型内部表示提供了系统性的理论框架,建立了与压缩感知的联系,并为机械可解释性奠定了基础。 Abstract: The fundamental units of internal representations in large language models (LLMs) remain undefined, limiting further understanding of their mechanisms. Neurons or features are often regarded as such units, yet neurons suffer from polysemy, while features face concerns of unreliable reconstruction and instability. To address this issue, we propose the Atoms Theory, which defines such units as atoms. We introduce the atomic inner product (AIP) to correct representation shifting, formally define atoms, and prove the conditions that atoms satisfy the Restricted Isometry Property (RIP), ensuring stable sparse representations over atom set and linking to compressed sensing. Under stronger conditions, we further establish the uniqueness and exact $\ell_1$ recoverability of the sparse representations, and provide guarantees that single-layer sparse autoencoders (SAEs) with threshold activations can reliably identify the atoms. To validate the Atoms Theory, we train threshold-activated SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% sparse reconstruction across layers on average, and more than 99.8% of atoms satisfy the uniqueness condition, compared to 0.5% for neurons and 68.2% for features, showing that atoms more faithfully capture intrinsic representations of LLMs. Scaling experiments further reveal the link between SAEs size and recovery capacity. Overall, this work systematically introduces and validates Atoms Theory of LLMs, providing a theoretical framework for understanding internal representations and a foundation for mechanistic interpretability. Code available at https://github.com/ChenhuiHu/towards_atoms.

[27] Few-Shot and Training-Free Review Generation via Conversational Prompting

Genki Kusano

Main category: cs.CL

TL;DR: 本文提出了一种名为对话式提示(Conversational Prompting)的轻量级方法,用于在少样本且无需训练的场景下生成个性化评论。该方法通过将用户评论重构为多轮对话,显著提升了大语言模型生成内容与目标用户风格的一致性。

Details Motivation: 现有个性化评论生成方法通常依赖大量用户历史评论或额外模型训练,难以应对现实场景中评论数据稀少且无法微调模型的限制。因此,需要一种无需训练且适用于少样本情况的方法。 Method: 提出两种对话式提示方法:简单版本(SCP)仅使用目标用户的评论构建多轮对话;对比版本(CCP)引入其他用户或LLM生成的错误回复作为负例,并让模型纠正,以增强风格模仿能力。 Result: 在八个产品领域和五种大语言模型上的实验表明,传统非对话式提示生成的评论与随机用户相似,而SCP和CCP显著提升了生成评论与目标用户的真实评论在ROUGE-L、BERTScore、用户身份匹配和情感分析等指标上的一致性,即使每个用户仅有两条评论。CCP在有高质量负例时表现更优,SCP在缺乏负例时仍具竞争力。 Conclusion: 对话式提示是一种在少样本和无需训练条件下生成个性化评论的有效且实用的方法,尤其适用于资源受限的实际应用场景。 Abstract: Personalized review generation helps businesses understand user preferences, yet most existing approaches assume extensive review histories of the target user or require additional model training. Real-world applications often face few-shot and training-free situations, where only a few user reviews are available and fine-tuning is infeasible. It is well known that large language models (LLMs) can address such low-resource settings, but their effectiveness depends on prompt engineering. In this paper, we propose Conversational Prompting, a lightweight method that reformulates user reviews as multi-turn conversations. Its simple variant, Simple Conversational Prompting (SCP), relies solely on the user's own reviews, while the contrastive variant, Contrastive Conversational Prompting (CCP), inserts reviews from other users or LLMs as incorrect replies and then asks the model to correct them, encouraging the model to produce text in the user's style. Experiments on eight product domains and five LLMs showed that the conventional non-conversational prompt often produced reviews similar to those written by random users, based on text-based metrics such as ROUGE-L and BERTScore, and application-oriented tasks like user identity matching and sentiment analysis. In contrast, both SCP and CCP produced reviews much closer to those of the target user, even when each user had only two reviews. CCP brings further improvements when high-quality negative examples are available, whereas SCP remains competitive when such data cannot be collected. These results suggest that conversational prompting offers a practical solution for review generation under few-shot and training-free constraints.

[28] Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching

Songze Li,Zhiqiang Liu,Zhengke Gui,Huajun Chen,Wen Zhang

Main category: cs.CL

TL;DR: 提出了一种名为Enrich-on-Graph(EoG)的灵活框架,利用大语言模型的先验知识增强知识图谱,弥合查询与图之间的语义鸿沟,提升知识图谱问答的性能。

Details Motivation: 大语言模型在知识密集型任务中存在幻觉和事实错误,主要由于结构化知识图谱与非结构化查询之间存在语义差距。 Method: 提出Enrich-on-Graph(EoG)框架,利用大语言模型的先验知识对知识图谱进行动态增强,并设计了三个图质量评估指标来分析查询与图的对齐程度。 Result: 在两个KGQA基准数据集上的实验表明,EoG能有效生成高质量的知识图谱,并达到最先进的性能。 Conclusion: EoG通过弥合语义鸿沟,实现了高效、可扩展且适应性强的知识图谱增强,显著提升了KGQA任务的准确性和鲁棒性。 Abstract: Large Language Models (LLMs) exhibit strong reasoning capabilities in complex tasks. However, they still struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA). We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures. Existing methods usually employ resource-intensive, non-scalable workflows reasoning on vanilla KGs, but overlook this gap. To address this challenge, we propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs' prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries. EoG enables efficient evidence extraction from KGs for precise and robust reasoning, while ensuring low computational costs, scalability, and adaptability across different methods. Furthermore, we propose three graph quality evaluation metrics to analyze query-graph alignment in KGQA task, supported by theoretical validation of our optimization objectives. Extensive experiments on two KGQA benchmark datasets indicate that EoG can effectively generate high-quality KGs and achieve the state-of-the-art performance. Our code and data are available at https://github.com/zjukg/Enrich-on-Graph.

[29] Leveraging What's Overfixed: Post-Correction via LLM Grammatical Error Overcorrection

Taehee Park,Heejin Do,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 提出了一种名为Post-Correction via Overcorrection (PoCO)的新方法,利用大模型的生成能力和小模型的可靠性,在语法错误纠正任务中有效平衡召回率与精确率。

Details Motivation: 小语言模型在语法纠错中精度高但召回率低,而大语言模型则容易过度修正导致精度下降,需结合两者优势提升整体性能。 Method: 首先使用大语言模型触发过修正以提高召回率,然后通过微调的小模型进行后处理,修正错误输出,从而平衡精确率和召回率。 Result: 实验表明,PoCO在保持较高精确率的同时显著提升了召回率,整体语法纠错效果优于现有方法。 Conclusion: PoCO成功融合了大模型的生成能力和小模型的可靠性,有效解决了语法纠错中精确率与召回率的权衡问题。 Abstract: Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.

[30] Distilling Many-Shot In-Context Learning into a Cheat Sheet

Ukyo Honda,Soichiro Murakami,Peinan Zhang

Main category: cs.CL

TL;DR: 提出了一种称为“cheat-sheet ICL”的方法,通过将多样本上下文学习的信息压缩成简洁的文本摘要,在减少输入token数量的同时保持甚至提升推理性能。

Details Motivation: 解决大规模语言模型在多示例上下文学习中因输入token过长而导致的高计算开销问题。 Method: 将多示例ICL的信息提炼为简洁的文本摘要(cheat sheet),在推理时用作上下文。 Result: 在复杂推理任务上,cheat-sheet ICL实现了与多示例ICL相当或更好的性能,且使用更少的token,并无需测试时检索即可匹配基于检索的ICL效果。 Conclusion: cheat-sheet ICL是一种实用的、高效利用大模型进行下游任务的方法。 Abstract: Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which distills the information from many-shot ICL into a concise textual summary (cheat sheet) used as the context at inference time. Experiments on challenging reasoning tasks show that cheat-sheet ICL achieves comparable or better performance than many-shot ICL with far fewer tokens, and matches retrieval-based ICL without requiring test-time retrieval. These findings demonstrate that cheat-sheet ICL is a practical alternative for leveraging LLMs in downstream tasks.

Shuo Huang,Xingliang Yuan,Gholamreza Haffari,Lizhen Qu

Main category: cs.CL

TL;DR: 提出一种零样本、基于树搜索的迭代句子重写算法,用于在保持文本连贯性和自然性的同时,有效保护用户输入中的隐私信息。

Details Motivation: 现有文本去标识化方法在隐私保护与文本自然性之间难以平衡,且可能损害文本效用。 Method: 采用基于奖励模型引导的树搜索策略,逐步重写敏感语句片段,实现系统性隐私信息模糊化或删除。 Result: 在隐私敏感数据集上的实验表明,该方法显著优于现有基线方法,在隐私保护和文本效用保持方面取得更好平衡。 Conclusion: 所提方法能有效兼顾隐私保护与文本质量,适用于大语言模型中的隐私安全场景。 Abstract: The increasing adoption of large language models (LLMs) in cloud-based services has raised significant privacy concerns, as user inputs may inadvertently expose sensitive information. Existing text anonymization and de-identification techniques, such as rule-based redaction and scrubbing, often struggle to balance privacy preservation with text naturalness and utility. In this work, we propose a zero-shot, tree-search-based iterative sentence rewriting algorithm that systematically obfuscates or deletes private information while preserving coherence, relevance, and naturalness. Our method incrementally rewrites privacy-sensitive segments through a structured search guided by a reward model, enabling dynamic exploration of the rewriting space. Experiments on privacy-sensitive datasets show that our approach significantly outperforms existing baselines, achieving a superior balance between privacy protection and utility preservation.

[32] Concise and Sufficient Sub-Sentence Citations for Retrieval-Augmented Generation

Guo Chen,Qiuyuan Li,Qiuxian Li,Hongliang Dai,Xiang Chen,Piji Li

Main category: cs.CL

TL;DR: 本文提出了一种生成细粒度子句级引用的方法,以提升检索增强生成(RAG)系统中引用的精确性和可读性,从而增强结果的可验证性。

Details Motivation: 现有RAG系统中的引用多为句子或段落级别,存在内容冗余或信息不足的问题,影响用户验证生成结果的效率和准确性。 Method: 提出了子句级引用标注规范并构建了相应数据集;设计了一个基于大语言模型的归因框架,利用LLM自动生成微调数据,并通过信用模型筛选高质量样本。 Result: 实验表明,该方法能生成更简洁、充分且高质量的引用,显著提升引用的可读性和验证效率。 Conclusion: 所提出的细粒度引用生成框架有效解决了现有引用方法在精度和完整性上的不足,有助于提高RAG系统输出的可信度和用户体验。 Abstract: In retrieval-augmented generation (RAG) question answering systems, generating citations for large language model (LLM) outputs enhances verifiability and helps users identify potential hallucinations. However, we observe two problems in the citations produced by existing attribution methods. First, the citations are typically provided at the sentence or even paragraph level. Long sentences or paragraphs may include a substantial amount of irrelevant content. Second, sentence-level citations may omit information that is essential for verifying the output, forcing users to read the surrounding context. In this paper, we propose generating sub-sentence citations that are both concise and sufficient, thereby reducing the effort required by users to confirm the correctness of the generated output. To this end, we first develop annotation guidelines for such citations and construct a corresponding dataset. Then, we propose an attribution framework for generating citations that adhere to our standards. This framework leverages LLMs to automatically generate fine-tuning data for our task and employs a credit model to filter out low-quality examples. Our experiments on the constructed dataset demonstrate that the propose approach can generate high-quality and more readable citations.

[33] WeFT: Weighted Entropy-driven Fine-Tuning for dLLMs

Guowei Xu,Wenxin Xu,Jiawang Zhao,Kaisheng Ma

Main category: cs.CL

TL;DR: 提出了一种名为WeFT的加权监督微调方法,通过基于熵为token分配不同权重来提升扩散语言模型在少量样本下的推理能力,在多个基准上显著优于标准SFT。

Details Motivation: 扩散模型在语言生成中具有快速生成的优势,但缺乏每步去噪的精确概率估计,导致生成过程不可控且不一致,难以有效进行监督微调(SFT)。 Method: 提出WeFT方法,基于扩散理论,根据token的熵为其分配不同的权重,在微调过程中突出关键token的作用,从而更好地引导生成方向。 Result: 在s1K、s1K-1.1和3k样本上训练,WeFT在Sudoku、Countdown、GSM8K和MATH-500四个推理基准上相比标准SFT取得了39%、64%和83%的相对提升。 Conclusion: WeFT有效提升了扩散语言模型在少样本场景下的推理性能,验证了控制关键token对生成质量的重要性,为扩散模型的微调提供了新思路。 Abstract: Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose WeFT, a weighted SFT method for diffusion language models, where tokens are assigned different weights based on their entropy. Derived from diffusion theory, WeFT delivers substantial gains: training on s1K, s1K-1.1, and 3k samples from open-r1, it achieves relative improvements of 39%, 64%, and 83% over standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500). The code and models will be made publicly available.

[34] Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models

Pittawat Taveekitworachai,Natpatchara Pongjirapat,Krittaphas Chaisutyakorn,Piyalitt Ittichaiwong,Tossaporn Saengja,Kunat Pipatanakul

Main category: cs.CL

TL;DR: 本文首次系统研究了如何使医学推理模型(MRMs)生成开放式问题的排序答案列表,提出并比较了提示与微调方法,发现基于强化微调(RFT)的模型在多种答案格式下更具鲁棒性。

Details Motivation: 临床决策通常需考虑多个可能选项,而非单一答案,但现有医学推理模型多被训练为仅输出单个答案,难以满足实际需求。 Method: 提出排序列表作为新输出格式,采用提示(prompting)和监督微调(SFT)、强化微调(RFT)两种微调方法,并设计针对排序列表的新型奖励函数。 Result: 实验表明,虽然部分SFT模型能泛化到某些答案格式,但RFT训练的模型在多种格式下表现更稳健;在修改版MedQA上的案例显示,模型虽可能未选中标准答案,但能识别出有效答案。 Conclusion: 这是首个关于让医学推理模型生成排序答案列表的系统性研究,表明RFT在多格式适应性上优于SFT,为医学领域发展超越单一答案的输出形式提供了可行路径。 Abstract: This paper presents a systematic study on enabling medical reasoning models (MRMs) to generate ranked lists of answers for open-ended questions. Clinical decision-making rarely relies on a single answer but instead considers multiple options, reducing the risks of narrow perspectives. Yet current MRMs are typically trained to produce only one answer, even in open-ended settings. We propose an alternative format: ranked lists and investigate two approaches: prompting and fine-tuning. While prompting is a cost-effective way to steer an MRM's response, not all MRMs generalize well across different answer formats: choice, short text, and list answers. Based on our prompting findings, we train and evaluate MRMs using supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT teaches a model to imitate annotated responses, and RFT incentivizes exploration through the responses that maximize a reward. We propose new reward functions targeted at ranked-list answer formats, and conduct ablation studies for RFT. Our results show that while some SFT models generalize to certain answer formats, models trained with RFT are more robust across multiple formats. We also present a case study on a modified MedQA with multiple valid answers, finding that although MRMs might fail to select the benchmark's preferred ground truth, they can recognize valid answers. To the best of our knowledge, this is the first systematic investigation of approaches for enabling MRMs to generate answers as ranked lists. We hope this work provides a first step toward developing alternative answer formats that are beneficial beyond single answers in medical domains.

[35] Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch

Main category: cs.CL

TL;DR: 本文提出了一种名为SummQ的对抗性多智能体框架,通过摘要生成与问答机制的协同作用,显著提升了长文档摘要的质量。

Details Motivation: 现有大模型在处理长文档摘要时存在信息丢失、事实不一致和连贯性问题,亟需更有效的解决方案。 Method: 设计了包含摘要生成者、评审者、问答生成者、评审者及应试者在内的多智能体协作框架,在摘要与问答两个互补领域进行对抗性优化,并通过迭代反馈机制持续改进摘要质量。 Result: 在三个主流长文档摘要基准上的实验表明,SummQ在ROUGE、BERTScore、LLM-as-a-Judge及人工评估中均显著优于现有最先进方法。 Conclusion: 该工作验证了多智能体对抗协作在提升长文档摘要质量方面的有效性,为该领域提供了新的技术路径。 Abstract: Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.

[36] MemLens: Uncovering Memorization in LLMs with Activation Trajectories

Zirui He,Haiyan Zhao,Ali Payani,Mengnan du

Main category: cs.CL

TL;DR: 本文提出了MemLens,一种通过分析生成过程中数字标记的概率轨迹来检测大语言模型中记忆化现象的新方法。与依赖表面词汇重叠和困惑度的传统方法不同,MemLens能够有效区分受污染和未受污染样本的推理路径。

Details Motivation: 现有的记忆化检测方法在面对隐式污染数据时泛化能力差,且容易受到基准测试数据被记忆的影响,因此需要一种更鲁棒的方法来准确识别模型是否记住了训练数据。 Method: MemLens通过分析模型生成过程中各层对数字标记的概率变化轨迹,识别受污染样本中存在的‘捷径’行为——即模型在早期层就以高置信度锁定答案。作者还通过LoRA微调注入设计好的样本,验证了该方法的有效性。 Result: 实验表明,受污染和干净样本展现出明显分离的推理轨迹;MemLens能可靠地检测到真正的记忆化信号,而非虚假相关性,在自然污染和人工注入数据上均表现出一致的模式。 Conclusion: MemLens提供了一种新颖且有效的视角来探测大语言模型中的记忆化行为,证明了通过内部激活路径分析可以显著提升对隐式数据污染的检测能力。 Abstract: Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical overlap and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut'' behaviors, locking onto an answer with high confidence in the model's early layers, whereas clean samples show more gradual evidence accumulation across the model's full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.

[37] Cross-Linguistic Analysis of Memory Load in Sentence Comprehension: Linear Distance and Structural Density

Krishna Aggarwal

Main category: cs.CL

TL;DR: 该研究通过跨语言依存句法树库和混合效应模型,比较句子长度、依存距离与介入成分复杂性对句子理解中记忆负荷的影响,提出“介入头词数量”作为比线性距离更优的结构化预测指标。

Details Motivation: 旨在检验句法相关词之间的线性距离与结构密度(即中间成分的复杂性)哪种更能解释句子理解中的记忆负荷,以深化对语言加工中局部性原则的理解。 Method: 采用统一依存句法标注的树库,将句子层面的记忆负荷操作化为特征干扰与特征误绑定的线性总和,并使用跨语言混合效应模型评估句子长度、依存长度和介入复杂性三个因素的相对贡献。 Result: 发现句子长度影响最广,但介入复杂性在控制线性距离后仍具显著解释力,表明结构密度比单纯线性距离更能反映记忆负荷。 Conclusion: 研究调和了线性与层级视角下的局部性理论,指出依存长度是表层线索,而中间头词数量才是理解整合与记忆维持需求的更直接指标,并展示了基于UD图指标和跨语言建模的方法论优势。 Abstract: This study examines whether sentence-level memory load in comprehension is better explained by linear proximity between syntactically related words or by the structural density of the intervening material. Building on locality-based accounts and cross-linguistic evidence for dependency length minimization, the work advances Intervener Complexity-the number of intervening heads between a head and its dependent-as a structurally grounded lens that refines linear distance measures. Using harmonized dependency treebanks and a mixed-effects framework across multiple languages, the analysis jointly evaluates sentence length, dependency length, and Intervener Complexity as predictors of the Memory-load measure. Studies in Psycholinguistics have reported the contributions of feature interference and misbinding to memory load during processing. For this study, I operationalized sentence-level memory load as the linear sum of feature misbinding and feature interference for tractability; current evidence does not establish that their cognitive contributions combine additively. All three factors are positively associated with memory load, with sentence length exerting the broadest influence and Intervener Complexity offering explanatory power beyond linear distance. Conceptually, the findings reconcile linear and hierarchical perspectives on locality by treating dependency length as an important surface signature while identifying intervening heads as a more proximate indicator of integration and maintenance demands. Methodologically, the study illustrates how UD-based graph measures and cross-linguistic mixed-effects modelling can disentangle linear and structural contributions to processing efficiency, providing a principled path for evaluating competing theories of memory load in sentence comprehension.

[38] Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning

Asim Ersoy,Enes Altinisik,Husrev Taha Sencar,Kareem Darwish

Main category: cs.CL

TL;DR: 本文研究了如何为阿拉伯语实现工具调用功能,探讨了三种策略:使用阿拉伯语工具调用数据、通用指令微调和特定工具微调,并通过实验验证了这些方法在开源阿拉伯大语言模型上的效果。

Details Motivation: 现有工具调用研究主要集中于英语,缺乏对阿拉伯语等其他语言的支持,因此需要探索适用于阿拉伯语的有效工具调用方法。 Method: 翻译并适配两个开源工具调用数据集为阿拉伯语,使用基础和后训练版本的开源阿拉伯大语言模型进行广泛实验,评估不同微调策略的效果。 Result: 实验结果揭示了在阿拉伯语中实现高效工具调用的最佳策略,包括在语言内数据的重要性、通用指令微调的影响以及针对高优先级工具微调的价值。 Conclusion: 为构建强大的阿拉伯语工具增强型智能体,结合在语言内数据训练、指令微调和特定工具微调是关键且有效的路径。 Abstract: Tool calling is a critical capability that allows Large Language Models (LLMs) to interact with external systems, significantly expanding their utility. However, research and resources for tool calling are predominantly English-centric, leaving a gap in our understanding of how to enable this functionality for other languages, such as Arabic. This paper investigates three key research questions: (1) the necessity of in-language (Arabic) tool-calling data versus relying on cross-lingual transfer, (2) the effect of general-purpose instruction tuning on tool-calling performance, and (3) the value of fine-tuning on specific, high-priority tools. To address these questions, we conduct extensive experiments using base and post-trained variants of an open-weight Arabic LLM. To enable this study, we bridge the resource gap by translating and adapting two open-source tool-calling datasets into Arabic. Our findings provide crucial insights into the optimal strategies for developing robust tool-augmented agents for Arabic.

[39] Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting

Valeria Ramirez-Garcia,David de-Fitero-Dominguez,Antonio Garcia-Cabot,Eva Garcia-Lopez

Main category: cs.CL

TL;DR: 本研究探讨了基于大语言模型(LLM)的自动评分系统在高等教育文本输入题中的应用,提出了五种基于评分规则的评估方法,并在包含110个计算机科学答案的数据集上进行了测试。结果显示,使用参考答案辅助的“参考辅助评估”方法最接近人类评分,具有最低的偏差,是目前最优的自动评分方案。

Details Motivation: 为了提升教育领域中自动评估系统的准确性与公平性,探索大语言模型作为评分工具的有效方法,特别是在缺乏充足人工评分资源的情况下提供可靠替代方案。 Method: 提出并比较了五种LLM驱动的评估方法:JudgeLM评估、参考辅助评估、无参考评估、加性评估和自适应评估。使用JudgeLM、Llama-3.1-8B和DeepSeek-R1-Distill-Llama-8B三个模型,在自建学生答案数据集上运行,并与人类评分结果对比。 Result: 参考辅助评估表现最佳,中位绝对偏差为0.945,均方根偏差为1.214,评分最接近人类。其他方法在简洁答案中表现不佳或因信息不足而效果较差。JudgeLM原生评估受限于模型能力,结果不理想。 Conclusion: 在合理方法支持下,基于大语言模型的自动评分系统具备作为教学辅助工具的潜力,可作为传统学术资源的有益补充。 Abstract: Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education, LLMs have been studied as assistant tools for students and teachers. Our research investigates LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics. We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM evaluation, which uses the model's single answer prompt to obtain a score; Reference Aided Evaluation, which uses a correct answer as a guide aside from the original context of the question; No Reference Evaluation, which ommits the reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive Evaluation, which is an evaluation done with generated criteria fitted to each question. All evaluation methods have been compared with the results of a human evaluator. Results show that the best method to automatically evaluate and score Text-Input Problems using LLMs is Reference Aided Evaluation. With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations. Other methods such as Additive and Adaptive Evaluation fail to provide good results in concise answers, No Reference Evaluation lacks information needed to correctly assess questions and JudgeLM Evaluations have not provided good results due to the model's limitations. As a result, we conclude that Artificial Intelligence-driven automatic evaluation systems, aided with proper methodologies, show potential to work as complementary tools to other academic resources.

[40] Generative AI for FFRDCs

Arun S. Maiya

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的框架OnPrem.LLM,用于加速联邦资助研发机构(FFRDCs)对文本密集型任务的处理,在仅需少量示例的情况下实现摘要、分类、信息提取和认知分析,并确保政府敏感环境中的可审计性和数据主权。

Details Motivation: FFRDCs面临大量文本分析任务,传统手动处理效率低,且在敏感政府环境中应用AI需保障数据安全与合规性,因此需要一种高效、安全的自动化文本分析方法。 Method: 采用OnPrem.LLM开源框架,结合大语言模型的小样本学习能力,对国防政策文件和科学文献等文本进行少样本的摘要、分类、抽取和语义理解。 Result: 在NDAA和NSF Awards等真实案例中验证了该方法的有效性,显著提升了文本处理效率与战略分析能力,同时保证了数据本地化处理、审计追踪和数据主权。 Conclusion: OnPrem.LLM框架能够在保障安全与合规的前提下,有效提升FFRDCs在敏感场景下的文本分析效率,具有广泛的应用前景。 Abstract: Federally funded research and development centers (FFRDCs) face text-heavy workloads, from policy documents to scientific and engineering papers, that are slow to analyze manually. We show how large language models can accelerate summarization, classification, extraction, and sense-making with only a few input-output examples. To enable use in sensitive government contexts, we apply OnPrem$.$LLM, an open-source framework for secure and flexible application of generative AI. Case studies on defense policy documents and scientific corpora, including the National Defense Authorization Act (NDAA) and National Science Foundation (NSF) Awards, demonstrate how this approach enhances oversight and strategic analysis while maintaining auditability and data sovereignty.

[41] Behind RoPE: How Does Causal Mask Encode Positional Information?

Junu Kim,Xiao Liu,Zhenghao Lin,Lei Ji,Yeyun Gong,Edward Choi

Main category: cs.CL

TL;DR: 本文研究了Transformer解码器中因果掩码对注意力分数的位置依赖性影响,发现即使没有参数或输入中的因果依赖,因果掩码也能诱导出类似位置编码的邻近偏好模式,并与RoPE相互作用,改变其相对注意力模式。

Details Motivation: 探究除了显式位置编码(如RoPE)外,因果掩码是否也提供重要的位置信息,从而更全面地理解Transformer中的位置建模机制。 Method: 通过理论分析证明因果掩码可诱导位置相关的注意力模式,并通过在现代大语言模型上的实证研究验证该现象及其与RoPE的交互效应。 Result: 因果掩码本身能产生偏向附近查询-键对的注意力模式,且与RoPE结合时会破坏其相对性,导致非相对的注意力分布。 Conclusion: 因果掩码是位置信息的重要来源,应与显式位置编码一同被考虑,在模型设计中不可忽视。 Abstract: While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE's relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.

[42] When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

Keno Harada,Yudai Yamazaki,Masachika Taniguchi,Edison Marrese-Taylor,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CL

TL;DR: 本文提出了两个专门的基准测试(ManyIFEval和StyleMBPP),用于评估大语言模型在多指令跟随任务中的表现,并发现性能随指令数量增加而下降;同时开发了三种回归模型,可用少量样本有效预测未见指令组合下的模型性能。

Details Motivation: 随着大语言模型在现实场景中的广泛应用,理解其同时遵循多个指令的能力变得至关重要,但缺乏系统性评估方法。 Method: 构建了两个包含多指令文本生成和代码生成的基准测试集(ManyIFEval和StyleMBPP),并在十个LLM上进行实验,使用回归模型预测不同指令数量和组合下的性能。 Result: 实验表明,随着指令数量增加,模型性能持续下降;基于指令数量的逻辑回归模型可在未见指令组合下以约10%的误差预测性能;仅需500(ManyIFEval)和300(StyleMBPP)样本即可实现有效估计。 Conclusion: 多指令跟随能力是当前LLM的薄弱环节,所提出的基准和回归预测方法可高效评估和估计模型在多种指令组合下的表现。 Abstract: As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.

[43] SoM-1K: A Thousand-Problem Benchmark Dataset for Strength of Materials

Qixin Wan,Zilong Wang,Jingwen Zhou,Wanting Wang,Ziheng Geng,Jiachen Liu,Ran Cao,Minghui Cheng,Lu Cheng

Main category: cs.CL

TL;DR: 本文提出了SoM-1K,首个面向材料力学领域的多模态大规模基准数据集,用于评估基础模型在复杂工程问题中的表现,并提出“图像描述”(DoI)提示策略以提升模型理解能力,发现当前模型在此类任务上表现不佳,且语言模型结合DoI常优于视觉语言模型直接看图,表明精确文本描述对当前模型更有效。

Details Motivation: 由于基础模型在复杂多模态工程问题上的表现尚未充分探索,尤其是在材料力学这类需要图文联合推理的领域,缺乏标准评测基准,且现有模型难以准确理解复杂图表信息,因此需要构建专门数据集并探索更有效的输入方式以提升评估效果。 Method: 构建包含1,065个标注问题的SoM-1K数据集,涵盖真实工程场景中的文本描述与示意图;提出“图像描述”(DoI)策略,由专家生成图表的严谨文字描述作为上下文输入;评估八种代表性基础模型(包括大语言模型和视觉语言模型),比较其在原始图像输入与DoI文本输入下的性能差异。 Result: 当前基础模型在SoM-1K上整体表现较差,最佳模型准确率仅为56.6%;使用DoI输入的纯语言模型表现常优于直接输入图像的视觉语言模型;错误分析显示DoI显著减少视觉误解错误,说明文本描述比直接图像输入更利于当前模型理解复杂工程图表。 Conclusion: SoM-1K为工程AI提供了严格的评测基准,揭示了当前基础模型在科学工程多模态推理中的局限性,表明发展更强的多模态理解能力至关重要,尤其应重视高质量文本辅助信息在复杂任务中的作用。 Abstract: Foundation models have shown remarkable capabilities in various domains, but their performance on complex, multimodal engineering problems remains largely unexplored. We introduce SoM-1K, the first large-scale multimodal benchmark dataset dedicated to evaluating foundation models on problems in the strength of materials (SoM). The dataset, which contains 1,065 annotated SoM problems, mirrors real-world engineering tasks by including both textual problem statements and schematic diagrams. Due to the limited capabilities of current foundation models in understanding complicated visual information, we propose a novel prompting strategy called Descriptions of Images (DoI), which provides rigorous expert-generated text descriptions of the visual diagrams as the context. We evaluate eight representative foundation models, including both large language models (LLMs) and vision language models (VLMs). Our results show that current foundation models struggle significantly with these engineering problems, with the best-performing model achieving only 56.6% accuracy. Interestingly, we found that LLMs, when provided with DoI, often outperform VLMs provided with visual diagrams. A detailed error analysis reveals that DoI plays a crucial role in mitigating visual misinterpretation errors, suggesting that accurate text-based descriptions can be more effective than direct image input for current foundation models. This work establishes a rigorous benchmark for engineering AI and highlights a critical need for developing more robust multimodal reasoning capabilities in foundation models, particularly in scientific and engineering contexts.

[44] Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs

Yixin Wan,Xingrun Chen,Kai-Wei Chang

Main category: cs.CL

TL;DR: 本文提出了文化定位偏见的概念,并通过CultureLens基准测试揭示了大语言模型在生成内容时倾向于主流美国文化而边缘化其他文化的倾向;为此,作者设计了基于代理的公平性干预框架MFA,在推理阶段有效减轻了这种偏见。

Details Motivation: 发现大语言模型在生成内容时存在对非主流文化的边缘化问题,即文化定位偏见,亟需系统性研究和缓解方法。 Method: 提出CultureLens基准(4000个生成提示和3项评估指标),以情境化访谈脚本生成任务评估模型的文化立场;设计两种推理时去偏方法:基于提示的FIP和基于多智能体的MFA框架(包括单代理自反思重写和多代理分工协作)。 Result: 实验表明,当前主流LLM在美国文化语境下采用内部视角超过88%,但在弱势文化中多采用外部视角;所提MFA方法显著降低了文化定位偏见,尤其是多代理结构表现更优。 Conclusion: 大语言模型存在文化定位偏见,而基于多智能体的公平性干预框架(MFA)能有效缓解该问题,为生成式AI的公平性提供了可行路径。 Abstract: Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM's default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.

[45] PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models

Mohammad Hosseini,Kimia Hosseini,Shayan Bali,Zahra Zanjani,Saeedeh Momtazi

Main category: cs.CL

TL;DR: PerHalluEval是首个针对波斯语的动态幻觉评估基准,通过LLM驱动的三阶段流程结合人工验证,生成合理回答与摘要,用于检测内外部幻觉,并利用生成token的对数概率筛选最可信的幻觉实例。

Details Motivation: 大语言模型在低资源语言(如波斯语)中普遍存在幻觉问题,缺乏专门针对波斯语的评估基准,因此需要构建一个针对性的评测工具来准确识别和分析幻觉现象。 Method: 提出PerHalluEval基准,采用三阶段LLM驱动流水线生成QA和摘要任务中的幻觉内容,结合人工验证提升质量;使用生成token的对数概率选择最具迷惑性的幻觉样本,并引入人工标注突出波斯文化相关语境。 Result: 评估12个开源与闭源LLM发现,现有模型普遍难以检测波斯语幻觉;引入外部知识(如原文档)可部分缓解幻觉;专为波斯语训练的模型在幻觉表现上并无显著优势。 Conclusion: 当前大语言模型在处理波斯语时仍面临严重幻觉问题,需更有效的评估机制与改进策略,而外部知识有助于减轻幻觉,但模型是否专精于波斯语并非决定性因素。 Abstract: Hallucination is a persistent issue affecting all large language Models (LLMs), particularly within low-resource languages such as Persian. PerHalluEval (Persian Hallucination Evaluation) is the first dynamic hallucination evaluation benchmark tailored for the Persian language. Our benchmark leverages a three-stage LLM-driven pipeline, augmented with human validation, to generate plausible answers and summaries regarding QA and summarization tasks, focusing on detecting extrinsic and intrinsic hallucinations. Moreover, we used the log probabilities of generated tokens to select the most believable hallucinated instances. In addition, we engaged human annotators to highlight Persian-specific contexts in the QA dataset in order to evaluate LLMs' performance on content specifically related to Persian culture. Our evaluation of 12 LLMs, including open- and closed-source models using PerHalluEval, revealed that the models generally struggle in detecting hallucinated Persian text. We showed that providing external knowledge, i.e., the original document for the summarization task, could mitigate hallucination partially. Furthermore, there was no significant difference in terms of hallucination when comparing LLMs specifically trained for Persian with others.

[46] BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

Hyunseo Kim,Sangam Lee,Kwangwook Seo,Dongha Lee

Main category: cs.CL

TL;DR: 本文提出了BESPOKE,一个用于评估搜索增强型大语言模型中个性化的现实且具有诊断性的基准,通过真实的人类交互数据和细粒度反馈系统地分析个性化需求。

Details Motivation: 现有搜索增强型大语言模型在满足多样化用户需求方面仍不足,缺乏对个性化效果的系统性评估。 Method: 构建BESPOKE基准,收集真实人类聊天和搜索历史,结合详细信息需求的查询撰写,并通过长期人工标注获取带细粒度评分和诊断反馈的响应数据。 Result: BESPOKE实现了对个性化响应的细粒度评估,揭示了信息寻求任务中有效个性化的关键要求。 Conclusion: BESPOKE为评估个性化搜索增强型大语言模型提供了现实且可诊断的基础,推动了该领域的系统化发展。 Abstract: Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.

[47] VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model

Junhyuk Choi,Ro-hoon Oh,Jihwan Seol,Bugeun Kim

Main category: cs.CL

TL;DR: VoiceBBQ是BBQ数据集的语音扩展版本,用于衡量口语语言模型中的社会偏见,涵盖内容和声学两个方面。通过控制语音条件,评估了LLaMA-Omni和Qwen2-Audio两种模型,发现其在处理性别和口音偏见上的不同表现。

Details Motivation: 由于语音的特性,口语语言模型中的社会偏见可能来自内容和声学两方面,现有文本基准无法全面评估,因此需要一个能同时衡量这两类偏见的语音数据集。 Method: 将BBQ数据集的每个上下文转换为受控的语音条件,构建VoiceBBQ数据集,并对两种口语语言模型(LLaMA-Omni和Qwen2-Audio)进行评估,分别测量其在内容和声学维度上的准确性、偏见和一致性。 Result: 实验显示LLaMA-Omni能抵抗声学偏见但加剧性别和口音偏见,而Qwen2-Audio则显著抑制这些偏见同时保持内容保真度。 Conclusion: VoiceBBQ提供了一个紧凑且可直接使用的测试平台,能够联合诊断口语语言模型中内容与声学层面的社会偏见,有助于更全面地评估和改进模型公平性。 Abstract: We introduce VoiceBBQ, a spoken extension of the BBQ (Bias Benchmark for Question Answering) - a dataset that measures social bias by presenting ambiguous or disambiguated contexts followed by questions that may elicit stereotypical responses. Due to the nature of speech, social bias in Spoken Language Models (SLMs) can emerge from two distinct sources: 1) content aspect and 2) acoustic aspect. The dataset converts every BBQ context into controlled voice conditions, enabling per-axis accuracy, bias, and consistency scores that remain comparable to the original text benchmark. Using VoiceBBQ, we evaluate two SLMs - LLaMA-Omni and Qwen2-Audio - and observe architectural contrasts: LLaMA-Omni resists acoustic bias while amplifying gender and accent bias, whereas Qwen2-Audio substantially dampens these cues while preserving content fidelity. VoiceBBQ thus provides a compact, drop-in testbed for jointly diagnosing content and acoustic bias across spoken language models.

[48] Acoustic-based Gender Differentiation in Speech-aware Language Models

Junhyuk Choi,Jihwan Seol,Nayeon Kim,Chanhee Cho,EunBin Cho,Bugeun Kim

Main category: cs.CL

TL;DR: 该论文研究了语音语言模型(SpeechLMs)中的性别偏见问题,发现尽管整体响应看似无性别差异,但实际上存在悖论性偏差:在性别刻板问题上模型偏向男性回应,而在应区分性别的场景中却忽略性别信息。这种偏差主要源于Whisper语音编码器生成的声学标记,并提示当前技术需更精细地处理性别信息。

Details Motivation: 探究语音语言模型在处理不同性别语音时是否存在隐性性别偏见,尤其是在相同问题下因说话人性别不同而导致响应差异的问题。 Method: 构建了一个包含9,208个语音样本的新数据集,涵盖三类问题:性别无关、性别刻板和性别相关,并对LLaMA-Omni系列模型进行评估,分析其在不同性别语音输入下的响应模式,同时对比使用Whisper编码器与基础大语言模型的表现。 Result: 发现模型在性别刻板问题上一致表现出男性倾向,而在应区分性别的问题上反而忽略性别;该偏差并非来自中性选项或语音感知性别,且在应用性别中和方法后仍存在;进一步分析表明偏差主要来自Whisper语音编码器产生的男性导向声学标记。 Conclusion: 当前语音语言模型虽追求整体公平,但未能恰当利用性别信息,导致悖论性偏差,需开发更复杂的技术以合理处理语音中的性别因素。 Abstract: Speech-aware Language Models (SpeechLMs) have fundamentally transformed human-AI interaction by enabling voice-based communication, yet they may exhibit acoustic-based gender differentiation where identical questions lead to different responses based on the speaker's gender. This paper propose a new dataset that enables systematic analysis of this phenomenon, containing 9,208 speech samples across three categories: Gender-Independent, Gender-Stereotypical, and Gender-Dependent. We further evaluated LLaMA-Omni series and discovered a paradoxical pattern; while overall responses seems identical regardless of gender, the pattern is far from unbiased responses. Specifically, in Gender-Stereotypical questions, all models consistently exhibited male-oriented responses; meanwhile, in Gender-Dependent questions where gender differentiation would be contextually appropriate, models exhibited responses independent to gender instead. We also confirm that this pattern does not result from neutral options nor perceived gender of a voice. When we allow neutral response, models tends to respond neutrally also in Gender-Dependent questions. The paradoxical pattern yet retains when we applied gender neutralization methods on speech. Through comparison between SpeechLMs with corresponding backbone LLMs, we confirmed that these paradoxical patterns primarily stem from Whisper speech encoders, which generates male-oriented acoustic tokens. These findings reveal that current SpeechLMs may not successfully remove gender biases though they prioritized general fairness principles over contextual appropriateness, highlighting the need for more sophisticated techniques to utilize gender information properly in speech technology.

[49] AutoIntent: AutoML for Text Classification

Ilya Alekseev,Roman Solomatin,Darina Rustamova,Denis Kuznetsov

Main category: cs.CL

TL;DR: AutoIntent是一个用于文本分类任务的自动化机器学习工具,提供端到端自动化,包括嵌入模型选择、分类器优化和决策阈值调整,支持多标签分类和范围外检测,在标准意图分类数据集上表现优于现有AutoML工具。

Details Motivation: 现有的AutoML工具在文本分类任务中缺乏端到端的自动化支持,特别是在嵌入模型选择、分类器优化和决策阈值调整方面,且对多标签分类和范围外检测的支持有限。 Method: AutoIntent采用模块化设计,具有类似sklearn的接口,集成嵌入模型选择、分类器优化和决策阈值调优,实现端到端自动化,并支持多标签分类与范围外检测。 Result: 在标准意图分类数据集上,AutoIntent的表现优于现有的AutoML工具,同时允许用户在效果和资源消耗之间进行权衡。 Conclusion: AutoIntent通过全面的自动化流程和灵活的模块化架构,在文本分类任务中实现了更高的性能和实用性,特别适用于需要多标签分类和范围外检测的实际应用场景。 Abstract: AutoIntent is an automated machine learning tool for text classification tasks. Unlike existing solutions, AutoIntent offers end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning, all within a modular, sklearn-like interface. The framework is designed to support multi-label classification and out-of-scope detection. AutoIntent demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets and enables users to balance effectiveness and resource consumption.

[50] Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction

Lei Hei,Tingjing Liao,Yingxin Pei,Yiyang Qi,Jiaqi Wang,Ruiting Li,Feiliang Ren

Main category: cs.CL

TL;DR: 提出了一种名为ROC的新框架,将多模态关系抽取从分类任务重构为基于语义的检索任务,通过结合实体类型和位置信息、扩展关系标签为自然语言描述,并利用语义相似性对比学习对齐实体-关系对,实现了最先进的性能。

Details Motivation: 传统多模态关系抽取方法依赖分类范式,使用离散标签表示关系,忽略了结构约束(如实体类型和位置线索),且缺乏细粒度关系理解所需的语义表达能力。 Method: 提出ROC框架:1)通过多模态编码器整合实体类型和位置信息;2)利用大语言模型将关系标签扩展为自然语言描述;3)采用基于语义相似性的对比学习对齐实体-关系对。 Result: 在MNRE和MORE两个基准数据集上实现了最先进的性能,表现出更强的鲁棒性和可解释性。 Conclusion: ROC通过将多模态关系抽取从分类转向检索,有效提升了语义表达能力和结构建模,为细粒度关系理解提供了新思路。 Abstract: Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose \underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.

[51] Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

Chantal Shaib,Vinith M. Suriyakumar,Levent Sagun,Byron C. Wallace,Marzyeh Ghassemi

Main category: cs.CL

TL;DR: 本文研究了语法模板、领域和语义在任务-指令对中的作用,发现训练数据中语法与领域的虚假相关性可能导致模型性能下降,并可能被利用来绕过安全拒绝机制,因此需要在训练数据中确保语法多样性并显式检测此类相关性。

Details Motivation: 为了提高大语言模型对指令的理解能力,需要探究语法、领域和语义之间的相互作用,尤其是语法模板与领域之间的虚假相关性如何影响模型表现和安全性。 Method: 通过构建合成训练数据集,分析语法-领域相关性对模型性能的影响,并提出评估框架检测多种开源和闭源模型中的这一现象,最后进行安全微调的案例研究。 Result: 发现语法-领域相关性会降低OLMo-2系列模型在实体知识任务上的性能(平均0.51±0.06),并在OLMo-2、Llama-4-Maverick和GPT-4o等模型中验证了该现象的存在,且可被用于绕过安全拒绝机制。 Conclusion: 应显式测试语法-领域相关性,并在训练数据中特别是各领域内部确保语法多样性,以防止此类虚假相关性影响模型性能和安全性。 Abstract: For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates--frequent sequences of Part-of-Speech (PoS) tags--are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.

[52] Who's Laughing Now? An Overview of Computational Humour Generation and Explanation

Tyler Loakman,William Thorne,Chenghua Lin

Main category: cs.CL

TL;DR: 本文综述了与生成和解释幽默相关的计算幽默领域,指出尽管理解幽默是自然语言处理的基础任务,但除双关语外的幽默生成与解释研究仍较少,当前最先进的模型尚无法达到人类水平,并强调了将主观和伦理模糊性纳入未来研究的重要性。

Details Motivation: 幽默作为一种抽象、创造性和依赖上下文的构造,需要广泛的推理能力,因此是评估现代大语言模型常识和推理能力的理想任务。然而目前在非双关类幽默的生成与解释方面研究不足,且模型表现仍远逊于人类,亟需系统性探讨。 Method: 通过文献综述的方法,梳理了计算幽默在生成(如笑话创作)和解释(如幽默机制分析)两个生成性任务中的研究现状,并总结了现有方法的局限性。 Result: 发现当前关于幽默理解的研究虽具基础性意义,但在非双关语类幽默的生成与解释方面进展有限,最先进模型的表现仍显著落后于人类水平。 Conclusion: 计算幽默应被视为自然语言处理中的一个重要子领域,未来的研究需重视幽默的主观性和伦理模糊性,并在此基础上推动更具深度和广度的模型发展。 Abstract: The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (LLMs). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains sparse, while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.

[53] GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models

Jieli Zhu,Vi Ngoc-Nha Tran

Main category: cs.CL

TL;DR: 本文研究了基于小语言模型(SLM)的聊天机器人在下游任务中的个人身份信息(PII)泄露问题,提出了一种新的贪婪坐标梯度(GEP)方法,显著提高了PII提取效率,并在复杂真实场景中验证了其有效性。

Details Motivation: 尽管小语言模型(SLMs)在效率和性能上具有优势,但其在下游任务中可能存在的个人身份信息(PII)泄露风险尚未被充分研究,尤其是在医疗等敏感领域。 Method: 首先基于BioGPT架构和医疗数据集Alpaca与HealthCareMagic微调出ChatBioGPT模型;然后提出一种基于贪婪坐标梯度(GCG)的PII提取方法GEP,并在固定模板和自由插入两种情况下进行实验评估。 Result: 实验表明,GEP方法相比传统模板攻击方法最多可提升60倍的PII泄露量,在更复杂的自由式插入场景下仍能达到最高4.53%的PII泄露率。 Conclusion: GEP是一种针对SLM环境下有效的PII泄露检测方法,揭示了当前SLM在隐私保护方面的潜在风险,特别是在医疗对话系统中需加强数据去标识化和模型安全性设计。 Abstract: Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backbone of BioGPT using medical datasets Alpaca and HealthCareMagic. It shows a matchable performance in BERTscore compared with previous studies of ChatDoctor and ChatGPT. Based on this model, we prove that the previous template-based PII attacking methods cannot effectively extract the PII in the dataset for leakage detection under the SLM condition. We then propose GEP, which is a greedy coordinate gradient-based (GCG) method specifically designed for PII extraction. We conduct experimental studies of GEP and the results show an increment of up to 60$\times$ more leakage compared with the previous template-based methods. We further expand the capability of GEP in the case of a more complicated and realistic situation by conducting free-style insertion where the inserted PII in the dataset is in the form of various syntactic expressions instead of fixed templates, and GEP is still able to reveal a PII leakage rate of up to 4.53%.

[54] Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning

Xiangru Tang,Wanghan Xu,Yujie Wang,Zijie Guo,Daniel Shao,Jiapeng Chen,Cixuan Zhang,Ziyi Wang,Lixin Zhang,Guancheng Wan,Wenlong Zhang,Lei Bai,Zhenfei Yin,Philip Torr,Hanrui Wang,Di Jin

Main category: cs.CL

TL;DR: 提出一种结合隐式检索与结构化协作的统一框架,通过Monitor-based检索模块和分层解 refine 机制,在科学推理任务上实现了当前最高的准确率,同时显著减少token使用和代理步骤。

Details Motivation: 解决大语言模型在科学推理中存在的两个瓶颈:显式检索打断推理过程导致额外开销,以及多智能体方案因平均候选结果而稀释优质解。 Method: 引入基于Monitor的隐式检索模块(token级别知识集成)和分层解 refine 机制(HSR与QAIR),实现隐式知识获取与质量感知的迭代优化。 Result: 在HLE Bio/Chem Gold上达到48.3%准确率(提升13.4-18.1个百分点),降低token消耗53.5%和agent步骤43.7%;在SuperGPQA和TRQA上验证跨领域鲁棒性;错误分析显示85%以上失败同时涉及推理与知识缺陷。 Conclusion: 隐式增强与结构化 refine 能有效克服显式工具使用和统一聚合带来的效率低下问题,为科学推理提供了更高效、精准的框架。 Abstract: Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden "tool tax" of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3\% accuracy -- the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5\% and agent steps by 43.7\%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85\% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.

Xinzhe Xu,Liang Zhao,Hongshen Xu,Chen Chen

Main category: cs.CL

TL;DR: 本文提出了CLaw,一个用于评估大语言模型在中文法律知识及其推理应用方面表现的新基准。CLaw包含306部中国国家法规的细粒度语料库和254个基于案例的推理实例,实验表明当前多数大模型在准确复现法律条文方面存在显著困难,强调了可靠法律推理需结合精确的知识检索与强大的通用推理能力。

Details Motivation: 现有大语言模型在处理法律文本时因缺乏专门化训练而可靠性不足,难以准确引用法律条文,亟需一个专门的评估基准来衡量其在中国法律领域的知识掌握与应用能力。 Method: 构建了CLaw基准,包括两个部分:一是覆盖全部306部中国国家法律、细分至子条款并包含历史修订时间的精细语料库(共64,849条);二是基于中国最高法院材料整理的254个案例推理任务,用于评估模型的法律知识应用能力。 Result: 实验发现大多数当前大语言模型在准确回忆和引用法律条文方面表现不佳,严重影响其法律推理的可信度,且知识检索错误是主要瓶颈。 Conclusion: 实现可信的法律推理需要精准的知识检索(可通过监督微调或检索增强生成改进)与强大推理能力的结合,CLaw为推进领域特定的大模型法律推理提供了重要基准和洞察。 Abstract: Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fine-grained corpus of all 306 Chinese national statutes, segmented to the subparagraph level and incorporating precise historical revision timesteps for rigorous recall evaluation (64,849 entries), and (2) a challenging set of 254 case-based reasoning instances derived from China Supreme Court curated materials to assess the practical application of legal knowledge. Our empirical evaluation reveals that most contemporary LLMs significantly struggle to faithfully reproduce legal provisions. As accurate retrieval and citation of legal provisions form the basis of legal reasoning, this deficiency critically undermines the reliability of their responses. We contend that achieving trustworthy legal reasoning in LLMs requires a robust synergy of accurate knowledge retrieval--potentially enhanced through supervised fine-tuning (SFT) or retrieval-augmented generation (RAG)--and strong general reasoning capabilities. This work provides an essential benchmark and critical insights for advancing domain-specific LLM reasoning, particularly within the complex legal sphere.

[56] SGMem: Sentence Graph Memory for Long-Term Conversational Agents

Yaxiong Wu,Yongyue Zhang,Sheng Liang,Yong Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为SGMem(Sentence Graph Memory)的新型记忆管理方法,用于解决长对话中超出大语言模型上下文窗口的记忆处理问题。

Details Motivation: 现有的基于事实提取或摘要的方法虽然减少了冗余,但在跨不同粒度的对话和生成记忆之间组织和检索相关信息方面存在困难。 Method: SGMem将对话表示为分块单元内的句子级图,捕捉回合、轮次和会话层级上下文之间的关联,并结合原始对话与生成的记忆(如摘要、事实和洞察)来提供连贯且相关的信息。 Result: 在LongMemEval和LoCoMo数据集上的实验表明,SGMem在长期对话问答任务中显著提升了准确率,并优于多个强基线方法。 Conclusion: SGMem通过多粒度信息整合和图结构表示,有效改善了长时对话系统中的记忆检索与响应生成性能。 Abstract: Long-term conversational agents require effective memory management to handle dialogue histories that exceed the context window of large language models (LLMs). Existing methods based on fact extraction or summarization reduce redundancy but struggle to organize and retrieve relevant information across different granularities of dialogue and generated memory. We introduce SGMem (Sentence Graph Memory), which represents dialogue as sentence-level graphs within chunked units, capturing associations across turn-, round-, and session-level contexts. By combining retrieved raw dialogue with generated memory such as summaries, facts and insights, SGMem supplies LLMs with coherent and relevant context for response generation. Experiments on LongMemEval and LoCoMo show that SGMem consistently improves accuracy and outperforms strong baselines in long-term conversational question answering.

[57] Query-Centric Graph Retrieval Augmented Generation

Yaxiong Wu,Jianyuan Bo,Yongyue Zhang,Sheng Liang,Yong Liu

Main category: cs.CL

TL;DR: QCG-RAG是一种查询中心化的图结构检索增强生成框架,通过可控粒度的查询中心图构建和多跳检索机制,在多跳问答任务中优于现有方法。

Details Motivation: 现有基于图的RAG方法面临粒度困境:细粒度实体级图消耗高且丢失上下文,粗粒度文档级图难以捕捉复杂关系。 Method: 提出QCG-RAG框架,利用Doc2Query技术构建查询中心化的图结构,并设计定制的多跳检索机制,实现查询粒度索引和相关文本块检索。 Result: 在LiHuaWorld和MultiHop-RAG数据集上的实验表明,QCG-RAG在问答准确率上持续优于现有的基于块和基于图的RAG方法。 Conclusion: QCG-RAG通过查询中心化图结构和多跳检索,为多跳推理提供了一种新范式,有效平衡了检索粒度与上下文完整性。 Abstract: Graph-based retrieval-augmented generation (RAG) enriches large language models (LLMs) with external knowledge for long-context understanding and multi-hop reasoning, but existing methods face a granularity dilemma: fine-grained entity-level graphs incur high token costs and lose context, while coarse document-level graphs fail to capture nuanced relations. We introduce QCG-RAG, a query-centric graph RAG framework that enables query-granular indexing and multi-hop chunk retrieval. Our query-centric approach leverages Doc2Query and Doc2Query{-}{-} to construct query-centric graphs with controllable granularity, improving graph quality and interpretability. A tailored multi-hop retrieval mechanism then selects relevant chunks via the generated queries. Experiments on LiHuaWorld and MultiHop-RAG show that QCG-RAG consistently outperforms prior chunk-based and graph-based RAG methods in question answering accuracy, establishing a new paradigm for multi-hop reasoning.

[58] Un-Doubling Diffusion: LLM-guided Disambiguation of Homonym Duplication

Evgeny Kaskov,Elizaveta Petrova,Petr Surovtsev,Anna Kostikova,Ilya Mistiurin,Alexander Kapitanov,Alexander Nagaev

Main category: cs.CL

TL;DR: 本文提出了一种测量扩散模型中同音词重复问题的方法,并通过视觉-语言模型和人工评估对不同模型进行了评测,发现提示扩展能有效缓解由同音词和英语中心偏见引起的生成歧义问题。

Details Motivation: 同音词在文本到图像生成中会导致模型同时生成多个语义,造成混淆,且非英语词汇经翻译成英文后可能产生新的同音歧义,影响生成质量。 Method: 提出一种量化同音词重复率的评估方法,结合视觉-语言模型进行自动评估,并辅以人工评估;采用提示扩展策略来缓解该问题。 Result: 实验表明扩散模型普遍存在同音词重复现象,提示扩展能显著降低由同音词及英语中心偏见引发的生成歧义。 Conclusion: 提示扩展是一种有效缓解同音词重复和跨语言语义丢失的方法,有助于提升多义词和非英语语境下的文本到图像生成准确性。 Abstract: Homonyms are words with identical spelling but distinct meanings, which pose challenges for many generative models. When a homonym appears in a prompt, diffusion models may generate multiple senses of the word simultaneously, which is known as homonym duplication. This issue is further complicated by an Anglocentric bias, which includes an additional translation step before the text-to-image model pipeline. As a result, even words that are not homonymous in the original language may become homonyms and lose their meaning after translation into English. In this paper, we introduce a method for measuring duplication rates and conduct evaluations of different diffusion models using both automatic evaluation utilizing Vision-Language Models (VLM) and human evaluation. Additionally, we investigate methods to mitigate the homonym duplication problem through prompt expansion, demonstrating that this approach also effectively reduces duplication related to Anglocentric bias. The code for the automatic evaluation pipeline is publicly available.

[59] LLM Output Homogenization is Task Dependent

Shomik Jain,Jack Lanchantin,Maximilian Nickel,Karen Ullrich,Ashia Wilson,Jamelle Watson-Daniels

Main category: cs.CL

TL;DR: 本文提出了一个任务分类法,引入了任务锚定的功能多样性来评估和缓解大语言模型的输出同质化问题,并提出了一种任务锚定采样技术,在保持响应质量的同时提高功能多样性。

Details Motivation: 现有研究在处理输出同质化时未能根据任务类型区分多样性需求,导致评估和缓解方法不够精准。 Method: 提出了包含八个任务类别的分类体系,定义了任务相关的功能多样性指标,并设计了任务锚定采样技术以提升所需任务中的多样性。 Result: 实验表明该方法能有效提升不同任务类别下的功能多样性,同时保持输出质量,挑战了多样性与质量之间存在权衡的传统观点。 Conclusion: 任务依赖性能够显著改善对输出同质化的评估与缓解,为大语言模型的多样性管理提供了更精细的框架。 Abstract: A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct conceptualizations of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving homogenization where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.

[60] LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

Irina Tolstykh,Aleksandra Tsybina,Sergey Yakubson,Maksim Kuprashevich

Main category: cs.CL

TL;DR: 本文介绍了LLMTrace,一个大规模、双语(英语和俄语)的AI生成文本检测语料库,支持全文二分类和AI生成片段的字符级定位检测。

Details Motivation: 现有的AI生成文本检测数据集多使用过时模型生成,主要为英文,且缺乏对混合人类-AI写作场景中AI部分精确定位的支持。 Method: 利用多种现代专有和开源大语言模型构建了一个包含字符级标注的双语语料库,支持全文分类和AI生成区间检测。 Result: LLMTrace数据集填补了现有数据在语言多样性、模型新颖性和细粒度标注方面的空白,可用于训练更精细、实用的AI检测模型。 Conclusion: LLMTrace是一个有助于推动AI生成文本检测研究的重要资源,尤其在混合作者身份和精确片段定位方面具有显著优势。 Abstract: The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models. The project page is available at \href{https://sweetdream779.github.io/LLMTrace-info/}{iitolstykh/LLMTrace}.

[61] Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond

Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng

Main category: cs.CL

TL;DR: 本文理论分析了输入扰动对思维链(CoT)输出波动的影响,推导出扰动的上界,并证明该上界与推理步数正相关,且无限长的推理无法完全消除扰动影响;进一步在简化Transformer模型LSA上验证了该上界与输入嵌入和隐藏状态向量范数负相关,实验结果支持理论结论。

Details Motivation: 现有研究缺乏对输入扰动如何影响CoT输出的理论解释,限制了对推理过程中扰动传播机制的理解和提示优化方法的改进。 Method: 通过理论推导建立输入扰动的上界模型,分析其与推理步数、输入嵌入和隐藏状态向量范数的关系,并在Linear Self-Attention模型上进行验证,结合三个主流数据集和四个主流模型开展实验。 Result: 证明了输入扰动上界与推理步数正相关,无限推理无法消除扰动;在LSA模型中发现该上界与输入嵌入和隐藏状态向量范数负相关;实验结果与理论分析一致。 Conclusion: 输入扰动对CoT输出的影响存在理论可解释的边界,推理长度和向量范数是关键因素,为后续提示优化和鲁棒性设计提供了理论依据。 Abstract: Existing research indicates that the output of Chain-of-Thought (CoT) is significantly affected by input perturbations. Although many methods aim to mitigate such impact by optimizing prompts, a theoretical explanation of how these perturbations influence CoT outputs remains an open area of research. This gap limits our in-depth understanding of how input perturbations propagate during the reasoning process and hinders further improvements in prompt optimization methods. Therefore, in this paper, we theoretically analyze the effect of input perturbations on the fluctuation of CoT outputs. We first derive an upper bound for input perturbations under the condition that the output fluctuation is within an acceptable range, based on which we prove that: (i) This upper bound is positively correlated with the number of reasoning steps in the CoT; (ii) Even an infinitely long reasoning process cannot eliminate the impact of input perturbations. We then apply these conclusions to the Linear Self-Attention (LSA) model, which can be viewed as a simplified version of the Transformer. For the LSA model, we prove that the upper bound for input perturbation is negatively correlated with the norms of the input embedding and hidden state vectors. To validate this theoretical analysis, we conduct experiments on three mainstream datasets and four mainstream models. The experimental results align with our theoretical analysis, empirically demonstrating the correctness of our findings.

[62] DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

Kin Ian Lo,Hala Hawashin,Mina Abbaszadeh,Tilen Limback-Stokin,Hadi Wazni,Mehrnoosh Sadrzadeh

Main category: cs.CL

TL;DR: DisCoCLIP是一种结合冻结的CLIP视觉变换器与新型张量网络文本编码器的多模态模型,通过显式编码句法结构提升语言-视觉任务中的组成性推理能力。

Details Motivation: 现有视觉-语言模型在大规模图像-文本对齐方面表现出色,但常忽视语言的组成结构,导致在依赖词序和谓词-论元结构的任务上表现不佳。 Method: 引入DisCoCLIP,将句子用组合范畴语法解析器进行分析,生成分布词张量,并通过张量分解降低高阶张量的参数数量;模型采用自监督对比损失进行端到端训练。 Result: DisCoCLIP显著提升了对动词语义和词序的敏感性:SVO-Probes动词准确率从77.6%提升至82.4%,ARO属性和关系得分分别提高9%以上和4%以上,并在新提出的SVO-Swap基准上达到93.7%的准确率。 Conclusion: 通过张量网络嵌入显式语言结构,可实现可解释且参数高效的表示,显著增强视觉-语言任务中的组成性推理能力。 Abstract: Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence's grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP's SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.

[63] The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

Pranjal A. Chitale,Varun Gumma,Sanchit Ahuja,Prashant Kodali,Manan Uppadhyay,Deepthi Sudharsan,Sunayana Sitaram

Main category: cs.CL

TL;DR: 提出了一种基于语言特定维基百科内容的自下而上生成策略,构建了包含13种印度语言的950万条数据的大规模合成指令跟随数据集Updesh,实验证明该方法在低资源和中等资源语言中显著提升多语言AI模型性能。

Details Motivation: 解决低资源多语言环境下AI系统缺乏文化根基的问题,探索合成数据在多语言多文化背景下的有效性。 Method: 利用参数量大于等于2350亿的开源大语言模型,结合印度各语言的维基百科内容,采用自下而上的方式生成具有文化背景的合成数据,并构建Updesh数据集;通过微调模型并在15个多元语言数据集上进行下游任务评估。 Result: 生成的数据质量较高,人工评估指出了改进空间;在生成任务上模型性能显著提升,在选择类NLU任务上表现具竞争力,尤其在低、中等资源语言中提升更为明显。 Conclusion: 有效的多语言AI需要融合上下文感知和文化根基的多维度数据策展与生成策略,自下而上的文化情境化合成数据方法可有效缩小高低资源语言间的性能差距。 Abstract: Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

[64] Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs

Daniel Vennemeyer,Phan Anh Duong,Tiffany Zhan,Tianyu Jiang

Main category: cs.CL

TL;DR: 该研究将大语言模型中的谄媚行为分解为谄媚性同意和谄媚性赞美,并与真实同意进行对比,发现这三种行为在潜在空间中具有独立且可分离的线性方向,且可在不同模型中独立调控,表明其由不同的表征机制驱动。

Details Motivation: 尚不清楚大语言模型中的谄媚行为是由单一机制还是多种不同过程引起的,因此需要对其行为进行分解和分析。 Method: 使用均值差异方向、激活添加和子空间几何方法,在多个模型和数据集上分析谄媚性同意、谄媚性赞美与 genuine agreement 的表示差异。 Result: (1)三种行为在潜在空间中沿不同的线性方向编码;(2)每种行为可被独立增强或抑制而不影响其他行为;(3)其表征结构在不同模型族和规模间保持一致。 Conclusion: 谄媚行为对应于不同且可独立操控的表征,表明其背后存在多个独立机制而非单一来源。 Abstract: Large language models (LLMs) often exhibit sycophantic behaviors -- such as excessive agreement with or flattery of the user -- but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.

[65] RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Zhilin Wang,Jiaqi Zeng,Olivier Delalleau,Ellie Evans,Daniel Egert,Hoo-Chang Shin,Felipe Soares,Yi Dong,Oleksii Kuchaiev

Main category: cs.CL

TL;DR: 提出了一种新的强化学习范式RLBFF,结合人类反馈的灵活性和规则验证的精确性,通过二元化原则提升奖励模型的可解释性和性能,并实现高效、可定制的LLM对齐。

Details Motivation: 现有RLHF方法因缺乏明确标准导致可解释性差和奖励黑客问题,而RLVR局限于正确性验证,无法捕捉响应质量的细微方面。因此需要一种兼具两者优势的新方法。 Method: 将自然语言反馈转化为可二值判断的原则(如信息准确性、代码可读性),并将奖励模型训练建模为蕴含任务(即判断响应是否满足某原则),支持在推理时灵活指定关注原则。 Result: 所训练的奖励模型在RM-Bench(86.2%)和JudgeBench(81.4%,截至2025年9月24日排名第一)上表现优异,优于Bradley-Terry模型;并通过开源方案用Qwen3-32B实现了接近或超越o3-mini和DeepSeek R1的对齐效果,推理成本低于5%。 Conclusion: RLBFF有效融合了人类偏好与规则验证的优势,提升了奖励模型的精度、可解释性和灵活性,为大模型对齐提供了一种高效且可定制的新路径。 Abstract: Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost).

[66] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Yizhou Wang,Chen Tang,Han Deng,Jiabei Xiao,Jiaqi Liu,Jianyu Wu,Jun Yao,Pengze Li,Encheng Su,Lintao Wang,Guohang Zhuang,Yuchen Ren,Ben Fei,Ming Hu,Xin Chen,Dongzhan Zhou,Junjun He,Xiangyu Yue,Zhenfei Yin,Jiamin Wu,Qihao Zheng,Yuhao Zhou,Huihui Xu,Chenglong Ma,Yan Lu,Wenlong Zhang,Chunfeng Song,Philip Torr,Shixiang Tang,Xinzhu Ma,Wanli Ouyang,Lei Bai

Main category: cs.CL

TL;DR: 提出了一种科学推理基础模型,通过多阶段训练对齐自然语言与科学表示,在103个跨领域任务上表现出强泛化能力和高保真度,并开源了模型和数据。

Details Motivation: 为了提升现有模型在科学领域的跨模态对齐、推理能力及跨学科泛化性能,需要构建能理解并生成多种科学表示形式的统一基础模型。 Method: 在206B token的科学文本、纯序列和序列-文本对数据上预训练,随后通过监督微调(SFT)、退火冷启动自举生成长链思维(CoT),以及任务特定奖励塑形的强化学习进行对齐,以增强科学推理能力。 Result: 模型支持五类能力共103项任务,包括文本与科学格式互译、知识提取、属性预测与分类、序列生成与设计;相比专用系统,具备更广的指令覆盖、更好的跨域泛化和更高的输出保真度。 Conclusion: 跨学科学习显著提升模型迁移能力和下游任务可靠性,所提出的方法为科学AI奠定了可扩展、可解释的推理基础,且全部模型与工具已开源。 Abstract: We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.

cs.CV [Back]

[67] Leveraging NTPs for Efficient Hallucination Detection in VLMs

Ofir Azachi,Kfir Eliyahu,Eyal El Ani,Rom Himelstein,Roi Reichart,Yuval Pinter,Nitay Calderon

Main category: cs.CV

TL;DR: 提出一种基于视觉语言模型(VLM)下一词概率(NTP)的轻量级幻觉检测方法,利用NTP作为模型不确定性的指标,结合传统机器学习模型实现高效、实时的幻觉检测,并通过融合语言NTP和VLM预测得分进一步提升性能。

Details Motivation: 视觉语言模型(VLM)生成文本时可能出现与视觉内容不一致的幻觉问题,影响其可靠性;现有基于VLM自身进行检测的方法计算开销大、延迟高,亟需更高效的检测方案。 Method: 构建一个包含1400个人工标注样本的数据集,基于VLM生成结果中的下一词概率(NTP)提取不确定性特征,训练传统机器学习模型进行幻觉检测;进一步引入仅基于生成文本的语言NTP,并融合VLM自身的幻觉预测分数以提升性能。 Result: 实验证明NTP特征能有效预测幻觉,轻量级模型即可达到与强VLM相当的检测性能;加入语言NTP和VLM预测分数后性能进一步提升,优于单独使用VLM或NTP的方法。 Conclusion: 基于NTP的轻量级方法可高效检测VLM幻觉,为提升VLM可靠性提供了简单、低延迟的解决方案。 Abstract: Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM's next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.

[68] Quasi-Synthetic Riemannian Data Generation for Writer-Independent Offline Signature Verification

Elias N. Zois,Moises Diaz,Salem Said,Miguel A. Ferrer

Main category: cs.CV

TL;DR: 提出一种基于黎曼几何的准合成数据生成框架,用于离线手写签名验证,通过在对称正定矩阵空间中生成合成数据,在跨数据集和不同书写风格下均表现出低错误率。

Details Motivation: 离线手写签名验证在书写者独立场景下具有挑战性,现有方法依赖真实签名数据集进行训练,限制了泛化能力。 Method: 利用对称正定矩阵(SPD)的黎曼几何特性,以少量真实样本为种子构建黎曼高斯混合模型,识别黎曼中心作为合成书写者,并通过黎曼高斯采样生成正负类别的合成SPD数据,结合度量学习框架进行训练和测试。 Result: 在两个包含西方和亚洲书写风格的真实数据集上实验表明,该方法在跨数据集和内部评估协议下均取得较低错误率。 Conclusion: 在黎曼空间中生成合成数据是可行且有效的,为书写者独立的签名验证系统提供了新思路。 Abstract: Offline handwritten signature verification remains a challenging task, particularly in writer-independent settings where models must generalize across unseen individuals. Recent developments have highlighted the advantage of geometrically inspired representations, such as covariance descriptors on Riemannian manifolds. However, past or present, handcrafted or data-driven methods usually depend on real-world signature datasets for classifier training. We introduce a quasi-synthetic data generation framework leveraging the Riemannian geometry of Symmetric Positive Definite matrices (SPD). A small set of genuine samples in the SPD space is the seed to a Riemannian Gaussian Mixture which identifies Riemannian centers as synthetic writers and variances as their properties. Riemannian Gaussian sampling on each center generates positive as well as negative synthetic SPD populations. A metric learning framework utilizes pairs of similar and dissimilar SPD points, subsequently testing it over on real-world datasets. Experiments conducted on two popular signature datasets, encompassing Western and Asian writing styles, demonstrate the efficacy of the proposed approach under both intra- and cross- dataset evaluation protocols. The results indicate that our quasi-synthetic approach achieves low error rates, highlighting the potential of generating synthetic data in Riemannian spaces for writer-independent signature verification systems.

[69] Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream,Yunpeng Chen,Yu Gao,Lixue Gong,Meng Guo,Qiushan Guo,Zhiyao Guo,Xiaoxia Hou,Weilin Huang,Yixuan Huang,Xiaowen Jian,Huafeng Kuang,Zhichao Lai,Fanshi Li,Liang Li,Xiaochen Lian,Chao Liao,Liyang Liu,Wei Liu,Yanzuo Lu,Zhengxiong Luo,Tongtong Ou,Guang Shi,Yichun Shi,Shiqi Sun,Yu Tian,Zhi Tian,Peng Wang,Rui Wang,Xun Wang,Ye Wang,Guofeng Wu,Jie Wu,Wenxu Wu,Yonghui Wu,Xin Xia,Xuefeng Xiao,Shuang Xu,Xin Yan,Ceyuan Yang,Jianchao Yang,Zhonghua Zhai,Chenlin Zhang,Heng Zhang,Qi Zhang,Xinyu Zhang,Yuwei Zhang,Shijia Zhao,Wenliang Zhao,Wenjia Zhu

Main category: cs.CV

TL;DR: Seedream 4.0 是一个高效、高性能的多模态图像生成系统,统一了文本到图像生成、图像编辑和多图像合成,支持快速生成高分辨率图像,并在多种任务上达到SOTA性能。

Details Motivation: 为了提升多模态图像生成系统的效率与性能,实现文本生成图像、图像编辑和多图合成的统一框架,增强生成模型在复杂创意和专业场景中的交互性与实用性。 Method: 采用高效的扩散Transformer和强大的VAE减少图像token数量,结合大规模文本-图像对预训练、多模态后训练、对抗蒸馏、分布匹配、量化和推测解码等技术进行优化。 Result: Seedream 4.0 可在1.8秒内生成2K图像,支持多图像参考和多输出生成,在文本到图像生成和多模态图像编辑任务上达到最先进水平,具备精确编辑和上下文推理能力。 Conclusion: Seedream 4.0 成功将传统T2I系统扩展为更智能、交互性强的多维创作工具,推动了生成式AI在创意与专业应用中的边界。 Abstract: We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.

[70] A Contrastive Learning Framework for Breast Cancer Detection

Samia Saeed,Khuram Naveed

Main category: cs.CV

TL;DR: 本研究提出了一种基于对比学习的半监督框架,利用ResNet-50在少量标注数据和大量未标注乳腺X线图像上进行训练,通过数据增强提升模型性能,在INbreast和MIAS数据集上实现了96.7%的准确率,优于现有方法。

Details Motivation: 由于深度学习方法在乳腺癌检测中受限于大规模标注数据集的缺乏,影响了其准确性,因此需要一种能在小规模标注数据下仍表现优异的方法。 Method: 采用ResNet-50网络,结合对比学习(Contrastive Learning)框架,在半监督模式下利用相似性指标对大量未标注乳腺X线图像进行预训练,并使用多种数据增强和变换策略;随后在少量标注数据上进行微调。 Result: 该方法在INbreast和MIAS两个基准数据集上实现了96.7%的分类准确率,超过了现有的最先进方法。 Conclusion: 所提出的基于对比学习的半监督方法能有效缓解标注数据不足的问题,在乳腺癌早期检测中表现出卓越性能,具有临床应用潜力。 Abstract: Breast cancer, the second leading cause of cancer-related deaths globally, accounts for a quarter of all cancer cases [1]. To lower this death rate, it is crucial to detect tumors early, as early-stage detection significantly improves treatment outcomes. Advances in non-invasive imaging techniques have made early detection possible through computer-aided detection (CAD) systems which rely on traditional image analysis to identify malignancies. However, there is a growing shift towards deep learning methods due to their superior effectiveness. Despite their potential, deep learning methods often struggle with accuracy due to the limited availability of large-labeled datasets for training. To address this issue, our study introduces a Contrastive Learning (CL) framework, which excels with smaller labeled datasets. In this regard, we train Resnet-50 in semi supervised CL approach using similarity index on a large amount of unlabeled mammogram data. In this regard, we use various augmentation and transformations which help improve the performance of our approach. Finally, we tune our model on a small set of labelled data that outperforms the existing state of the art. Specifically, we observed a 96.7% accuracy in detecting breast cancer on benchmark datasets INbreast and MIAS.

[71] Are Foundation Models Ready for Industrial Defect Recognition? A Reality Check on Real-World Data

Simon Baeuerle,Pratik Khanna,Nils Friederich,Angelo Jovin Yamachui Sitcheu,Damir Shakirov,Andreas Steimer,Ralf Mikut

Main category: cs.CV

TL;DR: 本文探讨了基础模型(FMs)在工业制造质量检测中的应用潜力,发现尽管这些模型在公共基准数据集上表现良好,但在真实工业图像数据上均失败。

Details Motivation: 探索基础模型在无需标注数据的零样本设置下,是否适用于多产品、多类型的自动化质量检测任务,以减少传统监督模型对大量标注数据的依赖。 Method: 测试多种最新的基础模型在自定义的真实工业图像数据和公共图像数据上的表现,比较其在不同数据集上的性能差异。 Result: 所有测试的基础模型在真实工业数据上表现不佳,无法有效检测异常,但在公共基准数据集上表现良好。 Conclusion: 当前的基础模型虽有潜力,但尚不能直接应用于实际工业质量检测场景,需进一步改进以适应真实工业环境的复杂性。 Abstract: Foundation Models (FMs) have shown impressive performance on various text and image processing tasks. They can generalize across domains and datasets in a zero-shot setting. This could make them suitable for automated quality inspection during series manufacturing, where various types of images are being evaluated for many different products. Replacing tedious labeling tasks with a simple text prompt to describe anomalies and utilizing the same models across many products would save significant efforts during model setup and implementation. This is a strong advantage over supervised Artificial Intelligence (AI) models, which are trained for individual applications and require labeled training data. We test multiple recent FMs on both custom real-world industrial image data and public image data. We show that all of those models fail on our real-world data, while the very same models perform well on public benchmark datasets.

[72] Shared Neural Space: Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision

Jing Li,Oskar Bartosz,Chengyu Wang,Michal Wnuczynski,Dilshan Godaliyadda,Michael Polley

Main category: cs.CV

TL;DR: 提出了一种通用的神经空间(NS)框架,通过共享特征空间实现多任务视觉和成像任务的高效处理。

Details Motivation: 解决现有AI模型在多个模块化任务中因各自映射到不同潜在域而导致的效率低下问题。 Method: 采用轻量级CNN为基础的编码器-解码器框架,预计算跨任务的特征,编码器学习具有变换感知性和可泛化的表示。 Result: 实现了多个下游AI模块(如去马赛克、去噪、深度估计和语义分割)在统一特征空间中的高效运行,减少了冗余,提升了跨域泛化能力。 Conclusion: 该通用神经空间为高效多任务视觉系统提供了基础,且兼容广泛硬件平台。 Abstract: The majority of AI models in imaging and vision are customized to perform on specific high-precision task. However, this strategy is inefficient for applications with a series of modular tasks, since each requires a mapping into a disparate latent domain. To address this inefficiency, we proposed a universal Neural Space (NS), where an encoder-decoder framework pre-computes features across vision and imaging tasks. Our encoder learns transformation aware, generalizable representations, which enable multiple downstream AI modules to share the same feature space. This architecture reduces redundancy, improves generalization across domain shift, and establishes a foundation for effecient multi-task vision pipelines. Furthermore, as opposed to larger transformer backbones, our backbone is lightweight and CNN-based, allowing for wider across hardware. We furthur demonstrate that imaging and vision modules, such as demosaicing, denoising, depth estimation and semantic segmentation can be performed efficiently in the NS.

[73] Data-Efficient Stream-Based Active Distillation for Scalable Edge Model Deployment

Dani Manjah,Tim Bary,Benoît Gérin,Benoît Macq,Christophe de Vleeschouwer

Main category: cs.CV

TL;DR: 提出一种结合高置信度流式策略与多样性方法的图像选择方案,以在低传输成本下最大化边缘设备模型质量。

Details Motivation: 边缘摄像头系统面临动态环境,需频繁更新模型,但边缘设备计算能力有限,需高效选择训练数据以平衡模型质量与通信开销。 Method: 采用流式高置信度策略结合多样性采样方法,筛选最有用的图像用于训练轻量级边缘模型,利用中心服务器上的复杂教师模型进行标注。 Result: 在相似训练迭代次数下,该方法能以极少的数据查询量实现高质量模型训练,显著降低传输成本。 Conclusion: 高置信度与多样性结合的图像选择策略可有效提升边缘设备模型性能,同时减少数据传输开销,适用于资源受限的边缘计算场景。 Abstract: Edge camera-based systems are continuously expanding, facing ever-evolving environments that require regular model updates. In practice, complex teacher models are run on a central server to annotate data, which is then used to train smaller models tailored to the edge devices with limited computational power. This work explores how to select the most useful images for training to maximize model quality while keeping transmission costs low. Our work shows that, for a similar training load (i.e., iterations), a high-confidence stream-based strategy coupled with a diversity-based approach produces a high-quality model with minimal dataset queries.

[74] InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On

Julien Han,Shuwen Qiu,Qi Li,Xingzi Xu,Mehmet Saygin Seyfioglu,Kavosh Asadi,Karim Bouyarmane

Main category: cs.CV

TL;DR: InstructVTON是一个基于自然语言指令的交互式虚拟试穿系统,通过视觉语言模型和图像分割自动生​​成二值掩码,实现对单件或多件衣物的细粒度、复杂风格控制。

Details Motivation: 传统基于掩码的虚拟试穿方法难以处理复杂的穿衣场景(如卷起袖子),且需要用户手动绘制精确掩码,操作困难且模型依赖性强。 Method: 将虚拟试穿建模为图像引导的修复任务,利用视觉语言模型(VLM)和图像分割模型,根据用户提供的图像和自由文本指令自动生成二值掩码,并支持多轮生成以实现复杂穿搭效果。 Result: InstructVTON无需人工设计掩码即可实现精细的风格控制,兼容现有虚拟试穿模型,在复杂穿衣场景中达到领先效果。 Conclusion: InstructVTON通过结合VLM与图像分割技术,简化了用户操作,提升了虚拟试穿系统的可控性与适用范围,推动了指令驱动的交互式虚拟试穿发展。 Abstract: We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditioned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with "sleeves rolled up" styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.

[75] Innovative Deep Learning Architecture for Enhanced Altered Fingerprint Recognition

Dana A Abdullah,Dana Rasul Hamad,Bishar Rasheed Ibrahim,Sirwan Abdulwahid Aula,Aso Khaleel Ameen,Sabat Salih Hamadamin

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的畸变指纹识别模型DeepAFRNet,采用VGG16骨干网络和余弦相似度进行特征匹配,在真实篡改指纹数据集SOCOFing上实现了高精度识别,并强调了阈值选择对系统性能的重要性。

Details Motivation: 由于攻击者可能故意修改指纹纹路以逃避检测,传统指纹识别系统面临挑战,因此需要一种能够鲁棒识别篡改指纹的方法来提升生物特征验证系统的安全性。 Method: 提出DeepAFRNet模型,使用VGG16作为特征提取 backbone,并通过计算嵌入向量之间的余弦相似度实现指纹匹配。在SOCOFing数据集的真实-篡改子集上进行评估,按难易程度分为三个级别(Easy、Medium、Hard)。 Result: 在严格阈值下,DeepAFRNet在三个难度级别上的准确率分别为96.7%、98.76%和99.54%;但当阈值从0.92放宽至0.72时,准确率显著下降至7.86%、27.05%和29.51%,显示出阈值敏感性。 Conclusion: DeepAFRNet在真实篡改指纹识别任务中表现优异,克服了以往研究依赖合成数据或有限验证协议的局限,具备在高安全需求场景中实际部署的潜力。 Abstract: Altered fingerprint recognition (AFR) is challenging for biometric verification in applications such as border control, forensics, and fiscal admission. Adversaries can deliberately modify ridge patterns to evade detection, so robust recognition of altered prints is essential. We present DeepAFRNet, a deep learning recognition model that matches and recognizes distorted fingerprint samples. The approach uses a VGG16 backbone to extract high-dimensional features and cosine similarity to compare embeddings. We evaluate on the SOCOFing Real-Altered subset with three difficulty levels (Easy, Medium, Hard). With strict thresholds, DeepAFRNet achieves accuracies of 96.7 percent, 98.76 percent, and 99.54 percent for the three levels. A threshold-sensitivity study shows that relaxing the threshold from 0.92 to 0.72 sharply degrades accuracy to 7.86 percent, 27.05 percent, and 29.51 percent, underscoring the importance of threshold selection in biometric systems. By using real altered samples and reporting per-level metrics, DeepAFRNet addresses limitations of prior work based on synthetic alterations or limited verification protocols, and indicates readiness for real-world deployments where both security and recognition resilience are critical.

[76] Large Pre-Trained Models for Bimanual Manipulation in 3D

Hanna Yurchyk,Wei-Di Chang,Gregory Dudek,David Meger

Main category: cs.CV

TL;DR: 将预训练Vision Transformer的注意力图集成到体素表示中,以提升双手机器人操作性能。

Details Motivation: 为了增强双手机器人操作中的语义理解能力,利用视觉Transformer的注意力机制提供更有效的感知输入。 Method: 从自监督ViT模型DINOv2中提取注意力图作为RGB图像的像素级显著性分数,并将其提升至3D体素网格,形成体素级语义线索,融入行为克隆策略中。 Result: 在RLBench双手机器人基准上,该方法平均绝对性能提升8.2%,相对增益达21.9%。 Conclusion: 通过引入注意力引导的体素特征表示,显著提升了现有体素策略在双手机器人操作任务中的表现。 Abstract: We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

[77] A Comparative Benchmark of Real-time Detectors for Blueberry Detection towards Precision Orchard Management

Xinyang Mu,Yuzhen Lu,Boyang Deng

Main category: cs.CV

TL;DR: 本研究提出了一种针对蓝莓检测的新型实时目标检测模型基准分析,使用包含85,879个标注实例的新数据集,比较了YOLO和RT-DETR系列共36种模型变体,并通过半监督学习进一步提升性能,RT-DETRv2-X在微调后达到94.8%的mAP@50,结果公开以促进后续研究。

Details Motivation: 蓝莓在自然环境中因光照变化、遮挡和运动模糊等因素难以准确检测,且现有深度学习模型需要大规模多样化数据集以及合适的精度/速度/内存权衡。 Method: 构建了一个包含661张图像、85,879个标注样本的新蓝莓检测数据集,并对YOLO(v8-v12)和RT-DETR(v1-v2)共36种模型进行系统性评估;采用基于Unbiased Mean Teacher的半监督学习方法,在1,035张无标签图像上进行微调以提升性能。 Result: YOLOv12m取得93.3%的mAP@50,RT-DETRv2-X表现最佳,初始mAP@50为93.6%,经半监督微调后提升至94.8%;中等规模模型在精度与推理速度之间表现出良好平衡;不同模型的推理时间随复杂度变化。 Conclusion: RT-DETRv2-X在蓝莓检测任务中表现最优,结合半监督学习可进一步提升性能,但跨域无标签数据的有效利用仍需深入研究;公开数据集与代码以支持后续研究。 Abstract: Blueberry detection in natural environments remains challenging due to variable lighting, occlusions, and motion blur due to environmental factors and imaging devices. Deep learning-based object detectors promise to address these challenges, but they demand a large-scale, diverse dataset that captures the real-world complexities. Moreover, deploying these models in practical scenarios often requires the right accuracy/speed/memory trade-off in model selection. This study presents a novel comparative benchmark analysis of advanced real-time object detectors, including YOLO (You Only Look Once) (v8-v12) and RT-DETR (Real-Time Detection Transformers) (v1-v2) families, consisting of 36 model variants, evaluated on a newly curated dataset for blueberry detection. This dataset comprises 661 canopy images collected with smartphones during the 2022-2023 seasons, consisting of 85,879 labelled instances (including 36,256 ripe and 49,623 unripe blueberries) across a wide range of lighting conditions, occlusions, and fruit maturity stages. Among the YOLO models, YOLOv12m achieved the best accuracy with a mAP@50 of 93.3%, while RT-DETRv2-X obtained a mAP@50 of 93.6%, the highest among all the RT-DETR variants. The inference time varied with the model scale and complexity, and the mid-sized models appeared to offer a good accuracy-speed balance. To further enhance detection performance, all the models were fine-tuned using Unbiased Mean Teacher-based semi-supervised learning (SSL) on a separate set of 1,035 unlabeled images acquired by a ground-based machine vision platform in 2024. This resulted in accuracy gains ranging from -1.4% to 2.9%, with RT-DETR-v2-X achieving the best mAP@50 of 94.8%. More in-depth research into SSL is needed to better leverage cross-domain unlabeled data. Both the dataset and software programs of this study are made publicly available to support further research.

[78] Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation

Farbod Bigdeli,Mohsen Mohammadagha,Ali Bigdeli

Main category: cs.CV

TL;DR: 提出一种轻量级的ROI增强策略,通过在训练中使用无标签边界框库中的随机ROI裁剪来增强Mini-DDSM数据集上的乳腺癌筛查性能,无需额外标注或模型修改。

Details Motivation: 现有深度学习在乳腺X线摄影解读中受限于低分辨率数据集和小样本量,影响模型性能。 Method: 引入一种训练阶段专用的轻量级ROI增强方法:从预计算的无标签边界框库中随机采样ROI区域,并以一定概率替换完整图像,可选加入抖动以增加多样性。评估采用严格的患者级别交叉验证。 Result: 在Mini-DDSM数据集上,该方法在ROC-AUC上有轻微提升(最佳参数:p_roi=0.10, alpha=0.10),但PR-AUC持平或略降;性能跨折叠有变化;训练效率指标显示吞吐量和GPU内存使用可控。推理成本不变。 Conclusion: 简单的数据中心化ROI增强策略可在资源受限场景下有效提升乳腺X线分类性能,且不增加推理开销或依赖额外标注。 Abstract: Breast cancer screening with mammography remains central to early detection and mortality reduction. Deep learning has shown strong potential for automating mammogram interpretation, yet limited-resolution datasets and small sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset (9,684 images; 2,414 patients) and introduce a lightweight region-of-interest (ROI) augmentation strategy. During training, full images are probabilistically replaced with random ROI crops sampled from a precomputed, label-free bounding-box bank, with optional jitter to increase variability. We evaluate under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and training-time efficiency metrics (throughput and GPU memory). Because ROI augmentation is training-only, inference-time cost remains unchanged. On Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to slightly lower. These results demonstrate that simple, data-centric ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural modifications.

[79] Reflect3r: Single-View 3D Stereo Reconstruction Aided by Mirror Reflections

Jing Wu,Zirui Wang,Iro Laina,Victor Adrian Prisacariu

Main category: cs.CV

TL;DR: 本文提出了一种利用镜面反射在单张图像中构建虚拟双目视图的方法,通过设计一个物理上有效的虚拟相机变换,实现像素域的虚拟视图生成,并结合对称感知损失优化姿态估计,从而实现从单张图像进行通用且鲁棒的3D重建。

Details Motivation: 镜面反射在日常环境中普遍存在,但常被视为视觉干扰。然而,镜面同时呈现真实与虚拟视角,蕴含立体信息。本文旨在利用这一特性,将镜面反射作为辅助视图,从单张图像中提取三维几何信息,克服传统多视角立体匹配需要多张图像的限制。 Method: 将镜面反射视为辅助视图,设计一种可学习的变换来构造物理上合理的虚拟相机,直接在像素域生成符合真实成像过程的虚拟视角图像;引入对称感知损失(symmetric-aware loss)以利用镜面对称性优化相机姿态估计;框架可自然扩展至包含镜面反射的动态场景,实现逐帧几何恢复;并构建了一个包含16个Blender场景的可定制合成数据集用于定量评估。 Result: 在合成与真实世界数据上的大量实验表明,该方法能有效利用镜面反射信息,实现高质量的单图像3D重建和姿态估计,优于现有相关方法;提出的对称感知损失显著提升了姿态估计精度;所发布数据集为后续研究提供了基准。 Conclusion: 本文成功将镜面反射从视觉噪声转化为有用的立体信息源,提出了一种新颖的单图像多视角立体重建框架,兼具通用性与鲁棒性,为基于反射的几何理解提供了新思路,并展示了其在静态与动态场景中的潜力。 Abstract: Mirror reflections are common in everyday environments and can provide stereo information within a single capture, as the real and reflected virtual views are visible simultaneously. We exploit this property by treating the reflection as an auxiliary view and designing a transformation that constructs a physically valid virtual camera, allowing direct pixel-domain generation of the virtual view while adhering to the real-world imaging process. This enables a multi-view stereo setup from a single image, simplifying the imaging process, making it compatible with powerful feed-forward reconstruction models for generalizable and robust 3D reconstruction. To further exploit the geometric symmetry introduced by mirrors, we propose a symmetric-aware loss to refine pose estimation. Our framework also naturally extends to dynamic scenes, where each frame contains a mirror reflection, enabling efficient per-frame geometry recovery. For quantitative evaluation, we provide a fully customizable synthetic dataset of 16 Blender scenes, each with ground-truth point clouds and camera poses. Extensive experiments on real-world data and synthetic data are conducted to illustrate the effectiveness of our method.

[80] Recov-Vision: Linking Street View Imagery and Vision-Language Models for Post-Disaster Recovery

Yiming Xiao,Archit Gupta,Miguel Esparza,Yu-Hsuan Ho,Antonia Sebastian,Hannah Weas,Rose Houck,Ali Mostafavi

Main category: cs.CV

TL;DR: 本文提出了一种名为FacadeTrack的街景语言引导框架,用于灾后建筑级 occupancy 评估,通过将全景视频与地块关联并提取可解释属性,实现了高精度的占用判断,并支持可审计和可扩展的应急管理工作流集成。

Details Motivation: 灾后建筑 occupancy 信息对资源分配、安全检查等至关重要,但现有方法在覆盖范围或细节捕捉上存在不足:航拍图像缺乏立面细节,街景图像稀疏且难以与地块匹配。 Method: 提出FacadeTrack框架,结合街景全景视频与地块数据,利用语言引导进行立面校正,提取影响 habitability 的可解释特征(如入口阻塞、临时遮盖等),并设计两种决策策略:透明的一阶段规则和分离感知与推理的二阶段策略。 Result: 在两次飓风Helene灾后调查中,二阶段方法达到0.927的精确率、0.781的召回率和0.848的F1分数,优于一阶段基线(精确率0.943,召回率0.728,F1分数0.822);中间属性和空间诊断有助于定位和分析误差来源。 Conclusion: FacadeTrack提供了一种可审计、可扩展的occupancy评估方案,能够有效整合到地理空间和应急管理流程中,提升灾后响应的效率与公平性。 Abstract: Building-level occupancy after disasters is vital for triage, inspections, utility re-energization, and equitable resource allocation. Overhead imagery provides rapid coverage but often misses facade and access cues that determine habitability, while street-view imagery captures those details but is sparse and difficult to align with parcels. We present FacadeTrack, a street-level, language-guided framework that links panoramic video to parcels, rectifies views to facades, and elicits interpretable attributes (for example, entry blockage, temporary coverings, localized debris) that drive two decision strategies: a transparent one-stage rule and a two-stage design that separates perception from conservative reasoning. Evaluated across two post-Hurricane Helene surveys, the two-stage approach achieves a precision of 0.927, a recall of 0.781, and an F-1 score of 0.848, compared with the one-stage baseline at a precision of 0.943, a recall of 0.728, and an F-1 score of 0.822. Beyond accuracy, intermediate attributes and spatial diagnostics reveal where and why residual errors occur, enabling targeted quality control. The pipeline provides auditable, scalable occupancy assessments suitable for integration into geospatial and emergency-management workflows.

[81] Human Semantic Representations of Social Interactions from Moving Shapes

Yiling Yun,Hongjing Lu

Main category: cs.CV

TL;DR: 该研究探讨了人类在识别简单移动形状所展示的社会互动时,语义表征如何补充视觉特征。通过两项研究发现,基于描述动词的语义嵌入最能解释人类的相似性判断,表明社会感知反映了社会互动的语义结构。

Details Motivation: 探究人类在识别简单动态图形中的社会互动时,除了视觉特征外所依赖的语义表征机制。 Method: 研究1让参与者根据对移动形状的印象进行标签化;研究2通过人类相似性判断测量27种社会互动的表征几何,并与基于视觉特征、标签和语义嵌入的模型预测进行比较。 Result: 人类反应具有分布性;语义模型为视觉特征提供了补充信息,其中基于动词的语义嵌入最能解释人类的相似性判断。 Conclusion: 简单动态显示中的社会感知反映了社会互动的语义结构,语义表征在视觉与抽象认知之间起到桥梁作用。 Abstract: Humans are social creatures who readily recognize various social interactions from simple display of moving shapes. While previous research has often focused on visual features, we examine what semantic representations that humans employ to complement visual features. In Study 1, we directly asked human participants to label the animations based on their impression of moving shapes. We found that human responses were distributed. In Study 2, we measured the representational geometry of 27 social interactions through human similarity judgments and compared it with model predictions based on visual features, labels, and semantic embeddings from animation descriptions. We found that semantic models provided complementary information to visual features in explaining human judgments. Among the semantic models, verb-based embeddings extracted from descriptions account for human similarity judgments the best. These results suggest that social perception in simple displays reflects the semantic structure of social interactions, bridging visual and abstract representations.

[82] Enhancing Cross-View Geo-Localization Generalization via Global-Local Consistency and Geometric Equivariance

Xiaowei Wang,Di Wang,Ke Li,Yifeng Wang,Chengjian Wang,Libin Sun,Zhihong Wu,Yiming Zhang,Quan Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的跨视角地理定位框架EGS,通过E(2)-Steerable CNN和带有虚拟超节点的图结构来提升跨域泛化能力,在多个基准上实现了最先进的性能。

Details Motivation: 现有方法在应对无人机不同朝向和视野带来的显著外观变化时鲁棒性不足,且难以同时建模全局语义与局部细节的可靠对应关系。 Method: 提出EGS框架,采用E(2)-Steerable CNN提取对旋转和视角变化鲁棒的特征,并构建包含虚拟超节点的图结构以聚合和重分配全局语义到局部区域,实现全局-局部一致性。 Result: 在University-1652和SUES-200基准上实验表明,EGS显著优于现有方法,取得了新的性能突破。 Conclusion: EGS有效提升了跨视角地理定位中的跨域泛化能力,解决了外观变化大和全局-局部匹配难的问题,为该领域提供了新的解决方案。 Abstract: Cross-view geo-localization (CVGL) aims to match images of the same location captured from drastically different viewpoints. Despite recent progress, existing methods still face two key challenges: (1) achieving robustness under severe appearance variations induced by diverse UAV orientations and fields of view, which hinders cross-domain generalization, and (2) establishing reliable correspondences that capture both global scene-level semantics and fine-grained local details. In this paper, we propose EGS, a novel CVGL framework designed to enhance cross-domain generalization. Specifically, we introduce an E(2)-Steerable CNN encoder to extract stable and reliable features under rotation and viewpoint shifts. Furthermore, we construct a graph with a virtual super-node that connects to all local nodes, enabling global semantics to be aggregated and redistributed to local regions, thereby enforcing global-local consistency. Extensive experiments on the University-1652 and SUES-200 benchmarks demonstrate that EGS consistently achieves substantial performance gains and establishes a new state of the art in cross-domain CVGL.

[83] DENet: Dual-Path Edge Network with Global-Local Attention for Infrared Small Target Detection

Jiayi Zuo,Songwei Pei,Qian Li

Main category: cs.CV

TL;DR: 提出一种双路径边缘网络(Dual-Path Edge Network)用于红外小目标检测,通过解耦边缘增强与语义建模,在多尺度下提升微弱目标的检测精度。

Details Motivation: 红外小目标缺乏明显的纹理和形态特征,易融入复杂背景,现有方法在高噪声低对比度下难以准确提取边缘信息,且空间细节与语义上下文之间存在冲突。 Method: 设计双路径结构:一条路径采用双向交互模块(结合局部与全局自注意力)捕捉多尺度特征依赖;另一条路径引入多边缘精炼器,使用级联的泰勒有限差分算子和注意力门控机制进行多尺度边缘增强。 Result: 所提方法在多个数据集上实现了优于现有方法的检测性能,有效增强了边缘细节并抑制了噪声,提升了对不同尺寸目标的定位能力。 Conclusion: 该双路径框架通过融合结构语义与边缘精炼,为精确的红外小目标检测提供了一种有效的解决方案。 Abstract: Infrared small target detection is crucial for remote sensing applications like disaster warning and maritime surveillance. However, due to the lack of distinctive texture and morphological features, infrared small targets are highly susceptible to blending into cluttered and noisy backgrounds. A fundamental challenge in designing deep models for this task lies in the inherent conflict between capturing high-resolution spatial details for minute targets and extracting robust semantic context for larger targets, often leading to feature misalignment and suboptimal performance. Existing methods often rely on fixed gradient operators or simplistic attention mechanisms, which are inadequate for accurately extracting target edges under low contrast and high noise. In this paper, we propose a novel Dual-Path Edge Network that explicitly addresses this challenge by decoupling edge enhancement and semantic modeling into two complementary processing paths. The first path employs a Bidirectional Interaction Module, which uses both Local Self-Attention and Global Self-Attention to capture multi-scale local and global feature dependencies. The global attention mechanism, based on a Transformer architecture, integrates long-range semantic relationships and contextual information, ensuring robust scene understanding. The second path introduces the Multi-Edge Refiner, which enhances fine-grained edge details using cascaded Taylor finite difference operators at multiple scales. This mathematical approach, along with an attention-driven gating mechanism, enables precise edge localization and feature enhancement for targets of varying sizes, while effectively suppressing noise. Our method provides a promising solution for precise infrared small target detection and localization, combining structural semantics and edge refinement in a unified framework.

[84] Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset

Ruixu Zhang,Yuran Wang,Xinyi Hu,Chaoyu Mai,Wenxuan Liu,Danni Xu,Xian Zhong,Zheng Wang

Main category: cs.CV

TL;DR: 本文提出了群体意图预测(GIF)这一新任务,并构建了首个大规模数据集SHOT和相应的预测框架GIFT,用于在群体目标显现前通过个体行为和互动来预测群体意图的出现。

Details Motivation: 传统意图识别主要关注个体意图,忽视了群体环境中集体意图的复杂性。为此,本文旨在提出群体意图的概念,并推动对群体意图提前预测的研究。 Method: 提出群体意图预测(GIF)任务,构建包含多视角、多个体信息的大规模篮球视频数据集SHOT,并设计GIFT框架,通过提取细粒度个体特征和建模动态群体关系来预测群体意图的形成。 Result: 实验结果表明,SHOT数据集和GIFT框架在群体意图预测任务上具有有效性,为该领域后续研究提供了坚实基础。 Conclusion: 群体意图预测是一个有前景的新方向,SHOT数据集和GIFT框架的成功验证了从个体行为中预测群体意图的可行性。 Abstract: Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting. The dataset is available at https://xinyi-hu.github.io/SHOT_DATASET.

[85] Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection

Yu Guo,Shengfeng He,Yuxu Lu,Haonan An,Yihang Tao,Huilin Zhu,Jingxian Liu,Yuguang Fang

Main category: cs.CV

TL;DR: 本文提出Neptune-X,一种数据驱动的生成-选择框架,通过任务感知的样本生成与选择来提升海上目标检测性能。

Details Motivation: 由于标注海上数据稀缺且模型在不同海上属性间泛化能力差(如类别、视角、环境等),现有方法在开放海域等场景下表现不佳。 Method: 提出X-to-Maritime多模态生成模型,结合双向物体-水体注意力机制提升合成场景的真实性;并设计属性相关主动采样策略,动态筛选对任务有益的合成样本。 Result: 在多个实验中显著提升检测精度,尤其在挑战性和代表性不足的场景中表现突出,并构建了首个面向生成式海上学习的数据集Maritime Generation Dataset。 Conclusion: Neptune-X通过合成数据生成与智能选择,有效缓解了海上目标检测中的数据稀缺和泛化问题,为该领域提供了新的基准和解决方案。 Abstract: Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). % In particular, models trained on existing datasets often underperform in underrepresented scenarios such as open-sea environments. To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings.The code is available at https://github.com/gy65896/Neptune-X.

[86] AI-Enabled Crater-Based Navigation for Lunar Mapping

Sofia McLeod,Chee-Kheng Chng,Matthew Rodda,Tat-Jun Chin

Main category: cs.CV

TL;DR: 本文提出了STELLA——首个用于长期月球测绘任务的基于陨石坑导航(CBN)端到端管道,并发布了模拟一年期月球测绘任务的公开数据集CRESENT-365,实验表明STELLA在多种条件下均能保持米级定位和亚度级姿态精度。

Details Motivation: 现有基于陨石坑的导航研究主要集中于着陆阶段,难以适用于长期、稀疏、倾斜且光照变化大的月球测绘任务,因此需要开发适应此类复杂条件的导航系统。 Method: STELLA结合了基于Mask R-CNN的陨石坑检测器、无描述符的陨石坑识别模块、鲁棒的PnC位姿求解器以及批量轨道确定后端,并在新构建的CRESENT-365数据集上进行验证。 Result: 在CRESENT+和CRESENT-365数据集上的实验显示,STELLA在不同视角、光照条件和纬度下平均实现了米级位置精度和亚度级姿态精度。 Conclusion: 这是首次在真实月球测绘场景中对CBN进行全面评估,结果为未来长期月球测绘任务的导航设计提供了重要参考。 Abstract: Crater-Based Navigation (CBN) uses the ubiquitous impact craters of the Moon observed on images as natural landmarks to determine the six degrees of freedom pose of a spacecraft. To date, CBN has primarily been studied in the context of powered descent and landing. These missions are typically short in duration, with high-frequency imagery captured from a nadir viewpoint over well-lit terrain. In contrast, lunar mapping missions involve sparse, oblique imagery acquired under varying illumination conditions over potentially year-long campaigns, posing significantly greater challenges for pose estimation. We bridge this gap with STELLA - the first end-to-end CBN pipeline for long-duration lunar mapping. STELLA combines a Mask R-CNN-based crater detector, a descriptor-less crater identification module, a robust perspective-n-crater pose solver, and a batch orbit determination back-end. To rigorously test STELLA, we introduce CRESENT-365 - the first public dataset that emulates a year-long lunar mapping mission. Each of its 15,283 images is rendered from high-resolution digital elevation models with SPICE-derived Sun angles and Moon motion, delivering realistic global coverage, illumination cycles, and viewing geometries. Experiments on CRESENT+ and CRESENT-365 show that STELLA maintains metre-level position accuracy and sub-degree attitude accuracy on average across wide ranges of viewing angles, illumination conditions, and lunar latitudes. These results constitute the first comprehensive assessment of CBN in a true lunar mapping setting and inform operational conditions that should be considered for future missions.

[87] Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models

Zoe Wanying He,Sean Trott,Meenakshi Khosla

Main category: cs.CV

TL;DR: 研究表明,尽管训练在不同模态上的视觉和语言模型仍会在中间到后期层中形成部分对齐的语义表征空间,这种对齐具有语义敏感性,并与人类在多对多图文匹配中的偏好一致,且通过示例聚合进一步增强。

Details Motivation: 探究单模态深度模型(视觉和语言)为何能在无跨模态训练的情况下产生对齐表征,明确对齐出现的位置、支持因素、是否反映人类偏好以及示例聚合的影响。 Method: 分析视觉和语言模型在不同网络层的表征对齐程度,测试对齐对语义和外观变化的鲁棒性,并通过‘Pick-a-Pic’选择任务评估模型嵌入空间与人类在多对多图文匹配中偏好的一致性。 Result: 对齐在中后层达到峰值,依赖语义而非外观;语义破坏会导致对齐崩溃;模型嵌入空间能反映人类在图文匹配中的偏好,且双向多对一场景下依然成立;平均多个示例的嵌入反而增强对齐。 Conclusion: 单模态视觉和语言模型自发形成共享的语义编码,该编码与人类语义判断一致,并可通过示例聚合得到加强。 Abstract: Recent studies show that deep vision-only and language-only models--trained on disjoint modalities--nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, whether it captures human preferences in many-to-many image-text scenarios, and how aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice "Pick-a-Pic" task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.

[88] FreeInsert: Personalized Object Insertion with Geometric and Style Control

Yuhong Zhang,Han Wang,Yiwen Wang,Rong Xie,Li Song

Main category: cs.CV

TL;DR: 提出了一种无需训练的框架FreeInsert,通过利用3D几何信息实现对象在任意场景中的自定义插入,解决了图像编辑中几何控制不足和风格不一致的问题。

Details Motivation: 现有的图像编辑方法在处理个性化图像合成任务时存在几何控制不足和风格一致性差的问题,且通常需要大量训练。 Method: 首先将2D对象转换为3D,在3D层面进行交互式编辑,然后从指定视角重新渲染为2D图像,结合扩散适配器实现样式和内容控制,最终通过扩散模型生成结果。 Result: 实现了几何可控、风格一致的图像编辑,无需额外训练即可在任意场景中插入对象。 Conclusion: FreeInsert框架有效提升了图像编辑中的几何控制精度和风格一致性,同时避免了复杂的训练过程,具有良好的应用潜力。 Abstract: Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose \textit{FreeInsert}, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.

[89] CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion

Maoye Ren,Praneetha Vaddamanu,Jianjin Xu,Fernando De la Torre Frade

Main category: cs.CV

TL;DR: 本文提出CustomEnhancer,一种用于增强现有身份定制模型的零样本框架,通过三流融合的PerGeneration方法统一生成与重建过程,并引入ResInversion方法显著加速图像反演。

Details Motivation: 现有文本到图像扩散模型在人物合成中存在场景质量下降、控制不足和身份保真度不理想的问题,需要更高效、可控且无需训练的身份定制方法。 Method: 采用人脸交换技术和预训练扩散模型获取额外表示;提出三流融合的PerGeneration方法,结合双向潜在空间操控个性化模型的关键空间;引入ResInversion进行噪声校正以加速反演过程。 Result: CustomEnhancer在场景多样性、身份保真度和无训练控制方面达到SOTA水平,ResInversion将反演时间比NTI加快129倍。 Conclusion: CustomEnhancer实现了高效、精确控制且无需重新训练控制器的个性化图像生成,ResInversion大幅提升了反演效率,为实际应用提供了可行方案。 Abstract: Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.

[90] CompressAI-Vision: Open-source software to evaluate compression methods for computer vision tasks

Hyomin Choi,Heeji Han,Chris Rosewarne,Fabien Racapé

Main category: cs.CV

TL;DR: CompressAI-Vision 是一个为视觉任务优化的视频压缩技术评估平台,支持远程和分割推理场景,已作为开源软件发布并被MPEG用于开发FCM标准。

Details Motivation: 随着神经网络在计算机视觉应用中的广泛使用,需要一个统一的平台来实现和评估针对下游视觉任务优化的压缩方法。 Method: 提出 CompressAI-Vision 平台,集成多种标准编解码器,评估不同数据集上比特率与任务准确率之间的压缩增益。 Result: 该平台展示了在多种用例下压缩性能的提升,并已被MPEG采纳用于Feature Coding for Machines (FCM) 标准的制定。 Conclusion: CompressAI-Vision 提供了一个开放、全面的评估环境,推动了面向机器视觉任务的高效视频压缩技术的发展和标准化。 Abstract: With the increasing use of neural network (NN)-based computer vision applications that process image and video data as input, interest has emerged in video compression technology optimized for computer vision tasks. In fact, given the variety of vision tasks, associated NN models and datasets, a consolidated platform is needed as a common ground to implement and evaluate compression methods optimized for downstream vision tasks. CompressAI-Vision is introduced as a comprehensive evaluation platform where new coding tools compete to efficiently compress the input of vision network while retaining task accuracy in the context of two different inference scenarios: "remote" and "split" inferencing. Our study showcases various use cases of the evaluation platform incorporated with standard codecs (under development) by examining the compression gain on several datasets in terms of bit-rate versus task accuracy. This evaluation platform has been developed as open-source software and is adopted by the Moving Pictures Experts Group (MPEG) for the development the Feature Coding for Machines (FCM) standard. The software is available publicly at https://github.com/InterDigitalInc/CompressAI-Vision.

[91] Dual-supervised Asymmetric Co-training for Semi-supervised Medical Domain Generalization

Jincai Song,Haipeng Chen,Jun Qin,Na Zhao

Main category: cs.CV

TL;DR: 本文提出了一种针对跨域半监督域泛化(CD-SSDG)的双监督非对称协同训练(DAC)框架,用于解决医学图像分割中标注数据有限且存在域偏移的问题。

Details Motivation: 传统SSDG方法假设每个源域都有标注和未标注数据,但在实际中这一条件往往不成立。本文旨在应对训练集中标注与未标注数据之间也存在域偏移的更现实挑战。 Method: 提出DAC框架,基于协同训练机制,引入两个子模型进行互伪监督,并结合特征级监督和非对称的自监督辅助任务,以缓解因域偏移导致的伪标签不准问题,提升域不变特征学习能力。 Result: 在真实医学图像数据集(Fundus、Polyp、SCGM)上的实验表明,所提DAC框架在跨域半监督域泛化设置下具有优异的泛化性能,显著优于现有方法。 Conclusion: DAC框架有效应对了CD-SSDG中的域偏移与标注稀缺问题,通过特征级监督和非对称辅助任务增强了模型鲁棒性,为医学图像分割提供了更具实用性的解决方案。 Abstract: Semi-supervised domain generalization (SSDG) in medical image segmentation offers a promising solution for generalizing to unseen domains during testing, addressing domain shift challenges and minimizing annotation costs. However, conventional SSDG methods assume labeled and unlabeled data are available for each source domain in the training set, a condition that is not always met in practice. The coexistence of limited annotation and domain shift in the training set is a prevalent issue. Thus, this paper explores a more practical and challenging scenario, cross-domain semi-supervised domain generalization (CD-SSDG), where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets. Existing SSDG methods exhibit sub-optimal performance under such domain shifts because of inaccurate pseudolabels. To address this issue, we propose a novel dual-supervised asymmetric co-training (DAC) framework tailored for CD-SSDG. Building upon the co-training paradigm with two sub-models offering cross pseudo supervision, our DAC framework integrates extra feature-level supervision and asymmetric auxiliary tasks for each sub-model. This feature-level supervision serves to address inaccurate pseudo supervision caused by domain shifts between labeled and unlabeled data, utilizing complementary supervision from the rich feature space. Additionally, two distinct auxiliary self-supervised tasks are integrated into each sub-model to enhance domain-invariant discriminative feature learning and prevent model collapse. Extensive experiments on real-world medical image segmentation datasets, i.e., Fundus, Polyp, and SCGM, demonstrate the robust generalizability of the proposed DAC framework.

[92] Real-Time Object Detection Meets DINOv3

Shihua Huang,Yongjie Hou,Longfei Liu,Xuanlong Yu,Xi Shen

Main category: cs.CV

TL;DR: DEIMv2 是一种基于 DEIM 框架并引入 DINOv3 特征的新型实时目标检测器,涵盖从 X 到 Atto 的八种模型规模,在性能与成本之间实现了最优权衡,显著超越现有方法。

Details Motivation: 为了进一步提升实时 DETR 框架的性能并适应多场景部署需求,研究者希望扩展现有的 DEIM 框架,结合更强大的特征提取能力,并优化不同规模模型的效率与精度平衡。 Method: 采用 DINOv3 预训练或蒸馏的主干网络,并引入空间调优适配器(STA)以生成多尺度特征;对于超轻量级模型使用 HGNetv2 结合深度与宽度剪枝;配合简化解码器和改进的 Dense O2O 机制,实现统一高效的设计。 Result: DEIMv2-X 达到 57.8 AP(仅 5030 万参数),DEIMv2-S 以 971 万参数实现 50.9 AP,首次突破 50 AP 大关;DEIMv2-Pico 仅 150 万参数即达 38.5 AP,性能媲美 YOLOv10-Nano 且参数减少约 50%。 Conclusion: DEIMv2 通过统一架构设计在广泛的应用场景中实现了最先进的性能-成本权衡,成为新一代高效实时目标检测的标准框架之一。 Abstract: Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters.

[93] DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation

Ved Umrajkar

Main category: cs.CV

TL;DR: 提出DAC-LoRA框架,通过引入动态对抗课程和FOSC指导的损失,在保持干净样本准确率的同时显著提升基于LoRA的视觉语言模型的对抗鲁棒性。

Details Motivation: 现有视觉语言模型在参数高效微调后仍易受对抗攻击,尤其CLIP等核心模型的脆弱性会蔓延至整个多模态系统,亟需轻量且有效的防御方法。 Method: 提出DAC-LoRA,结合对抗训练与PEFT,利用FOSC和TRADES启发的损失设计渐进式对抗攻击课程,逐步提升攻击难度以增强模型鲁棒性。 Result: 在多个任务上验证了DAC-LoRA能显著提升对抗防御能力,同时几乎不损害原始性能,且可轻松集成到标准PEFT流程中。 Conclusion: DAC-LoRA为视觉语言模型提供了一种高效、通用且实用的对抗鲁棒性增强方案,适用于安全关键场景下的部署。 Abstract: Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to specialized tasks, these models remain vulnerable to adversarial attacks that can compromise safety-critical decisions. CLIP, the backbone for numerous downstream VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem. We propose Dynamic Adversarial Curriculum DAC-LoRA, a novel framework that integrates adversarial training into PEFT. The core principle of our method i.e. an intelligent curriculum of progressively challenging attack, is general and can potentially be applied to any iterative attack method. Guided by the First-Order Stationary Condition (FOSC) and a TRADES-inspired loss, DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. Our work presents an effective, lightweight, and broadly applicable method to demonstrate that the DAC-LoRA framework can be easily integrated into a standard PEFT pipeline to significantly enhance robustness.

[94] Federated Domain Generalization with Domain-specific Soft Prompts Generation

Jianhan Wu,Xiaoyang Qu,Zhangcheng Huang,Jianzong Wang

Main category: cs.CV

TL;DR: 提出了一种基于生成式域特定软提示(FedDSPG)的联邦域泛化新方法,通过在训练中引入域特定软提示并结合内容与域知识,在多个公开数据集上实现了优于现有方法的性能。

Details Motivation: 现有基于提示学习的联邦域泛化方法生成的软提示多样性不足,且忽视未知域的信息,限制了模型的泛化能力。 Method: 提出FedDSPG方法,在训练阶段为每个域引入域特定软提示(DSPs),并在客户端间利用生成模型融合内容与域知识;在推理阶段,使用生成器为未见目标域生成DSPs以指导下游任务。 Result: 在多个公共数据集上的实验表明,该方法在联邦域泛化任务中优于现有的强基线方法,取得了最先进的结果。 Conclusion: FedDSPG通过生成域特定软提示有效提升了联邦学习中的域泛化能力,尤其在处理未知域的下游任务适应方面表现出色。 Abstract: Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. engendering domain shift among clients and posing a formidable challenge for downstream-task adaptation. Existing federated domain generalization (FDG) methods based on prompt learning typically learn soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts exhibit limited diversity and tend to ignore information from unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, during training, we introduce domain-specific soft prompts (DSPs) for each domain and integrate content and domain knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Comprehensive evaluations across several public datasets confirm that our method outperforms existing strong baselines in FDG, achieving state-of-the-art results.

[95] Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning

Thanh Binh Le,Hoang Nhat Khang Vo,Tan-Ha Mai,Trong Nhan Phan

Main category: cs.CV

TL;DR: LumbarCLIP 是一种基于对比语言-图像预训练的多模态框架,用于对齐腰椎MRI图像与放射学文本报告,在分类任务中达到95.00%准确率。

Details Motivation: 低背痛影响广泛,需要能够联合分析医学影像和文本报告的诊断模型。 Method: 采用ResNet-50、ViT、Swin等视觉编码器与BERT文本编码器,通过可学习的投影头将特征映射到共享嵌入空间,使用软CLIP损失进行对比训练。 Result: 在测试集上最高达到95.00%准确率和94.75% F1分数,线性投影头优于非线性变体。 Conclusion: LumbarCLIP为肌肉骨骼系统的自动化诊断和临床决策支持提供了有力基础。 Abstract: Low back pain affects millions worldwide, driving the need for robust diagnostic models that can jointly analyze complex medical images and accompanying text reports. We present LumbarCLIP, a novel multimodal framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions. Built upon a curated dataset containing axial MRI views paired with expert-written reports, LumbarCLIP integrates vision encoders (ResNet-50, Vision Transformer, Swin Transformer) with a BERT-based text encoder to extract dense representations. These are projected into a shared embedding space via learnable projection heads, configurable as linear or non-linear, and normalized to facilitate stable contrastive training using a soft CLIP loss. Our model achieves state-of-the-art performance on downstream classification, reaching up to 95.00% accuracy and 94.75% F1-score on the test set, despite inherent class imbalance. Extensive ablation studies demonstrate that linear projection heads yield more effective cross-modal alignment than non-linear variants. LumbarCLIP offers a promising foundation for automated musculoskeletal diagnosis and clinical decision support.

[96] Poisoning Prompt-Guided Sampling in Video Large Language Models

Yuxin Cao,Wei Song,Jingling Xue,Jin Song Dong

Main category: cs.CV

TL;DR: 本文提出了PoisonVID,首个针对视频大语言模型中提示引导采样机制的黑盒投毒攻击,通过闭合优化策略生成通用扰动,成功抑制有害帧的相关性得分,在多种先进VideoLLM上实现了82%至99%的攻击成功率。

Details Motivation: 尽管先前的帧采样策略已发现存在漏洞,但提示引导采样在视频大语言模型中的安全性尚未被探索,因此需要研究其潜在风险。 Method: 提出PoisonVID,采用闭合回路优化策略,利用影子VideoLLM和轻量级语言模型(如GPT-4o-mini)生成改写后的有害描述构建描绘集,迭代优化通用扰动以抑制有害帧的关联得分。 Result: 在三种提示引导采样策略和三个先进VideoLLM上全面评估,PoisonVID实现了82%到99%的攻击成功率。 Conclusion: 提示引导采样机制存在严重安全漏洞,PoisonVID揭示了现有VideoLLM采样方法的脆弱性,强调未来需开发更安全、先进的采样策略。 Abstract: Video Large Language Models (VideoLLMs) have emerged as powerful tools for understanding videos, supporting tasks such as summarization, captioning, and question answering. Their performance has been driven by advances in frame sampling, progressing from uniform-based to semantic-similarity-based and, most recently, prompt-guided strategies. While vulnerabilities have been identified in earlier sampling strategies, the safety of prompt-guided sampling remains unexplored. We close this gap by presenting PoisonVID, the first black-box poisoning attack that undermines prompt-guided sampling in VideoLLMs. PoisonVID compromises the underlying prompt-guided sampling mechanism through a closed-loop optimization strategy that iteratively optimizes a universal perturbation to suppress harmful frame relevance scores, guided by a depiction set constructed from paraphrased harmful descriptions leveraging a shadow VideoLLM and a lightweight language model, i.e., GPT-4o-mini. Comprehensively evaluated on three prompt-guided sampling strategies and across three advanced VideoLLMs, PoisonVID achieves 82% - 99% attack success rate, highlighting the importance of developing future advanced sampling strategies for VideoLLMs.

[97] Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer

Abdur Rehman,S M A Sharif,Md Abdur Rahaman,Mohamed Jismy Aashik Rasool,Seongwan Kim,Jaeho Lee

Main category: cs.CV

TL;DR: 提出了一种名为Game of Regularizer (GoR) 的可学习正则化方法,通过两个可训练参数动态平衡任务特定和知识蒸馏损失,显著提升低比特量化模型的性能,并结合集成蒸馏框架QAT-EKD-GoR在边缘设备上实现高效推理。

Details Motivation: 现有量化感知训练与知识蒸馏(QAT-KD)方法在低比特量化下因梯度幅值差异难以平衡任务损失与蒸馏损失,导致训练冲突和性能下降。 Method: 设计了GoR方法,引入仅含两个可学习参数的动态损失权重机制,自适应调节任务特定和知识蒸馏目标;进一步提出QAT-EKD-GoR,利用多个异构教师模型进行集成蒸馏。 Result: 在图像分类、目标检测和大语言模型压缩任务中,GoR consistently超越现有QAT-KD方法;在低功耗边缘设备上实现更快推理速度并保持全精度模型准确性;EKD-GoR在最优条件下甚至超过全精度模型性能。 Conclusion: GoR为小规模量化模型提供了一种高效、轻量且通用的损失平衡方案,结合集成蒸馏可有效支持AI模型在资源受限场景下的部署。 Abstract: Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization. We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small quantized models (SQMs). Experiments on image classification, object detection (OD), and large language model (LLM) compression show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.

[98] Plant identification based on noisy web data: the amazing performance of deep learning (LifeCLEF 2017)

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: LifeCLEF 2017植物识别挑战赛评估了大规模网络采集的含噪声数据集与小规模专家标注数据集在植物识别任务中的性能对比,使用Pl@ntNet应用数据作为测试集。

Details Motivation: 尽管已有大量植物图像资源,但多数物种仍缺乏或仅有少量图片;同时,网络上存在大量非机构性植物图像但标注质量参差不齐,因此需评估噪声数据在大规模植物识别中的有效性。 Method: 通过网络爬取构建大规模含噪声训练集,与专家验证的小型高质量训练集进行对比,采用独立来源的Pl@ntNet应用图像作为测试集以确保公平评估。 Result: 挑战赛提供了对两种训练策略的系统评估,多个研究团队提交了基于深度学习的植物识别系统,结果揭示了噪声数据在实际识别性能中的潜力与局限。 Conclusion: 大规模网络采集的噪声数据在植物识别任务中具有一定竞争力,但在标注准确性与模型鲁棒性方面仍需进一步优化,结合高质量数据可能更具前景。 Abstract: The 2017-th edition of the LifeCLEF plant identification challenge is an important milestone towards automated plant identification systems working at the scale of continental floras with 10.000 plant species living mainly in Europe and North America illustrated by a total of 1.1M images. Nowadays, such ambitious systems are enabled thanks to the conjunction of the dazzling recent progress in image classification with deep learning and several outstanding international initiatives, such as the Encyclopedia of Life (EOL), aggregating the visual knowledge on plant species coming from the main national botany institutes. However, despite all these efforts the majority of the plant species still remain without pictures or are poorly illustrated. Outside the institutional channels, a much larger number of plant pictures are available and spread on the web through botanist blogs, plant lovers web-pages, image hosting websites and on-line plant retailers. The LifeCLEF 2017 plant challenge presented in this paper aimed at evaluating to what extent a large noisy training dataset collected through the web and containing a lot of labelling errors can compete with a smaller but trusted training dataset checked by experts. To fairly compare both training strategies, the test dataset was created from a third data source, i.e. the Pl@ntNet mobile application that collects millions of plant image queries all over the world. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[99] TasselNetV4: A vision foundation model for cross-scene, cross-scale, and cross-species plant counting

Xiaonan Hu,Xuebing Li,Jinyu Xu,Abdulkadir Duran Adan,Letian Zhou,Xuhui Zhu,Yanan Li,Wei Guo,Shouyang Liu,Wenzhong Liu,Hao Lu

Main category: cs.CV

TL;DR: 本文提出TasselNetV4,一种基于视觉Transformer的跨物种植物计数模型,结合局部计数与提取-匹配范式,提升跨场景、跨尺度和跨物种的植物计数性能。

Details Motivation: 传统植物计数模型依赖特定物种,难以覆盖不断新增的作物品种。受类无关计数启发,需重新思考植物计数问题,从‘数什么’转向‘如何数’,以应对植物动态变化和非刚性结构带来的挑战。 Method: 继承TasselNet的局部计数思想,引入类无关计数中的提取-匹配范式,采用纯视觉Transformer架构,并设计多分支盒感知局部计数器以增强跨尺度鲁棒性。构建了PAC-105和PAC-Somalia两个新数据集进行实验验证。 Result: 在多个先进类无关计数模型对比中,TasselNetV4展现出更优的计数性能和高效率,能在跨场景、跨尺度和跨物种条件下稳定工作。 Conclusion: TasselNetV4可作为植物计数的视觉基础模型,为农业中的作物产量预测、密度评估等提供通用、高效的解决方案。 Abstract: Accurate plant counting provides valuable information for agriculture such as crop yield prediction, plant density assessment, and phenotype quantification. Vision-based approaches are currently the mainstream solution. Prior art typically uses a detection or a regression model to count a specific plant. However, plants have biodiversity, and new cultivars are increasingly bred each year. It is almost impossible to exhaust and build all species-dependent counting models. Inspired by class-agnostic counting (CAC) in computer vision, we argue that it is time to rethink the problem formulation of plant counting, from what plants to count to how to count plants. In contrast to most daily objects with spatial and temporal invariance, plants are dynamic, changing with time and space. Their non-rigid structure often leads to worse performance than counting rigid instances like heads and cars such that current CAC and open-world detection models are suboptimal to count plants. In this work, we inherit the vein of the TasselNet plant counting model and introduce a new extension, TasselNetV4, shifting from species-specific counting to cross-species counting. TasselNetV4 marries the local counting idea of TasselNet with the extract-and-match paradigm in CAC. It builds upon a plain vision transformer and incorporates novel multi-branch box-aware local counters used to enhance cross-scale robustness. Two challenging datasets, PAC-105 and PAC-Somalia, are harvested. Extensive experiments against state-of-the-art CAC models show that TasselNetV4 achieves not only superior counting performance but also high efficiency.Our results indicate that TasselNetV4 emerges to be a vision foundation model for cross-scene, cross-scale, and cross-species plant counting.

[100] SD-RetinaNet: Topologically Constrained Semi-Supervised Retinal Lesion and Layer Segmentation in OCT

Botond Fazekas,Guilherme Aresta,Philipp Seeböck,Julia Mai,Ursula Schmidt-Erfurth,Hrvoje Bogunović

Main category: cs.CV

TL;DR: 提出一种新的半监督模型,通过引入完全可微的生物标志物拓扑引擎,实现解剖学上正确的视网膜层和病变分割,显著提升了OCT图像中层与病变分割的准确性和鲁棒性。

Details Motivation: 现有半监督方法在视网膜OCT图像分割中常产生解剖学上不合理的结构,难以有效建模层-病变相互作用,且缺乏拓扑正确性保证。 Method: 提出一种新型半监督模型,结合完全可微的生物标志物拓扑引擎,实现层与病变的联合学习与双向影响,并通过解耦表征分离空间与样式因子,利用未标注和部分标注数据进行训练。 Result: 在公开和内部OCT数据集上,该模型在层和病变分割方面均优于当前最先进方法,并能使用部分标注数据泛化到病理情况下的层分割。 Conclusion: 将解剖约束引入半监督学习可显著提升视网膜生物标志物分割的准确性、鲁棒性和可信度,具有临床应用潜力。 Abstract: Optical coherence tomography (OCT) is widely used for diagnosing and monitoring retinal diseases, such as age-related macular degeneration (AMD). The segmentation of biomarkers such as layers and lesions is essential for patient diagnosis and follow-up. Recently, semi-supervised learning has shown promise in improving retinal segmentation performance. However, existing methods often produce anatomically implausible segmentations, fail to effectively model layer-lesion interactions, and lack guarantees on topological correctness. To address these limitations, we propose a novel semi-supervised model that introduces a fully differentiable biomarker topology engine to enforce anatomically correct segmentation of lesions and layers. This enables joint learning with bidirectional influence between layers and lesions, leveraging unlabeled and diverse partially labeled datasets. Our model learns a disentangled representation, separating spatial and style factors. This approach enables more realistic layer segmentations and improves lesion segmentation, while strictly enforcing lesion location in their anatomically plausible positions relative to the segmented layers. We evaluate the proposed model on public and internal datasets of OCT scans and show that it outperforms the current state-of-the-art in both lesion and layer segmentation, while demonstrating the ability to generalize layer segmentation to pathological cases using partially annotated training data. Our results demonstrate the potential of using anatomical constraints in semi-supervised learning for accurate, robust, and trustworthy retinal biomarker segmentation.

[101] Plant identification in an open-world (LifeCLEF 2016)

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: LifeCLEF 2016植物识别挑战赛评估了在接近真实生物多样性监测场景的大规模条件下的植物识别方法,首次采用开放集识别任务,要求系统能有效拒绝未知物种的误分类。

Details Motivation: 推动植物自动识别技术的发展,特别是在开放集条件下处理未知物种的能力,以更好地支持现实世界中的生物多样性监测。 Method: 使用超过11万张图像、涵盖西欧1000种植物的数据集,通过大规模参与式感知平台构建;将植物识别任务定义为开放集识别问题,评估系统对未知类别的鲁棒性。 Result: 多个研究团队提交了不同方法的系统,部分方法结合深度学习与拒绝机制来应对未知类别;实验结果显示现有系统在开放集条件下面临显著挑战,尤其是在控制假阳性方面。 Conclusion: 开放集识别是植物自动识别迈向实际应用的关键挑战,未来需加强模型对未知样本的判别与拒绝能力。 Abstract: The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2016-th edition was actually conducted on a set of more than 110K images illustrating 1000 plant species living in West Europe, built through a large-scale participatory sensing platform initiated in 2011 and which now involves tens of thousands of contributors. The main novelty over the previous years is that the identification task was evaluated as an open-set recognition problem, i.e. a problem in which the recognition system has to be robust to unknown and never seen categories. Beyond the brute-force classification across the known classes of the training set, the big challenge was thus to automatically reject the false positive classification hits that are caused by the unknown classes. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[102] SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering

Yan Zhang,Jiaqing Lin,Miao Zhang,Kui Xiao,Xiaoju Hou,Yue Zhao,Zhifei Li

Main category: cs.CV

TL;DR: 提出了一种名为SCRA-VQA的新方法,利用预训练的视觉语言模型生成图像描述,并通过总结和重排序去除无关信息,从而提升大语言模型在知识型视觉问答任务中的推理能力和适应性。

Details Motivation: 现有方法使用大语言模型结合图像描述进行知识型视觉问答,但描述中常包含与问题无关的噪声,且大语言模型缺乏对VQA任务的理解,限制了其推理能力。 Method: 提出SCRA-VQA框架,使用预训练视觉语言模型生成图像描述,并通过上下文示例生成、摘要和重排序机制优化描述内容,使大语言模型更准确理解图像与问题,无需端到端训练。 Result: 基于67亿参数的大语言模型,SCRA-VQA在OK-VQA和A-OKVQA两个数据集上分别取得了38.8%和34.6%的准确率,表现优异。 Conclusion: SCRA-VQA通过优化图像描述的生成与组织方式,有效提升了大语言模型在知识型视觉问答中的性能,且无需昂贵的端到端训练,具有良好的任务适应性和应用前景。 Abstract: Acquiring high-quality knowledge is a central focus in Knowledge-Based Visual Question Answering (KB-VQA). Recent methods use large language models (LLMs) as knowledge engines for answering. These methods generally employ image captions as visual text descriptions to assist LLMs in interpreting images. However, the captions frequently include excessive noise irrelevant to the question, and LLMs generally do not comprehend VQA tasks, limiting their reasoning capabilities. To address this issue, we propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA), which employs a pre-trained visual language model to convert images into captions. Moreover, SCRA-VQA generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information. The caption-rerank process enables LLMs to understand the image information and questions better, thus enhancing the model's reasoning ability and task adaptability without expensive end-to-end training. Based on an LLM with 6.7B parameters, SCRA-VQA performs excellently on two challenging knowledge-based VQA datasets: OK-VQA and A-OKVQA, achieving accuracies of 38.8% and 34.6%. Our code is available at https://github.com/HubuKG/SCRA-VQA.

[103] The Unanticipated Asymmetry Between Perceptual Optimization and Assessment

Jiabei Zhang,Qi Wang,Siyu Wu,Du Chen,Tianhe Wu

Main category: cs.CV

TL;DR: 本文系统分析了感知优化与图像质量评估(IQA)之间的关系,揭示了在优化目标和评估指标之间的意外不对称性:擅长IQA的保真度指标不一定适用于感知优化,尤其是在对抗训练下这种不匹配更为明显。此外,判别器虽能有效抑制优化过程中的伪影,但其学习到的表示对IQA模型的帮助有限;而判别器的设计对优化效果有决定性影响,基于patch和卷积的结构在细节重建上优于传统或Transformer架构。

Details Motivation: 尽管保真度和对抗性目标在感知优化中起核心作用,但它们作为优化目标的有效性与其作为图像质量评估指标的能力之间的关系尚未被充分探索。本文旨在系统分析这一关系,并揭示其中存在的不对称性。 Method: 通过系统性实验分析不同保真度和对抗性目标在感知优化与图像质量评估中的表现差异,研究判别器结构对优化过程的影响,并评估其学习表征在IQA任务中的可迁移性。 Result: 发现优秀的IQA保真度指标并不一定适合作为感知优化目标,尤其在对抗训练下存在显著不匹配;判别器虽能抑制伪影,但其表征对IQA模型初始化帮助有限;patch级和卷积结构的判别器在细节重建上优于其他架构。 Conclusion: 感知优化与图像质量评估之间存在未被重视的不对称性,判别器设计对优化效果至关重要,这些发现有助于更合理地设计损失函数并提升IQA的可迁移性。 Abstract: Perceptual optimization is primarily driven by the fidelity objective, which enforces both semantic consistency and overall visual realism, while the adversarial objective provides complementary refinement by enhancing perceptual sharpness and fine-grained detail. Despite their central role, the correlation between their effectiveness as optimization objectives and their capability as image quality assessment (IQA) metrics remains underexplored. In this work, we conduct a systematic analysis and reveal an unanticipated asymmetry between perceptual optimization and assessment: fidelity metrics that excel in IQA are not necessarily effective for perceptual optimization, with this misalignment emerging more distinctly under adversarial training. In addition, while discriminators effectively suppress artifacts during optimization, their learned representations offer only limited benefits when reused as backbone initializations for IQA models. Beyond this asymmetry, our findings further demonstrate that discriminator design plays a decisive role in shaping optimization, with patch-level and convolutional architectures providing more faithful detail reconstruction than vanilla or Transformer-based alternatives. These insights advance the understanding of loss function design and its connection to IQA transferability, paving the way for more principled approaches to perceptual optimization.

[104] Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering

Zhifei Li,Feng Qiu,Yiran Wang,Yujing Xia,Kui Xiao,Miao Zhang,Yan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的VQA模型IOG-VQA,结合对象交互自注意力和基于GAN的去偏方法,有效提升了在存在数据偏差情况下的视觉问答性能。

Details Motivation: 现有VQA模型容易受到训练数据偏差的影响,导致对表面模式过度依赖,泛化能力不足。 Method: 引入对象交互自注意力机制以捕捉图像中对象间的复杂交互,并采用基于GAN的去偏框架生成无偏数据分布,从而增强模型的鲁棒性和泛化能力。 Result: 在VQA-CP v1和v2数据集上的实验表明,该模型在处理有偏和不平衡数据时显著优于现有方法。 Conclusion: 同时建模对象交互和缓解数据偏差对于提升VQA模型的泛化性能至关重要。 Abstract: Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA.

[105] Nuclear Diffusion Models for Low-Rank Background Suppression in Videos

Tristan S. W. Stevens,Oisín Nolan,Jean-Luc Robert,Ruud J. G. van Sloun

Main category: cs.CV

TL;DR: 提出了一种结合低秩时间建模与扩散后验采样的混合框架——Nuclear Diffusion,用于视频去噪与恢复,在心脏超声去雾任务中表现优于传统RPCA方法。

Details Motivation: 现有鲁棒主成分分析(RPCA)依赖稀疏性假设,难以捕捉真实视频数据中的丰富动态变化,尤其在存在结构化噪声和背景伪影时性能受限。 Method: 提出Nuclear Diffusion框架,将低秩时间建模与扩散模型的生成先验相结合,通过低秩结构提取动态内容,并利用扩散模型的后验采样增强细节恢复能力。 Result: 在真实心脏超声去雾任务中,相比传统RPCA方法,Nuclear Diffusion在对比度增强(gCNR)和信号保真度(KS统计量)方面均表现出更优性能。 Conclusion: 将基于模型的时间建模与深度生成先验相结合,可有效提升视频恢复质量,尤其适用于复杂医疗视频的高保真去噪与增强。 Abstract: Video sequences often contain structured noise and background artifacts that obscure dynamic content, posing challenges for accurate analysis and restoration. Robust principal component methods address this by decomposing data into low-rank and sparse components. Still, the sparsity assumption often fails to capture the rich variability present in real video data. To overcome this limitation, a hybrid framework that integrates low-rank temporal modeling with diffusion posterior sampling is proposed. The proposed method, Nuclear Diffusion, is evaluated on a real-world medical imaging problem, namely cardiac ultrasound dehazing, and demonstrates improved dehazing performance compared to traditional RPCA concerning contrast enhancement (gCNR) and signal preservation (KS statistic). These results highlight the potential of combining model-based temporal models with deep generative priors for high-fidelity video restoration.

[106] FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies

Shuqiao Liang,Jian Liu,Renzhang Chen,Quanlong Guan

Main category: cs.CV

TL;DR: 本文提出了一种基于局部像素依赖性(LPD)的轻量级神经网络FerretNet,用于检测合成图像,其在跨22种生成模型的开放世界基准上表现出色,准确率达97.1%。

Details Motivation: 随着VAE、GAN和LDM等模型生成的合成图像越来越逼真,传统检测方法面临挑战,亟需更鲁棒的检测技术。 Method: 利用生成过程中引入的潜在分布偏差和解码平滑效应两种伪影,基于马尔可夫随机场中的局部像素依赖性(LPD)重建图像以暴露纹理和边缘不一致性,并设计了仅含110万参数的轻量级网络FerretNet进行检测。 Result: FerretNet在仅使用4类ProGAN数据训练的情况下,在包含22种生成模型的开放世界基准上平均准确率达到97.1%,超越现有最优方法10.6%。 Conclusion: FerretNet是一种高效且鲁棒的合成图像检测方法,具有良好的泛化能力和实际应用潜力。 Abstract: The increasing realism of synthetic images generated by advanced models such as VAEs, GANs, and LDMs poses significant challenges for synthetic image detection. To address this issue, we explore two artifact types introduced during the generation process: (1) latent distribution deviations and (2) decoding-induced smoothing effects, which manifest as inconsistencies in local textures, edges, and color transitions. Leveraging local pixel dependencies (LPD) properties rooted in Markov Random Fields, we reconstruct synthetic images using neighboring pixel information to expose disruptions in texture continuity and edge coherence. Building upon LPD, we propose FerretNet, a lightweight neural network with only 1.1M parameters that delivers efficient and robust synthetic image detection. Extensive experiments demonstrate that FerretNet, trained exclusively on the 4-class ProGAN dataset, achieves an average accuracy of 97.1% on an open-world benchmark comprising across 22 generative models, surpassing state-of-the-art methods by 10.6%.

[107] Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

Patrick Knab,Sascha Marton,Philipp J. Schubert,Drago Guggiana,Christian Bartelt

Main category: cs.CV

TL;DR: 本文提出了MoTIF,一种基于Transformer的可解释框架,将概念瓶颈模型扩展到视频分类,通过建模时间序列中的语义概念及其动态关系,在保持竞争力性能的同时实现对视频动作的可解释性分析。

Details Motivation: 现有的概念瓶颈模型主要针对静态图像,难以处理视频中关键的时间依赖性。为了在视频分类中实现可解释性,需要能够捕捉跨时间重复出现的概念及其时序动态的模型。 Method: 提出MoTIF框架,采用类Transformer架构,将概念瓶颈机制扩展到视频数据。该模型通过三种视角建模:整个视频的全局概念重要性、特定时间窗口内的局部概念相关性,以及概念随时间的演变依赖关系。 Result: 实验证明,MoTIF能有效将基于概念的建模范式应用于视频数据,在多个时间尺度上清晰揭示概念贡献,并在保持良好分类性能的同时提供强可解释性。 Conclusion: MoTIF成功地将可解释的概念瓶颈模型推广至视频领域,支持任意长度序列,并为理解视频中的动作提供了语义化、时间感知的解释能力。 Abstract: Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., 'bow', 'mount', 'shoot') that reoccur across time - forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance. Code available at github.com/patrick-knab/MoTIF.

[108] FSMODNet: A Closer Look at Few-Shot Detection in Multispectral Data

Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 本文提出了一种名为FSMODNet的框架,用于解决少样本多光谱目标检测(FSMOD)问题,通过可变形注意力机制融合可见光与热成像特征,在标注数据极少的情况下实现了鲁棒的检测性能。

Details Motivation: 由于标注多光谱数据成本高且耗时,如何在少量标注样本下实现跨模态(可见光与热成像)目标检测成为一个关键挑战。现有方法在低数据场景下的泛化能力不足,因此需要更有效的跨模态特征融合机制。 Method: 提出FSMODNet,利用可变形注意力机制进行跨模态特征融合,充分结合可见光图像的细节信息和热成像在弱光照条件下的优势,并在两阶段检测框架中引入少样本学习策略以提升模型在新类别上的适应能力。 Result: 在两个公开数据集上的实验表明,FSMODNet在极低标注数据条件下显著优于基于最新模型构建的多个基线方法,验证了其在复杂光照和环境下的有效性与鲁棒性。 Conclusion: FSMODNet通过可变形注意力实现有效的跨模态特征整合,为少样本多光谱目标检测提供了新的解决方案,在实际应用中具有较高的潜力。 Abstract: Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named "FSMODNet" that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strengths of visible and thermal imagery using deformable attention, the proposed method demonstrates robust adaptability in complex illumination and environmental conditions. Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines we established from state-of-the-art models. All code, models, and experimental data splits can be found at https://anonymous.4open.science/r/Test-B48D.

[109] Finding 3D Positions of Distant Objects from Noisy Camera Movement and Semantic Segmentation Sequences

Julius Pesonen,Arno Solin,Eija Honkavaara

Main category: cs.CV

TL;DR: 本文提出使用粒子滤波器在计算资源受限或目标距离较远的情况下,基于相机姿态和图像分割实现三维物体定位,适用于无人机野火监测等安全关键任务。

Details Motivation: 在远距离或计算资源受限的场景下,传统的稠密深度估计或3D重建方法难以实现有效的3D物体定位,因此需要一种更可行的替代方案。 Method: 采用粒子滤波器对单个和多个目标进行3D定位,结合GNSS提供的相机位姿和图像分割结果,在仿真环境和真实无人机数据上进行验证。 Result: 实验结果表明,粒子滤波器能够在其他方法失效的情况下有效完成实际定位任务,并且该方法不依赖于具体的检测方式,具有良好的灵活性。 Conclusion: 粒子滤波器是一种适用于资源受限环境下基于相机序列的3D物体定位的有效方法,尤其适合如无人机野火监测这类应用。 Abstract: 3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with dense depth estimation or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved using particle filters for both single and multiple target scenarios. The method was studied using a 3D simulation and a drone-based image segmentation sequence with global navigation satellite system (GNSS)-based camera pose estimates. The results showed that a particle filter can be used to solve practical localisation tasks based on camera poses and image segments in these situations where other solutions fail. The particle filter is independent of the detection method, making it flexible for new tasks. The study also demonstrates that drone-based wildfire monitoring can be conducted using the proposed method paired with a pre-existing image segmentation model.

[110] SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images

Qinfeng Zhu,Han Li,Liang He,Lei Fan

Main category: cs.CV

TL;DR: 本文提出了一种新的遥感图像语义分割框架SwinMamba,结合局部与全局扫描机制,在保持低计算复杂度的同时提升对局部细节和全局上下文的感知能力,显著优于现有方法。

Details Motivation: 现有的Vision Mamba在遥感图像分割中因依赖全局扫描而忽视了关键的局部特征(如纹理和边缘),影响分割精度。 Method: 受Swin Transformer启发,SwinMamba在移位窗口内引入局部Mamba式扫描,前两个阶段进行局部扫描以捕捉细粒度特征,后两个阶段采用全局扫描融合上下文信息,并通过重叠的移位窗口增强区域间信息交换。 Result: 在LoveDA和ISPRS Potsdam数据集上的实验表明,SwinMamba在性能上超过了当前最先进的方法。 Conclusion: SwinMamba有效平衡了局部细节与全局上下文的建模,是一种高效且强大的遥感图像语义分割解决方案。 Abstract: Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba-style scanning within shifted windows with a global receptive field, to enhance the model's perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine-grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter-region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state-of-the-art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.

[111] Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Framework

Wenhao Tang,Heng Fang,Ge Wu,Xiang Li,Ming-Ming Cheng

Main category: cs.CV

TL;DR: 提出了一种基于包的多实例学习框架(PackMIL),用于解决计算病理学中全切片图像序列长度极长、变化大且监督有限的问题,显著提升训练效率和准确性。

Details Motivation: 全切片图像(WSI)具有极长且可变的序列长度、高度异质性和冗余性,在有限监督下传统方法难以兼顾训练效率与优化性能。 Method: 提出PackMIL框架:将多个可变长度特征序列打包为固定长度序列以实现批训练;引入残差分支构建跨切片的‘超滑片’并提供多切片监督;设计注意力驱动的下采样器减少冗余。 Result: 在PANDA(UNI)数据集上准确率最高提升8%,训练时间仅需原来的12%。 Conclusion: 通过针对性解决数据异质性、冗余和监督不足问题,PackMIL大幅提升了计算病理学模型的训练效率和性能,表明关注数据挑战在基础模型时代具有重要意义。 Abstract: Computational pathology (CPath) digitizes pathology slides into whole slide images (WSIs), enabling analysis for critical healthcare tasks such as cancer diagnosis and prognosis. However, WSIs possess extremely long sequence lengths (up to 200K), significant length variations (from 200 to 200K), and limited supervision. These extreme variations in sequence length lead to high data heterogeneity and redundancy. Conventional methods often compromise on training efficiency and optimization to preserve such heterogeneity under limited supervision. To comprehensively address these challenges, we propose a pack-based MIL framework. It packs multiple sampled, variable-length feature sequences into fixed-length ones, enabling batched training while preserving data heterogeneity. Moreover, we introduce a residual branch that composes discarded features from multiple slides into a hyperslide which is trained with tailored labels. It offers multi-slide supervision while mitigating feature loss from sampling. Meanwhile, an attention-driven downsampler is introduced to compress features in both branches to reduce redundancy. By alleviating these challenges, our approach achieves an accuracy improvement of up to 8% while using only 12% of the training time in the PANDA(UNI). Extensive experiments demonstrate that focusing data challenges in CPath holds significant potential in the era of foundation models. The code is https://github.com/FangHeng/PackMIL

[112] SimDiff: Simulator-constrained Diffusion Model for Physically Plausible Motion Generation

Akihisa Watanabe,Jiawei Ren,Li Siyao,Yichen Peng,Erwin Wu,Edgar Simo-Serra

Main category: cs.CV

TL;DR: 提出SimDiff模型,通过将环境参数直接融入去噪过程,实现高效生成物理上合理的运动。

Details Motivation: 现有方法因依赖模拟器进行运动投影而计算成本高且难以并行化,需要更高效的物理合理动作生成方法。 Method: 将模拟器约束视为扩散过程中的引导形式,提出SimDiff模型,直接在去噪过程中条件化环境参数(如重力、风力)。 Result: SimDiff在推理时无需反复调用模拟器,能高效生成物理合理的运动,并对不同物理系数提供细粒度控制,且能泛化到未见过的环境参数组合。 Conclusion: SimDiff有效平衡了物理合理性与计算效率,具备良好的泛化能力和环境适应性。 Abstract: Generating physically plausible human motion is crucial for applications such as character animation and virtual reality. Existing approaches often incorporate a simulator-based motion projection layer to the diffusion process to enforce physical plausibility. However, such methods are computationally expensive due to the sequential nature of the simulator, which prevents parallelization. We show that simulator-based motion projection can be interpreted as a form of guidance, either classifier-based or classifier-free, within the diffusion process. Building on this insight, we propose SimDiff, a Simulator-constrained Diffusion Model that integrates environment parameters (e.g., gravity, wind) directly into the denoising process. By conditioning on these parameters, SimDiff generates physically plausible motions efficiently, without repeated simulator calls at inference, and also provides fine-grained control over different physical coefficients. Moreover, SimDiff successfully generalizes to unseen combinations of environmental parameters, demonstrating compositional generalization.

[113] Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models

Bum Jun Kim,Makoto Kawano,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CV

TL;DR: 本文通过分析1174个预训练视觉模型,发现并理论解释了四种提升对高斯噪声鲁棒性的设计模式:更大的stem卷积核、更小的输入分辨率、使用平均池化以及监督式ViT而非CLIP ViT,提出了可解释且实用的鲁棒性设计准则。

Details Motivation: 视觉模型的鲁棒性常被评估,但其与架构设计选择之间的依赖关系缺乏深入分析。作者旨在揭示为何某些架构天然更具鲁棒性,并将经验观察转化为可操作的设计规则。 Method: 通过对1,174个预训练视觉模型进行大规模实证评估,识别出影响高斯噪声鲁棒性的关键设计因素;结合理论分析,从因果机制角度解释这些设计为何有效,包括低通滤波、抗混叠下采样、池化方式偏差分析及CLIP ViT的像素空间Lipschitz界分析。 Result: 发现了四个显著提升鲁棒性的设计模式:更大的stem核(噪声衰减随尺寸平方增长)、更小输入分辨率(降低噪声能量)、平均池化(噪声抑制与窗口面积成正比)以及监督ViT优于CLIP ViT(后者因归一化标准差小导致最坏情况敏感性增加1.91倍)。实验显示最多506名次提升和21.6%的准确率增益。 Conclusion: 该研究将视觉模型鲁棒性分解为可解释的模块化组件,提供了理论支持的经验规律,并建立了即插即用的实用设计指南,用于构建对高斯噪声更鲁棒的视觉模型。 Abstract: While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6\%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that low-pass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.

[114] Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery

Angelo Henriques,Korab Hoxha,Daniel Zapp,Peter C. Issa,Nassir Navab,M. Ali Nasseri

Main category: cs.CV

TL;DR: 本文综述了场景图(SG)在手术中的研究进展,指出尽管发展迅速,但存在‘数据鸿沟’:内视研究多用真实2D视频,外视4D建模则依赖模拟数据。方法上已从基础图神经网络转向专用基础模型,并在手术场景中超越通用视觉-语言大模型。SG已成为手术流程识别、安全监控和可控手术模拟的关键技术,正成为提升手术安全、效率与培训的智能系统核心。

Details Motivation: 旨在系统梳理场景图在手术环境中的应用现状与挑战,揭示内外视图研究间的数据断层问题,推动该领域向更高效、可靠的智能手术系统发展。 Method: 采用PRISMA-ScR指导的范围综述方法,系统分析现有文献,归纳应用场景、技术演进路径及未来方向。 Result: 发现场景图研究快速增长,方法从传统图神经网络发展到专用基础模型,显著优于通用大模型;明确了‘数据divide’问题,并确认其在分析与生成任务中的核心作用。 Conclusion: 手术场景图正成熟为连接感知与语义的桥梁,是构建下一代智能手术系统的关键技术,有望大幅提升手术的安全性、效率和培训水平。 Abstract: Scene graphs (SGs) provide structured relational representations crucial for decoding complex, dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, charting its applications, methodological advancements, and future directions. Our analysis reveals rapid growth, yet uncovers a critical 'data divide': internal-view research (e.g., triplet recognition) almost exclusively uses real-world 2D video, while external-view 4D modeling relies heavily on simulated data, exposing a key translational research gap. Methodologically, the field has advanced from foundational graph neural networks to specialized foundation models that now significantly outperform generalist large vision-language models in surgical contexts. This progress has established SGs as a cornerstone technology for both analysis, such as workflow recognition and automated safety monitoring, and generative tasks like controllable surgical simulation. Although challenges in data annotation and real-time implementation persist, they are actively being addressed through emerging techniques. Surgical SGs are maturing into an essential semantic bridge, enabling a new generation of intelligent systems to improve surgical safety, efficiency, and training.

[115] A Real-Time On-Device Defect Detection Framework for Laser Power-Meter Sensors via Unsupervised Learning

Dongqi Zheng,Wenjin Fu,Guangzong Chen

Main category: cs.CV

TL;DR: 提出一种基于视觉的自动化系统,用于激光功率计传感器涂层缺陷检测与分类,采用无监督异常检测框架和UFlow网络,结合StyleGAN2数据增强,在真实图像上实现了高精度检测。

Details Motivation: 激光功率计传感器涂层的缺陷(如热损伤和划痕)会影响医疗和工业应用中激光能量测量的准确性,传统依赖人工或需大量标注数据的方法效率低、成本高,因此需要一种高效、自动化的缺陷检测方案。 Method: 系统采用无监督异常检测框架,仅使用“良好”样本进行训练;包括三个部分:1)基于Laplacian边缘检测和K-means聚类的预处理流程;2)使用StyleGAN2进行合成数据增强;3)基于UFlow的神经网络进行多尺度特征提取和异常图生成。 Result: 在366张真实传感器图像上的实验显示,缺陷样本准确率为93.8%,正常样本为89.3%,图像级AUROC为0.957,像素级AUROC为0.961;单张图像处理时间仅0.5秒,并具备显著的年成本节约潜力。 Conclusion: 该方法能有效检测已知和新型缺陷,无需大量标注数据,适用于实际工业场景中的自动化质量控制。 Abstract: We present an automated vision-based system for defect detection and classification of laser power meter sensor coatings. Our approach addresses the critical challenge of identifying coating defects such as thermal damage and scratches that can compromise laser energy measurement accuracy in medical and industrial applications. The system employs an unsupervised anomaly detection framework that trains exclusively on ``good'' sensor images to learn normal coating distribution patterns, enabling detection of both known and novel defect types without requiring extensive labeled defect datasets. Our methodology consists of three key components: (1) a robust preprocessing pipeline using Laplacian edge detection and K-means clustering to segment the area of interest, (2) synthetic data augmentation via StyleGAN2, and (3) a UFlow-based neural network architecture for multi-scale feature extraction and anomaly map generation. Experimental evaluation on 366 real sensor images demonstrates $93.8\%$ accuracy on defective samples and $89.3\%$ accuracy on good samples, with image-level AUROC of 0.957 and pixel-level AUROC of 0.961. The system provides potential annual cost savings through automated quality control and processing times of 0.5 seconds per image in on-device implementation.

[116] Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos

Sarmistha Das,R E Zera Marveen Lyngkhoi,Sriparna Saha,Alka Maurya

Main category: cs.CV

TL;DR: 本文提出了一种名为FASTER的模块化框架,用于对金融咨询类长视频进行多模态摘要,结合文本、语音和图像特征,并引入Fin-APT数据集以推动相关研究。

Details Motivation: 由于社交媒体上金融咨询视频内容时长较长且多模态,难以提取关键信息,因此需要一个能高效生成准确、简洁摘要并保持跨模态一致性的系统。 Method: FASTER框架结合BLIP生成视觉语义描述、OCR提取文本模式、Whisper进行带说话人区分的语音转录,并采用改进的基于DPO的损失函数(含事实校验)优化摘要质量;通过排序检索机制对齐关键帧与文本内容。 Result: 在跨领域实验中,FASTER在摘要的准确性、相关性和事实一致性方面优于现有大语言模型和视觉-语言模型,且具备良好的鲁棒性与泛化能力。 Conclusion: FASTER为金融咨询视频的多模态摘要设立了新标准,提升了内容的可访问性和实用性,同时发布的Fin-APT数据集有助于推动该领域的后续研究。 Abstract: The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER

[117] An Adaptor for Triggering Semi-Supervised Learning to Out-of-Box Serve Deep Image Clustering

Yue Duan,Lei Qi,Yinghuan Shi,Yang Gao

Main category: cs.CV

TL;DR: ASD是一种无需预训练或先验条件的适配器,可实现自监督学习(SSL)模型在深度图像聚类中的冷启动,通过伪标签数据和实例级分类器提取高阶相似性,赋予聚类标签,从而触发SSL模型进行聚类,在多个基准上表现优异且与使用真实标签的SSL方法差距极小。

Details Motivation: 现有将自监督学习(SSL)融入深度聚类的方法通常需要预训练、聚类学习或已训练好的聚类模型作为前提,限制了SSL模型在图像聚类任务中灵活即用的应用。因此,亟需一种无需任何前提条件即可启动SSL模型进行聚类的方法。 Method: 提出ASD适配器:首先从无标签数据中随机采样生成伪标签数据,并训练一个实例级分类器学习语义对齐的实例级标签;利用该分类器预测无标签数据的类别转移,提取实例级类别的高阶相似性,进而为伪标签数据分配聚类级标签;最后使用这些带有聚类标签的伪标签数据驱动在无标签数据上训练的通用SSL模型进行图像聚类。 Result: ASD在多个基准数据集上优于最新的深度图像聚类方法,且与使用真实标签的SSL方法相比仅有微小精度差距(如在CIFAR-10上仅差1.33%);同时ASD还能进一步提升现有嵌入SSL的聚类方法的性能。 Conclusion: ASD成功实现了SSL学习器在深度图像聚类中的冷启动,无需任何前置条件,具有良好的通用性和即插即用特性,显著提升了聚类性能并展现了强大的兼容性。 Abstract: Recently, some works integrate SSL techniques into deep clustering frameworks to enhance image clustering performance. However, they all need pretraining, clustering learning, or a trained clustering model as prerequisites, limiting the flexible and out-of-box application of SSL learners in the image clustering task. This work introduces ASD, an adaptor that enables the cold-start of SSL learners for deep image clustering without any prerequisites. Specifically, we first randomly sample pseudo-labeled data from all unlabeled data, and set an instance-level classifier to learn them with semantically aligned instance-level labels. With the ability of instance-level classification, we track the class transitions of predictions on unlabeled data to extract high-level similarities of instance-level classes, which can be utilized to assign cluster-level labels to pseudo-labeled data. Finally, we use the pseudo-labeled data with assigned cluster-level labels to trigger a general SSL learner trained on the unlabeled data for image clustering. We show the superior performance of ASD across various benchmarks against the latest deep image clustering approaches and very slight accuracy gaps compared to SSL methods using ground-truth, e.g., only 1.33% on CIFAR-10. Moreover, ASD can also further boost the performance of existing SSL-embedded deep image clustering methods.

[118] SiNGER: A Clearer Voice Distills Vision Transformers Further

Geunhyeok Yu,Sunjae Jeong,Yoonyoung Choi,Jaeseung Kim,Hyoseok Hwang

Main category: cs.CV

TL;DR: 提出了一种名为SiNGER的新型知识蒸馏框架,通过利用零空间引导的扰动来抑制教师模型中的高范数伪影,同时保留有用信息,从而提升学生模型的表现。

Details Motivation: Vision Transformers作为视觉基础模型的骨干网络会产生高范数伪影,影响表示质量;在知识蒸馏中,这些伪影会主导学习目标,导致学生模型过拟合伪影而忽略有用信号,限制了大模型的优势传递。 Method: 提出SiNGER框架,核心是对教师特征进行原则性优化:利用零空间引导的扰动抑制伪影并保留信息,并通过LoRA-based适配器高效实现;优化后的教师特征用于蒸馏学生模型。 Result: 实验表明,SiNGER在多个下游任务上 consistently 提升学生模型性能,达到SOTA水平,并生成更清晰、可解释的表示。 Conclusion: SiNGER有效解决了知识蒸馏中抑制伪影与保留信息之间的权衡问题,显著提升了蒸馏效果和表示质量。 Abstract: Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher's features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.

[119] Fast-SEnSeI: Lightweight Sensor-Independent Cloud Masking for On-board Multispectral Sensors

Jan Kněžík,Jonáš Herec,Rado Pitoňák

Main category: cs.CV

TL;DR: 提出了一种轻量级、传感器无关的编码模块Fast-SEnSeI,用于多光谱传感器上的星载云分割,支持任意波段组合输入,并在嵌入式CPU和FPGA上实现高效混合部署。

Details Motivation: 现有云分割模型通常依赖特定传感器配置且需地面处理,难以适应不同传感器和星载实时处理需求。 Method: 基于SEnSeI-v2改进,引入更优的光谱描述符、轻量化架构和鲁棒的填充波段处理机制,结合量化U-Net实现固定尺寸特征输出与高效推理。 Result: 在Sentinel-2和Landsat 8数据集上验证了模型对不同波段配置的良好适应性和准确分割性能,支持在嵌入式CPU(Apache TVM)和FPGA上高效运行。 Conclusion: Fast-SEnSeI实现了灵活、高效的星载云分割,具备跨传感器泛化能力,适用于资源受限的空间任务。 Abstract: Cloud segmentation is a critical preprocessing step for many Earth observation tasks, yet most models are tightly coupled to specific sensor configurations and rely on ground-based processing. In this work, we propose Fast-SEnSeI, a lightweight, sensor-independent encoder module that enables flexible, on-board cloud segmentation across multispectral sensors with varying band configurations. Building upon SEnSeI-v2, Fast-SEnSeI integrates an improved spectral descriptor, lightweight architecture, and robust padding-band handling. It accepts arbitrary combinations of spectral bands and their wavelengths, producing fixed-size feature maps that feed into a compact, quantized segmentation model based on a modified U-Net. The module runs efficiently on embedded CPUs using Apache TVM, while the segmentation model is deployed on FPGA, forming a CPU-FPGA hybrid pipeline suitable for space-qualified hardware. Evaluations on Sentinel-2 and Landsat 8 datasets demonstrate accurate segmentation across diverse input configurations.

[120] A Single Neuron Works: Precise Concept Erasure in Text-to-Image Diffusion Models

Qinqin He,Jiaqi Weng,Jialing Tao,Hui Xue

Main category: cs.CV

TL;DR: 本文提出了一种基于单个神经元的文本到图像模型中有害概念擦除方法SNCE,通过稀疏自编码器和激活模式调制频率评分精确定位并抑制有害概念神经元,实现高精度擦除且保持图像质量。

Details Motivation: 现有概念擦除方法难以在彻底去除有害内容的同时保持图像生成质量,需要更精确、低侵入性的方法来提升文本到图像模型的安全性。 Method: 提出Single Neuron-based Concept Erasure (SNCE):使用稀疏自编码器(SAE)将文本嵌入映射到稀疏解耦的潜在空间,并设计基于调制频率评分的神经元识别方法,定位与有害概念相关的单一神经元,仅通过抑制该神经元实现概念擦除。 Result: 在多个基准测试中,SNCE在目标概念擦除效果上达到最先进水平,同时对非目标概念的生成能力影响极小,并在对抗攻击下表现出强鲁棒性,显著优于现有方法。 Conclusion: SNCE通过操纵单个神经元即可实现精准、低损伤的概念擦除,为文本到图像生成模型提供了一种高效且安全的内容控制方案。 Abstract: Text-to-image models exhibit remarkable capabilities in image generation. However, they also pose safety risks of generating harmful content. A key challenge of existing concept erasure methods is the precise removal of target concepts while minimizing degradation of image quality. In this paper, we propose Single Neuron-based Concept Erasure (SNCE), a novel approach that can precisely prevent harmful content generation by manipulating only a single neuron. Specifically, we train a Sparse Autoencoder (SAE) to map text embeddings into a sparse, disentangled latent space, where individual neurons align tightly with atomic semantic concepts. To accurately locate neurons responsible for harmful concepts, we design a novel neuron identification method based on the modulated frequency scoring of activation patterns. By suppressing activations of the harmful concept-specific neuron, SNCE achieves surgical precision in concept erasure with minimal disruption to image quality. Experiments on various benchmarks demonstrate that SNCE achieves state-of-the-art results in target concept erasure, while preserving the model's generation capabilities for non-target concepts. Additionally, our method exhibits strong robustness against adversarial attacks, significantly outperforming existing methods.

[121] OmniPlantSeg: Species Agnostic 3D Point Cloud Organ Segmentation for High-Resolution Plant Phenotyping Across Modalities

Andreas Gilson,Lukas Meyer,Oliver Scholz,Ute Schmid

Main category: cs.CV

TL;DR: 提出一种与传感器和植物种类无关的轻量级子采样算法KD-SS,用于保留分辨率的植物器官点云分割。

Details Motivation: 现有方法受限于特定植物种类或传感器模态,且常需大量预处理和下采样,限制了全分辨率点云的分割能力。 Method: 提出KD-SS算法,无需下采样即可处理生物点云,兼容多种传感器模态,并结合当前最先进的分割模型进行验证。 Result: 在多种模态(如摄影测量、激光三角测量和LiDAR)及不同植物种类上均取得满意的分割效果,能够有效保留原始点云分辨率。 Conclusion: KD-SS是一种简单而有效的轻量级替代方案,适用于跨物种和传感器模态的植物器官点云全分辨率分割。 Abstract: Accurate point cloud segmentation for plant organs is crucial for 3D plant phenotyping. Existing solutions are designed problem-specific with a focus on certain plant species or specified sensor-modalities for data acquisition. Furthermore, it is common to use extensive pre-processing and down-sample the plant point clouds to meet hardware or neural network input size requirements. We propose a simple, yet effective algorithm KDSS for sub-sampling of biological point clouds that is agnostic to sensor data and plant species. The main benefit of this approach is that we do not need to down-sample our input data and thus, enable segmentation of the full-resolution point cloud. Combining KD-SS with current state-of-the-art segmentation models shows satisfying results evaluated on different modalities such as photogrammetry, laser triangulation and LiDAR for various plant species. We propose KD-SS as lightweight resolution-retaining alternative to intensive pre-processing and down-sampling methods for plant organ segmentation regardless of used species and sensor modality.

[122] Background Prompt for Few-Shot Out-of-Distribution Detection

Songyue Cai,Zongqian Wu,Yujie Mo,Liang Peng,Ping Hu,Xiaoshuang Shi,Xiaofeng Zhu

Main category: cs.CV

TL;DR: 提出了一种新的前景-背景分解框架Mambo,用于少样本异常检测,通过学习背景提示和自校准调整策略,提升了检测鲁棒性和性能。

Details Motivation: 现有方法因过度依赖局部类相似性和固定的背景块提取策略而导致鲁棒性低。 Method: 引入背景提示学习以获取包含背景和语义信息的局部背景相似性,并结合局部类相似性进行优化;采用自校准调优策略灵活选择不同样本的背景块数量。 Result: 在多个真实数据集上实验表明,Mambo在OOD和近OOD检测任务中优于现有最先进方法。 Conclusion: Mambo通过改进背景提取机制和考虑样本多样性,显著提升了少样本异常检测的性能和鲁棒性。 Abstract: Existing foreground-background (FG-BG) decomposition methods for the few-shot out-of-distribution (FS-OOD) detection often suffer from low robustness due to over-reliance on the local class similarity and a fixed background patch extraction strategy. To address these challenges, we propose a new FG-BG decomposition framework, namely Mambo, for FS-OOD detection. Specifically, we propose to first learn a background prompt to obtain the local background similarity containing both the background and image semantic information, and then refine the local background similarity using the local class similarity. As a result, we use both the refined local background similarity and the local class similarity to conduct background extraction, reducing the dependence of the local class similarity in previous methods. Furthermore, we propose the patch self-calibrated tuning to consider the sample diversity to flexibly select numbers of background patches for different samples, and thus exploring the issue of fixed background extraction strategies in previous methods. Extensive experiments on real-world datasets demonstrate that our proposed Mambo achieves the best performance, compared to SOTA methods in terms of OOD detection and near OOD detection setting. The source code will be released at https://github.com/YuzunoKawori/Mambo.

[123] Stratify or Die: Rethinking Data Splits in Image Segmentation

Naga Venkata Sai Jitin Jami,Thomas Altstidl,Jonas Mueller,Jindong Li,Dario Zanca,Bjoern Eskofier,Heike Leutheuser

Main category: cs.CV

TL;DR: 提出了一种新的数据集分割方法WDES,通过最小化Wasserstein距离来优化分割任务中标签分布的相似性,相比随机采样能产生更具代表性的数据划分,尤其适用于小规模、不平衡和低多样性数据集。

Details Motivation: 传统随机分割数据集在图像分割任务中容易导致测试集不具代表性,从而引发评估偏差和模型泛化能力差的问题;现有分层抽样方法难以直接应用于具有多标签结构和类别不平衡的分割任务。 Method: 基于迭代像素分层(IPS)思想,提出Wasserstein驱动的进化分层(WDES),使用遗传算法最小化Wasserstein距离以优化训练集和测试集之间的标签分布相似性,并证明其在足够代数下全局最优。 Result: 通过新提出的统计异质性指标评估,WDES在街景、医学影像和卫星图像等多种分割任务中均比随机抽样产生更一致的分割结果,性能方差更低,模型评估更可靠,尤其在小样本和不平衡数据上优势明显。 Conclusion: WDES是一种有效的图像分割数据集划分方法,能够显著提升模型评估的准确性和稳定性,特别适合处理标签分布复杂、数据稀缺的现实场景。 Abstract: Random splitting of datasets in image segmentation often leads to unrepresentative test sets, resulting in biased evaluations and poor model generalization. While stratified sampling has proven effective for addressing label distribution imbalance in classification tasks, extending these ideas to segmentation remains challenging due to the multi-label structure and class imbalance typically present in such data. Building on existing stratification concepts, we introduce Iterative Pixel Stratification (IPS), a straightforward, label-aware sampling method tailored for segmentation tasks. Additionally, we present Wasserstein-Driven Evolutionary Stratification (WDES), a novel genetic algorithm designed to minimize the Wasserstein distance, thereby optimizing the similarity of label distributions across dataset splits. We prove that WDES is globally optimal given enough generations. Using newly proposed statistical heterogeneity metrics, we evaluate both methods against random sampling and find that WDES consistently produces more representative splits. Applying WDES across diverse segmentation tasks, including street scenes, medical imaging, and satellite imagery, leads to lower performance variance and improved model evaluation. Our results also highlight the particular value of WDES in handling small, imbalanced, and low-diversity datasets, where conventional splitting strategies are most prone to bias.

[124] EnGraf-Net: Multiple Granularity Branch Network with Fine-Coarse Graft Grained for Classification Task

Riccardo La Grassa,Ignazio Gallo,Nicola Landro

Main category: cs.CV

TL;DR: 本文提出了一种名为EnGraf-Net的端到端深度神经网络模型,利用分层语义关联(分类法)作为监督信号,提升了细粒度分类性能,无需依赖裁剪技术或人工标注,在多个数据集上表现出与最先进方法相媲美的性能。

Details Motivation: 现有的细粒度分类方法通常依赖部件标注或自动注意力机制,但往往导致局部特征表示不完整。作者认为引入语义关联和层次结构有助于更好地模拟人类识别物体的方式,从而提升分类效果。 Method: 提出EnGraf-Net模型,将语义层次结构(taxonomy)作为监督信号嵌入端到端的深度神经网络中,利用层级语义关系指导特征学习,避免使用边界框、部件位置或文本属性等额外标注。 Result: 在CIFAR-100、CUB-200-2011和FGVC-Aircraft三个主流细粒度分类数据集上进行了广泛实验,结果表明EnGraf-Net优于许多现有方法,性能与最新的先进方法相当,且无需任何裁剪或人工标注。 Conclusion: 通过引入分层语义关联作为监督信号,EnGraf-Net有效增强了细粒度分类中的特征表示能力,提供了一种不依赖局部部件标注的高效解决方案,具有良好的实用性和扩展性。 Abstract: Fine-grained classification models are designed to focus on the relevant details necessary to distinguish highly similar classes, particularly when intra-class variance is high and inter-class variance is low. Most existing models rely on part annotations such as bounding boxes, part locations, or textual attributes to enhance classification performance, while others employ sophisticated techniques to automatically extract attention maps. We posit that part-based approaches, including automatic cropping methods, suffer from an incomplete representation of local features, which are fundamental for distinguishing similar objects. While fine-grained classification aims to recognize the leaves of a hierarchical structure, humans recognize objects by also forming semantic associations. In this paper, we leverage semantic associations structured as a hierarchy (taxonomy) as supervised signals within an end-to-end deep neural network model, termed EnGraf-Net. Extensive experiments on three well-known datasets CIFAR-100, CUB-200-2011, and FGVC-Aircraft demonstrate the superiority of EnGraf-Net over many existing fine-grained models, showing competitive performance with the most recent state-of-the-art approaches, without requiring cropping techniques or manual annotations.

[125] Vision Transformers: the threat of realistic adversarial patches

Kasper Cools,Clara Maathuis,Alexander M. van Oers,Claudia S. Hübner,Nikos Deligiannis,Marijke Vandewal,Geert De Cubber

Main category: cs.CV

TL;DR: 本研究探讨了卷积神经网络(CNN)中的对抗性补丁攻击在视觉Transformer(ViT)模型上的跨架构可迁移性,发现不同ViT模型对这类攻击的脆弱性差异显著,攻击成功率从40%到接近100%不等,表明预训练数据规模和方法论显著影响模型的抗攻击能力。

Details Motivation: 随着机器学习系统在关键应用中的广泛使用,其安全性变得至关重要。尽管视觉Transformer(ViT)相比传统CNN在性能和鲁棒性上有所提升,但其在面对现实世界对抗性攻击(如对抗性补丁)时的脆弱性仍不清楚,因此需要系统评估其安全风险。 Method: 研究采用Creases Transformation(CT)技术生成具有自然褶皱特征的对抗性补丁,并将其应用于四种微调后的ViT模型,执行二分类(人 vs. 非人)任务,评估攻击成功率,同时分析攻击从CNN迁移到ViT的有效性及影响因素。 Result: 实验结果显示,不同ViT模型对对抗性补丁的敏感度差异巨大:google/vit-base-patch16-224-in21k的攻击成功率为40.04%,facebook/dino-vitb16高达99.97%,其余两个模型分别为66.40%和65.17%,证实了对抗性攻击在CNN与ViT之间的跨架构可迁移性。 Conclusion: 视觉Transformer虽然在设计上更具鲁棒性,但仍易受现实对抗性补丁攻击,其安全性高度依赖于预训练策略和数据规模,未来需针对ViT架构开发专门的防御机制以提升其在实际部署中的安全性。 Abstract: The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.

[126] UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition

Guojun Lei,Rong Zhang,Chi Wang,Tianhang Liu,Hong Li,Zhiyuan Ma,Weiwei Xu

Main category: cs.CV

TL;DR: 提出UniTransfer架构,通过时空分解实现精确可控的视频概念迁移。

Details Motivation: 实现更精细、可控的视频内容编辑与概念迁移,提升生成质量与灵活性。 Method: 引入空间分解(前景、背景、运动流)和扩散时间步分解,采用双流到单流DiT架构,并结合自监督预训练和基于大语言模型的Chain-of-Prompt机制进行分阶段生成引导。 Result: 在多种参考图像和场景下实现了高质量、高可控性的视频概念迁移,优于现有方法,在视觉保真度和可编辑性方面表现突出。 Conclusion: UniTransfer通过时空双重分解和LLM引导的渐进生成机制,显著提升了视频概念迁移的效果,为未来视频编辑提供了新思路。 Abstract: We propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability. Web Page: https://yu-shaonian.github.io/UniTransfer-Web/

[127] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Ziang Yan,Xinhao Li,Yinan He,Zhengrong Yue,Xiangyu Zeng,Yali Wang,Yu Qiao,Limin Wang,Yi Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为视觉测试时扩展(VTTS)的新方法,通过推理过程中的迭代感知来增强多模态大语言模型(MLLMs)的推理能力,并引入了VTTS-80K数据集以支持该范式。

Details Motivation: 现有的多模态大语言模型在静态感知阶段存在局限性,难以实现人类水平的感知与理解,因此需要一种能够动态优化推理过程中视觉感知的方法。 Method: 提出VTTS方法,采用迭代感知(ITP)机制,结合强化学习和时空监督,根据文本预测逐步优化对高置信度时空区域的关注,从而实现推理增强。同时构建VTTS-80K数据集用于训练与验证。 Result: 在超过15个涵盖视频对话、视频推理和时空感知的基准上,基于VTTS的Videochat-R1.5模型相比Qwen2.5VL-3B和-7B等强基线平均提升超过5%。 Conclusion: VTTS通过增加感知计算实现了MLLM性能的可扩展提升,展现出良好的泛化能力和应用潜力。 Abstract: Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs' reasoning via iterative perception during inference. VTTS mimics humans' hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS's effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5\%, compared to robust baselines such as Qwen2.5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.

[128] Mammo-CLIP Dissect: A Framework for Analysing Mammography Concepts in Vision-Language Models

Suaiba Amina Salahuddin,Teresa Dorszewski,Marit Almenning Martiniussen,Tone Hovda,Antonio Portaluri,Solveig Thrun,Michael Kampffmeyer,Elisabeth Wetzer,Kristoffer Wickstrøm,Robert Jenssen

Main category: cs.CV

TL;DR: 本文提出了Mammo-CLIP Dissect,首个用于乳腺X线摄影深度学习模型的概念性可解释性框架,利用乳腺特异性视觉-语言模型标注神经元并量化其与领域知识的对齐程度,揭示了不同训练数据和微调策略对概念学习的影响。

Details Motivation: 为了更安全地在临床环境中部署AI,需要理解深度学习模型学到的内容,尤其是与临床医生推理更一致的文本概念,而非仅依赖像素级可解释方法。 Method: 提出Mammo-CLIP Dissect框架,使用乳腺特异性的视觉-语言模型(Mammo-CLIP)作为“解剖器”,在指定层将神经元与人类可理解的文本概念对齐,并量化其与领域知识的一致性。 Result: 发现基于乳腺数据训练的模型比通用图像训练的模型捕捉到更多临床相关概念;微调能增强某些概念(如良性钙化)的学习,但可能削弱其他特征(如密度相关)的覆盖;部分乳腺相关概念仍表征不足。 Conclusion: Mammo-CLIP Dissect有助于理解CNN如何捕获乳腺特异性知识,揭示了领域特定训练和任务适应对概念学习的塑造作用,为提升医学AI的可解释性和可靠性提供了新工具。 Abstract: Understanding what deep learning (DL) models learn is essential for the safe deployment of artificial intelligence (AI) in clinical settings. While previous work has focused on pixel-based explainability methods, less attention has been paid to the textual concepts learned by these models, which may better reflect the reasoning used by clinicians. We introduce Mammo-CLIP Dissect, the first concept-based explainability framework for systematically dissecting DL vision models trained for mammography. Leveraging a mammography-specific vision-language model (Mammo-CLIP) as a "dissector," our approach labels neurons at specified layers with human-interpretable textual concepts and quantifies their alignment to domain knowledge. Using Mammo-CLIP Dissect, we investigate three key questions: (1) how concept learning differs between DL vision models trained on general image datasets versus mammography-specific datasets; (2) how fine-tuning for downstream mammography tasks affects concept specialisation; and (3) which mammography-relevant concepts remain underrepresented. We show that models trained on mammography data capture more clinically relevant concepts and align more closely with radiologists' workflows than models not trained on mammography data. Fine-tuning for task-specific classification enhances the capture of certain concept categories (e.g., benign calcifications) but can reduce coverage of others (e.g., density-related features), indicating a trade-off between specialisation and generalisation. Our findings show that Mammo-CLIP Dissect provides insights into how convolutional neural networks (CNNs) capture mammography-specific knowledge. By comparing models across training data and fine-tuning regimes, we reveal how domain-specific training and task-specific adaptation shape concept learning. Code and concept set are available: https://github.com/Suaiba/Mammo-CLIP-Dissect.

[129] MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

Sicheng Tao,Jungang Li,Yibo Yan,Junyan Zhang,Yubo Gao,Hanqian Li,ShuHang Xun,Yuxuan Fan,Hong Chen,Jianxiang He,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出MOSS-ChatV,一种基于动态时间规整(DTW)过程奖励的强化学习框架,用于提升多模态大语言模型在视频推理中的过程一致性,并通过构建MOSS-Video基准验证其有效性。

Details Motivation: 现有MLLM在视频推理中常出现中间推理过程与视频动态不一致的问题,即使最终答案正确也影响可解释性和鲁棒性。 Method: 提出基于DTW的过程奖励机制,在强化学习中对推理轨迹与时间对齐的参考进行对齐;构建包含标注推理轨迹的MOSS-Video基准,用于训练和评估。 Result: MOSS-ChatV在MOSS-Video测试集上达到87.2%的性能,在MVBench和MMVU等通用视频基准上也有提升,且在不同模型架构上均表现一致增益,GPT-4o评估显示其推理过程更一致稳定。 Conclusion: 该框架有效提升了视频推理中模型思维过程与实际动态的一致性,具有良好的通用性和应用潜力。 Abstract: Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2\% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.

[130] MotionFlow:Learning Implicit Motion Flow for Complex Camera Trajectory Control in Video Generation

Guojun Lei,Chi Wang,Yikai Wang,Hong Li,Ying Song,Weiwei Xu

Main category: cs.CV

TL;DR: 提出一种新方法,通过将相机和物体运动转换为像素运动,结合稳定扩散网络和图像到视频网络生成符合指定相机轨迹且保持物体运动一致性的视频。

Details Motivation: 现有方法在处理相机和物体同时运动时难以保证一致性与泛化能力,容易混淆相对运动。 Method: 将相机和物体运动统一转化为像素运动,利用稳定扩散网络学习参考运动图,并结合语义对象先验输入图像到视频网络生成视频。 Result: 实验表明该模型在遵循指定相机轨迹和保持物体运动一致性方面显著优于现有最先进方法。 Conclusion: 该方法有效解决了相机与物体运动耦合带来的生成不一致问题,显著提升了视频生成的质量和可控性。 Abstract: Generating videos guided by camera trajectories poses significant challenges in achieving consistency and generalizability, particularly when both camera and object motions are present. Existing approaches often attempt to learn these motions separately, which may lead to confusion regarding the relative motion between the camera and the objects. To address this challenge, we propose a novel approach that integrates both camera and object motions by converting them into the motion of corresponding pixels. Utilizing a stable diffusion network, we effectively learn reference motion maps in relation to the specified camera trajectory. These maps, along with an extracted semantic object prior, are then fed into an image-to-video network to generate the desired video that can accurately follow the designated camera trajectory while maintaining consistent object motions. Extensive experiments verify that our model outperforms SOTA methods by a large margin.

[131] The Unwinnable Arms Race of AI Image Detection

Till Aczel,Lorenzo Vettor,Andreas Plesner,Roger Wattenhofer

Main category: cs.CV

TL;DR: 本文研究了在图像生成AI快速发展的背景下,判别器在何种条件下处于劣势。通过分析数据维度和复杂性两个因素,发现简单或高度复杂的數據集会降低合成图像的可检测性,而中等复杂度的数据集最有利于检测。

Details Motivation: 随着图像生成AI的进步,合成图像与真实图像之间的界限变得模糊,导致生成器与判别器之间的竞争加剧。本文旨在探究判别器在此竞争中处于劣势的条件。 Method: 利用Kolmogorov复杂度作为衡量数据集内在结构的指标,分析数据维度和复杂性对判别器检测能力的影响。 Result: 发现增加维度通常有助于判别器检测细微不一致,而数据复杂性具有更复杂的影响:过于简单或高度复杂的数据集都会降低合成图像的可检测性,中等复杂度的数据集最有利于检测,因为生成器难以完全捕捉分布且其错误仍然可见。 Conclusion: 数据复杂性在生成器与判别器的竞争中起关键作用,中等复杂度的数据集为判别器提供了最佳检测条件。 Abstract: The rapid progress of image generative AI has blurred the boundary between synthetic and real images, fueling an arms race between generators and discriminators. This paper investigates the conditions under which discriminators are most disadvantaged in this competition. We analyze two key factors: data dimensionality and data complexity. While increased dimensionality often strengthens the discriminators ability to detect subtle inconsistencies, complexity introduces a more nuanced effect. Using Kolmogorov complexity as a measure of intrinsic dataset structure, we show that both very simple and highly complex datasets reduce the detectability of synthetic images; generators can learn simple datasets almost perfectly, whereas extreme diversity masks imperfections. In contrast, intermediate-complexity datasets create the most favorable conditions for detection, as generators fail to fully capture the distribution and their errors remain visible.

[132] WAVECLIP: Wavelet Tokenization for Adaptive-Resolution CLIP

Moshe Kimhi,Erez Koifman,Ehud Rivlin,Eli Schwartz,Chaim Baskin

Main category: cs.CV

TL;DR: 提出WAVECLIP,一种基于小波分词的统一模型,支持CLIP中的自适应分辨率推理,通过多级小波分解和关键值缓存实现计算效率与准确率的权衡。

Details Motivation: 为了在保持准确性的同时提升CLIP模型在不同分辨率图像上的推理效率,减少计算开销。 Method: 用基于小波的多级分解替代标准的图像块嵌入,结合关键值缓存和因果跨层级注意力机制,实现从粗到精的自适应推理,并通过轻量级蒸馏训练。 Result: 在零样本分类任务中验证了方法的有效性,能够通过置信度门控实现自适应早退,显著降低计算量同时保持竞争力的准确率。 Conclusion: WAVECLIP通过单一模型实现了灵活的计算-精度权衡,适用于需要高效推理的实际应用场景。 Abstract: We introduce WAVECLIP, a single unified model for adaptive resolution inference in CLIP, enabled by wavelet-based tokenization. WAVECLIP replaces standard patch embeddings with a multi-level wavelet decomposition, enabling the model to process images coarse to fine while naturally supporting multiple resolutions within the same model. At inference time, the model begins with low resolution tokens and refines only when needed, using key-value caching and causal cross-level attention to reuse computation, effectively introducing to the model only new information when needed. We evaluate WAVECLIP in zero-shot classification, demonstrating that a simple confidence-based gating mechanism enables adaptive early exits. This allows users to dynamically choose a compute-accuracy trade-off using a single deployed model. Our approach requires only lightweight distillation from a frozen CLIP teacher and achieves competitive accuracy with significant computational savings.

[133] Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy

Aymen Bouguerra,Daniel Montoya,Alexandra Gomez-Villa,Fabio Arnez,Chokri Mraidha

Main category: cs.CV

TL;DR: 本文研究了量化对CLIP模型在准确率之外的可靠性指标的影响,发现量化能改善欠置信模型的校准性,且即使在校准性下降的情况下仍可能提升OOD检测性能,并通过量化感知训练实现了效率与性能的共同增益。

Details Motivation: 尽管CLIP等视觉语言模型在分布外检测等安全相关任务中展现出零样本泛化能力,但其在量化下的可靠性表现尚未被充分探索,尤其是在实际部署中对计算效率和稳定性的需求。 Method: 作者对CLIP模型进行了大规模的量化评估,不仅考察了分布内准确率,还系统分析了包括模型校准性和分布外检测在内的多种可靠性指标,并比较了不同预训练来源模型在量化及量化感知训练下的表现。 Result: 量化能持续改善通常欠置信模型的校准性,但会降低过置信模型的校准性;然而,这种校准性下降并不妨碍OOD检测性能的提升;同时发现特定的量化感知训练方法可同时提升零样本准确率、校准性和OOD鲁棒性。 Conclusion: 量化不仅是提升模型效率的手段,还能在多目标优化中增强视觉语言模型的可靠性与鲁棒性,挑战了传统的效率-性能权衡观点。 Abstract: The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP's performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.

[134] TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

Iñigo Alonso,Imanol Miranda,Eneko Agirre,Mirella Lapata

Main category: cs.CV

TL;DR: TABLET是一个大规模视觉表格理解(VTU)数据集,包含400万个样本,涵盖20项任务,基于200万个唯一表格,88%保留原始可视化。提供图像-HTML配对、元数据和来源信息,支持可追溯性和多样化评估。

Details Motivation: 现有VTU基准多使用缺乏真实复杂性的合成渲染表格,且数据集固定、无法访问底层数据,限制了模型的鲁棒性和泛化能力。 Method: 构建包含真实世界表格可视化的大规模数据集TABLET,每例包含图像-HTML对、元数据和来源信息,并用于微调如Qwen2.5-VL-7B等视觉语言模型。 Result: 在TABLET上微调的模型在已见和未见VTU任务上表现更好,对真实世界表格图像具有更强的鲁棒性。 Conclusion: TABLET通过保留原始可视化和数据溯源,为VTU模型的稳健训练和可扩展评估奠定了基础。 Abstract: While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.

[135] Learning Conformal Explainers for Image Classifiers

Amr Alkhatib,Stephanie Lowry

Main category: cs.CV

TL;DR: 提出一种基于保形预测的新方法,使用户能够直接控制生成解释的保真度,并在多个图像数据集上验证了其在保真度和信息效率方面优于现有方法。

Details Motivation: 现有的特征归因方法在解释图像预测时存在鲁棒性和保真度不足的问题,且通常依赖于真实解释进行校准,限制了其可靠性与实用性。 Method: 提出一种基于保形预测的方法,通过识别足以保持模型预测结果的关键特征子集来生成解释,并设计了四种一致性函数来衡量解释与模型预测的一致性程度,无需访问真实标签进行校准。 Result: 在六个图像数据集上对五种解释器进行了评估,实验结果表明FastSHAP在保真度和信息效率(解释区域大小)方面 consistently 优于对比方法,且基于超像素的一致性度量比像素级更有效。 Conclusion: 该方法能有效提升解释的保真度与稳健性,允许用户控制解释质量,且不依赖真实解释进行校准,具有较强的实用价值。 Abstract: Feature attribution methods are widely used for explaining image-based predictions, as they provide feature-level insights that can be intuitively visualized. However, such explanations often vary in their robustness and may fail to faithfully reflect the reasoning of the underlying black-box model. To address these limitations, we propose a novel conformal prediction-based approach that enables users to directly control the fidelity of the generated explanations. The method identifies a subset of salient features that is sufficient to preserve the model's prediction, regardless of the information carried by the excluded features, and without demanding access to ground-truth explanations for calibration. Four conformity functions are proposed to quantify the extent to which explanations conform to the model's predictions. The approach is empirically evaluated using five explainers across six image datasets. The empirical results demonstrate that FastSHAP consistently outperforms the competing methods in terms of both fidelity and informational efficiency, the latter measured by the size of the explanation regions. Furthermore, the results reveal that conformity measures based on super-pixels are more effective than their pixel-wise counterparts.

[136] Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding

Muxin Pu,Mei Kuan Lim,Chun Yong Chong,Chen Change Loy

Main category: cs.CV

TL;DR: 本文提出了一种基于骨架数据的统一手语理解框架Sigma,通过语言感知的早期融合、分层对齐学习和多任务预训练策略,有效提升了手语识别与翻译的性能。

Details Motivation: 现有手语理解方法存在语义关联弱、局部细节与全局上下文不平衡以及跨模态学习效率低的问题,亟需更有效的语义对齐和融合机制。 Method: 提出Sigma框架,包含语言感知的早期融合机制、分层对齐学习策略以及结合对比学习、文本匹配和语言建模的统一预训练框架。 Result: 在多种孤立和连续手语识别及无gloss翻译任务上达到最先进性能,验证了骨架数据作为独立解决方案的有效性。 Conclusion: Sigma通过增强语义关联和跨模态对齐,显著提升了基于骨架的手语理解效果,展示了预训练中语义信息的重要性。 Abstract: Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.

[137] CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,Shayan Baghayi Nejad,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: 本文提出了一种名为CARINOX的统一框架,结合噪声优化与探索,并引入基于人类判断相关性的奖励选择机制,显著提升了文本到图像生成中的组合对齐性能。

Details Motivation: 现有的文本到图像扩散模型在处理复杂对象关系、属性或空间布局时难以实现良好的组合对齐,且现有方法在单独使用优化或探索策略时存在局限性。 Method: 提出CARINOX框架,结合初始噪声的优化与探索,并设计一种原则性的奖励函数选择方法,基于其与人类判断的相关性进行筛选,以提升组合对齐效果。 Result: 在T2I-CompBench++和HRS两个基准上分别将平均对齐得分提高了+16%和+11%,在所有主要类别上均优于当前最先进的方法,同时保持图像质量和多样性。 Conclusion: CARINOX通过融合优化与探索并采用更可靠的奖励机制,有效解决了文本到图像生成中的组合性挑战,具有更强的鲁棒性和泛化能力。 Abstract: Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/{this URL}.

[138] Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation

Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: 本文研究了文本-图像生成中常用评估指标与人类判断的一致性,发现不同指标在不同组合任务中表现各异,没有一种指标在所有任务中都表现最佳。VQA类指标并非始终优于其他方法,而某些基于嵌入的指标在特定情况下更优。纯图像指标对组合评估贡献有限。研究强调应谨慎、透明地选择评估指标。

Details Motivation: 现有的文本-图像生成评估指标多依赖自动化度量,但这些指标是否真实反映人类偏好尚不明确。由于评估结果和领域进展依赖这些指标,亟需系统分析其有效性。 Method: 对广泛使用的组合文本-图像评估指标进行了全面研究,超越简单的相关性分析,考察它们在多种组合挑战中的表现,并比较不同类别指标与人类判断的对齐程度。 Result: 没有单一指标在所有任务上 consistently 表现最佳;指标性能随组合问题类型变化。VQA-based 指标并不普遍 superior,某些 embedding-based 指标在特定情况下更强。image-only 指标在组合评估中作用有限。 Conclusion: 应根据具体任务谨慎选择评估指标,避免盲目依赖流行指标。透明和有针对性的指标选择对于可信评估及作为生成模型奖励函数具有重要意义。 Abstract: Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at \href{https://amirkasaei.com/eval-the-evals/}{this URL}.

[139] SlideMamba: Entropy-Based Adaptive Fusion of GNN and Mamba for Enhanced Representation Learning in Digital Pathology

Shakib Khan,Fariba Dambandkhameneh,Nazim Shaikh,Yao Nie,Raghavan Venugopal,Xiao Li

Main category: cs.CV

TL;DR: 提出了一种结合Mamba架构与图神经网络(GNN)的可泛化深度学习框架SlideMamba,用于全切片图像(WSI)分析,并通过基于熵的自适应融合策略有效整合局部与全局信息,在基因融合和突变状态预测任务中表现优于现有方法。

Details Motivation: 为了提升全切片图像(WSI)分析的性能,需要同时捕捉局部空间关系和长距离上下文依赖,而现有方法在建模这两种特征方面存在局限。 Method: 提出SlideMamba框架,结合Mamba模块(擅长捕捉长距离依赖)和GNN(强调细粒度局部交互),并设计基于熵的置信度加权机制来自适应融合两个分支的输出。 Result: 在基因融合与突变预测任务中,SlideMamba的PRAUC达到0.751±0.05,优于MIL、Trans-MIL、Mamba-only、GNN-only及GAT-Mamba;ROC AUC、敏感性和特异性也表现出竞争力。 Conclusion: 集成Mamba与GNN并采用熵驱动自适应融合的架构能有效提升WSI分析性能,具有在计算病理学中广泛应用的潜力。 Abstract: Advances in computational pathology increasingly rely on extracting meaningful representations from Whole Slide Images (WSIs) to support various clinical and biological tasks. In this study, we propose a generalizable deep learning framework that integrates the Mamba architecture with Graph Neural Networks (GNNs) for enhanced WSI analysis. Our method is designed to capture both local spatial relationships and long-range contextual dependencies, offering a flexible architecture for digital pathology analysis. Mamba modules excels in capturing long-range global dependencies, while GNNs emphasize fine-grained short-range spatial interactions. To effectively combine these complementary signals, we introduce an adaptive fusion strategy that uses an entropy-based confidence weighting mechanism. This approach dynamically balances contributions from both branches by assigning higher weight to the branch with more confident (lower-entropy) predictions, depending on the contextual importance of local versus global information for different downstream tasks. We demonstrate the utility of our approach on a representative task: predicting gene fusion and mutation status from WSIs. Our framework, SlideMamba, achieves an area under the precision recall curve (PRAUC) of 0.751 \pm 0.05, outperforming MIL (0.491 \pm 0.042), Trans-MIL (0.39 \pm 0.017), Mamba-only (0.664 \pm 0.063), GNN-only (0.748 \pm 0.091), and a prior similar work GAT-Mamba (0.703 \pm 0.075). SlideMamba also achieves competitive results across ROC AUC (0.738 \pm 0.055), sensitivity (0.662 \pm 0.083), and specificity (0.725 \pm 0.094). These results highlight the strength of the integrated architecture, enhanced by the proposed entropy-based adaptive fusion strategy, and suggest promising potential for application of spatially-resolved predictive modeling tasks in computational pathology.

[140] Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Team Hunyuan3D,:,Bowen Zhang,Chunchao Guo,Haolin Liu,Hongyu Yan,Huiwen Shi,Jingwei Huang,Junlin Yu,Kunhong Li,Linus,Penghao Wang,Qingxiang Lin,Sicong Liu,Xianghui Yang,Yixuan Tang,Yunfei Zhao,Zeqiang Lai,Zhihao Liang,Zibo Zhao

Main category: cs.CV

TL;DR: Hunyuan3D-Omni 是一个基于 Hunyuan3D 2.1 的统一框架,支持细粒度、可控的 3D 资产生成,引入多种条件信号(如点云、体素、边界框和骨骼姿态)实现对几何、拓扑和姿态的精确控制,并通过难度感知采样策略提升多模态融合与鲁棒性。

Details Motivation: 现有3D生成模型主要依赖图像或文本条件,缺乏细粒度跨模态控制,限制了可控性和实际应用。 Method: 提出Hunyuan3D-Omni,统一集成多种条件信号(图像、点云、体素、边界框、骨骼姿态)于单一跨模态架构中,并采用渐进式、难度感知的采样策略进行训练。 Result: 实验表明该方法提高了生成精度,支持几何感知的变换,并增强了在生产流程中的鲁棒性。 Conclusion: Hunyuan3D-Omni 实现了更精细的3D资产控制,提升了多模态条件生成的性能与实用性。 Abstract: Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.

[141] Learning to Look: Cognitive Attention Alignment with Vision-Language Models

Ryan L. Yang,Dipkamal Bhusal,Nidhi Rastogi

Main category: cs.CV

TL;DR: 提出一种可扩展的框架,利用视觉-语言模型自动生成语义注意力图,通过辅助损失对齐CNN注意力,提升模型泛化能力和认知合理性。

Details Motivation: 现有方法依赖专家标注的概念监督和解释正则化来引导模型注意力,但标注成本高、难以扩展。 Method: 利用视觉-语言模型和自然语言提示自动生成语义注意力图,并引入辅助损失使其与CNN注意力对齐。 Result: 在ColoredMNIST上达到SOTA,在DecoyMNIST上表现与依赖大量标注的方法相当,减少了对捷径的依赖,注意力更符合人类直觉。 Conclusion: 该方法无需人工标注即可提升模型决策的可靠性和可解释性,具有良好的可扩展性和应用前景。 Abstract: Convolutional Neural Networks (CNNs) frequently "cheat" by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.

[142] Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

Zhijian Yang,Noel DSouza,Istvan Megyeri,Xiaojian Xu,Amin Honarmandi Shandiz,Farzin Haddadpour,Krisztian Koos,Laszlo Rusko,Emanuele Valeriano,Bharadwaj Swaninathan,Lei Wu,Parminder Bhatia,Taha Kass-Hout,Erhan Bas

Main category: cs.CV

TL;DR: Decipher-MR是一个基于大规模多区域MRI数据训练的3D视觉-语言基础模型,结合自监督学习与报告引导的文本监督,支持模块化轻量解码器微调,在多种临床任务中表现出优越的泛化性和性能。

Details Motivation: 由于数据稀缺和解剖区域局限,现有基础模型在MRI分析中的应用受限,亟需一个可扩展、通用的MRI专用基础模型。 Method: 提出Decipher-MR,采用3D视觉-语言架构,结合自监督视觉学习与报告文本监督,训练于20万MRI序列(来自2.2万研究),并通过冻结编码器+轻量任务特定解码器实现模块化微调。 Result: 在疾病分类、人口统计预测、解剖定位和跨模态检索等任务上,Decipher-MR consistently优于现有基础模型和特定任务方法。 Conclusion: Decipher-MR是一个可扩展、多功能的MRI基础模型,有助于推动临床和科研中MRI人工智能的高效发展。 Abstract: Magnetic Resonance Imaging (MRI) is a critical medical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity pose challenges for automated analysis, particularly in scalable and generalizable machine learning applications. While foundation models have revolutionized natural language and vision tasks, their application to MRI remains limited due to data scarcity and narrow anatomical focus. In this work, we present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on a large-scale dataset comprising 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust, generalizable representations, enabling effective adaptation across broad applications. To enable robust and diverse clinical tasks with minimal computational overhead, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across diverse benchmarks including disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent performance gains over existing foundation models and task-specific approaches. Our results establish Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.

[143] Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

You-Won Jang,Yu-Jung Heo,Jaeseok Kim,Minsu Lee,Du-Seong Chang,Byoung-Tak Zhang

Main category: cs.CV

TL;DR: 提出SQ-InstructBLIP模型,通过生成图像感知的子问题和子答案来提升视觉问答任务中的多步推理性能。

Details Motivation: 现有方法在处理需要多步推理的视觉语言理解任务时存在局限,尤其是无法有效利用图像细粒度信息且依赖黑盒大语言模型导致机制不可见。 Method: 设计由Questioner、Answerer和Reasoner组成的SQ-InstructBLIP,三者共享架构,分别负责生成子问题、子答案并进行最终推理,实现图像感知的迭代式推理。 Result: 实验表明,SQ-InstructBLIP在VQA任务中利用生成的子问题作为额外信息,相比先前方法能实现更准确的推理。 Conclusion: SQ-InstructBLIP通过可解释的模块化设计提升了多步视觉语言推理的性能,克服了黑盒模型和缺乏视觉细粒度信息的问题。 Abstract: The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.

[144] Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation

Seyed Amir Kasaei,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: 本文提出了文本到图像生成模型中“幻觉”现象的新定义,将其归类为由模型偏见引起的偏差,并提出包含属性、关系和对象三类的分类体系,以支持更深入的模型评估。

Details Motivation: 现有评估主要关注提示词与生成内容的一致性,忽略了模型在提示之外生成的内容,缺乏对文本到图像模型中幻觉现象的清晰界定。 Method: 提出将文本到图像生成中的幻觉定义为由模型先验知识或偏见导致的偏离提示的生成结果,并构建了包含属性、关系和对象三类幻觉的分类体系。 Result: 该框架为评估文本到图像模型提供了一个上限,并揭示了隐藏的模型偏见,有助于更全面地评估生成内容。 Conclusion: 通过引入结构化的幻觉分类体系,可以更好地理解和评估文本到图像模型中的偏差问题,为未来研究提供了基础。 Abstract: In language and vision-language models, hallucination is broadly understood as content generated from a model's prior knowledge or biases rather than from the given input. While this phenomenon has been studied in those domains, it has not been clearly framed for text-to-image (T2I) generative models. Existing evaluations mainly focus on alignment, checking whether prompt-specified elements appear, but overlook what the model generates beyond the prompt. We argue for defining hallucination in T2I as bias-driven deviations and propose a taxonomy with three categories: attribute, relation, and object hallucinations. This framing introduces an upper bound for evaluation and surfaces hidden biases, providing a foundation for richer assessment of T2I models.

[145] Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization

Feng-Qi Cui,Jinyang Huang,Anyang Tong,Ziyu Jia,Jie Zhang,Zhi Liu,Dan Guo,Jianwei Lu,Meng Wang

Main category: cs.CV

TL;DR: 提出了一种人物无关的通用微动作识别框架,通过分布鲁棒优化学习跨个体的鲁棒表征,在特征层和损失层引入双模块提升模型泛化能力。

Details Motivation: 现有微动作识别方法因个体间差异大,在真实场景中泛化能力差,难以稳定识别相同动作在不同人身上的表现。 Method: 提出人物独立的通用微动作识别框架,包含特征层的时频对齐模块(时域分支用Wasserstein正则化对齐动态轨迹,频域分支引入方差引导扰动)和损失层的组不变正则化损失(通过伪分组模拟未见个体分布,加权边界样本并正则化子组方差)。 Result: 在大规模MA-52数据集上实验表明,该方法在准确性和鲁棒性上均优于现有方法,能在细粒度条件下实现稳定的泛化性能。 Conclusion: 所提框架有效缓解了个体差异对微动作识别的影响,通过特征对齐与损失正则化实现了更鲁棒、可泛化的模型性能。 Abstract: Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.

[146] Dense Semantic Matching with VGGT Prior

Songlin Yang,Tianyi Wei,Yushi Lan,Zeqi Xiao,Anyi Rao,Xingang Pan

Main category: cs.CV

TL;DR: 本文提出了一种基于3D几何基础模型VGGT的语义匹配方法,通过重用早期特征、微调后期特征并添加语义头,结合循环一致性训练和合成数据增强,在数据稀缺情况下实现了跨实例的像素级语义匹配,显著提升了几何感知能力和匹配可靠性。

Details Motivation: 现有语义匹配方法存在几何模糊性和依赖最近邻规则的问题,难以处理对称结构且缺乏泛化能力,同时忽略跨图像不可见性和流形保持,需要更鲁棒的几何感知描述符和整体密集匹配机制。 Method: 利用VGGT提供的几何感知特征和整体匹配能力,保留其早期特征阶段,微调后期阶段,并增加一个双向语义匹配头;通过循环一致性训练策略、合成数据增强和渐进式训练流程(含混叠伪影抑制)来适应语义匹配任务和数据稀缺问题。 Result: 实验表明该方法在几何感知、匹配可靠性和流形保持方面优于先前方法,在多个基准上取得了更优的语义匹配性能。 Conclusion: 所提出的方法有效解决了现有语义匹配中几何模糊和局部匹配缺陷问题,通过适配3D几何基础模型VGGT并在数据稀缺下进行优化训练,为语义匹配提供了新的有效范式。 Abstract: Semantic matching aims to establish pixel-level correspondences between instances of the same category and represents a fundamental task in computer vision. Existing approaches suffer from two limitations: (i) Geometric Ambiguity: Their reliance on 2D foundation model features (e.g., Stable Diffusion, DINO) often fails to disambiguate symmetric structures, requiring extra fine-tuning yet lacking generalization; (ii) Nearest-Neighbor Rule: Their pixel-wise matching ignores cross-image invisibility and neglects manifold preservation. These challenges call for geometry-aware pixel descriptors and holistic dense correspondence mechanisms. Inspired by recent advances in 3D geometric foundation models, we turn to VGGT, which provides geometry-grounded features and holistic dense matching capabilities well aligned with these needs. However, directly transferring VGGT is challenging, as it was originally designed for geometry matching within cross views of a single instance, misaligned with cross-instance semantic matching, and further hindered by the scarcity of dense semantic annotations. To address this, we propose an approach that (i) retains VGGT's intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences; and (ii) adapts VGGT to the semantic matching scenario under data scarcity through cycle-consistent training strategy, synthetic data augmentation, and progressive training recipe with aliasing artifact mitigation. Extensive experiments demonstrate that our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.

[147] MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation

Xinyu Liu,Guolei Sun,Cheng Wang,Yixuan Yuan,Ender Konukoglu

Main category: cs.CV

TL;DR: 提出了一种针对医学视频超分辨率(VSR)的框架MedVSR,通过交叉状态空间传播(CSSP)和内部状态空间重建(ISSR)模块,有效解决低分辨率医学视频中的对齐困难、伪影问题,显著提升重建性能。

Details Motivation: 医学高清视频对诊断至关重要,但受限于设备和生理因素难以获取;现有VSR模型在处理含噪声、相机抖动和帧间突变的低分辨率医学视频时存在光流误差大、组织结构失真等问题。 Method: 提出MedVSR框架,包含两个核心模块:1)Cross State-Space Propagation (CSSP),利用状态空间模型将远距离帧作为控制矩阵,实现跨帧一致特征传播以改善对齐;2)Inner State-Space Reconstruction (ISSR),结合长程空间特征学习与大核短程信息聚合,增强组织结构并减少伪影。 Result: 在四个不同医学场景(包括内窥镜和白内障手术)的数据集上实验表明,MedVSR在重建质量与效率方面均显著优于现有VSR方法。 Conclusion: MedVSR通过引入状态空间机制有效解决了医学视频超分辨率中的对齐与结构保持难题,具有临床实用价值,并为医学视频增强提供了新思路。 Abstract: High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.

[148] MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Sicong Leng,Jing Wang,Jiaxi Li,Hao Zhang,Zhiqiang Hu,Boqiang Zhang,Yuming Jiang,Hang Zhang,Xin Li,Lidong Bing,Deli Zhao,Wei Lu,Yu Rong,Aixin Sun,Shijian Lu

Main category: cs.CV

TL;DR: 本文提出了方差感知采样(VAS)方法,解决了大规模多模态推理模型训练中奖励方差低导致的梯度消失问题,并发布了高质量长链思维数据集和开源模型系列,显著提升了强化学习后训练的稳定性与效果。

Details Motivation: 现有大模型在多模态推理上的进展受限于缺乏高质量长链思维数据和强化学习算法在后训练中的不稳定性,尤其是低奖励方差导致梯度消失问题。 Method: 提出方差感知采样(VAS),基于方差促进分数(VPS)选择数据;构建约160万条长链思维冷启动数据和1.5万条强化学习问答对;理论证明奖励方差对策略梯度的下界作用。 Result: 实验显示VAS有效提升奖励方差和模型性能,在数学推理任务上取得更好结果;消融研究验证各组件贡献;发布完整可复现代码与多尺度开源模型。 Conclusion: VAS能有效缓解低方差导致的优化困难,所发布的数据和模型为多模态推理提供了重要资源,推动了该领域的可复现研究。 Abstract: Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

[149] A Sentinel-3 foundation model for ocean colour

Geoffrey Dawson,Remy Vandaele,Andrew Taylor,David Moffat,Helen Tamura-Wicks,Sarah Jackson,Rosie Lickorish,Paolo Fraccaro,Hywel Williams,Chunbo Luo,Anne Jones

Main category: cs.CV

TL;DR: 提出一种基于Prithvi-EO Vision Transformer架构的新型地理空间AI基础模型,通过自监督预训练重建Sentinel-3 OLCI数据,并在叶绿素浓度估算和海洋初级生产力遥感反演两个下游任务中验证了其优于现有基线模型的性能。

Details Motivation: 海洋科学中标签数据稀疏且获取成本高,传统模型难以充分利用有限数据;而基础模型有望利用大规模无标签数据提升遥感应用性能。 Method: 采用Prithvi-EO视觉Transformer架构构建基础模型,通过自监督方式预训练以重建Sentinel-3 OLCI数据,随后在两个海洋遥感下游任务上进行微调评估。 Result: 模型在叶绿素浓度估算和海洋初级生产力量化任务中表现优于现有基线模型,能有效利用少量高质量标签数据,精确捕捉海洋颜色的空间细节并匹配实地观测点。 Conclusion: 该新一代地理空间AI基础模型可为海洋生态系统及其在全球气候过程中的作用提供更稳健、数据驱动的洞察,具有广阔应用前景。 Abstract: Artificial Intelligence (AI) Foundation models (FMs), pre-trained on massive unlabelled datasets, have the potential to drastically change AI applications in ocean science, where labelled data are often sparse and expensive to collect. In this work, we describe a new foundation model using the Prithvi-EO Vision Transformer architecture which has been pre-trained to reconstruct data from the Sentinel-3 Ocean and Land Colour Instrument (OLCI). We evaluate the model by fine-tuning on two downstream marine earth observation tasks. We first assess model performance compared to current baseline models used to quantify chlorophyll concentration. We then evaluate the FMs ability to refine remote sensing-based estimates of ocean primary production. Our results demonstrate the utility of self-trained FMs for marine monitoring, in particular for making use of small amounts of high quality labelled data and in capturing detailed spatial patterns of ocean colour whilst matching point observations. We conclude that this new generation of geospatial AI models has the potential to provide more robust, data-driven insights into ocean ecosystems and their role in global climate processes.

[150] Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Shilin Lu,Zhuming Lian,Zihan Zhou,Shaocong Zhang,Chen Zhao,Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: 提出了一种无需训练的图像合成框架SHINE,利用预训练适配器和新的损失函数,在复杂光照和高分辨率场景下实现高质量、无缝的对象插入。

Details Motivation: 现有方法在处理复杂光照、高分辨率输入时存在生成质量差、对象姿态不自然或依赖脆弱的注意力操作等问题,缺乏有效利用扩散模型先验知识的框架。 Method: 提出SHINE框架,引入流形引导的锚点损失(manifold-steered anchor loss),结合预训练定制化适配器(如IP-Adapter)指导潜在表示;设计降质抑制引导和自适应背景融合策略以提升生成质量与一致性。 Result: 在新提出的ComplexCompo和DreamEditBench数据集上,SHINE在DINOv2等指标及DreamSim、ImageReward等人对齐评分中均达到SOTA性能。 Conclusion: SHINE为图像合成提供了一种高效、无需训练的解决方案,能充分利用现有扩散模型的物理和分辨率先验,在复杂场景下实现高保真、无缝的对象插入。 Abstract: Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

[151] Quantized Visual Geometry Grounded Transformer

Weilun Feng,Haotong Qin,Mingqiang Wu,Chuanguang Yang,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出了首个针对视觉几何基础Transformer(VGGT)的量化框架QuantVGGT,通过双平滑细粒度量化和噪声过滤多样化采样技术,有效解决了大规模3D重建模型在后训练量化中的重尾分布与校准不稳定问题,在4比特下实现3.7倍内存压缩和2.5倍加速,同时保持98%以上的精度。

Details Motivation: 大规模VGGT模型因计算和内存开销巨大而难以部署,现有后训练量化方法在处理十亿级VGGT时面临激活分布重尾和多视角数据导致的校准不稳定性问题,亟需专门的量化方案。 Method: 提出QuantVGGT框架,包含两项技术:1)双平滑细粒度量化,结合全局Hadamard旋转和局部通道平滑以缓解重尾分布和通道间方差;2)噪声过滤多样化采样,利用深层统计滤除异常值并构建帧感知的多样化校准簇以稳定量化范围。 Result: 在多个基准和比特宽度上达到最先进水平,4比特量化下实现3.7倍内存减少和2.5倍硬件推理加速,重建精度保持在全精度模型的98%以上。 Conclusion: QuantVGGT显著提升了VGGT模型在资源受限场景下的实用性和效率,为大规模3D重建模型的部署提供了有效的量化解决方案。 Abstract: Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

[152] NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics

Yu Yuan,Xijun Wang,Tharindu Wickremasinghe,Zeeshan Nadir,Bole Ma,Stanley H. Chan

Main category: cs.CV

TL;DR: 本文提出了NewtonGen,一个结合数据驱动合成与可学习物理原理的框架,通过引入可训练的神经牛顿动力学(NND)来建模和预测牛顿运动,从而在视频生成中注入潜在的动力学约束,实现物理一致且参数可控的视频合成。

Details Motivation: 现有文本到视频生成模型在物理一致性和可控性方面存在瓶颈,通常产生不现实的运动,且缺乏对不同初始条件下动态行为的精确控制。其根本原因在于模型仅从外观学习运动分布,而缺乏对底层动力学的理解。 Method: 提出NewtonGen框架,核心是可训练的神经牛顿动力学(NND),将数据先验与动力学引导相结合,在生成过程中引入物理规律,以实现对复杂牛顿运动的建模与预测。 Result: NewtonGen能够生成物理上更一致的视频,并支持对运动参数的精确控制,在不同初始条件下表现出更强的稳定性和真实性。 Conclusion: 通过融合可学习的物理原理,NewtonGen有效解决了当前文本到视频生成中运动不合理和控制性差的问题,为物理合理的视频生成提供了新方向。 Abstract: A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.

[153] SD3.5-Flash: Distribution-Guided Distillation of Generative Flows

Hmrishav Bandyopadhyay,Rahim Entezari,Jim Scott,Reshinth Adithyan,Yi-Zhe Song,Varun Jampani

Main category: cs.CV

TL;DR: SD3.5-Flash是一种高效的少步蒸馏框架,可在消费级设备上实现高质量图像生成。

Details Motivation: 将计算成本高昂的修正流模型应用于资源受限的消费设备,推动生成式AI的普及。 Method: 通过重构分布匹配目标,结合时间步共享和分步微调技术,并优化文本编码器与量化策略,实现高效少步生成。 Result: 在多种硬件上实现快速生成和低内存占用,用户研究表明其性能优于现有少步方法。 Conclusion: SD3.5-Flash使先进生成式AI在移动到桌面设备上的实际部署成为可能,显著提升了可访问性。 Abstract: We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to accessible consumer devices. Our approach distills computationally prohibitive rectified flow models through a reformulated distribution matching objective tailored specifically for few-step generation. We introduce two key innovations: "timestep sharing" to reduce gradient noise and "split-timestep fine-tuning" to improve prompt alignment. Combined with comprehensive pipeline optimizations like text encoder restructuring and specialized quantization, our system enables both rapid generation and memory-efficient deployment across different hardware configurations. This democratizes access across the full spectrum of devices, from mobile phones to desktop computers. Through extensive evaluation including large-scale user studies, we demonstrate that SD3.5-Flash consistently outperforms existing few-step methods, making advanced generative AI truly accessible for practical deployment.