Table of Contents
cs.CL [Back]
[1] Interpreting Public Sentiment in Diplomacy Events: A Counterfactual Analysis Framework Using Large Language Models
Leyi Ouyang
Main category: cs.CL
TL;DR: 本文提出一种基于大语言模型的反事实生成框架,通过调整外交事件叙述中的特定文本特征,成功将公众情绪从负面转为中性或正面,实验显示该方法有70%的成功率。
Details
Motivation: 传统测量公众情绪的方法耗时耗力,且缺乏前瞻性分析能力;同时,如何有效引导外交事件中的公众舆论尚缺乏数据驱动的解决方案。 Method: 首先构建包含外交事件描述及其相关公众讨论的数据集,并训练语言模型预测公众反应;然后基于传播理论和领域专家意见确定可修改的文本特征,设计反事实生成算法,利用大语言模型生成保持事实核心但叙事框架优化的文本版本。 Result: 所提出的框架在70%的案例中成功将负面公众情绪转向中性或正面,验证了通过调整叙事方式影响公共情感的有效性。 Conclusion: 该框架可作为外交人员、政策制定者和传播专家的实用工具,提供数据驱动的叙事优化建议,以塑造更有利的国际舆论环境。 Abstract: Diplomatic events consistently prompt widespread public discussion and debate. Public sentiment plays a critical role in diplomacy, as a good sentiment provides vital support for policy implementation, helps resolve international issues, and shapes a nation's international image. Traditional methods for gauging public sentiment, such as large-scale surveys or manual content analysis of media, are typically time-consuming, labor-intensive, and lack the capacity for forward-looking analysis. We propose a novel framework that identifies specific modifications for diplomatic event narratives to shift public sentiment from negative to neutral or positive. First, we train a language model to predict public reaction towards diplomatic events. To this end, we construct a dataset comprising descriptions of diplomatic events and their associated public discussions. Second, guided by communication theories and in collaboration with domain experts, we predetermined several textual features for modification, ensuring that any alterations changed the event's narrative framing while preserving its core facts.We develop a counterfactual generation algorithm that employs a large language model to systematically produce modified versions of an original text. The results show that this framework successfully shifted public sentiment to a more favorable state with a 70\% success rate. This framework can therefore serve as a practical tool for diplomats, policymakers, and communication specialists, offering data-driven insights on how to frame diplomatic initiatives or report on events to foster a more desirable public sentiment.[2] Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition
Shreya G. Upadhyay,Carlos Busso,Chi-Chun Lee
Main category: cs.CL
TL;DR: 提出一种说话人风格感知的音素锚定框架,用于跨语言语音情感识别,通过在说话人和音素空间中进行双空间锚定,提升跨语言情感迁移效果。
Details
Motivation: 跨语言语音情感识别因不同语言间的语音差异和说话人表达风格差异而具有挑战性,需要有效对齐不同语言和说话人的情感表达方式。 Method: 通过基于图的聚类构建情感特定的说话人社区以捕捉共享特征,并在说话人空间和音素空间进行双空间锚定,实现跨语言情感表达对齐。 Result: 在MSP-Podcast(英语)和BIIC-Podcast(台湾普通话)数据集上的实验表明,该方法优于现有基线模型,展现出更好的泛化能力。 Conclusion: 所提出的框架能有效捕捉跨语言情感表达的共性,提升了跨语言语音情感识别的性能。 Abstract: Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.[3] CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
Nithin Somasekharan,Ling Yue,Yadi Cao,Weichao Li,Patrick Emami,Pochinapeddi Sai Bhargav,Anurag Acharya,Xingyu Xie,Shaowu Pan
Main category: cs.CL
TL;DR: 本文提出了CFDLLMBench,一个用于评估大语言模型在计算流体动力学(CFD)数值实验自动化中性能的基准测试套件,包含三个组件:CFDQuery、CFDCodeBench和FoamBench,旨在全面评估模型在CFD知识、物理与数值推理及工作流实现方面的能力。
Details
Motivation: 尽管大语言模型在自然语言处理任务中表现优异,但其在复杂物理系统数值实验自动化中的应用仍待探索。计算流体动力学作为计算科学的核心领域,为评估大语言模型的科学能力提供了一个具有挑战性的测试平台。 Method: 构建了一个基于真实CFD实践的基准测试套件CFDLLMBench,包含三个互补部分:CFDQuery(评估研究生水平的CFD知识)、CFDCodeBench(评估数值与物理推理能力)和FoamBench(评估上下文相关的CFD工作流实现能力),并结合详细的任务分类体系和严格的评估框架。 Result: 该基准能够从代码可执行性、解的准确性和数值收敛行为等方面量化大语言模型的表现,提供了可复现的评估结果。 Conclusion: CFDLLMBench为推动大语言模型在复杂物理系统数值实验自动化中的发展和评估奠定了坚实基础。 Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.[4] Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text
Sharanya Parimanoharan,Ruwan D. Nawarathna
Main category: cs.CL
TL;DR: 本研究评估了多种机器学习方法在区分ChatGPT-3.5生成文本与人类撰写研究摘要中的表现,发现DistilBERT性能最佳,集成模型未能超越最优单模型。
Details
Motivation: 随着大语言模型的广泛应用,AI生成文本与人类文本的界限日益模糊,对学术诚信和信息可信度构成挑战,亟需可靠的AI文本检测技术。 Method: 使用包含250对人类与ChatGPT生成的研究摘要的数据集,比较了经典方法(如逻辑回归结合词袋、POS、TF-IDF)和基于Transformer的方法(如BERT、DistilBERT、BERT自定义分类器、LSTM-Ngram)以及三者最佳模型的投票集成方法。 Result: DistilBERT整体表现最优,逻辑回归与BERT-Custom表现稳健均衡,LSTM与BERT-Ngram效果较差;三模型最大投票集成未能超越DistilBERT。 Conclusion: 单一高性能Transformer模型优于模型集成策略,未来应构建更大更丰富的数据集以发展更鲁棒的检测框架应对不断进步的生成式AI。 Abstract: The rapid adoption of large language models (LLMs) such as ChatGPT has blurred the line between human and AI-generated texts, raising urgent questions about academic integrity, intellectual property, and the spread of misinformation. Thus, reliable AI-text detection is needed for fair assessment to safeguard human authenticity and cultivate trust in digital communication. In this study, we investigate how well current machine learning (ML) approaches can distinguish ChatGPT-3.5-generated texts from human-written texts employing a labeled data set of 250 pairs of abstracts from a wide range of research topics. We test and compare both classical (Logistic Regression armed with classical Bag-of-Words, POS, and TF-IDF features) and transformer-based (BERT augmented with N-grams, DistilBERT, BERT with a lightweight custom classifier, and LSTM-based N-gram models) ML detection techniques. As we aim to assess each model's performance in detecting AI-generated research texts, we also aim to test whether an ensemble of these models can outperform any single detector. Results show DistilBERT achieves the overall best performance, while Logistic Regression and BERT-Custom offer solid, balanced alternatives; LSTM- and BERT-N-gram approaches lag. The max voting ensemble of the three best models fails to surpass DistilBERT itself, highlighting the primacy of a single transformer-based representation over mere model diversity. By comprehensively assessing the strengths and weaknesses of these AI-text detection approaches, this work lays a foundation for more robust transformer frameworks with larger, richer datasets to keep pace with ever-improving generative AI models.[5] ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models
Haoxuan Li,Zhen Wen,Qiqi Jiang,Chenxiao Li,Yuwei Wu,Yuchen Yang,Yiyao Wang,Xiuqi Huang,Minfeng Zhu,Wei Chen
Main category: cs.CL
TL;DR: 本文提出了ConceptViz,一个用于探索大语言模型中概念的可视化分析系统,通过识别、解释和验证的流程,帮助研究人员更好地理解和验证稀疏自编码器提取的特征与人类可理解概念之间的对应关系。
Details
Motivation: 尽管稀疏自编码器(SAE)在提取大语言模型中的可解释特征方面具有潜力,但这些特征与人类可理解的概念之间缺乏直接对齐,导致解释困难且耗时。因此,需要一种工具来桥接这一差距。 Method: 设计并实现了一个名为ConceptViz的视觉分析系统,采用“识别=>解释=>验证”三步流程:用户可通过感兴趣的概念查询SAE,交互式探索概念与特征的对齐,并通过模型行为验证其对应性。 Result: 通过两个使用场景和一项用户研究验证了ConceptViz的有效性,结果表明该系统能有效提升对LLM中概念表示的发现与验证效率。 Conclusion: ConceptViz显著增强了对大语言模型特征的可解释性研究,有助于研究人员构建更准确的心理模型,推动透明化和可信AI的发展。 Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. Understanding how LLMs internally represent knowledge remains a significant challenge. Despite Sparse Autoencoders (SAEs) have emerged as a promising technique for extracting interpretable features from LLMs, SAE features do not inherently align with human-understandable concepts, making their interpretation cumbersome and labor-intensive. To bridge the gap between SAE features and human concepts, we present ConceptViz, a visual analytics system designed for exploring concepts in LLMs. ConceptViz implements a novel dentification => Interpretation => Validation pipeline, enabling users to query SAEs using concepts of interest, interactively explore concept-to-feature alignments, and validate the correspondences through model behavior verification. We demonstrate the effectiveness of ConceptViz through two usage scenarios and a user study. Our results show that ConceptViz enhances interpretability research by streamlining the discovery and validation of meaningful concept representations in LLMs, ultimately aiding researchers in building more accurate mental models of LLM features. Our code and user guide are publicly available at https://github.com/Happy-Hippo209/ConceptViz.[6] SKILL-RAG: Self-Knowledge Induced Learning and Filtering for Retrieval-Augmented Generation
Tomoaki Isoda
Main category: cs.CL
TL;DR: 提出SKILL-RAG方法,利用模型的自知能力通过强化学习框架筛选有益检索内容,提升RAG性能。
Details
Motivation: 检索系统可能返回无关内容,导致幻觉问题,需识别和过滤无用信息以提升RAG性能。 Method: 设计基于强化学习的训练框架,从模型中显式激发自知能力,并在句子级别过滤无关内容,保留有用知识。 Result: 在Llama2-7B和Qwen3-8B上多个问答基准测试中,SKILL-RAG提升了生成质量并显著减少输入文档数量。 Conclusion: 自知能力对指导高质量检索选择至关重要,有效提升RAG系统的性能。 Abstract: Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive tasks in recent years. However, since retrieval systems may return irrelevant content, incorporating such information into the model often leads to hallucinations. Thus, identifying and filtering out unhelpful retrieved content is a key challenge for improving RAG performance.To better integrate the internal knowledge of the model with external knowledge from retrieval, it is essential to understand what the model "knows" and "does not know" (which is also called "self-knowledge"). Based on this insight, we propose SKILL-RAG (Self-Knowledge Induced Learning and Filtering for RAG), a novel method that leverages the model's self-knowledge to determine which retrieved documents are beneficial for answering a given query. We design a reinforcement learning-based training framework to explicitly elicit self-knowledge from the model and employs sentence-level granularity to filter out irrelevant content while preserving useful knowledge.We evaluate SKILL-RAG using Llama2-7B and Qwen3-8B on several question answering benchmarks. Experimental results demonstrate that SKILL-RAG not only improves generation quality but also significantly reduces the number of input documents, validating the importance of self-knowledge in guiding the selection of high-quality retrievals.[7] Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation
Sirui Wang,Andong Chen,Tiejun Zhao
Main category: cs.CL
TL;DR: 提出Emo-FiLM框架,实现基于LLM的文本到语音中词级别的细粒度情感控制。
Details
Motivation: 现有情感TTS系统多依赖句子级别的情感控制,难以捕捉句子内的情感动态变化。 Method: 通过emotion2vec提取帧级特征并对其对齐到词级别,利用FiLM层调制文本嵌入,实现词级别情感控制。 Result: 在全局和细粒度情感控制任务上均优于现有方法,并构建了FEDD数据集用于评估细粒度情感动态。 Conclusion: Emo-FiLM能有效建模句子内部的情感动态,提升情感TTS的表现力与自然度。 Abstract: Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction. Existing systems typically rely on sentence-level control through predefined labels, reference audio, or natural language prompts. While effective for global emotion expression, these approaches fail to capture dynamic shifts within a sentence. To address this limitation, we introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS. Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations, and maps them through a Feature-wise Linear Modulation (FiLM) layer, enabling word-level emotion control by directly modulating text embeddings. To support evaluation, we construct the Fine-grained Emotion Dynamics Dataset (FEDD) with detailed annotations of emotional transitions. Experiments show that Emo-FiLM outperforms existing approaches on both global and fine-grained tasks, demonstrating its effectiveness and generality for expressive speech synthesis.[8] USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model
Jianyu Wen,Jingyun Wang,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Ying Zhang
Main category: cs.CL
TL;DR: 提出一种基于用户模拟器的训练-推理框架(USB-Rec),通过强化学习构建偏好优化数据集并在推理阶段引入自增强策略,显著提升大模型在对话推荐中的性能。
Details
Motivation: 现有基于大语言模型的对话推荐系统多关注如何利用其分析能力,而忽视了模型训练问题,缺乏对推荐策略的深入学习。 Method: 设计基于LLM的偏好优化(PO)数据集构建策略用于强化学习训练,并在推理阶段提出自增强策略(SES),以充分挖掘训练中获得的推荐能力。 Result: 在多个数据集上的实验表明,该方法 consistently 优于之前的最先进方法。 Conclusion: USB-Rec框架通过引入训练机制和推理优化,有效提升了大语言模型在对话推荐任务中的表现,验证了训练与推理协同的重要性。 Abstract: Recently, Large Language Models (LLMs) have been widely employed in Conversational Recommender Systems (CRSs). Unlike traditional language model approaches that focus on training, all existing LLMs-based approaches are mainly centered around how to leverage the summarization and analysis capabilities of LLMs while ignoring the issue of training. Therefore, in this work, we propose an integrated training-inference framework, User-Simulator-Based framework (USB-Rec), for improving the performance of LLMs in conversational recommendation at the model level. Firstly, we design a LLM-based Preference Optimization (PO) dataset construction strategy for RL training, which helps the LLMs understand the strategies and methods in conversational recommendation. Secondly, we propose a Self-Enhancement Strategy (SES) at the inference stage to further exploit the conversational recommendation potential obtained from RL training. Extensive experiments on various datasets demonstrate that our method consistently outperforms previous state-of-the-art methods.[9] Document Summarization with Conformal Importance Guarantees
Bruce Kuwahara,Chen-Yuan Lin,Xiao Shi Huang,Kin Kwan Leung,Jullian Arta Yapeter,Ilya Stanevich,Felipe Perez,Jesse C. Cresswell
Main category: cs.CL
TL;DR: 提出了一种基于保形预测的重要性保留摘要生成框架Conformal Importance Summarization,可在高风险领域为关键内容提供可靠的信息覆盖保证。
Details
Motivation: 现有自动摘要系统在医疗、法律和金融等高风险领域缺乏对关键内容包含的可靠保障。 Method: 通过在句子级重要性得分上校准阈值,使用保形预测实现提取式文档摘要,支持用户指定的覆盖率和召回率,且与模型无关,仅需少量校准集即可集成到现有大语言模型中。 Result: 在标准摘要基准上的实验表明,该方法能够实现理论上的信息覆盖率保证,并与现有技术结合提升摘要的可控性和可靠性。 Conclusion: Conformal Importance Summarization为关键应用场景下安全部署AI摘要工具提供了可行路径。 Abstract: Automatic summarization systems have advanced rapidly with large language models (LLMs), yet they still lack reliable guarantees on inclusion of critical content in high-stakes domains like healthcare, law, and finance. In this work, we introduce Conformal Importance Summarization, the first framework for importance-preserving summary generation which uses conformal prediction to provide rigorous, distribution-free coverage guarantees. By calibrating thresholds on sentence-level importance scores, we enable extractive document summarization with user-specified coverage and recall rates over critical content. Our method is model-agnostic, requires only a small calibration set, and seamlessly integrates with existing black-box LLMs. Experiments on established summarization benchmarks demonstrate that Conformal Importance Summarization achieves the theoretically assured information coverage rate. Our work suggests that Conformal Importance Summarization can be combined with existing techniques to achieve reliable, controllable automatic summarization, paving the way for safer deployment of AI summarization tools in critical applications. Code is available at https://github.com/layer6ai-labs/conformal-importance-summarization.[10] ShortCheck: Checkworthiness Detection of Multilingual Short-Form Videos
Henrik Vatndal,Vinay Setty
Main category: cs.CL
TL;DR: ShortCheck是一个用于检测短视频平台(如TikTok)中值得核查的虚假信息的模块化、仅推理的自动化管道,集成了多种技术并在多语言环境中表现出良好的性能。
Details
Motivation: 短视频平台的内容具有多模态、动态和噪声大的特点,给虚假信息检测带来了独特挑战,需要高效的工具辅助人工事实核查。 Method: 提出ShortCheck系统,整合了语音转录、OCR、物体与深度伪造检测、视频到文本摘要以及声明验证等模块,构建了一个用户友好的自动化检查流程。 Result: 在两个手动标注的多语言TikTok视频数据集上进行评估,该管道的加权F1分数超过70%。 Conclusion: ShortCheck能有效识别值得核查的短视频内容,在多语言环境下表现良好,有助于提升人工事实核查的效率。 Abstract: Short-form video platforms like TikTok present unique challenges for misinformation detection due to their multimodal, dynamic, and noisy content. We present ShortCheck, a modular, inference-only pipeline with a user-friendly interface that automatically identifies checkworthy short-form videos to help human fact-checkers. The system integrates speech transcription, OCR, object and deepfake detection, video-to-text summarization, and claim verification. ShortCheck is validated by evaluating it on two manually annotated datasets with TikTok videos in a multilingual setting. The pipeline achieves promising results with F1-weighted score over 70\%.[11] MARS: toward more efficient multi-agent collaboration for LLM reasoning
Xiao Wang,Jia Wang,Yijie Wang,Pengtao Dang,Sha Cao,Chi Zhang
Main category: cs.CL
TL;DR: 本文提出了MARS(多智能体评审系统),一种基于角色协作的推理框架,通过作者、评审者和元评审者的分工,在保持多智能体辩论(MAD)准确性的同时,将token使用和推理时间减少约50%。
Details
Motivation: 单个大语言模型的推理能力有限,现有多智能体辩论(MAD)方法虽有效但计算开销大,需降低通信与计算成本。 Method: 提出MARS框架:作者生成解,评审者独立反馈,元评审者整合意见并指导修订,避免评审者间直接交互以降低开销。 Result: 在多个基准上实验表明,MARS与MAD准确率相当,但token使用和推理时间减少约50%。 Conclusion: MARS在不牺牲推理性能的前提下显著提升了效率,是一种更经济的多智能体协同推理方案。 Abstract: Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.[12] SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages
Hannah Liu,Junghyun Min,Ethan Yue Heng Cheung,Shou-Yi Hung,Syed Mekael Wasti,Runtong Liang,Shiyao Qian,Shizhao Zheng,Elsie Chan,Ka Ieng Charlotte Lo,Wing Yu Yip,Richard Tzong-Han Tsai,En-Shiun Annie Lee
Main category: cs.CL
TL;DR: 本文介绍了SiniticMTError,一个包含错误标注的中-英机器翻译数据集,涵盖普通话、粤语和吴语,旨在支持低资源语言的翻译质量评估与错误感知生成研究。
Details
Motivation: 由于低资源语言缺乏大规模训练数据和语言资源,机器翻译进展受限,尤其是粤语和吴语等使用广泛但仍缺乏关注的语言。 Method: 基于现有平行语料库,通过母语者对从英语翻译为普通话、粤语和吴语的译文进行错误跨度、错误类型和严重程度的标注,构建SiniticMTError数据集,并分析标注一致性与错误模式。 Result: 成功构建了SiniticMTError数据集,提供了详细的错误标注信息,并报告了较高的标注一致性,揭示了不同语言翻译中的常见错误类型与严重性模式。 Conclusion: SiniticMTError为低资源中文方言的机器翻译质量评估和模型优化提供了有价值的资源,有助于推动错误感知翻译模型的发展。 Abstract: Despite major advances in machine translation (MT) in recent years, progress remains limited for many low-resource languages that lack large-scale training data and linguistic resources. Cantonese and Wu Chinese are two Sinitic examples, although each enjoys more than 80 million speakers around the world. In this paper, we introduce SiniticMTError, a novel dataset that builds on existing parallel corpora to provide error span, error type, and error severity annotations in machine-translated examples from English to Mandarin, Cantonese, and Wu Chinese. Our dataset serves as a resource for the MT community to utilize in fine-tuning models with error detection capabilities, supporting research on translation quality estimation, error-aware generation, and low-resource language evaluation. We report our rigorous annotation process by native speakers, with analyses on inter-annotator agreement, iterative feedback, and patterns in error type and severity.[13] SwasthLLM: a Unified Cross-Lingual, Multi-Task, and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations
Ayan Sar,Pranav Singh Puri,Sumit Aich,Tanupriya Choudhury,Abhijit Kumar
Main category: cs.CL
TL;DR: 本文提出了SwasthLLM,一种统一的、零样本、跨语言多任务学习框架,用于在英语、印地语和孟加拉语中进行医疗诊断,无需语言特定微调,在监督和零样本设置下均表现出色。
Details
Motivation: 由于低资源语言中标注医学数据稀缺以及不同人群间的语言差异,多语言医疗环境中的临床文本自动疾病诊断具有挑战性。 Method: SwasthLLM基于多语言XLM-RoBERTa编码器,引入语言感知注意力机制、Siamese对比学习模块、翻译一致性模块和对比投影头,并采用多任务学习策略联合优化疾病分类、翻译对齐和对比学习目标,结合MAML实现快速适应。 Result: 在监督设置下达到97.22%准确率和97.17% F1分数;零样本场景下对印地语和孟加拉语分别取得92.78%和73.33%的准确率。 Conclusion: SwasthLLM在无需语言特定微调的情况下,展现出强大的跨语言泛化能力和在低资源医疗文本诊断中的有效性。 Abstract: In multilingual healthcare environments, automatic disease diagnosis from clinical text remains a challenging task due to the scarcity of annotated medical data in low-resource languages and the linguistic variability across populations. This paper proposes SwasthLLM, a unified, zero-shot, cross-lingual, and multi-task learning framework for medical diagnosis that operates effectively across English, Hindi, and Bengali without requiring language-specific fine-tuning. At its core, SwasthLLM leverages the multilingual XLM-RoBERTa encoder augmented with a language-aware attention mechanism and a disease classification head, enabling the model to extract medically relevant information regardless of the language structure. To align semantic representations across languages, a Siamese contrastive learning module is introduced, ensuring that equivalent medical texts in different languages produce similar embeddings. Further, a translation consistency module and a contrastive projection head reinforce language-invariant representation learning. SwasthLLM is trained using a multi-task learning strategy, jointly optimizing disease classification, translation alignment, and contrastive learning objectives. Additionally, we employ Model-Agnostic Meta-Learning (MAML) to equip the model with rapid adaptation capabilities for unseen languages or tasks with minimal data. Our phased training pipeline emphasizes robust representation alignment before task-specific fine-tuning. Extensive evaluation shows that SwasthLLM achieves high diagnostic performance, with a test accuracy of 97.22% and an F1-score of 97.17% in supervised settings. Crucially, in zero-shot scenarios, it attains 92.78% accuracy on Hindi and 73.33% accuracy on Bengali medical text, demonstrating strong generalization in low-resource contexts.[14] Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures
Sampurna Roy,Ayan Sar,Anurag Kaushish,Kanav Gupta,Tanupriya Choudhury,Abhijit Kumar
Main category: cs.CL
TL;DR: 本文提出了一种基于深度专业化的专家混合模型(DS-MoE),通过动态构建推理链,实现对不同复杂度输入的自适应深度处理,在提升推理效率和准确性的同时增强模型可解释性。
Details
Motivation: 传统Transformer对所有输入采用相同的处理深度,导致简单查询与复杂逻辑问题消耗相同计算资源,造成效率低下并限制深层推理能力。 Method: 引入深度专业化的专家混合架构(DS-MoE),设计针对不同推理深度(如浅层模式识别、组合推理、逻辑推断等)优化的专家模块,并通过学习路由网络动态组装定制化推理链,仅激活必要专家。 Result: 在The Pile数据集上实验表明,相比固定深度模型,DS-MoE最多节省16%计算成本,推理速度提升35%,在复杂多步推理任务中准确率提高2.8%,且路由过程生成可解释的推理路径。 Conclusion: DS-MoE通过深度专业化和动态路由实现了高效、高质量和可解释的推理,是自适应神经网络架构的重要进展。 Abstract: Contemporary transformer architectures apply identical processing depth to all inputs, creating inefficiencies and limiting reasoning quality. Simple factual queries are subjected to the same multilayered computation as complex logical problems, wasting resources while constraining deep inference. To overcome this, we came up with a concept of Dynamic Reasoning Chains through Depth Specialised Mixture of Experts (DS-MoE), a modular framework that extends the Mixture of Experts paradigm from width-based to depth specialised computation. DS-MoE introduces expert modules optimised for distinct reasoning depths, shallow pattern recognition, compositional reasoning, logical inference, memory integration, and meta-cognitive supervision. A learned routing network dynamically assembles custom reasoning chains, activating only the necessary experts to match input complexity. The dataset on which we trained and evaluated DS-MoE is on The Pile, an 800GB corpus covering diverse domains such as scientific papers, legal texts, programming code, and web content, enabling systematic assessment across reasoning depths. Experimental results demonstrate that DS-MoE achieves up to 16 per cent computational savings and 35 per cent faster inference compared to uniform-depth transformers, while delivering 2.8 per cent higher accuracy on complex multi-step reasoning benchmarks. Furthermore, routing decisions yield interpretable reasoning chains, enhancing transparency and scalability. These findings establish DS-MoE as a significant advancement in adaptive neural architectures, demonstrating that depth-specialised modular processing can simultaneously improve efficiency, reasoning quality, and interpretability in large-scale language models.[15] Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding
Ayan Sar,Sampurna Roy,Kanav Gupta,Anurag Kaushish,Tanupriya Choudhury,Abhijit Kumar
Main category: cs.CL
TL;DR: 提出了一种受小波启发的分层解析Transformer(HRT),通过多分辨率处理语言,实现了与人类语言层次结构对齐的计算架构,在多个基准上优于标准Transformer,同时显著提升效率。
Details
Motivation: Transformer将文本视为扁平的token序列,未能有效建模语言的层次性,导致计算成本高、组成泛化能力弱和篇章建模不足。 Method: 提出Hierarchical Resolution Transformer(HRT),采用多分辨率注意力机制,结合自底向上的组合与自顶向下的上下文化,并利用指数级序列缩减实现O(nlogn)复杂度。 Result: 在GLUE、SuperGLUE和Long Range Arena等多个基准上平均提升3.8%–6.1%,内存减少42%,推理延迟降低37%,且消融实验验证了跨分辨率注意力和尺度专用模块的有效性。 Conclusion: HRT是首个将计算结构与人类语言层次组织对齐的模型,证明多尺度、小波启发的处理方式可同时带来理论效率和实际性能的提升。 Abstract: Transformer architectures have achieved state-of-the-art performance across natural language tasks, yet they fundamentally misrepresent the hierarchical nature of human language by processing text as flat token sequences. This results in quadratic computational cost, weak computational cost, weak compositional generalization, and inadequate discourse-level modeling. We propose Hierarchical Resolution Transformer (HRT), a novel wavelet-inspired neural architecture that processes language simultaneously across multiple resolutions, from characters to discourse-level units. HRT constructs a multi-resolution attention, enabling bottom-up composition and top-down contextualization. By employing exponential sequence reduction across scales, HRT achieves O(nlogn) complexity, offering significant efficiency improvements over standard transformers. We evaluated HRT on a diverse suite of benchmarks, including GLUE, SuperGLUE, Long Range Arena, and WikiText-103, and results demonstrated that HRT outperforms standard transformer baselines by an average of +3.8% on GLUE, +4.5% on SuperGLUE, and +6.1% on Long Range Arena, while reducing memory usage by 42% and inference latency by 37% compared to BERT and GPT style models of similar parameter count. Ablation studies confirm the effectiveness of cross-resolution attention and scale-specialized modules, showing that each contributes independently to both efficiency and accuracy. Our findings establish HRT as the first architecture to align computational structure with the hierarchical organization of human language, demonstrating that multi-scale, wavelet-inspired processing yields both theoretical efficiency gains and practical improvements in language understanding.[16] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models
Amin Karimi Monsefi,Nikhil Bhendawade,Manuel Rafael Ciosici,Dominic Culver,Yizhe Zhang,Irina Belousova
Main category: cs.CL
TL;DR: FS-DFM是一种专为快速生成而设计的离散流匹配模型,通过将采样步数设为显式参数,并结合可靠更新规则与教师指导,在仅8步内达到传统1024步模型的生成质量,实现高达128倍的加速。
Details
Motivation: 现有的自回归语言模型生成速度慢,而非自回归扩散模型虽可并行但需大量迭代步数,因此需要一种既能保持高质量又能显著提升生成速度的语言模型。 Method: 提出FS-DFM模型,将采样步数作为显式参数,训练模型在不同步数预算下保持一致性;采用避免过冲的更新规则,并利用长轨迹蒸馏出的强教师信号进行指导。 Result: 在语言建模基准上,FS-DFM用8步采样即可达到与1024步基线模型相当的困惑度,生成1024个token时实现最高128倍的采样速度提升。 Conclusion: FS-DFM通过设计一致性的少步生成机制,在不牺牲生成质量的前提下大幅提升了语言模型的生成效率,为高效文本生成提供了新方向。 Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.[17] Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions
Jungsoo Park,Ethan Mendes,Gabriel Stanovsky,Alan Ritter
Main category: cs.CL
TL;DR: 本文提出了一种在不运行实验的情况下预测大语言模型性能的方法,通过构建PRECOG数据集实现文本-only的性能预测,支持前瞻性的模型评估与实验优先级规划。
Details
Motivation: 大语言模型的发展受限于评估瓶颈,需要反复构建基准并迭代实验。因此,作者希望在实验前就能预测模型表现,以加速研究进程。 Method: 提出文本-only性能预测任务,构建包含多样化任务、领域和指标的红acted描述-性能配对数据集PRECOG,并利用具备检索模块的模型进行预测,同时评估不同模型的推理与证据收集行为。 Result: 实验表明该任务具有挑战性但可行,配备检索模块的模型在高置信度下达到准确子集平均绝对误差低至8.7;更强的推理模型表现出多样且迭代的查询行为,而当前开源模型较弱;在零泄漏场景下,GPT-5结合网络搜索仍能实现非平凡预测精度。 Conclusion: PRECOG数据集和相关分析为开放式的前瞻性评估提供了初步基础,有助于任务难度估计和更智能的实验优先级排序。 Abstract: Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, evaluate models and settings, then iterate. We therefore ask a simple question: can we forecast outcomes before running any experiments? We study text-only performance forecasting: estimating a model's score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. Experiments show the task is challenging but feasible: models equipped with a retrieval module that excludes source papers achieve moderate prediction performance with well-calibrated uncertainty, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds. Our analysis indicates that stronger reasoning models engage in diverse, iterative querying, whereas current open-source models lag and often skip retrieval or gather evidence with limited diversity. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT-5 with built-in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.[18] Building Tailored Speech Recognizers for Japanese Speaking Assessment
Yotaro Kubo,Richard Sproat,Chihiro Taguchi,Llion Jones
Main category: cs.CL
TL;DR: 本文提出两种缓解数据稀疏问题的方法,以构建准确的带音高重音标记的日语语音识别器,相比通用多语言识别器表现更优。
Details
Motivation: 由于缺乏带有重音标记的标注数据,难以训练出准确的日语语音识别模型用于发音评估任务。 Method: 采用多任务学习引入辅助损失函数,并融合基于音素和文本序列的两个估计器,利用有限状态转换器框架结合估计结果。 Result: 在CSJ核心测试集上,平均音拍标签错误率从12.3%降至7.1%,验证了多任务学习与融合方法的有效性。 Conclusion: 所提出的多任务学习和模型融合方法能有效提升日语带重音语音识别准确率,优于通用多语言识别器,适用于发音评估任务。 Abstract: This paper presents methods for building speech recognizers tailored for Japanese speaking assessment tasks. Specifically, we build a speech recognizer that outputs phonemic labels with accent markers. Although Japanese is resource-rich, there is only a small amount of data for training models to produce accurate phonemic transcriptions that include accent marks. We propose two methods to mitigate data sparsity. First, a multitask training scheme introduces auxiliary loss functions to estimate orthographic text labels and pitch patterns of the input signal, so that utterances with only orthographic annotations can be leveraged in training. The second fuses two estimators, one over phonetic alphabet strings, and the other over text token sequences. To combine these estimates we develop an algorithm based on the finite-state transducer framework. Our results indicate that the use of multitask learning and fusion is effective for building an accurate phonemic recognizer. We show that this approach is advantageous compared to the use of generic multilingual recognizers. The relative advantages of the proposed methods were also compared. Our proposed methods reduced the average of mora-label error rates from 12.3% to 7.1% over the CSJ core evaluation sets.[19] Enhancing Molecular Property Prediction with Knowledge from Large Language Models
Peng Zhou,Lai Hou Tim,Zhixiang Cheng,Kun Xie,Chaoyi Li,Wei Liu,Xiangxiang Zeng
Main category: cs.CL
TL;DR: 提出一种新框架,首次将大语言模型(LLM)提取的知识与预训练分子模型的结构特征结合,用于分子属性预测(MPP),并通过生成领域知识和可执行代码增强特征表示。
Details
Motivation: 尽管图神经网络和自监督学习在分子属性预测中取得进展,但大语言模型存在知识盲区和幻觉问题,尤其在研究较少的分子属性上,仍需融合人类先验知识以提升性能。 Method: 通过提示大语言模型(GPT-4o、GPT-4.1、DeepSeek-R1)生成领域相关知识和可执行的分子向量化代码,构建知识型特征,并将其与预训练分子模型的结构特征融合,形成统一表征用于分子属性预测。 Result: 在多项实验中,该方法显著优于现有方法,验证了LLM衍生知识与结构信息融合的有效性,在多种分子属性预测任务中表现出更强的鲁棒性和准确性。 Conclusion: 结合大语言模型提取的知识与分子结构特征是一种有效且稳健的分子属性预测策略,为未来药物发现中的AI应用提供了新方向。 Abstract: Predicting molecular properties is a critical component of drug discovery. Recent advances in deep learning, particularly Graph Neural Networks (GNNs), have enabled end-to-end learning from molecular structures, reducing reliance on manual feature engineering. However, while GNNs and self-supervised learning approaches have advanced molecular property prediction (MPP), the integration of human prior knowledge remains indispensable, as evidenced by recent methods that leverage large language models (LLMs) for knowledge extraction. Despite their strengths, LLMs are constrained by knowledge gaps and hallucinations, particularly for less-studied molecular properties. In this work, we propose a novel framework that, for the first time, integrates knowledge extracted from LLMs with structural features derived from pre-trained molecular models to enhance MPP. Our approach prompts LLMs to generate both domain-relevant knowledge and executable code for molecular vectorization, producing knowledge-based features that are subsequently fused with structural representations. We employ three state-of-the-art LLMs, GPT-4o, GPT-4.1, and DeepSeek-R1, for knowledge extraction. Extensive experiments demonstrate that our integrated method outperforms existing approaches, confirming that the combination of LLM-derived knowledge and structural information provides a robust and effective solution for MPP.[20] RedHerring Attack: Testing the Reliability of Attack Detection
Jonathan Rusert
Main category: cs.CL
TL;DR: 本文提出了RedHerring攻击方法,旨在通过使检测模型误报来破坏其可靠性,同时保持分类器的准确性。实验表明该攻击显著降低了检测准确率,但分类性能不变或提升,并提出了一种无需重训练的置信度检查作为初步防御手段。
Details
Motivation: 现有对抗文本攻击检测模型的可靠性尚未被充分研究,本文旨在探索一种新型攻击方式,以揭示检测模型可能面临的威胁。 Method: 提出并测试一种名为RedHerring的新攻击设置和方法,该方法通过修改文本使检测模型错误地预测为攻击,而分类器仍能正确分类,从而制造分类器与检测器之间的矛盾。 Result: 在4个数据集上对3种检测器和4种分类器进行测试,RedHerring使检测准确率下降20-71个百分点,同时保持或提高了分类准确率;提出的置信度检查方法显著提升了检测准确性。 Conclusion: RedHerring攻击暴露了检测模型在对抗环境中的脆弱性,强调了提升检测系统可靠性的必要性,并为未来防御机制的设计提供了新思路。 Abstract: In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an ``incorrect'' prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.[21] Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms
Abhinay Shankar Belde,Rohit Ramkumar,Jonathan Rusert
Main category: cs.CL
TL;DR: 提出两种新的攻击选择策略(Hybrid Select和Dynamic Select),在保持攻击效果的同时,平均减少25.82%的查询次数,提升对抗文本攻击的效率。
Details
Motivation: 现有黑盒攻击方法查询次数多,计算成本高,对资源有限的研究者不友好,尤其在复杂Transformer架构下问题更突出。 Method: 提出Hybrid Select和Dynamic Select两种策略:前者通过设定阈值结合BinarySelect与GreedySelect;后者动态学习不同文本长度下应使用的算法。 Result: 在4个数据集和6个目标模型上验证,句子级Hybrid Select平均减少25.82%查询次数,且保持攻击有效性。 Conclusion: 所提方法显著降低对抗攻击的查询成本,适用于编码器模型和大语言模型,提升了攻击效率而不牺牲效果。 Abstract: Adversarial text attack research plays a crucial role in evaluating the robustness of NLP models. However, the increasing complexity of transformer-based architectures has dramatically raised the computational cost of attack testing, especially for researchers with limited resources (e.g., GPUs). Existing popular black-box attack methods often require a large number of queries, which can make them inefficient and impractical for researchers. To address these challenges, we propose two new attack selection strategies called Hybrid and Dynamic Select, which better combine the strengths of previous selection algorithms. Hybrid Select merges generalized BinarySelect techniques with GreedySelect by introducing a size threshold to decide which selection algorithm to use. Dynamic Select provides an alternative approach of combining the generalized Binary and GreedySelect by learning which lengths of texts each selection method should be applied to. This greatly reduces the number of queries needed while maintaining attack effectiveness (a limitation of BinarySelect). Across 4 datasets and 6 target models, our best method(sentence-level Hybrid Select) is able to reduce the number of required queries per attack up 25.82\% on average against both encoder models and LLMs, without losing the effectiveness of the attack.[22] MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model
Hsiao-Ying Huang,Yi-Cheng Lin,Hung-yi Lee
Main category: cs.CL
TL;DR: 提出MI-Fuse框架,利用互信息加权的去噪标签融合,在无源域数据情况下通过API访问大音频语言模型,实现跨域语音情感识别性能提升。
Details
Motivation: 解决现实部署中由于领域不匹配导致语音情感识别性能下降的问题,且在无法获取源域数据、仅能通过API访问强大LALM的情况下实现模型自适应。 Method: 提出MI-Fuse框架,结合LALM和源域训练的SER分类器作为双教师,通过多次随机预测生成分布,基于互信息的不确定性进行加权,并采用指数移动平均稳定训练过程。 Result: 在三个公开情感数据集和六种跨域迁移任务中实验表明,该方法一致优于LALM和最强基线(提升3.9%)。 Conclusion: MI-Fuse能够在不共享源域数据的前提下,有效提升学生模型在目标域的语音情感识别性能,适用于现实场景中的模型适配。 Abstract: Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.[23] Probability Distribution Collapse: A Critical Bottleneck to Compact Unsupervised Neural Grammar Induction
Jinwook Park,Kangil Kim
Main category: cs.CL
TL;DR: 本文提出了一种缓解神经参数化中概率分布坍塌问题的新方法,显著提升了无监督神经语法归纳的解析性能,并实现了更紧凑的语法结构。
Details
Motivation: 现有模型存在表达能力瓶颈,导致语法过大但性能不足,主要原因是概率分布坍缩。 Method: 分析了神经参数化中坍缩现象的成因,并提出了“坍缩松弛神经参数化”方法来解决该问题。 Result: 新方法在多种语言上显著提升了语法解析性能,同时允许使用更紧凑的语法。 Conclusion: 通过缓解概率分布坍缩,可以有效提升无监督神经语法归纳的效率和性能。 Abstract: Unsupervised neural grammar induction aims to learn interpretable hierarchical structures from language data. However, existing models face an expressiveness bottleneck, often resulting in unnecessarily large yet underperforming grammars. We identify a core issue, $\textit{probability distribution collapse}$, as the underlying cause of this limitation. We analyze when and how the collapse emerges across key components of neural parameterization and introduce a targeted solution, $\textit{collapse-relaxing neural parameterization}$, to mitigate it. Our approach substantially improves parsing performance while enabling the use of significantly more compact grammars across a wide range of languages, as demonstrated through extensive empirical analysis.[24] Confidence-guided Refinement Reasoning for Zero-shot Question Answering
Youwon Jang,Woo Suk Choi,Minjoon Jung,Minsu Lee,Byoung-Tak Zhang
Main category: cs.CL
TL;DR: 提出了一种无需训练的框架C2R,通过构建和优化子问题及其答案,并利用模型自身的置信度评分来提升跨模态问答任务的性能。
Details
Motivation: 为了在不进行额外训练的情况下,提升多模态问答系统中推理过程的可靠性和鲁棒性,探索如何利用子问题与置信度指导最终答案选择。 Method: C2R通过筛选多样化的子问题路径,生成多个候选答案,并比较其置信度得分,选择最可靠的最终答案,整个过程无需训练且兼容现有模型。 Result: C2R在多种问答模型和基准上均实现了性能提升,并揭示了子问题数量与质量对推理可靠性的影响。 Conclusion: C2R是一种通用、高效且无需训练的推理框架,能有效利用置信度指导多模态问答中的推理过程,提升答案的准确性和稳定性。 Abstract: We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R strategically constructs and refines sub-questions and their answers (sub-QAs), deriving a better confidence score for the target answer. C2R first curates a subset of sub-QAs to explore diverse reasoning paths, then compares the confidence scores of the resulting answer candidates to select the most reliable final answer. Since C2R relies solely on confidence scores derived from the model itself, it can be seamlessly integrated with various existing QA models, demonstrating consistent performance improvements across diverse models and benchmarks. Furthermore, we provide essential yet underexplored insights into how leveraging sub-QAs affects model behavior, specifically analyzing the impact of both the quantity and quality of sub-QAs on achieving robust and reliable reasoning.[25] SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs
Jiacheng Lin,Zhongruo Wang,Kun Qian,Tian Wang,Arvind Srinivasan,Hansi Zeng,Ruochen Jiao,Xie Zhou,Jiri Gesi,Dakuo Wang,Yufan Guo,Kai Zhong,Weiqi Zhang,Sujay Sanghavi,Changyou Chen,Hyokun Yun,Lihong Li
Main category: cs.CL
TL;DR: 本文研究了在领域特定数据集上进行监督微调(SFT)对大语言模型通用能力的影响,发现较小的学习率可显著缓解性能下降,并提出了一种新的方法TALR,在平衡领域特异性与通用能力方面优于现有方法。
Details
Motivation: 解决SFT在提升领域性能的同时可能损害大语言模型通用能力的问题,重新审视这一权衡关系。 Method: 通过实验评估不同学习率和正则化策略(如LoRA、模型平均、FLOW等)的影响,并提出Token-Adaptive Loss Reweighting (TALR) 方法;结合理论分析解释现象。 Result: 实验证明较小学习率能有效缓解通用性能下降,而TALR在保持领域性能的同时更好地保留了通用能力,优于其他基线方法。 Conclusion: SFT不必然损害通用能力,关键在于学习率控制和损失加权策略;推荐使用小学习率并可在需要时采用TALR来优化权衡。 Abstract: Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.[26] Towards Atoms of Large Language Models
Chenhui Hu,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
Main category: cs.CL
TL;DR: 本文提出了“原子理论”(Atoms Theory),定义了大语言模型(LLM)内部表示的基本单元——原子,通过原子内积(AIP)校正表示偏移,并证明原子满足稀疏表示的稳定性与唯一性条件,且可通过带阈值激活的稀疏自编码器(SAE)可靠识别。实验在多个主流LLM上验证了该理论,重建精度高达99.9%,显著优于神经元和特征方法。
Details
Motivation: 大语言模型内部表示的基本单元尚未明确,现有方法如神经元存在多义性,特征则存在重构不可靠和不稳定的缺陷,限制了对模型机制的理解。因此,亟需一种更稳定、可解释且可恢复的基本表示单元理论。 Method: 提出原子理论,引入原子内积(AIP)校正表示偏移,形式化定义原子,并证明其满足受限等距性质(RIP),从而保证稀疏表示的稳定性;在更强条件下进一步证明稀疏表示的唯一性和ℓ₁可恢复性;理论分析表明带阈值激活的单层稀疏自编码器(SAE)可可靠识别原子;在Gemma2-2B、Gemma2-9B和Llama3.1-8B上训练SAE进行实证验证。 Result: 在多个大模型上实现了平均99.9%的稀疏重建精度;超过99.8%的原子满足唯一性条件,而神经元仅为0.5%,特征为68.2%;验证了SAE规模与恢复能力之间的关系;原子相比神经元和特征更能忠实捕捉LLM的内在表示。 Conclusion: 原子理论为大语言模型的内部表示提供了系统性的理论框架,定义了稳定、唯一且可恢复的基本表示单元,不仅建立了与压缩感知的联系,还为机制可解释性奠定了基础,推动了对LLM内部机制的深入理解。 Abstract: The fundamental units of internal representations in large language models (LLMs) remain undefined, limiting further understanding of their mechanisms. Neurons or features are often regarded as such units, yet neurons suffer from polysemy, while features face concerns of unreliable reconstruction and instability. To address this issue, we propose the Atoms Theory, which defines such units as atoms. We introduce the atomic inner product (AIP) to correct representation shifting, formally define atoms, and prove the conditions that atoms satisfy the Restricted Isometry Property (RIP), ensuring stable sparse representations over atom set and linking to compressed sensing. Under stronger conditions, we further establish the uniqueness and exact $\ell_1$ recoverability of the sparse representations, and provide guarantees that single-layer sparse autoencoders (SAEs) with threshold activations can reliably identify the atoms. To validate the Atoms Theory, we train threshold-activated SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% sparse reconstruction across layers on average, and more than 99.8% of atoms satisfy the uniqueness condition, compared to 0.5% for neurons and 68.2% for features, showing that atoms more faithfully capture intrinsic representations of LLMs. Scaling experiments further reveal the link between SAEs size and recovery capacity. Overall, this work systematically introduces and validates Atoms Theory of LLMs, providing a theoretical framework for understanding internal representations and a foundation for mechanistic interpretability. Code available at https://github.com/ChenhuiHu/towards_atoms.[27] Few-Shot and Training-Free Review Generation via Conversational Prompting
Genki Kusano
Main category: cs.CL
TL;DR: 本文提出了一种名为对话式提示(Conversational Prompting)的轻量级方法,用于在少样本且无需训练的场景下生成个性化评论,其中简单变体(SCP)和对比变体(CCP)均能显著提升生成评论与目标用户风格的一致性。
Details
Motivation: 现有个性化评论生成方法通常依赖大量用户历史评论或模型微调,难以应对实际应用中评论数据稀少且无法训练模型的挑战。 Method: 将用户评论重构为多轮对话形式作为提示,提出两种变体:仅使用用户自身评论的简单对话提示(SCP),以及引入其他用户或LLM错误回复并让模型纠正的对比对话提示(CCP)。 Result: 在八个产品领域和五种大语言模型上的实验表明,传统非对话提示生成的评论与随机用户相似,而SCP和CCP显著提升了生成评论与目标用户的真实评论在ROUGE-L、BERTScore、用户身份匹配和情感分析等方面的一致性,即使每用户仅有两条评论。CCP在有高质量负例时表现更优,SCP在缺乏负例时仍具竞争力。 Conclusion: 对话式提示是一种在少样本且无需训练条件下生成个性化评论的有效且实用的方法。 Abstract: Personalized review generation helps businesses understand user preferences, yet most existing approaches assume extensive review histories of the target user or require additional model training. Real-world applications often face few-shot and training-free situations, where only a few user reviews are available and fine-tuning is infeasible. It is well known that large language models (LLMs) can address such low-resource settings, but their effectiveness depends on prompt engineering. In this paper, we propose Conversational Prompting, a lightweight method that reformulates user reviews as multi-turn conversations. Its simple variant, Simple Conversational Prompting (SCP), relies solely on the user's own reviews, while the contrastive variant, Contrastive Conversational Prompting (CCP), inserts reviews from other users or LLMs as incorrect replies and then asks the model to correct them, encouraging the model to produce text in the user's style. Experiments on eight product domains and five LLMs showed that the conventional non-conversational prompt often produced reviews similar to those written by random users, based on text-based metrics such as ROUGE-L and BERTScore, and application-oriented tasks like user identity matching and sentiment analysis. In contrast, both SCP and CCP produced reviews much closer to those of the target user, even when each user had only two reviews. CCP brings further improvements when high-quality negative examples are available, whereas SCP remains competitive when such data cannot be collected. These results suggest that conversational prompting offers a practical solution for review generation under few-shot and training-free constraints.[28] Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching
Songze Li,Zhiqiang Liu,Zhengke Gui,Huajun Chen,Wen Zhang
Main category: cs.CL
TL;DR: 提出了一种名为Enrich-on-Graph(EoG)的灵活框架,利用大语言模型的先验知识丰富知识图谱,弥合查询与图之间的语义鸿沟,在知识图谱问答任务中实现了最先进的性能。
Details
Motivation: 大语言模型在知识密集型任务中存在幻觉和事实错误问题,主要源于结构化知识图谱与非结构化查询之间的语义差距。 Method: 提出Enrich-on-Graph(EoG)框架,利用大语言模型的先验知识对知识图谱进行动态增强,并设计了三个图质量评估指标来衡量查询与图的对齐程度。 Result: 在两个KGQA基准数据集上的实验表明,EoG能有效生成高质量的知识图谱,并实现最先进的性能,同时具有低计算成本、可扩展性和适应性。 Conclusion: EoG通过弥合查询与知识图谱之间的语义差距,显著提升了知识图谱问答的准确性和鲁棒性,为基于知识图谱的推理提供了高效且可扩展的解决方案。 Abstract: Large Language Models (LLMs) exhibit strong reasoning capabilities in complex tasks. However, they still struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA). We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures. Existing methods usually employ resource-intensive, non-scalable workflows reasoning on vanilla KGs, but overlook this gap. To address this challenge, we propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs' prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries. EoG enables efficient evidence extraction from KGs for precise and robust reasoning, while ensuring low computational costs, scalability, and adaptability across different methods. Furthermore, we propose three graph quality evaluation metrics to analyze query-graph alignment in KGQA task, supported by theoretical validation of our optimization objectives. Extensive experiments on two KGQA benchmark datasets indicate that EoG can effectively generate high-quality KGs and achieve the state-of-the-art performance. Our code and data are available at https://github.com/zjukg/Enrich-on-Graph.[29] Leveraging What's Overfixed: Post-Correction via LLM Grammatical Error Overcorrection
Taehee Park,Heejin Do,Gary Geunbae Lee
Main category: cs.CL
TL;DR: 提出了一种名为Post-Correction via Overcorrection (PoCO)的新方法,利用大模型的生成能力和小模型的可靠性,在语法错误纠正任务中有效平衡了召回率与精确率。
Details
Motivation: 小规模语言模型在语法纠错中精度高但召回率低,而大语言模型则容易过度修正导致精度下降,因此需要一种方法结合两者优势。 Method: PoCO先使用大语言模型触发过度修正以提高召回率,再通过微调的小模型进行后处理,修正错误输出,从而实现精确率与召回率的平衡。 Result: 实验表明,PoCO在保持较高精确率的同时显著提升了召回率,整体语法纠错性能优于现有方法。 Conclusion: PoCO成功融合了大模型的生成能力与小模型的可靠性,有效解决了传统小模型召回不足和大模型过度修正的问题,实现了更高质量的语法错误纠正。 Abstract: Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.[30] Distilling Many-Shot In-Context Learning into a Cheat Sheet
Ukyo Honda,Soichiro Murakami,Peinan Zhang
Main category: cs.CL
TL;DR: 提出了cheat-sheet ICL方法,通过将多样本上下文学习的信息压缩成简洁的文本摘要,在减少输入token的同时实现了与多样本ICL相当甚至更好的性能。
Details
Motivation: 解决大语言模型在使用多样本上下文学习时因输入token过长导致的高计算开销问题。 Method: 将多样本ICL中的信息提炼为一个简洁的文本摘要(cheat sheet),在推理时用作上下文。 Result: 在复杂推理任务上的实验表明,cheat-sheet ICL在显著减少token使用的情况下,性能优于或相当于多样本ICL,并且无需测试时检索即可匹配基于检索的ICL效果。 Conclusion: cheat-sheet ICL是一种实用的、高效的利用大语言模型进行下游任务的方法。 Abstract: Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which distills the information from many-shot ICL into a concise textual summary (cheat sheet) used as the context at inference time. Experiments on challenging reasoning tasks show that cheat-sheet ICL achieves comparable or better performance than many-shot ICL with far fewer tokens, and matches retrieval-based ICL without requiring test-time retrieval. These findings demonstrate that cheat-sheet ICL is a practical alternative for leveraging LLMs in downstream tasks.[31] Zero-Shot Privacy-Aware Text Rewriting via Iterative Tree Search
Shuo Huang,Xingliang Yuan,Gholamreza Haffari,Lizhen Qu
Main category: cs.CL
TL;DR: 提出一种零样本、基于树搜索的迭代句子重写算法,用于在保护隐私的同时保持文本的连贯性和自然性。
Details
Motivation: 现有文本去标识化方法在隐私保护与文本自然性之间难以平衡,且可能泄露敏感信息。 Method: 采用基于奖励模型引导的树搜索策略,逐步重写隐私敏感片段,实现系统性信息模糊化或删除。 Result: 在隐私敏感数据集上的实验表明,该方法显著优于基线方法,在隐私保护和文本效用保持方面表现更优。 Conclusion: 所提方法能有效平衡隐私保护与文本质量,适用于云服务中的大语言模型输入隐私防护。 Abstract: The increasing adoption of large language models (LLMs) in cloud-based services has raised significant privacy concerns, as user inputs may inadvertently expose sensitive information. Existing text anonymization and de-identification techniques, such as rule-based redaction and scrubbing, often struggle to balance privacy preservation with text naturalness and utility. In this work, we propose a zero-shot, tree-search-based iterative sentence rewriting algorithm that systematically obfuscates or deletes private information while preserving coherence, relevance, and naturalness. Our method incrementally rewrites privacy-sensitive segments through a structured search guided by a reward model, enabling dynamic exploration of the rewriting space. Experiments on privacy-sensitive datasets show that our approach significantly outperforms existing baselines, achieving a superior balance between privacy protection and utility preservation.[32] Concise and Sufficient Sub-Sentence Citations for Retrieval-Augmented Generation
Guo Chen,Qiuyuan Li,Qiuxian Li,Hongliang Dai,Xiang Chen,Piji Li
Main category: cs.CL
TL;DR: 本文提出了一种生成子句级引用的方法,以提高检索增强生成(RAG)系统中引用的精确性和可读性,减少用户验证生成内容正确性的负担。
Details
Motivation: 现有引用方法通常在句子或段落级别提供引用,导致信息冗余或不完整,影响用户对LLM输出的可验证性。 Method: 开发了子句级引用的标注规范并构建了相应数据集;利用大语言模型自动生成微调数据,并通过信用模型筛选高质量样本,构建了一个引用生成框架。 Result: 实验表明,该方法能生成更高质量、更简洁且充分的引用,显著提升引用的可读性和实用性。 Conclusion: 子句级引用相比传统句子级引用更精确、有效,有助于提升RAG系统中答案的可验证性与用户体验。 Abstract: In retrieval-augmented generation (RAG) question answering systems, generating citations for large language model (LLM) outputs enhances verifiability and helps users identify potential hallucinations. However, we observe two problems in the citations produced by existing attribution methods. First, the citations are typically provided at the sentence or even paragraph level. Long sentences or paragraphs may include a substantial amount of irrelevant content. Second, sentence-level citations may omit information that is essential for verifying the output, forcing users to read the surrounding context. In this paper, we propose generating sub-sentence citations that are both concise and sufficient, thereby reducing the effort required by users to confirm the correctness of the generated output. To this end, we first develop annotation guidelines for such citations and construct a corresponding dataset. Then, we propose an attribution framework for generating citations that adhere to our standards. This framework leverages LLMs to automatically generate fine-tuning data for our task and employs a credit model to filter out low-quality examples. Our experiments on the constructed dataset demonstrate that the propose approach can generate high-quality and more readable citations.[33] WeFT: Weighted Entropy-driven Fine-Tuning for dLLMs
Guowei Xu,Wenxin Xu,Jiawang Zhao,Kaisheng Ma
Main category: cs.CL
TL;DR: 提出了一种名为WeFT的加权监督微调方法,通过基于熵为标记分配不同权重来改进扩散语言模型的训练效果,在多个推理基准上显著优于标准SFT。
Details
Motivation: 扩散模型在语言建模中展现出潜力,但缺乏每一步去噪的精确概率估计,导致生成过程不可预测且不一致,难以有效进行监督微调(SFT)。 Method: 提出WeFT方法,基于扩散理论,根据标记的熵为其分配不同权重,从而在微调过程中更有效地控制关键标记,引导生成方向。 Result: 在s1K、s1K-1.1和3k样本上训练,WeFT在Sudoku、Countdown、GSM8K和MATH-500四个推理基准上相比标准SFT取得了39%、64%和83%的相对提升。 Conclusion: WeFT通过引入基于熵的加权机制,有效提升了扩散语言模型在小样本微调下的推理能力,验证了控制关键标记对生成质量的重要性。 Abstract: Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose WeFT, a weighted SFT method for diffusion language models, where tokens are assigned different weights based on their entropy. Derived from diffusion theory, WeFT delivers substantial gains: training on s1K, s1K-1.1, and 3k samples from open-r1, it achieves relative improvements of 39%, 64%, and 83% over standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500). The code and models will be made publicly available.[34] Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models
Pittawat Taveekitworachai,Natpatchara Pongjirapat,Krittaphas Chaisutyakorn,Piyalitt Ittichaiwong,Tossaporn Saengja,Kunat Pipatanakul
Main category: cs.CL
TL;DR: 本论文首次系统研究了如何使医学推理模型(MRMs)生成开放式问题的排序答案列表,提出并比较了提示和微调方法,发现基于强化微调(RFT)的模型在多种答案格式下更具鲁棒性。
Details
Motivation: 临床决策通常需要考虑多个可能选项而非单一答案,但现有医学推理模型多被训练为仅输出单个答案,限制了其在实际诊疗中的应用广度与安全性。 Method: 提出排序列表作为新答案格式,采用提示(prompting)和监督微调(SFT)、强化微调(RFT)两种微调方法进行实验,并设计针对排序列表的新型奖励函数,通过消融研究评估RFT效果。 Result: 部分SFT模型能在特定答案格式上泛化,而RFT训练的模型在多种格式下表现更稳健;在修改版MedQA的案例研究中,模型虽未选出标准答案,但能识别出多个有效答案。 Conclusion: RFT方法更有利于医学推理模型适应多样化的答案格式,排序列表是一种有前景的替代输出形式,有助于提升模型在临床决策中的实用性。 Abstract: This paper presents a systematic study on enabling medical reasoning models (MRMs) to generate ranked lists of answers for open-ended questions. Clinical decision-making rarely relies on a single answer but instead considers multiple options, reducing the risks of narrow perspectives. Yet current MRMs are typically trained to produce only one answer, even in open-ended settings. We propose an alternative format: ranked lists and investigate two approaches: prompting and fine-tuning. While prompting is a cost-effective way to steer an MRM's response, not all MRMs generalize well across different answer formats: choice, short text, and list answers. Based on our prompting findings, we train and evaluate MRMs using supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT teaches a model to imitate annotated responses, and RFT incentivizes exploration through the responses that maximize a reward. We propose new reward functions targeted at ranked-list answer formats, and conduct ablation studies for RFT. Our results show that while some SFT models generalize to certain answer formats, models trained with RFT are more robust across multiple formats. We also present a case study on a modified MedQA with multiple valid answers, finding that although MRMs might fail to select the benchmark's preferred ground truth, they can recognize valid answers. To the best of our knowledge, this is the first systematic investigation of approaches for enabling MRMs to generate answers as ranked lists. We hope this work provides a first step toward developing alternative answer formats that are beneficial beyond single answers in medical domains.[35] Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization
Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch
Main category: cs.CL
TL;DR: 提出SummQ,一种基于对抗性多智能体协作的长文档摘要框架,通过摘要与问答智能体的协同与反馈机制,显著提升摘要质量。
Details
Motivation: 现有长文档摘要方法存在信息丢失、事实不一致和连贯性问题,难以有效处理超长文本。 Method: 设计包含摘要生成、评审、问答生成与评审及应试智能体的多智能体对抗框架,通过问答机制进行持续质量监控和迭代优化。 Result: 在三个主流长文档摘要基准上,SummQ在ROUGE、BERTScore、LLM-as-a-Judge和人工评价中均显著优于现有最先进方法。 Conclusion: 多智能体对抗协作结合问答验证机制能有效提升长文档摘要的质量与可靠性,为该领域提供了新思路。 Abstract: Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.[36] MemLens: Uncovering Memorization in LLMs with Activation Trajectories
Zirui He,Haiyan Zhao,Ali Payani,Mengnan du
Main category: cs.CL
TL;DR: 本文提出MemLens,一种通过分析生成过程中数字标记的概率轨迹来检测大语言模型中记忆化现象的新方法。与依赖表面词汇重叠或困惑度的传统方法不同,MemLens发现被污染的样本在模型早期层就快速锁定答案,表现出“捷径”行为,而干净样本则在整个模型深度中逐步积累证据。通过LoRA微调注入设计好的样例进一步验证了该方法能捕捉真实的记忆信号而非虚假相关性。
Details
Motivation: 现有记忆化检测方法在面对隐式污染数据时泛化能力差,且易受训练数据污染和记忆的影响,因此需要一种更鲁棒、能识别深层记忆行为的方法。 Method: 通过分析大语言模型生成过程中数字标记在各层的概率变化轨迹,识别被污染样本与干净样本在推理路径上的差异;并利用LoRA微调注入控制样本进行因果验证。 Result: 发现被污染样本在模型早期即表现出高置信度的答案锁定(shortcut行为),而干净样本推理过程更渐进;两类样本的推理轨迹显著分离;注入实验验证了MemLens捕捉到的是真实记忆信号。 Conclusion: MemLens能够有效区分记忆化与正常推理过程,为检测LLM中的数据污染提供了可解释且鲁棒的新工具。 Abstract: Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical overlap and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut'' behaviors, locking onto an answer with high confidence in the model's early layers, whereas clean samples show more gradual evidence accumulation across the model's full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.[37] Cross-Linguistic Analysis of Memory Load in Sentence Comprehension: Linear Distance and Structural Density
Krishna Aggarwal
Main category: cs.CL
TL;DR: 该研究通过跨语言的依存句法树库和混合效应模型,探讨句子理解中的记忆负荷更受线性距离还是结构密度影响,提出“干预头数量”(Intervener Complexity)作为比线性距离更优的结构化指标。
Details
Motivation: 旨在厘清句法依存中线性距离与结构复杂性对记忆负荷的相对贡献,并为局部性理论提供更精细的结构解释。 Method: 使用统一依存句法标注库(UD),将句子长度、依存距离和干预头复杂度作为预测变量,采用混合效应模型分析其对记忆负荷(操作化为特征干扰与错误绑定之和)的影响。 Result: 三者均正向预测记忆负荷,其中句子长度影响最广,而干预头复杂度在线性距离之外仍具显著解释力。 Conclusion: 干预头数量是比线性距离更贴近认知加工需求的指标,研究整合了线性和层级视角,为评估记忆负荷理论提供了基于图结构和跨语言建模的方法路径。 Abstract: This study examines whether sentence-level memory load in comprehension is better explained by linear proximity between syntactically related words or by the structural density of the intervening material. Building on locality-based accounts and cross-linguistic evidence for dependency length minimization, the work advances Intervener Complexity-the number of intervening heads between a head and its dependent-as a structurally grounded lens that refines linear distance measures. Using harmonized dependency treebanks and a mixed-effects framework across multiple languages, the analysis jointly evaluates sentence length, dependency length, and Intervener Complexity as predictors of the Memory-load measure. Studies in Psycholinguistics have reported the contributions of feature interference and misbinding to memory load during processing. For this study, I operationalized sentence-level memory load as the linear sum of feature misbinding and feature interference for tractability; current evidence does not establish that their cognitive contributions combine additively. All three factors are positively associated with memory load, with sentence length exerting the broadest influence and Intervener Complexity offering explanatory power beyond linear distance. Conceptually, the findings reconcile linear and hierarchical perspectives on locality by treating dependency length as an important surface signature while identifying intervening heads as a more proximate indicator of integration and maintenance demands. Methodologically, the study illustrates how UD-based graph measures and cross-linguistic mixed-effects modelling can disentangle linear and structural contributions to processing efficiency, providing a principled path for evaluating competing theories of memory load in sentence comprehension.[38] Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning
Asim Ersoy,Enes Altinisik,Husrev Taha Sencar,Kareem Darwish
Main category: cs.CL
TL;DR: 本论文研究了如何为阿拉伯语实现工具调用功能,探讨了三种关键策略:使用阿拉伯语工具调用数据的必要性、通用指令微调的影响以及特定高优先级工具微调的价值。
Details
Motivation: 现有工具调用研究和资源主要集中于英语,缺乏对阿拉伯语等其他语言的支持,本文旨在填补这一空白。 Method: 通过翻译和适配两个开源工具调用数据集为阿拉伯语,并在开源阿拉伯大语言模型上进行基础与后训练变体的广泛实验。 Result: 实验结果揭示了在阿拉伯语环境下开发强大工具增强型代理的最优策略,包括跨语言迁移的效果及特定工具微调的优势。 Conclusion: 为实现高效的阿拉伯语工具调用,结合特定语言的数据、通用指令微调和关键工具的精细调优是最有效的路径。 Abstract: Tool calling is a critical capability that allows Large Language Models (LLMs) to interact with external systems, significantly expanding their utility. However, research and resources for tool calling are predominantly English-centric, leaving a gap in our understanding of how to enable this functionality for other languages, such as Arabic. This paper investigates three key research questions: (1) the necessity of in-language (Arabic) tool-calling data versus relying on cross-lingual transfer, (2) the effect of general-purpose instruction tuning on tool-calling performance, and (3) the value of fine-tuning on specific, high-priority tools. To address these questions, we conduct extensive experiments using base and post-trained variants of an open-weight Arabic LLM. To enable this study, we bridge the resource gap by translating and adapting two open-source tool-calling datasets into Arabic. Our findings provide crucial insights into the optimal strategies for developing robust tool-augmented agents for Arabic.[39] Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting
Valeria Ramirez-Garcia,David de-Fitero-Dominguez,Antonio Garcia-Cabot,Eva Garcia-Lopez
Main category: cs.CL
TL;DR: 本研究提出并测试了五种基于大语言模型(LLM)的学术文本输入题自动评分系统,发现“参考辅助评估”方法在与人类评分对比中表现最佳,具有较低的偏差和高质量的评价输出,显示出AI在教育评估中作为辅助工具的潜力。
Details
Motivation: 探索大语言模型在高等教育中对开放式文本问题进行自动评分的有效性,并设计能够贴近人类评分的自动化评估方法。 Method: 提出了五种基于LLM的评估系统:JudgeLM评估、参考辅助评估、无参考评估、加性评估和自适应评估,并在一个包含110个计算机科学学生答案的自建数据集上,使用JudgeLM、Llama-3.1-8B和DeepSeek-R1-Distill-Llama-8B三种模型进行测试,所有结果均与人工评分对比。 Result: 参考辅助评估表现最优,中位绝对偏差最低(0.945),均方根偏差最低(1.214),评分公平且评价内容详尽;加性和自适应评估在简短回答中效果不佳,无参考评估信息不足,JudgeLM原生评估受限于模型能力表现差。 Conclusion: 在合适方法支持下,基于大语言模型的自动评分系统有潜力作为教育评估中的有效补充工具,尤其以参考辅助评估最具应用前景。 Abstract: Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education, LLMs have been studied as assistant tools for students and teachers. Our research investigates LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics. We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM evaluation, which uses the model's single answer prompt to obtain a score; Reference Aided Evaluation, which uses a correct answer as a guide aside from the original context of the question; No Reference Evaluation, which ommits the reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive Evaluation, which is an evaluation done with generated criteria fitted to each question. All evaluation methods have been compared with the results of a human evaluator. Results show that the best method to automatically evaluate and score Text-Input Problems using LLMs is Reference Aided Evaluation. With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations. Other methods such as Additive and Adaptive Evaluation fail to provide good results in concise answers, No Reference Evaluation lacks information needed to correctly assess questions and JudgeLM Evaluations have not provided good results due to the model's limitations. As a result, we conclude that Artificial Intelligence-driven automatic evaluation systems, aided with proper methodologies, show potential to work as complementary tools to other academic resources.[40] Generative AI for FFRDCs
Arun S. Maiya
Main category: cs.CL
TL;DR: 本文展示了如何利用大语言模型(仅需少量输入输出示例)加速联邦资助的研发中心对文本密集型任务的处理,并通过OnPrem.LLM框架在敏感政府环境中安全、灵活地应用生成式AI,案例研究证明该方法在保持可审计性和数据主权的同时提升了监督和战略分析能力。
Details
Motivation: FFRDCs面临大量文本工作负载,手动分析耗时且低效,亟需自动化工具提升处理效率,同时满足政府环境中的安全与合规要求。 Method: 采用OnPrem.LLM开源框架,利用大语言模型进行少样本学习,实现对政策文件和科学文献的摘要、分类、信息提取和理解。 Result: 在国防授权法案(NDAA)和国家科学基金会(NSF)奖项等案例中,该方法有效提升了文本分析速度与质量,支持了监督和战略决策,同时保障了数据安全与审计追踪。 Conclusion: 基于OnPrem.LLM的大语言模型方法可在确保数据主权和可审计性的前提下,显著提升政府相关文本处理的效率与洞察力,适用于敏感领域的知识管理与决策支持。 Abstract: Federally funded research and development centers (FFRDCs) face text-heavy workloads, from policy documents to scientific and engineering papers, that are slow to analyze manually. We show how large language models can accelerate summarization, classification, extraction, and sense-making with only a few input-output examples. To enable use in sensitive government contexts, we apply OnPrem$.$LLM, an open-source framework for secure and flexible application of generative AI. Case studies on defense policy documents and scientific corpora, including the National Defense Authorization Act (NDAA) and National Science Foundation (NSF) Awards, demonstrate how this approach enhances oversight and strategic analysis while maintaining auditability and data sovereignty.[41] Behind RoPE: How Does Causal Mask Encode Positional Information?
Junu Kim,Xiao Liu,Zhenghao Lin,Lei Ji,Yeyun Gong,Edward Choi
Main category: cs.CL
TL;DR: 本文研究了Transformer解码器中因果掩码对注意力分数的位置依赖性影响,发现即使没有参数或输入中的因果依赖,因果掩码也能诱导出类似位置编码的局部偏好模式,并与RoPE相互作用扭曲其相对性。
Details
Motivation: 探究因果掩码是否在Transformer中提供位置信息,以及其与显式位置编码(如RoPE)的交互影响。 Method: 通过理论分析证明因果掩码可诱导位置相关的注意力模式,并通过实验验证训练模型中的该现象及其与RoPE的相互作用。 Result: 理论和实验证明因果掩码本身能产生偏向邻近查询-键对的注意力模式,且与RoPE结合时会破坏RoPE的相对性。 Conclusion: 因果掩码是位置信息的重要来源,应与显式位置编码一同被考虑。 Abstract: While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE's relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.[42] When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following
Keno Harada,Yudai Yamazaki,Masachika Taniguchi,Edison Marrese-Taylor,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
Main category: cs.CL
TL;DR: 本文提出了两个专门的基准测试(ManyIFEval和StyleMBPP),用于评估大语言模型在多指令跟随任务中的表现,并发现随着指令数量增加,模型性能持续下降;同时提出三种回归模型,可用少量样本有效预测未见指令组合下的性能。
Details
Motivation: 随着大语言模型在现实场景中的广泛应用,理解其同时遵循多个指令的能力变得至关重要。现有评估方法难以覆盖多样化的指令组合,因此需要系统性基准来衡量这一能力。 Method: 设计了两个新基准:ManyIFEval(最多十个文本生成指令)和StyleMBPP(最多六个编程指令);在十个LLM上进行实验,并构建三种回归模型(以指令数量等为特征)预测不同指令组合下的性能。 Result: 实验表明,随着指令数量增加,模型性能持续下降;使用指令数量作为解释变量的逻辑回归模型可在未见指令组合下以约10%的误差预测性能;仅需500(ManyIFEval)和300(StyleMBPP)样本即可实现有效估计。 Conclusion: 多指令跟随能力是当前LLM的一个薄弱环节,且性能随指令数增加而下降;提出的回归模型能以较小样本高效预测模型在各种指令组合下的表现,为实际应用中的性能评估提供了可行方案。 Abstract: As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.[43] SoM-1K: A Thousand-Problem Benchmark Dataset for Strength of Materials
Qixin Wan,Zilong Wang,Jingwen Zhou,Wanting Wang,Ziheng Geng,Jiachen Liu,Ran Cao,Minghui Cheng,Lu Cheng
Main category: cs.CL
TL;DR: 本文提出了SoM-1K,首个面向材料力学领域的多模态基准数据集,用于评估基础模型在复杂工程问题中的表现,并提出“图像描述”(DoI)提示策略以提升模型理解能力。实验表明现有模型在此类任务上表现不佳,最佳准确率仅为56.6%,且结合DoI的大型语言模型常优于使用图像的视觉语言模型,凸显当前模型在多模态工程推理上的不足。
Details
Motivation: 基础模型在多种领域表现出色,但在复杂多模态工程问题上的表现尚不明确,尤其在材料力学等需要图文联合推理的任务中缺乏系统评估。因此,亟需专门的基准数据集和有效方法来衡量并提升其工程应用能力。 Method: 构建包含1,065个标注问题的多模态数据集SoM-1K,涵盖真实工程场景中的文本描述与示意图;提出“图像描述”(DoI)提示策略,用专家生成的文本描述替代或辅助图像输入;评估8个代表性基础模型(包括LLMs和VLMs)在该数据集上的性能。 Result: 当前基础模型在SoM-1K上整体表现较差,最高准确率为56.6%;使用DoI的LLMs通常优于直接输入图像的VLMs;错误分析显示DoI能有效减少视觉误解错误,说明高质量文本描述对当前模型比原始图像更有效。 Conclusion: SoM-1K为工程AI提供了严谨的评估基准,揭示了现有基础模型在科学与工程多模态推理中的局限性,表明未来需重点发展更强大的多模态理解与推理能力,尤其是在依赖精确视觉信息的领域。 Abstract: Foundation models have shown remarkable capabilities in various domains, but their performance on complex, multimodal engineering problems remains largely unexplored. We introduce SoM-1K, the first large-scale multimodal benchmark dataset dedicated to evaluating foundation models on problems in the strength of materials (SoM). The dataset, which contains 1,065 annotated SoM problems, mirrors real-world engineering tasks by including both textual problem statements and schematic diagrams. Due to the limited capabilities of current foundation models in understanding complicated visual information, we propose a novel prompting strategy called Descriptions of Images (DoI), which provides rigorous expert-generated text descriptions of the visual diagrams as the context. We evaluate eight representative foundation models, including both large language models (LLMs) and vision language models (VLMs). Our results show that current foundation models struggle significantly with these engineering problems, with the best-performing model achieving only 56.6% accuracy. Interestingly, we found that LLMs, when provided with DoI, often outperform VLMs provided with visual diagrams. A detailed error analysis reveals that DoI plays a crucial role in mitigating visual misinterpretation errors, suggesting that accurate text-based descriptions can be more effective than direct image input for current foundation models. This work establishes a rigorous benchmark for engineering AI and highlights a critical need for developing more robust multimodal reasoning capabilities in foundation models, particularly in scientific and engineering contexts.[44] Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs
Yixin Wan,Xingrun Chen,Kai-Wei Chang
Main category: cs.CL
TL;DR: 本文提出并系统研究了大语言模型中存在的文化定位偏见问题,即模型在生成内容时倾向于主流美国文化视角,而将其他非主流文化视为外部者。为此,作者构建了CultureLens基准,并提出了基于提示的公平干预方法(FIP)和两种推理时缓解框架MFA-SA与MFA-MA,实验证明基于代理的方法能有效减轻此类偏见。
Details
Motivation: 发现大语言模型在生成内容时存在对非主流文化的边缘化倾向,缺乏对多元文化视角的公平表达,因此需要识别并量化这种文化定位偏见,并探索有效的缓解策略。 Method: 提出CultureLens基准,包含4000个生成提示和3个评估指标,通过情境化采访脚本生成任务来衡量偏见;设计两种推理时缓解方法:基于提示的FIP方法,以及基于多智能体架构的MFA框架(包括MFA-SA自反思重写循环和MFA-MA分层特化代理结构)。 Result: 在5个最先进的大语言模型上实验发现,模型在美国文化背景下采用内部视角的比例超过88%,但在非主流文化中则显著偏向外部视角;所提出的MFA方法,尤其是MFA-MA,在减少文化偏见方面显著优于基线方法。 Conclusion: 大语言模型存在显著的文化定位偏见,而基于多智能体的结构化公平性干预框架(如MFA)是一种有效且有前景的缓解路径。 Abstract: Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM's default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.[45] PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models
Mohammad Hosseini,Kimia Hosseini,Shayan Bali,Zahra Zanjani,Saeedeh Momtazi
Main category: cs.CL
TL;DR: PerHalluEval是首个针对波斯语的动态幻觉评估基准,通过LLM驱动的三阶段管道结合人工验证,评估12个大语言模型在波斯语文本中的幻觉问题,发现提供外部知识可部分缓解该问题,且专为波斯语训练的模型未表现出明显优势。
Details
Motivation: 大语言模型在低资源语言(如波斯语)中普遍存在幻觉问题,但缺乏专门的评估基准,尤其是针对波斯语文化和语言特性的评估工具。 Method: 构建了一个三阶段LLM驱动的生成管道,结合人工验证生成合理的问答与摘要样本,并利用生成token的对数概率筛选最可信的幻觉实例;同时引入人工标注以突出波斯语文化相关语境。 Result: 评估显示现有大语言模型普遍难以检测波斯语中的幻觉;提供原始文档等外部知识可部分减轻幻觉;专为波斯语训练的模型与其他模型在幻觉表现上无显著差异。 Conclusion: PerHalluEval为波斯语幻觉评估提供了有效基准,揭示了当前LLM在低资源语言中幻觉检测的局限性,并表明外部知识有助于缓解该问题,未来需更针对性地优化低资源语言的幻觉控制。 Abstract: Hallucination is a persistent issue affecting all large language Models (LLMs), particularly within low-resource languages such as Persian. PerHalluEval (Persian Hallucination Evaluation) is the first dynamic hallucination evaluation benchmark tailored for the Persian language. Our benchmark leverages a three-stage LLM-driven pipeline, augmented with human validation, to generate plausible answers and summaries regarding QA and summarization tasks, focusing on detecting extrinsic and intrinsic hallucinations. Moreover, we used the log probabilities of generated tokens to select the most believable hallucinated instances. In addition, we engaged human annotators to highlight Persian-specific contexts in the QA dataset in order to evaluate LLMs' performance on content specifically related to Persian culture. Our evaluation of 12 LLMs, including open- and closed-source models using PerHalluEval, revealed that the models generally struggle in detecting hallucinated Persian text. We showed that providing external knowledge, i.e., the original document for the summarization task, could mitigate hallucination partially. Furthermore, there was no significant difference in terms of hallucination when comparing LLMs specifically trained for Persian with others.[46] BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback
Hyunseo Kim,Sangam Lee,Kwangwook Seo,Dongha Lee
Main category: cs.CL
TL;DR: 本文提出了BESPOKE,一个用于评估搜索增强型大语言模型中个性化效果的现实且具有诊断性的基准。该基准基于真实的人类聊天和搜索历史,并结合细粒度偏好评分与反馈,系统分析了信息寻求任务中有效个性化的关键需求。
Details
Motivation: 现有的搜索增强型大语言模型在满足多样化用户需求方面仍显不足,缺乏对个性化效果的系统性评估。为此,需要构建一个真实且具备诊断能力的基准来衡量个性化性能。 Method: 提出BESPOKE基准,通过收集真实人类的聊天和搜索历史,由人工标注者提供自身数据、撰写具有详细信息需求的查询,并对模型响应进行评分和反馈,实现长期深度参与的人工标注。 Result: BESPOKE实现了对个性化搜索增强LLM的细粒度评估,揭示了有效个性化所需的关键因素,并支持对现有系统(如ChatGPT和Gemini)的系统分析。 Conclusion: BESPOKE为评估个性化搜索增强型大语言模型提供了一个现实且可诊断的基准,推动了针对用户意图多样性和信息呈现偏好的精细化研究。 Abstract: Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.[47] VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model
Junhyuk Choi,Ro-hoon Oh,Jihwan Seol,Bugeun Kim
Main category: cs.CL
TL;DR: VoiceBBQ是BBQ数据集的语音扩展版本,用于评估口语语言模型中的社会偏见,涵盖内容和声学两个方面。
Details
Motivation: 由于语音的特性,口语语言模型中的社会偏见可能来自内容和声学两个方面,因此需要一个能够同时衡量这两种偏见的数据集。 Method: 将BBQ中的每个上下文转换为受控的语音条件,构建VoiceBBQ数据集,并用于评估LLaMA-Omni和Qwen2-Audio两种口语语言模型。 Result: 实验发现LLaMA-Omni能抵抗声学偏见但加剧性别和口音偏见,而Qwen2-Audio则显著减弱这些偏见同时保持内容保真度。 Conclusion: VoiceBBQ提供了一个紧凑且可直接使用的测试平台,可用于联合诊断口语语言模型中的内容和声学偏见。 Abstract: We introduce VoiceBBQ, a spoken extension of the BBQ (Bias Benchmark for Question Answering) - a dataset that measures social bias by presenting ambiguous or disambiguated contexts followed by questions that may elicit stereotypical responses. Due to the nature of speech, social bias in Spoken Language Models (SLMs) can emerge from two distinct sources: 1) content aspect and 2) acoustic aspect. The dataset converts every BBQ context into controlled voice conditions, enabling per-axis accuracy, bias, and consistency scores that remain comparable to the original text benchmark. Using VoiceBBQ, we evaluate two SLMs - LLaMA-Omni and Qwen2-Audio - and observe architectural contrasts: LLaMA-Omni resists acoustic bias while amplifying gender and accent bias, whereas Qwen2-Audio substantially dampens these cues while preserving content fidelity. VoiceBBQ thus provides a compact, drop-in testbed for jointly diagnosing content and acoustic bias across spoken language models.[48] Acoustic-based Gender Differentiation in Speech-aware Language Models
Junhyuk Choi,Jihwan Seol,Nayeon Kim,Chanhee Cho,EunBin Cho,Bugeun Kim
Main category: cs.CL
TL;DR: 该论文研究了语音语言模型(SpeechLMs)中的性别偏见问题,发现尽管整体响应看似无性别差异,但实际上存在悖论性偏差:在性别刻板问题上模型偏向男性,在应区分性别的语境下却反而忽略性别。这种偏差主要源于Whisper语音编码器产生的男性导向声学标记,表明当前技术未能恰当利用性别信息。
Details
Motivation: 探究语音语言模型在处理不同性别语音时是否存在隐性性别偏见,尤其是在相同问题因说话人性别不同而产生不同回应的现象。 Method: 构建了一个包含9,208个语音样本的新数据集,涵盖性别无关、性别刻板和性别依赖三类问题,并对LLaMA-Omni系列模型进行评估,同时对比其与基础大语言模型的表现,分析Whisper语音编码器的影响。 Result: 发现模型在性别刻板问题上一致表现出男性倾向;在应区分性别的场景中却忽略性别差异;该模式不受中立选项或声音感知性别影响,且在使用性别中和方法后依然存在;偏差主要来自Whisper编码器生成的男性导向声学标记。 Conclusion: 当前语音语言模型虽追求总体公平,但未能正确处理语境中的性别信息,导致悖论性偏见,需更精细的技术来合理利用性别信息。 Abstract: Speech-aware Language Models (SpeechLMs) have fundamentally transformed human-AI interaction by enabling voice-based communication, yet they may exhibit acoustic-based gender differentiation where identical questions lead to different responses based on the speaker's gender. This paper propose a new dataset that enables systematic analysis of this phenomenon, containing 9,208 speech samples across three categories: Gender-Independent, Gender-Stereotypical, and Gender-Dependent. We further evaluated LLaMA-Omni series and discovered a paradoxical pattern; while overall responses seems identical regardless of gender, the pattern is far from unbiased responses. Specifically, in Gender-Stereotypical questions, all models consistently exhibited male-oriented responses; meanwhile, in Gender-Dependent questions where gender differentiation would be contextually appropriate, models exhibited responses independent to gender instead. We also confirm that this pattern does not result from neutral options nor perceived gender of a voice. When we allow neutral response, models tends to respond neutrally also in Gender-Dependent questions. The paradoxical pattern yet retains when we applied gender neutralization methods on speech. Through comparison between SpeechLMs with corresponding backbone LLMs, we confirmed that these paradoxical patterns primarily stem from Whisper speech encoders, which generates male-oriented acoustic tokens. These findings reveal that current SpeechLMs may not successfully remove gender biases though they prioritized general fairness principles over contextual appropriateness, highlighting the need for more sophisticated techniques to utilize gender information properly in speech technology.[49] AutoIntent: AutoML for Text Classification
Ilya Alekseev,Roman Solomatin,Darina Rustamova,Denis Kuznetsov
Main category: cs.CL
TL;DR: AutoIntent是一个用于文本分类任务的自动化机器学习工具,提供端到端自动化,包括嵌入模型选择、分类器优化和决策阈值调整,支持多标签分类和超出范围检测,在标准意图分类数据集上表现优于现有AutoML工具。
Details
Motivation: 现有的AutoML工具在文本分类任务中缺乏端到端的自动化支持,特别是在嵌入模型选择、分类器优化和决策阈值调整方面的集成不足,且对多标签分类和超出范围检测的支持有限。 Method: AutoIntent通过模块化的、类似sklearn的接口实现端到端自动化,整合了嵌入模型选择、分类器优化和决策阈值调整,并专门设计以支持多标签分类和超出范围检测。 Result: 在标准意图分类数据集上的实验表明,AutoIntent相比现有AutoML工具具有更优的性能,同时允许用户在效果和资源消耗之间进行权衡。 Conclusion: AutoIntent为文本分类任务提供了一个高效、灵活且易于使用的自动化解决方案,在性能和功能上均优于现有工具。 Abstract: AutoIntent is an automated machine learning tool for text classification tasks. Unlike existing solutions, AutoIntent offers end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning, all within a modular, sklearn-like interface. The framework is designed to support multi-label classification and out-of-scope detection. AutoIntent demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets and enables users to balance effectiveness and resource consumption.[50] Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Lei Hei,Tingjing Liao,Yingxin Pei,Yiyang Qi,Jiaqi Wang,Ruiting Li,Feiliang Ren
Main category: cs.CL
TL;DR: 提出了一种名为ROC的新框架,将多模态关系抽取从分类任务重构为基于语义的检索任务,通过融合实体类型和位置信息、扩展关系标签为自然语言描述,并利用对比学习对齐实体-关系对,在MNRE和MORE数据集上达到了SOTA性能。
Details
Motivation: 传统多模态关系抽取方法依赖分类范式,忽视结构约束且语义表达能力有限,难以实现细粒度关系理解。 Method: 提出ROC框架:1)使用多模态编码器整合实体类型和位置信息;2)利用大语言模型将关系标签扩展为自然语言描述;3)通过基于语义相似性的对比学习对齐实体-关系对。 Result: 在MNRE和MORE基准数据集上实现了最先进的性能,表现出更强的鲁棒性和可解释性。 Conclusion: ROC通过将多模态关系抽取从分类转向检索,有效提升了语义表达能力和结构建模能力,为细粒度关系理解提供了新思路。 Abstract: Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose \underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.[51] Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models
Chantal Shaib,Vinith M. Suriyakumar,Levent Sagun,Byron C. Wallace,Marzyeh Ghassemi
Main category: cs.CL
TL;DR: 本文研究了大型语言模型中句法模板、领域和语义之间的关系,发现训练数据中句法与领域的虚假相关性可能降低模型性能,并影响安全性微调,导致模型绕过拒绝机制。作者提出应测试此类相关性并增加训练数据的句法多样性。
Details
Motivation: 理解LLM在处理任务指令时如何受到句法、领域和语义之间相互作用的影响,特别是句法模板可能引入的虚假相关性问题。 Method: 通过构建合成训练数据集,分析句法-领域相关性对模型性能的影响,并提出评估框架检测多种模型(包括OLMo、Llama和GPT-4o)中的该现象,同时进行安全性微调的案例研究。 Result: 发现句法-领域相关性会降低实体知识任务的表现(平均性能0.51±0.06),并在多个开放和封闭模型中验证了该现象;此外,这种相关性可被利用来绕过安全拒绝机制。 Conclusion: 需要明确测试句法-领域相关性,并在训练数据中确保领域内的句法多样性,以避免模型学习到有害的虚假关联。 Abstract: For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates--frequent sequences of Part-of-Speech (PoS) tags--are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.[52] Who's Laughing Now? An Overview of Computational Humour Generation and Explanation
Tyler Loakman,William Thorne,Chenghua Lin
Main category: cs.CL
TL;DR: 本文综述了与生成和解释幽默相关的计算幽默领域,指出尽管理解幽默是自然语言处理的基础任务,但除双关语外的幽默生成与解释研究仍较少,当前最先进的模型尚未达到人类水平,并强调了计算幽默处理的重要性及其未来研究方向。
Details
Motivation: 幽默是一种抽象、创造性且常依赖于上下文的特质,需要大量推理来理解和创造,因此是评估现代大语言模型常识和推理能力的重要任务。然而目前相关研究仍不足,尤其是非双关类幽默的生成与解释。 Method: 通过文献综述的方法,系统梳理了计算幽默在生成与解释任务中的研究现状,并分析了现有模型的表现与局限。 Result: 发现当前在非双关幽默的生成与解释方面研究仍然稀疏,最先进的语言模型在这些任务上仍远未达到人类水平。 Conclusion: 计算幽默处理应被视为自然语言处理的一个重要子领域,未来的研究需考虑幽默的主观性和伦理模糊性,并在此基础上推动更深入的发展。 Abstract: The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (LLMs). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains sparse, while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.[53] GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models
Jieli Zhu,Vi Ngoc-Nha Tran
Main category: cs.CL
TL;DR: 本研究探讨了基于小语言模型(SLM)的聊天机器人在下游任务中的个人身份信息(PII)泄露问题,提出了一种名为GEP的贪婪坐标梯度法,显著提升了PII提取效率,并在复杂真实场景中验证了其有效性。
Details
Motivation: 尽管小语言模型(SLM)在效率上优于大语言模型(LLM),但其在下游应用中可能存在的个人身份信息(PII)泄露风险尚未被充分研究,因此需要专门针对SLM设计有效的PII泄露检测方法。 Method: 首先基于BioGPT架构和医疗数据集Alpaca与HealthCareMagic微调出ChatBioGPT模型;然后提出GEP方法——一种基于贪婪坐标梯度(GCG)的PII提取技术,并在模板式和自由式插入PII的情况下进行实验评估。 Result: 实验表明,GEP方法相比传统模板攻击方法可将PII泄露检测效果提升高达60倍,在更复杂的自由句式插入场景下仍能实现最高4.53%的PII泄露率。 Conclusion: GEP是一种高效且适应性强的PII泄露检测方法,揭示了SLM在实际应用中潜在的隐私风险,强调了在SLM部署过程中加强隐私保护的必要性。 Abstract: Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backbone of BioGPT using medical datasets Alpaca and HealthCareMagic. It shows a matchable performance in BERTscore compared with previous studies of ChatDoctor and ChatGPT. Based on this model, we prove that the previous template-based PII attacking methods cannot effectively extract the PII in the dataset for leakage detection under the SLM condition. We then propose GEP, which is a greedy coordinate gradient-based (GCG) method specifically designed for PII extraction. We conduct experimental studies of GEP and the results show an increment of up to 60$\times$ more leakage compared with the previous template-based methods. We further expand the capability of GEP in the case of a more complicated and realistic situation by conducting free-style insertion where the inserted PII in the dataset is in the form of various syntactic expressions instead of fixed templates, and GEP is still able to reveal a PII leakage rate of up to 4.53%.[54] Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning
Xiangru Tang,Wanghan Xu,Yujie Wang,Zijie Guo,Daniel Shao,Jiapeng Chen,Cixuan Zhang,Ziyi Wang,Lixin Zhang,Guancheng Wan,Wenlong Zhang,Lei Bai,Zhenfei Yin,Philip Torr,Hanrui Wang,Di Jin
Main category: cs.CL
TL;DR: 本文提出了一种结合隐式检索与结构化协作的统一框架,通过Monitor-based检索模块和分层解法优化(HSR)及质量感知迭代推理(QAIR)机制,显著提升了大模型在科学推理任务中的准确性和效率,在多个基准上取得当前最优结果。
Details
Motivation: 针对大语言模型在科学推理中面临的显式检索割裂推理过程以及多智能体方案因平均化候选解而稀释优质解的问题,本文旨在消除‘工具税’并提升推理质量。 Method: 提出统一框架,采用基于Monitor的token级隐式检索模块整合外部知识,并设计分层解法优化(HSR)和质量感知迭代推理(QAIR)实现结构化协作与动态优化。 Result: 在Humanity's Last Exam Bio/Chem Gold上达到48.3%的准确率,超越最强基线13.4个百分点,领先前沿大模型最多18.1个百分点,同时减少53.5%的token使用和43.7%的智能体步骤;在SuperGPQA和TRQA上验证了跨领域鲁棒性。 Conclusion: 隐式增强与结构化精炼能有效克服显式工具使用和统一聚合带来的低效问题,为科学推理提供了更高效、精准的解决方案。 Abstract: Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden "tool tax" of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3\% accuracy -- the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5\% and agent steps by 43.7\%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85\% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.[55] CLaw: Benchmarking Chinese Legal Knowledge in Large Language Models - A Fine-grained Corpus and Reasoning Analysis
Xinzhe Xu,Liang Zhao,Hongshen Xu,Chen Chen
Main category: cs.CL
TL;DR: 本文提出了CLaw,一个专门用于评估大语言模型在中国法律知识及其推理应用方面表现的新基准。CLaw包含306部中国国家法规的细粒度语料库和254个基于案例的推理实例,实验表明当前多数大模型在准确复现法律条文方面存在显著困难,强调了可靠法律推理需结合精确的知识检索与强大的通用推理能力。
Details
Motivation: 现有大语言模型在处理法律文本时因缺乏专门化训练而可靠性不足,难以准确引用和推理法律条文,亟需一个专门的评估基准来衡量其真实法律知识水平。 Method: 构建包含306部中国国家法律、细分至子条款并标注历史修订时间的精细语料库(64,849条目),以及基于最高人民法院材料生成的254个案例推理任务,用于评估模型的法律知识回忆与实际应用能力。 Result: 实证结果显示,当前主流大语言模型在准确复现法律条文方面表现不佳,暴露出其在法律知识检索上的严重缺陷,进而影响法律推理的可信度。 Conclusion: 实现可信的法律推理需要精准的知识检索(如通过监督微调或检索增强生成)与强大通用推理能力的协同;CLaw为推进特定领域尤其是法律领域的LLM发展提供了关键基准与洞见。 Abstract: Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fine-grained corpus of all 306 Chinese national statutes, segmented to the subparagraph level and incorporating precise historical revision timesteps for rigorous recall evaluation (64,849 entries), and (2) a challenging set of 254 case-based reasoning instances derived from China Supreme Court curated materials to assess the practical application of legal knowledge. Our empirical evaluation reveals that most contemporary LLMs significantly struggle to faithfully reproduce legal provisions. As accurate retrieval and citation of legal provisions form the basis of legal reasoning, this deficiency critically undermines the reliability of their responses. We contend that achieving trustworthy legal reasoning in LLMs requires a robust synergy of accurate knowledge retrieval--potentially enhanced through supervised fine-tuning (SFT) or retrieval-augmented generation (RAG)--and strong general reasoning capabilities. This work provides an essential benchmark and critical insights for advancing domain-specific LLM reasoning, particularly within the complex legal sphere.[56] SGMem: Sentence Graph Memory for Long-Term Conversational Agents
Yaxiong Wu,Yongyue Zhang,Sheng Liang,Yong Liu
Main category: cs.CL
TL;DR: 本文提出了SGMem(Sentence Graph Memory)方法,通过将对话表示为句子级图结构,有效管理长期对话记忆,提升大语言模型在长对话问答中的表现。
Details
Motivation: 现有的基于事实提取或摘要的方法难以组织和检索不同粒度的对话信息,且无法有效应对超出上下文窗口的对话历史。 Method: SGMem将对话划分为块单元,并以句子级别构建图结构,结合原始对话与生成的记忆(如摘要、事实和洞察),捕捉回合、轮次和会话级别的上下文关联。 Result: 在LongMemEval和LoCoMo数据集上的实验表明,SGMem在长时对话问答任务中显著优于强基线方法,提升了准确率。 Conclusion: SGMem通过多层次的图结构整合原始对话与生成记忆,有效解决了长时对话中信息冗余与检索困难的问题,增强了大语言模型的记忆利用能力。 Abstract: Long-term conversational agents require effective memory management to handle dialogue histories that exceed the context window of large language models (LLMs). Existing methods based on fact extraction or summarization reduce redundancy but struggle to organize and retrieve relevant information across different granularities of dialogue and generated memory. We introduce SGMem (Sentence Graph Memory), which represents dialogue as sentence-level graphs within chunked units, capturing associations across turn-, round-, and session-level contexts. By combining retrieved raw dialogue with generated memory such as summaries, facts and insights, SGMem supplies LLMs with coherent and relevant context for response generation. Experiments on LongMemEval and LoCoMo show that SGMem consistently improves accuracy and outperforms strong baselines in long-term conversational question answering.[57] Query-Centric Graph Retrieval Augmented Generation
Yaxiong Wu,Jianyuan Bo,Yongyue Zhang,Sheng Liang,Yong Liu
Main category: cs.CL
TL;DR: QCG-RAG是一种基于查询的图结构检索增强生成框架,通过查询中心化建图和多跳检索机制,在细粒度与粗粒度之间实现可控粒度,提升多跳问答性能。
Details
Motivation: 现有基于图的RAG方法面临粒度困境:细粒度实体级图消耗高且丢失上下文,粗粒度文档级图难以捕捉细微关系,因此需要一种兼顾效率与语义完整性的新方法。 Method: 提出QCG-RAG框架,利用Doc2Query及其变体构建查询中心化的图结构,实现查询粒度索引,并设计定制的多跳检索机制,通过生成的查询选择相关文本块进行推理。 Result: 在LiHuaWorld和MultiHop-RAG数据集上的实验表明,QCG-RAG在问答准确率上持续优于现有的基于块和基于图的RAG方法。 Conclusion: QCG-RAG通过查询中心化图建模和可控粒度索引,为多跳推理提供了一种更高效、可解释的新范式。 Abstract: Graph-based retrieval-augmented generation (RAG) enriches large language models (LLMs) with external knowledge for long-context understanding and multi-hop reasoning, but existing methods face a granularity dilemma: fine-grained entity-level graphs incur high token costs and lose context, while coarse document-level graphs fail to capture nuanced relations. We introduce QCG-RAG, a query-centric graph RAG framework that enables query-granular indexing and multi-hop chunk retrieval. Our query-centric approach leverages Doc2Query and Doc2Query{-}{-} to construct query-centric graphs with controllable granularity, improving graph quality and interpretability. A tailored multi-hop retrieval mechanism then selects relevant chunks via the generated queries. Experiments on LiHuaWorld and MultiHop-RAG show that QCG-RAG consistently outperforms prior chunk-based and graph-based RAG methods in question answering accuracy, establishing a new paradigm for multi-hop reasoning.[58] Un-Doubling Diffusion: LLM-guided Disambiguation of Homonym Duplication
Evgeny Kaskov,Elizaveta Petrova,Petr Surovtsev,Anna Kostikova,Ilya Mistiurin,Alexander Kapitanov,Alexander Nagaev
Main category: cs.CL
TL;DR: 本文提出了一种测量同形异义词重复问题的方法,并通过视觉-语言模型和人工评估对多种扩散模型进行了评估,发现提示扩展能有效缓解由同形异义词及英语中心偏见引起的生成重复问题。
Details
Motivation: 同形异义词在文本到图像生成中会导致模型同时生成多个词义,造成混淆;此外,非英语词汇在翻译为英语后可能产生新的同形异义词,加剧语义失真,因此需要系统方法来测量并缓解这一问题。 Method: 提出一种量化同形异义词重复率的评估方法,结合视觉-语言模型进行自动评估,并辅以人工评估;同时探索通过提示扩展来减轻同形异义词重复问题。 Result: 实验表明扩散模型普遍存在同形异义词重复现象,且Anglocentric翻译偏见会引入额外错误;提示扩展能显著降低重复率,改善生成准确性。 Conclusion: 提示扩展是缓解同形异义词重复和Anglocentric偏见的有效策略,结合自动与人工评估可更全面地衡量模型表现,相关代码已开源。 Abstract: Homonyms are words with identical spelling but distinct meanings, which pose challenges for many generative models. When a homonym appears in a prompt, diffusion models may generate multiple senses of the word simultaneously, which is known as homonym duplication. This issue is further complicated by an Anglocentric bias, which includes an additional translation step before the text-to-image model pipeline. As a result, even words that are not homonymous in the original language may become homonyms and lose their meaning after translation into English. In this paper, we introduce a method for measuring duplication rates and conduct evaluations of different diffusion models using both automatic evaluation utilizing Vision-Language Models (VLM) and human evaluation. Additionally, we investigate methods to mitigate the homonym duplication problem through prompt expansion, demonstrating that this approach also effectively reduces duplication related to Anglocentric bias. The code for the automatic evaluation pipeline is publicly available.[59] LLM Output Homogenization is Task Dependent
Shomik Jain,Jack Lanchantin,Maximilian Nickel,Karen Ullrich,Ashia Wilson,Jamelle Watson-Daniels
Main category: cs.CL
TL;DR: 本文提出了一个基于任务分类的输出多样性评估框架,通过引入任务锚定的功能多样性概念和采样技术,改进了大语言模型在不同任务中对输出同质化的评估与缓解,并证明可在保持响应质量的同时提升功能性多样性。
Details
Motivation: 现有研究在处理大语言模型输出同质化问题时,缺乏对任务依赖性多样性的充分考虑,导致评估和干预方法不够精准。 Method: 提出包含八个任务类别的分类体系,定义任务锚定的功能多样性指标,并设计相应的采样技术以在不同任务中调节输出多样性。 Result: 所提方法能在需多样性的任务中增加功能多样性,同时在需一致性的任务中保持同质化,并且不牺牲生成质量。 Conclusion: 任务依赖性能够有效提升对大语言模型输出同质化的评估与缓解效果。 Abstract: A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct conceptualizations of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving homogenization where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.[60] LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text
Irina Tolstykh,Aleksandra Tsybina,Sergey Yakubson,Maksim Kuprashevich
Main category: cs.CL
TL;DR: LLMTrace是一个大规模、双语(英语和俄语)的AI生成文本检测语料库,支持全文二分类和AI生成片段的字符级定位检测。
Details
Motivation: 现有数据集多由过时模型生成,主要限于英语,且缺乏对混合人机创作文本中AI生成部分的精确字符级标注。 Method: 利用多种现代专有和开源大语言模型构建双语文本数据集,并提供字符级标注以支持AI生成区间检测。 Result: LLMTrace支持全篇分类和AI生成内容区间定位,填补了当前数据集在语言多样性、模型新颖性和细粒度标注方面的空白。 Conclusion: LLMTrace将成为训练和评估下一代更精细、实用的AI检测模型的重要资源。 Abstract: The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models. The project page is available at \href{https://sweetdream779.github.io/LLMTrace-info/}{iitolstykh/LLMTrace}.[61] Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond
Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng
Main category: cs.CL
TL;DR: 本文理论分析了输入扰动对思维链(CoT)输出波动的影响,推导出在可接受输出波动下的输入扰动上界,并证明该上界与推理步数正相关,且无限长的推理无法完全消除扰动影响;进一步在简化Transformer模型LSA中证明该上界与输入嵌入和隐藏状态向量范数负相关,实验验证了理论分析的正确性。
Details
Motivation: 现有研究缺乏对输入扰动如何影响CoT输出的理论解释,限制了对推理过程中扰动传播机制的理解,也阻碍了提示优化方法的进一步提升。 Method: 通过理论分析推导输入扰动在可接受输出波动下的上界,并分析其与推理步数、模型结构(如LSA)中向量范数的关系,结合主流数据集和模型进行实验验证。 Result: 1) 输入扰动上界与CoT推理步数正相关;2) 即使推理步数无限,也无法完全消除输入扰动的影响;3) 在LSA模型中,该上界与输入嵌入和隐藏状态向量的范数负相关;实验结果与理论分析一致。 Conclusion: 输入扰动对CoT输出的影响存在理论上的上界,推理长度和向量范数是影响鲁棒性的关键因素,为提示优化和模型设计提供了理论依据。 Abstract: Existing research indicates that the output of Chain-of-Thought (CoT) is significantly affected by input perturbations. Although many methods aim to mitigate such impact by optimizing prompts, a theoretical explanation of how these perturbations influence CoT outputs remains an open area of research. This gap limits our in-depth understanding of how input perturbations propagate during the reasoning process and hinders further improvements in prompt optimization methods. Therefore, in this paper, we theoretically analyze the effect of input perturbations on the fluctuation of CoT outputs. We first derive an upper bound for input perturbations under the condition that the output fluctuation is within an acceptable range, based on which we prove that: (i) This upper bound is positively correlated with the number of reasoning steps in the CoT; (ii) Even an infinitely long reasoning process cannot eliminate the impact of input perturbations. We then apply these conclusions to the Linear Self-Attention (LSA) model, which can be viewed as a simplified version of the Transformer. For the LSA model, we prove that the upper bound for input perturbation is negatively correlated with the norms of the input embedding and hidden state vectors. To validate this theoretical analysis, we conduct experiments on three mainstream datasets and four mainstream models. The experimental results align with our theoretical analysis, empirically demonstrating the correctness of our findings.[62] DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding
Kin Ian Lo,Hala Hawashin,Mina Abbaszadeh,Tilen Limback-Stokin,Hadi Wazni,Mehrnoosh Sadrzadeh
Main category: cs.CL
TL;DR: DisCoCLIP结合了冻结的CLIP视觉变换器和基于张量网络的文本编码器,显式建模语言的句法结构,在保持高效的同时显著提升了视觉-语言任务中的组合推理能力。
Details
Motivation: 现有视觉-语言模型虽在大规模图像-文本对齐上表现优异,但常忽视语言的组合结构,导致在依赖词序和谓词-论元结构的任务上表现不佳。 Method: 提出DisCoCLIP,使用组合范畴语法解析句子,生成分布词张量并通过张量收缩反映句法推导;采用张量分解降低高阶张量参数量,并与CLIP视觉编码器联合进行端到端对比学习。 Result: 在SVO-Probes上动词准确率从77.6%提升至82.4%,ARO属性和关系得分分别提高9%以上和4%以上,在新提出的SVO-Swap基准上达到93.7%的准确率。 Conclusion: 通过张量网络显式嵌入语言结构可实现可解释、参数高效的表示,显著增强视觉-语言模型的组合推理能力。 Abstract: Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence's grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP's SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.[63] The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
Pranjal A. Chitale,Varun Gumma,Sanchit Ahuja,Prashant Kodali,Manan Uppadhyay,Deepthi Sudharsan,Sunayana Sitaram
Main category: cs.CL
TL;DR: 本文提出了一种基于语言特定维基百科内容的自下而上合成数据生成方法,构建了包含13种印度语言的950万条指令跟随数据集Updesh,验证了文化情境化合成数据在多语言AI系统中的有效性,尤其提升了低、中等资源语言的表现。
Details
Motivation: 现有的多语言AI系统多依赖从高资源语言(如英语)翻译的合成数据,缺乏对本地文化语境的充分考虑,尤其在低资源语言中表现不佳。因此,需要一种能够结合语言和文化背景的合成数据生成方法。 Method: 采用大规模开源大模型(>=235B参数),以各印度语言的维基百科内容为基础,通过自下而上的方式生成具有文化情境的合成数据,构建了Updesh数据集,并通过自动指标与人工评估验证数据质量,进一步通过微调模型并在15个多语言下游任务上评估其性能。 Result: Updesh数据集生成的数据质量较高,人工评估仍发现改进空间;在下游任务中,基于Updesh微调的模型在生成任务上显著提升性能,在选择类NLU任务上表现具竞争力,且对低、中等资源语言的提升最为明显,缩小了与高资源语言的差距。 Conclusion: 有效的多语言AI需要融合上下文感知和文化扎根的多维度数据生成策略,自下而上的文化情境化合成数据是一种有前景的方向。 Abstract: Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.[64] Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs
Daniel Vennemeyer,Phan Anh Duong,Tiffany Zhan,Tianyu Jiang
Main category: cs.CL
TL;DR: 该研究将大语言模型中的谄媚行为分解为谄媚性同意和谄媚性赞美,并与真实同意进行对比,发现这三种行为在潜在空间中具有独立且可分离的线性方向,且可在不同模型间独立调控,表明它们由不同的表征机制驱动。
Details
Motivation: 尚不清楚大语言模型中的谄媚行为是由单一机制还是多种不同过程引起的,因此需要对其进行分解和识别不同的行为类型及其内在机制。 Method: 使用均值差异方向、激活添加和子空间几何方法,在多个模型和数据集上分析谄媚性同意、谄媚性赞美与真实同意的表征结构。 Result: (1)三种行为在潜在空间中沿不同的线性方向编码;(2)每种行为可以被独立增强或抑制而不影响其他行为;(3)其表征结构在不同模型族和规模之间保持一致。 Conclusion: 谄媚行为对应于不同且可独立操控的表征,表明其由多个独立机制驱动而非单一机制。 Abstract: Large language models (LLMs) often exhibit sycophantic behaviors -- such as excessive agreement with or flattery of the user -- but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.[65] RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Zhilin Wang,Jiaqi Zeng,Olivier Delalleau,Ellie Evans,Daniel Egert,Hoo-Chang Shin,Felipe Soares,Yi Dong,Oleksii Kuchaiev
Main category: cs.CL
TL;DR: 提出了一种新的强化学习范式RLBFF,结合人类反馈的灵活性和规则验证的精确性,通过二元化原则提升奖励模型的可解释性和性能,并在多个基准上取得领先结果。
Details
Motivation: 现有RLHF方法缺乏可解释性且易发生奖励黑客行为,而RLVR局限于正确性验证,无法捕捉回复质量的多维特征。因此需要一种兼具人类偏好灵活性和规则精确性的新方法。 Method: 将自然语言反馈转化为可判断的二元原则(如信息准确性、代码可读性),并将奖励模型训练建模为蕴含任务(即判断回复是否满足某原则),支持推理时动态指定关注原则。 Result: 所训练的奖励模型在RM-Bench(86.2%)和JudgeBench(81.4%)上达到领先性能,优于同等数据下的Bradley-Terry模型,并支持用户自定义评估重点;开源了基于RLBFF对齐Qwen3-32B的完整方案。 Conclusion: RLBFF有效融合了人类反馈与规则验证的优势,提升了奖励模型的可解释性、灵活性和性能,为大模型对齐提供了一种高效且低成本的新路径。 Abstract: Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost).[66] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
Yizhou Wang,Chen Tang,Han Deng,Jiabei Xiao,Jiaqi Liu,Jianyu Wu,Jun Yao,Pengze Li,Encheng Su,Lintao Wang,Guohang Zhuang,Yuchen Ren,Ben Fei,Ming Hu,Xin Chen,Dongzhan Zhou,Junjun He,Xiangyu Yue,Zhenfei Yin,Jiamin Wu,Qihao Zheng,Yuhao Zhou,Huihui Xu,Chenglong Ma,Yan Lu,Wenlong Zhang,Chunfeng Song,Philip Torr,Shixiang Tang,Xinzhu Ma,Wanli Ouyang,Lei Bai
Main category: cs.CL
TL;DR: 提出一种科学推理基础模型,通过多阶段训练对齐自然语言与异构科学表示,在103个任务上支持多种科学工作流能力。
Details
Motivation: 提升科学领域中自然语言与专业格式之间的语义对齐能力,增强跨学科泛化性和推理可靠性。 Method: 在206B token的科学语料上预训练,结合SFT、退火冷启动自举生成长链思维,并通过任务特定奖励塑形的强化学习进行对齐。 Result: 模型在翻译、抽取、预测、分类和生成等任务中表现优异,相比专用系统具有更广的指令覆盖和更高的保真度。 Conclusion: 跨学科学习有效提升迁移能力和下游任务可靠性,模型及相关资源已开源。 Abstract: We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.cs.CV [Back]
[67] Leveraging NTPs for Efficient Hallucination Detection in VLMs
Ofir Azachi,Kfir Eliyahu,Eyal El Ani,Rom Himelstein,Roi Reichart,Yuval Pinter,Nitay Calderon
Main category: cs.CV
TL;DR: 本文提出一种基于视觉-语言模型(VLM)下一词概率(NTP)的轻量级幻觉检测方法,利用NTP反映模型不确定性,并结合传统机器学习模型实现高效、实时的幻觉检测,实验表明该方法性能媲美使用VLM自身检测的效果,且可通过融合语言学NTP和VLM预测得分进一步提升性能。
Details
Motivation: 视觉-语言模型(VLM)生成文本时可能出现与图像内容不一致的幻觉问题,影响其可靠性;现有检测方法多依赖VLM自身进行评估,计算开销大、延迟高,亟需更高效的检测方案。 Method: 提取VLM生成过程中的下一词概率(NTP)作为不确定性信号,训练传统机器学习模型进行幻觉检测;引入包含1400个人工标注样本的数据集验证方法有效性,并探索加入语言学NTP(将生成文本回输VLM计算)及融合VLM预测得分的增强策略。 Result: NTP特征能有效预测幻觉,轻量级模型即可达到与强VLM相当的检测性能;加入语言学NTP可提升效果,进一步融合VLM的预测得分则表现更优。 Conclusion: 基于NTP的轻量级方法为VLM幻觉检测提供了高效、可行的解决方案,兼顾速度与准确性,有望推动更可靠、低延迟的视觉-语言系统发展。 Abstract: Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM's next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.[68] Quasi-Synthetic Riemannian Data Generation for Writer-Independent Offline Signature Verification
Elias N. Zois,Moises Diaz,Salem Said,Miguel A. Ferrer
Main category: cs.CV
TL;DR: 提出了一种基于黎曼几何的准合成数据生成框架,用于离线手写签名验证,通过在对称正定矩阵空间中生成合成数据,在跨数据集和不同书写风格下均表现出低错误率。
Details
Motivation: 解决离线手写签名验证在未知书写者情况下的泛化问题,减少对真实世界签名数据集的依赖。 Method: 利用SPD矩阵的黎曼几何特性,构建黎曼高斯混合模型,通过在黎曼流形上采样生成正负类别的合成SPD数据,并采用度量学习框架进行训练和测试。 Result: 在两个包含西方和亚洲书写风格的公开数据集上验证了方法的有效性,实现了较低的错误率,且在跨数据集评估中表现良好。 Conclusion: 准合成数据在黎曼空间中的生成具有潜力,可有效支持书写者无关的手写签名验证系统。 Abstract: Offline handwritten signature verification remains a challenging task, particularly in writer-independent settings where models must generalize across unseen individuals. Recent developments have highlighted the advantage of geometrically inspired representations, such as covariance descriptors on Riemannian manifolds. However, past or present, handcrafted or data-driven methods usually depend on real-world signature datasets for classifier training. We introduce a quasi-synthetic data generation framework leveraging the Riemannian geometry of Symmetric Positive Definite matrices (SPD). A small set of genuine samples in the SPD space is the seed to a Riemannian Gaussian Mixture which identifies Riemannian centers as synthetic writers and variances as their properties. Riemannian Gaussian sampling on each center generates positive as well as negative synthetic SPD populations. A metric learning framework utilizes pairs of similar and dissimilar SPD points, subsequently testing it over on real-world datasets. Experiments conducted on two popular signature datasets, encompassing Western and Asian writing styles, demonstrate the efficacy of the proposed approach under both intra- and cross- dataset evaluation protocols. The results indicate that our quasi-synthetic approach achieves low error rates, highlighting the potential of generating synthetic data in Riemannian spaces for writer-independent signature verification systems.[69] Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream,Yunpeng Chen,Yu Gao,Lixue Gong,Meng Guo,Qiushan Guo,Zhiyao Guo,Xiaoxia Hou,Weilin Huang,Yixuan Huang,Xiaowen Jian,Huafeng Kuang,Zhichao Lai,Fanshi Li,Liang Li,Xiaochen Lian,Chao Liao,Liyang Liu,Wei Liu,Yanzuo Lu,Zhengxiong Luo,Tongtong Ou,Guang Shi,Yichun Shi,Shiqi Sun,Yu Tian,Zhi Tian,Peng Wang,Rui Wang,Xun Wang,Ye Wang,Guofeng Wu,Jie Wu,Wenxu Wu,Yonghui Wu,Xin Xia,Xuefeng Xiao,Shuang Xu,Xin Yan,Ceyuan Yang,Jianchao Yang,Zhonghua Zhai,Chenlin Zhang,Heng Zhang,Qi Zhang,Xinyu Zhang,Yuwei Zhang,Shijia Zhao,Wenliang Zhao,Wenjia Zhu
Main category: cs.CV
TL;DR: Seedream 4.0 是一个高效、高性能的多模态图像生成系统,统一了文本到图像生成、图像编辑和多图像合成,支持快速生成高分辨率图像,并在多种任务上达到SOTA性能。
Details
Motivation: 旨在构建一个统一且高效的多模态图像生成框架,克服传统T2I系统在编辑能力、交互性和多图合成方面的局限。 Method: 采用高效的扩散Transformer与强大VAE减少图像token数量,结合多模态后训练、对抗蒸馏、分布匹配、量化和推测解码等技术进行训练与推理加速。 Result: 实现了最高1.8秒生成2K图像的推理速度,在文本到图像生成和多模态图像编辑任务上达到SOTA,支持精确编辑、上下文推理和多图参考生成。 Conclusion: Seedream 4.0 将传统T2I系统扩展为更互动、多维的创作工具,在创意与专业应用中展现出强大的生成AI能力。 Abstract: We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.[70] A Contrastive Learning Framework for Breast Cancer Detection
Samia Saeed,Khuram Naveed
Main category: cs.CV
TL;DR: 本研究提出了一种基于对比学习的半监督框架,利用ResNet-50在少量标注数据和大量未标注乳腺X线图像上进行训练,通过数据增强提升模型性能,在INbreast和MIAS数据集上实现了96.7%的准确率,优于现有方法。
Details
Motivation: 由于乳腺癌是全球癌症死亡的第二大原因,早期检测对改善治疗效果至关重要;然而,深度学习方法受限于大规模标注数据集的缺乏,影响了其准确性。 Method: 采用半监督对比学习框架,使用ResNet-50结合相似性指数,在大量未标注乳腺X线数据上进行预训练,并通过多种数据增强和变换技术提升性能,最后在少量标注数据上进行微调。 Result: 在INbreast和MIAS两个基准数据集上,该方法实现了96.7%的乳腺癌检测准确率,超过了现有的最先进方法。 Conclusion: 所提出的对比学习框架能有效利用未标注数据,减少对大规模标注数据的依赖,在小样本标注条件下仍能达到高精度,为乳腺癌的早期检测提供了更优的深度学习解决方案。 Abstract: Breast cancer, the second leading cause of cancer-related deaths globally, accounts for a quarter of all cancer cases [1]. To lower this death rate, it is crucial to detect tumors early, as early-stage detection significantly improves treatment outcomes. Advances in non-invasive imaging techniques have made early detection possible through computer-aided detection (CAD) systems which rely on traditional image analysis to identify malignancies. However, there is a growing shift towards deep learning methods due to their superior effectiveness. Despite their potential, deep learning methods often struggle with accuracy due to the limited availability of large-labeled datasets for training. To address this issue, our study introduces a Contrastive Learning (CL) framework, which excels with smaller labeled datasets. In this regard, we train Resnet-50 in semi supervised CL approach using similarity index on a large amount of unlabeled mammogram data. In this regard, we use various augmentation and transformations which help improve the performance of our approach. Finally, we tune our model on a small set of labelled data that outperforms the existing state of the art. Specifically, we observed a 96.7% accuracy in detecting breast cancer on benchmark datasets INbreast and MIAS.[71] Are Foundation Models Ready for Industrial Defect Recognition? A Reality Check on Real-World Data
Simon Baeuerle,Pratik Khanna,Nils Friederich,Angelo Jovin Yamachui Sitcheu,Damir Shakirov,Andreas Steimer,Ralf Mikut
Main category: cs.CV
TL;DR: 本文探讨了基础模型(FMs)在工业质量检测中的应用潜力,发现尽管这些模型在公共基准数据集上表现良好,但在真实工业图像数据上均失败。
Details
Motivation: 探索基础模型在零样本设置下用于自动化质量检测的可行性,以减少对标注数据的依赖和模型部署成本。 Method: 测试多种最新的基础模型在自定义的真实工业图像数据和公共图像数据上的表现,比较其性能差异。 Result: 所有测试的基础模型在真实工业数据上表现不佳,尽管它们在公共基准数据集上表现良好。 Conclusion: 当前的基础模型尚不能直接应用于真实场景中的工业质量检测任务,需要进一步改进以适应实际工业环境。 Abstract: Foundation Models (FMs) have shown impressive performance on various text and image processing tasks. They can generalize across domains and datasets in a zero-shot setting. This could make them suitable for automated quality inspection during series manufacturing, where various types of images are being evaluated for many different products. Replacing tedious labeling tasks with a simple text prompt to describe anomalies and utilizing the same models across many products would save significant efforts during model setup and implementation. This is a strong advantage over supervised Artificial Intelligence (AI) models, which are trained for individual applications and require labeled training data. We test multiple recent FMs on both custom real-world industrial image data and public image data. We show that all of those models fail on our real-world data, while the very same models perform well on public benchmark datasets.[72] Shared Neural Space: Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision
Jing Li,Oskar Bartosz,Chengyu Wang,Michal Wnuczynski,Dilshan Godaliyadda,Michael Polley
Main category: cs.CV
TL;DR: 提出了一种通用的神经空间(Neural Space, NS),通过编码器-解码器框架在视觉和成像任务中预计算特征,实现多任务共享同一特征空间,提升效率与泛化能力。
Details
Motivation: 传统AI模型针对特定高精度任务定制,难以高效处理一系列模块化任务,因每个任务需映射到不同的潜在空间,导致冗余和低效。 Method: 设计一个轻量级CNN-based编码器-解码器架构,构建通用神经空间(NS),编码器学习具有变换感知性和可泛化的表示,使多个下游任务共享同一特征空间。 Result: 在去马赛克、去噪、深度估计和语义分割等多个成像与视觉任务中验证了该方法的有效性,表现出良好的效率、跨域泛化能力和硬件兼容性。 Conclusion: 所提出的神经空间提供了一个高效、轻量且可扩展的多任务视觉框架,减少了模型冗余,为多任务视觉系统奠定了基础。 Abstract: The majority of AI models in imaging and vision are customized to perform on specific high-precision task. However, this strategy is inefficient for applications with a series of modular tasks, since each requires a mapping into a disparate latent domain. To address this inefficiency, we proposed a universal Neural Space (NS), where an encoder-decoder framework pre-computes features across vision and imaging tasks. Our encoder learns transformation aware, generalizable representations, which enable multiple downstream AI modules to share the same feature space. This architecture reduces redundancy, improves generalization across domain shift, and establishes a foundation for effecient multi-task vision pipelines. Furthermore, as opposed to larger transformer backbones, our backbone is lightweight and CNN-based, allowing for wider across hardware. We furthur demonstrate that imaging and vision modules, such as demosaicing, denoising, depth estimation and semantic segmentation can be performed efficiently in the NS.[73] Data-Efficient Stream-Based Active Distillation for Scalable Edge Model Deployment
Dani Manjah,Tim Bary,Benoît Gérin,Benoît Macq,Christophe de Vleeschouwer
Main category: cs.CV
TL;DR: 提出一种结合高置信度流式策略与多样性方法的图像选择方案,以在低传输成本下最大化边缘设备模型质量。
Details
Motivation: 在边缘摄像头系统中,如何高效选择最有用的图像进行训练,以平衡模型质量和传输成本是一个关键问题。 Method: 采用高置信度流式策略结合基于多样性的方法,在相似训练负载下选择最优图像子集用于模型训练。 Result: 该方法能够在较少数据查询的情况下,显著提升边缘设备上轻量模型的性能。 Conclusion: 结合高置信度和多样性的图像选择策略,可在控制传输开销的同时有效提高模型训练效果。 Abstract: Edge camera-based systems are continuously expanding, facing ever-evolving environments that require regular model updates. In practice, complex teacher models are run on a central server to annotate data, which is then used to train smaller models tailored to the edge devices with limited computational power. This work explores how to select the most useful images for training to maximize model quality while keeping transmission costs low. Our work shows that, for a similar training load (i.e., iterations), a high-confidence stream-based strategy coupled with a diversity-based approach produces a high-quality model with minimal dataset queries.[74] InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On
Julien Han,Shuwen Qiu,Qi Li,Xingzi Xu,Mehmet Saygin Seyfioglu,Kavosh Asadi,Karim Bouyarmane
Main category: cs.CV
TL;DR: 提出InstructVTON,一种基于自然语言指令的交互式虚拟试穿系统,通过视觉语言模型和图像分割自动生二值掩码,实现细粒度且复杂的服饰风格控制。
Details
Motivation: 传统基于掩码的虚拟试穿方法难以处理复杂造型(如卷袖),且需用户手动绘制精确掩码,限制了灵活性和可用性。 Method: 将虚拟试穿建模为图像引导的修复任务,利用视觉语言模型(VLM)和图像分割模型,根据用户提供的图像和自由文本指令自动生成二值掩码,并支持多轮生成以实现复杂试穿效果。 Result: InstructVTON无需人工精细标注掩码,可处理传统方法无法实现的复杂试穿场景,并与现有虚拟试穿模型兼容,实现最先进的风格控制效果。 Conclusion: InstructVTON通过结合VLM和图像分割技术,提升了虚拟试穿系统的易用性和表达能力,支持细粒度、复杂指令驱动的服饰编辑。 Abstract: We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditioned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with "sleeves rolled up" styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.[75] Innovative Deep Learning Architecture for Enhanced Altered Fingerprint Recognition
Dana A Abdullah,Dana Rasul Hamad,Bishar Rasheed Ibrahim,Sirwan Abdulwahid Aula,Aso Khaleel Ameen,Sabat Salih Hamadamin
Main category: cs.CV
TL;DR: 本文提出了一种名为DeepAFRNet的深度学习模型,用于识别经过刻意修改的指纹,在真实 altered 指纹数据集SOCOFing上实现了高准确率,强调了阈值选择对生物特征识别系统的重要性。
Details
Motivation: 由于攻击者可能故意修改指纹纹路以逃避检测,传统指纹识别系统面临挑战,因此需要一种能够鲁棒识别 altered 指纹的方法来提升生物特征验证系统的安全性。 Method: 采用VGG16作为骨干网络提取高维特征,并利用余弦相似度比较指纹嵌入向量,实现匹配与识别。 Result: 在SOCOFing Real-Altered数据集的三个难度级别(Easy、Medium、Hard)上,使用严格阈值时分别达到96.7%、98.76%和99.54%的准确率;但当阈值从0.92降至0.72时,准确率显著下降至7.86%、27.05%和29.51%。 Conclusion: DeepAFRNet在真实 altered 指纹识别任务中表现优异,克服了以往研究依赖合成数据或验证协议不足的问题,具备实际部署潜力,适用于对安全性和识别鲁棒性要求高的场景。 Abstract: Altered fingerprint recognition (AFR) is challenging for biometric verification in applications such as border control, forensics, and fiscal admission. Adversaries can deliberately modify ridge patterns to evade detection, so robust recognition of altered prints is essential. We present DeepAFRNet, a deep learning recognition model that matches and recognizes distorted fingerprint samples. The approach uses a VGG16 backbone to extract high-dimensional features and cosine similarity to compare embeddings. We evaluate on the SOCOFing Real-Altered subset with three difficulty levels (Easy, Medium, Hard). With strict thresholds, DeepAFRNet achieves accuracies of 96.7 percent, 98.76 percent, and 99.54 percent for the three levels. A threshold-sensitivity study shows that relaxing the threshold from 0.92 to 0.72 sharply degrades accuracy to 7.86 percent, 27.05 percent, and 29.51 percent, underscoring the importance of threshold selection in biometric systems. By using real altered samples and reporting per-level metrics, DeepAFRNet addresses limitations of prior work based on synthetic alterations or limited verification protocols, and indicates readiness for real-world deployments where both security and recognition resilience are critical.[76] Large Pre-Trained Models for Bimanual Manipulation in 3D
Hanna Yurchyk,Wei-Di Chang,Gregory Dudek,David Meger
Main category: cs.CV
TL;DR: 本文提出了一种将预训练Vision Transformer的注意力图集成到体素表示中的方法,以提升双手机器人操作性能。
Details
Motivation: 为了增强双手机器人操作中对关键视觉区域的关注能力,利用自监督模型提取注意力图作为语义线索。 Method: 从自监督ViT模型DINOv2中提取注意力图,解释为RGB图像上的像素级显著性得分,并将其提升至3D体素网格中,作为行为克隆策略的输入特征。 Result: 在RLBench双手机器人基准上,该方法平均绝对性能提升8.2%,相对增益达21.9%。 Conclusion: 将注意力图融入体素表示能有效提升现有体素策略在复杂双手机器人任务中的表现。 Abstract: We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.[77] A Comparative Benchmark of Real-time Detectors for Blueberry Detection towards Precision Orchard Management
Xinyang Mu,Yuzhen Lu,Boyang Deng
Main category: cs.CV
TL;DR: 本研究提出了一种针对蓝莓检测的新型实时目标检测器基准分析,使用包含661张图像和85,879个标注实例的新数据集,评估了YOLO和RT-DETR系列共36种模型变体。YOLOv12m和RT-DETRv2-X分别在YOLO和RT-DETR中表现最佳,后者经半监督学习微调后mAP@50达到94.8%。研究还探讨了模型精度与速度的权衡,并公开了数据集与代码。
Details
Motivation: 蓝莓在自然环境中的检测面临光照变化、遮挡和运动模糊等挑战,现有深度学习方法缺乏足够多样化的大规模数据集支持,且实际部署需平衡模型精度、速度与内存消耗。 Method: 构建了一个新的蓝莓检测数据集,包含来自智能手机采集的661张树冠图像,共85,879个标注样本;对YOLO(v8-v12)和RT-DETR(v1-v2)共36种模型进行系统性基准测试;采用基于Unbiased Mean Teacher的半监督学习方法,在1,035张无标签图像上进行微调以提升性能。 Result: YOLOv12m取得93.3% mAP@50,RT-DETRv2-X达到93.6% mAP@50,为各系列中最高;经半监督微调后,RT-DETRv2-X进一步提升至94.8% mAP@50;中间规模模型在精度与推理速度间表现出较好平衡;部分模型出现性能下降(最大-1.4%),表明SSL效果存在差异。 Conclusion: RT-DETRv2-X在蓝莓检测任务中表现最优,结合半监督学习可进一步提升性能;中等规模模型更适合实际应用中的精度-速度权衡;未来需深入研究跨域无标签数据的有效利用;公开数据集和代码以促进后续研究。 Abstract: Blueberry detection in natural environments remains challenging due to variable lighting, occlusions, and motion blur due to environmental factors and imaging devices. Deep learning-based object detectors promise to address these challenges, but they demand a large-scale, diverse dataset that captures the real-world complexities. Moreover, deploying these models in practical scenarios often requires the right accuracy/speed/memory trade-off in model selection. This study presents a novel comparative benchmark analysis of advanced real-time object detectors, including YOLO (You Only Look Once) (v8-v12) and RT-DETR (Real-Time Detection Transformers) (v1-v2) families, consisting of 36 model variants, evaluated on a newly curated dataset for blueberry detection. This dataset comprises 661 canopy images collected with smartphones during the 2022-2023 seasons, consisting of 85,879 labelled instances (including 36,256 ripe and 49,623 unripe blueberries) across a wide range of lighting conditions, occlusions, and fruit maturity stages. Among the YOLO models, YOLOv12m achieved the best accuracy with a mAP@50 of 93.3%, while RT-DETRv2-X obtained a mAP@50 of 93.6%, the highest among all the RT-DETR variants. The inference time varied with the model scale and complexity, and the mid-sized models appeared to offer a good accuracy-speed balance. To further enhance detection performance, all the models were fine-tuned using Unbiased Mean Teacher-based semi-supervised learning (SSL) on a separate set of 1,035 unlabeled images acquired by a ground-based machine vision platform in 2024. This resulted in accuracy gains ranging from -1.4% to 2.9%, with RT-DETR-v2-X achieving the best mAP@50 of 94.8%. More in-depth research into SSL is needed to better leverage cross-domain unlabeled data. Both the dataset and software programs of this study are made publicly available to support further research.[78] Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation
Farbod Bigdeli,Mohsen Mohammadagha,Ali Bigdeli
Main category: cs.CV
TL;DR: 提出一种轻量级的ROI增强策略,通过在训练中使用无标签的边界框库中的随机ROI裁剪来替换全图,从而提升小样本和低分辨率乳腺X线图像分类性能。
Details
Motivation: 现有深度学习方法在乳腺X线分析中受限于数据集分辨率低和样本量小,影响模型性能。 Method: 在Mini-DDSM数据集上引入一种训练阶段专用的ROI增强策略,利用预计算的无标签边界框库进行随机ROI裁剪,并可加入抖动以增加多样性;采用严格的患者级别交叉验证评估。 Result: 在Mini-DDSM上,该方法(最优参数:p_roi=0.10, alpha=0.10)带来轻微的ROC-AUC平均提升,但PR-AUC持平或略降;性能在不同折上有所波动;训练效率指标显示吞吐量和GPU内存使用合理。 Conclusion: 简单的数据中心化ROI增强策略可在不增加推理成本、无需额外标注或模型修改的情况下,有效提升资源受限场景下的乳腺癌筛查模型性能。 Abstract: Breast cancer screening with mammography remains central to early detection and mortality reduction. Deep learning has shown strong potential for automating mammogram interpretation, yet limited-resolution datasets and small sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset (9,684 images; 2,414 patients) and introduce a lightweight region-of-interest (ROI) augmentation strategy. During training, full images are probabilistically replaced with random ROI crops sampled from a precomputed, label-free bounding-box bank, with optional jitter to increase variability. We evaluate under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and training-time efficiency metrics (throughput and GPU memory). Because ROI augmentation is training-only, inference-time cost remains unchanged. On Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to slightly lower. These results demonstrate that simple, data-centric ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural modifications.[79] Reflect3r: Single-View 3D Stereo Reconstruction Aided by Mirror Reflections
Jing Wu,Zirui Wang,Iro Laina,Victor Adrian Prisacariu
Main category: cs.CV
TL;DR: 本文提出一种利用镜面反射生成虚拟视角的方法,通过构建物理上有效的虚拟相机实现单张图像的多视角立体重建,并引入对称感知损失优化位姿估计,适用于静态和动态场景。
Details
Motivation: 镜面反射在日常环境中普遍存在,且能在一个图像中同时提供真实视图和反射的虚拟视图,但现有方法未能充分挖掘其立体信息潜力。因此,本文旨在利用镜面反射特性,从单幅图像中实现鲁棒、通用的3D重建。 Method: 将镜面反射视为辅助视角,设计一种变换来构造物理上合理的虚拟相机,直接在像素域生成虚拟视图;结合对称感知损失优化姿态估计,并扩展至含镜面反射的动态视频帧进行逐帧几何恢复。 Result: 在合成数据(16个Blender场景)和真实数据上进行了大量实验,验证了该方法在3D重建和位姿估计方面的有效性与鲁棒性。 Conclusion: 该方法成功利用镜面反射实现单图像多视角立体重建,简化了成像过程,兼容前馈重建模型,且可推广到动态场景,为基于反射的几何理解提供了新思路。 Abstract: Mirror reflections are common in everyday environments and can provide stereo information within a single capture, as the real and reflected virtual views are visible simultaneously. We exploit this property by treating the reflection as an auxiliary view and designing a transformation that constructs a physically valid virtual camera, allowing direct pixel-domain generation of the virtual view while adhering to the real-world imaging process. This enables a multi-view stereo setup from a single image, simplifying the imaging process, making it compatible with powerful feed-forward reconstruction models for generalizable and robust 3D reconstruction. To further exploit the geometric symmetry introduced by mirrors, we propose a symmetric-aware loss to refine pose estimation. Our framework also naturally extends to dynamic scenes, where each frame contains a mirror reflection, enabling efficient per-frame geometry recovery. For quantitative evaluation, we provide a fully customizable synthetic dataset of 16 Blender scenes, each with ground-truth point clouds and camera poses. Extensive experiments on real-world data and synthetic data are conducted to illustrate the effectiveness of our method.[80] Recov-Vision: Linking Street View Imagery and Vision-Language Models for Post-Disaster Recovery
Yiming Xiao,Archit Gupta,Miguel Esparza,Yu-Hsuan Ho,Antonia Sebastian,Hannah Weas,Rose Houck,Ali Mostafavi
Main category: cs.CV
TL;DR: 提出FacadeTrack框架,通过街景视频与地块关联,提取建筑立面属性,支持灾后建筑占用状态评估,具有高精度和可解释性。
Details
Motivation: 灾后建筑占用状态评估对资源分配和应急响应至关重要,但现有方法在覆盖范围和细节捕捉上存在不足。 Method: 提出一种基于街景全景视频的语言引导框架FacadeTrack,将视频与地块匹配,校正视图至立面,并提取可解释的属性(如入口阻塞、临时覆盖等),采用单阶段规则和两阶段分离感知与推理的策略进行决策。 Result: 在两次飓风后的调查中,两阶段方法达到0.927的精确率、0.781的召回率和0.848的F1分数,优于单阶段基线(精确率0.943,召回率0.728,F1分数0.822),且中间属性和空间诊断有助于错误分析。 Conclusion: FacadeTrack提供了一种可审计、可扩展的建筑占用评估方案,适用于地理空间和应急管理流程集成。 Abstract: Building-level occupancy after disasters is vital for triage, inspections, utility re-energization, and equitable resource allocation. Overhead imagery provides rapid coverage but often misses facade and access cues that determine habitability, while street-view imagery captures those details but is sparse and difficult to align with parcels. We present FacadeTrack, a street-level, language-guided framework that links panoramic video to parcels, rectifies views to facades, and elicits interpretable attributes (for example, entry blockage, temporary coverings, localized debris) that drive two decision strategies: a transparent one-stage rule and a two-stage design that separates perception from conservative reasoning. Evaluated across two post-Hurricane Helene surveys, the two-stage approach achieves a precision of 0.927, a recall of 0.781, and an F-1 score of 0.848, compared with the one-stage baseline at a precision of 0.943, a recall of 0.728, and an F-1 score of 0.822. Beyond accuracy, intermediate attributes and spatial diagnostics reveal where and why residual errors occur, enabling targeted quality control. The pipeline provides auditable, scalable occupancy assessments suitable for integration into geospatial and emergency-management workflows.[81] Human Semantic Representations of Social Interactions from Moving Shapes
Yiling Yun,Hongjing Lu
Main category: cs.CV
TL;DR: 该研究探讨了人类在识别简单移动形状所展示的社会互动时,语义表征如何补充视觉特征。通过两项研究发现,基于动词的语义嵌入能最好地解释人类的相似性判断,表明社会感知反映了社会互动的语义结构。
Details
Motivation: 理解人类在识别简单动态形状中的社会互动时,除了视觉特征外,还依赖哪些语义表征来辅助感知。 Method: 研究1中让参与者根据对移动形状的印象进行标签化;研究2中通过人类相似性判断测量27种社会互动的表征几何,并与基于视觉特征、标签和描述文本语义嵌入的模型预测进行比较。 Result: 人类反应具有分布性;语义模型为视觉特征提供了补充信息,其中基于动词的语义嵌入最能解释人类的相似性判断。 Conclusion: 简单动态显示中的社会感知不仅依赖视觉特征,还反映社会互动的语义结构,实现了视觉与抽象表征之间的桥梁。 Abstract: Humans are social creatures who readily recognize various social interactions from simple display of moving shapes. While previous research has often focused on visual features, we examine what semantic representations that humans employ to complement visual features. In Study 1, we directly asked human participants to label the animations based on their impression of moving shapes. We found that human responses were distributed. In Study 2, we measured the representational geometry of 27 social interactions through human similarity judgments and compared it with model predictions based on visual features, labels, and semantic embeddings from animation descriptions. We found that semantic models provided complementary information to visual features in explaining human judgments. Among the semantic models, verb-based embeddings extracted from descriptions account for human similarity judgments the best. These results suggest that social perception in simple displays reflects the semantic structure of social interactions, bridging visual and abstract representations.[82] Enhancing Cross-View Geo-Localization Generalization via Global-Local Consistency and Geometric Equivariance
Xiaowei Wang,Di Wang,Ke Li,Yifeng Wang,Chengjian Wang,Libin Sun,Zhihong Wu,Yiming Zhang,Quan Wang
Main category: cs.CV
TL;DR: 提出了一种新的跨视角地理定位框架EGS,利用E(2)-Steerable CNN和带虚拟超节点的图结构,提升跨域泛化能力,在多个基准上达到最先进性能。
Details
Motivation: 现有方法在严重外观变化下鲁棒性不足,且难以同时捕捉全局语义和局部细节,限制了跨域泛化能力。 Method: 采用E(2)-Steerable CNN提取对旋转和视角变化鲁棒的特征,并通过引入虚拟超节点的图结构聚合和重分配全局语义信息,增强全局与局部一致性。 Result: 在University-1652和SUES-200数据集上实验表明,EGS显著优于现有方法,实现了新的性能纪录。 Conclusion: EGS有效提升了跨视角地理定位的跨域泛化能力,通过结构创新实现了更鲁棒的特征匹配。 Abstract: Cross-view geo-localization (CVGL) aims to match images of the same location captured from drastically different viewpoints. Despite recent progress, existing methods still face two key challenges: (1) achieving robustness under severe appearance variations induced by diverse UAV orientations and fields of view, which hinders cross-domain generalization, and (2) establishing reliable correspondences that capture both global scene-level semantics and fine-grained local details. In this paper, we propose EGS, a novel CVGL framework designed to enhance cross-domain generalization. Specifically, we introduce an E(2)-Steerable CNN encoder to extract stable and reliable features under rotation and viewpoint shifts. Furthermore, we construct a graph with a virtual super-node that connects to all local nodes, enabling global semantics to be aggregated and redistributed to local regions, thereby enforcing global-local consistency. Extensive experiments on the University-1652 and SUES-200 benchmarks demonstrate that EGS consistently achieves substantial performance gains and establishes a new state of the art in cross-domain CVGL.[83] DENet: Dual-Path Edge Network with Global-Local Attention for Infrared Small Target Detection
Jiayi Zuo,Songwei Pei,Qian Li
Main category: cs.CV
TL;DR: 提出了一种双路径边缘网络(Dual-Path Edge Network),通过解耦边缘增强和语义建模,结合局部与全局自注意力机制及多尺度泰勒有限差分算子,有效提升红外小目标检测的精度与鲁棒性。
Details
Motivation: 红外小目标缺乏显著纹理和形态特征,易融入复杂背景,现有方法在低对比度和高噪声下难以准确提取目标边缘,且空间细节与语义上下文之间存在冲突。 Method: 设计双路径结构:一条路径使用双向交互模块(结合局部与全局自注意力)捕捉多尺度特征依赖;另一条路径引入多边缘精炼器,采用级联的多尺度泰勒有限差分算子和注意力门控机制增强边缘细节并抑制噪声。 Result: 所提方法在多个数据集上实现了优于现有方法的检测性能,能够更精确地定位不同尺寸的小目标,同时有效抑制背景干扰。 Conclusion: 该双路径框架通过融合结构语义与边缘精炼,为红外小目标检测提供了一个高效、鲁棒的解决方案。 Abstract: Infrared small target detection is crucial for remote sensing applications like disaster warning and maritime surveillance. However, due to the lack of distinctive texture and morphological features, infrared small targets are highly susceptible to blending into cluttered and noisy backgrounds. A fundamental challenge in designing deep models for this task lies in the inherent conflict between capturing high-resolution spatial details for minute targets and extracting robust semantic context for larger targets, often leading to feature misalignment and suboptimal performance. Existing methods often rely on fixed gradient operators or simplistic attention mechanisms, which are inadequate for accurately extracting target edges under low contrast and high noise. In this paper, we propose a novel Dual-Path Edge Network that explicitly addresses this challenge by decoupling edge enhancement and semantic modeling into two complementary processing paths. The first path employs a Bidirectional Interaction Module, which uses both Local Self-Attention and Global Self-Attention to capture multi-scale local and global feature dependencies. The global attention mechanism, based on a Transformer architecture, integrates long-range semantic relationships and contextual information, ensuring robust scene understanding. The second path introduces the Multi-Edge Refiner, which enhances fine-grained edge details using cascaded Taylor finite difference operators at multiple scales. This mathematical approach, along with an attention-driven gating mechanism, enables precise edge localization and feature enhancement for targets of varying sizes, while effectively suppressing noise. Our method provides a promising solution for precise infrared small target detection and localization, combining structural semantics and edge refinement in a unified framework.[84] Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset
Ruixu Zhang,Yuran Wang,Xinyi Hu,Chaoyu Mai,Wenxuan Liu,Danni Xu,Xian Zhong,Zheng Wang
Main category: cs.CV
TL;DR: 本文提出了群体意图预测(GIF)这一新任务,并构建了首个大规模数据集SHOT和相应的预测框架GIFT,用于在集体目标显现前通过个体行为和交互来预测群体意图的出现。
Details
Motivation: 传统意图识别主要关注个体意图,忽视了群体环境中集体意图的复杂性。为此,作者提出群体意图的概念,旨在捕捉多个人员共同行动中浮现的共享目标。 Method: 提出SHOT数据集,包含1979个篮球视频片段,涵盖5个视角和6类个体属性标注;设计GIFT框架,提取细粒度个体特征并建模动态群体关系以预测意图产生。 Result: 实验结果验证了SHOT数据集和GIFT框架在群体意图预测任务上的有效性,为该领域后续研究奠定了基础。 Conclusion: 群体意图预测是一个有前景的新方向,SHOT和GIFT为研究群体行为中的意图涌现提供了有效工具和方法。 Abstract: Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting. The dataset is available at https://xinyi-hu.github.io/SHOT_DATASET.[85] Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection
Yu Guo,Shengfeng He,Yuxu Lu,Haonan An,Yihang Tao,Huilin Zhu,Jingxian Liu,Yuguang Fang
Main category: cs.CV
TL;DR: 本文提出Neptune-X,一种数据中心的生成-选择框架,通过任务感知的样本选择和多模态条件生成模型X-to-Maritime来提升海上目标检测性能,尤其改善了在开放海域等代表性不足场景中的检测精度。
Details
Motivation: 由于标注的海上数据稀缺且现有模型在不同海上属性间泛化能力差,特别是在开放海域等场景下表现不佳,因此需要提高模型在多样化和代表性不足场景下的检测能力。 Method: 提出Neptune-X框架,包括X-to-Maritime多模态生成模型和属性相关主动采样方法;其中X-to-Maritime引入双向物体-水体注意力模块以增强视觉真实感,并通过动态选择与任务相关的合成样本来优化训练效果。 Result: 实验表明该方法在海上场景合成和目标检测精度上均优于现有方法,尤其在挑战性和代表性不足的环境中显著提升性能,并发布了首个面向生成式海上学习的数据集Maritime Generation Dataset。 Conclusion: Neptune-X通过合成多样化且真实的海上场景并结合任务感知的样本选择,有效缓解了数据稀缺和泛化性差的问题,为海上目标检测提供了新的基准。 Abstract: Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). % In particular, models trained on existing datasets often underperform in underrepresented scenarios such as open-sea environments. To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings.The code is available at https://github.com/gy65896/Neptune-X.[86] AI-Enabled Crater-Based Navigation for Lunar Mapping
Sofia McLeod,Chee-Kheng Chng,Matthew Rodda,Tat-Jun Chin
Main category: cs.CV
TL;DR: 本文提出了STELLA——首个用于长期月球测绘任务的端到端基于陨石坑导航(CBN)系统,并发布了模拟全年任务的公开数据集CRESENT-365,验证了其在复杂条件下实现米级定位和亚度级姿态精度的能力。
Details
Motivation: 现有基于陨石坑的导航研究主要集中于着陆阶段,难以适用于长期、稀疏、倾斜且光照变化大的月球测绘任务,因此需要开发适应此类场景的新型导航系统。 Method: STELLA结合了Mask R-CNN陨石坑检测器、无描述符陨石坑识别模块、鲁棒的PnC位姿求解器和批量轨道确定后端,并通过新构建的CRESENT-365数据集进行测试。 Result: 实验表明,STELLA在不同视角、光照条件和纬度下平均实现了米级位置精度和亚度级姿态精度,首次全面评估了CBN在真实月球测绘环境中的性能。 Conclusion: STELLA为长期月球测绘任务中的导航提供了可行方案,其结果对今后任务的操作条件设计具有指导意义。 Abstract: Crater-Based Navigation (CBN) uses the ubiquitous impact craters of the Moon observed on images as natural landmarks to determine the six degrees of freedom pose of a spacecraft. To date, CBN has primarily been studied in the context of powered descent and landing. These missions are typically short in duration, with high-frequency imagery captured from a nadir viewpoint over well-lit terrain. In contrast, lunar mapping missions involve sparse, oblique imagery acquired under varying illumination conditions over potentially year-long campaigns, posing significantly greater challenges for pose estimation. We bridge this gap with STELLA - the first end-to-end CBN pipeline for long-duration lunar mapping. STELLA combines a Mask R-CNN-based crater detector, a descriptor-less crater identification module, a robust perspective-n-crater pose solver, and a batch orbit determination back-end. To rigorously test STELLA, we introduce CRESENT-365 - the first public dataset that emulates a year-long lunar mapping mission. Each of its 15,283 images is rendered from high-resolution digital elevation models with SPICE-derived Sun angles and Moon motion, delivering realistic global coverage, illumination cycles, and viewing geometries. Experiments on CRESENT+ and CRESENT-365 show that STELLA maintains metre-level position accuracy and sub-degree attitude accuracy on average across wide ranges of viewing angles, illumination conditions, and lunar latitudes. These results constitute the first comprehensive assessment of CBN in a true lunar mapping setting and inform operational conditions that should be considered for future missions.[87] Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models
Zoe Wanying He,Sean Trott,Meenakshi Khosla
Main category: cs.CV
TL;DR: 研究表明,尽管深度视觉和语言模型在独立模态上训练,但它们的表征空间部分对齐,这种对齐在中后期层达到峰值,具有语义性,并与人类判断一致,且通过示例聚合进一步增强。
Details
Motivation: 探究单模态模型在无跨模态训练的情况下如何形成共享表征,理解对齐出现的位置、支持因素、是否反映人类偏好以及示例聚合的影响。 Method: 分析视觉和语言模型在不同层的表征对齐情况,通过改变语义或外观测试其鲁棒性,并设计“Pick-a-Pic”选择任务评估模型对多对多图文匹配的判断是否与人类偏好一致。 Result: 发现对齐在中晚期层最强,依赖语义而非外观;人类在图文匹配上的偏好被模型嵌入空间所反映,且多 caption 与单 image 的匹配判断具双向一致性;平均多个示例的嵌入反而增强对齐。 Conclusion: 单模态网络自发形成与人类判断一致的共享语义编码,且该对齐可通过示例聚合进一步强化。 Abstract: Recent studies show that deep vision-only and language-only models--trained on disjoint modalities--nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, whether it captures human preferences in many-to-many image-text scenarios, and how aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice "Pick-a-Pic" task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.[88] FreeInsert: Personalized Object Insertion with Geometric and Style Control
Yuhong Zhang,Han Wang,Yiwen Wang,Rong Xie,Li Song
Main category: cs.CV
TL;DR: 提出了一种无需训练的框架FreeInsert,通过利用3D几何信息实现对象在任意场景中的定制化插入,解决了图像编辑中几何控制不足和风格不一致的问题。
Details
Motivation: 现有图像编辑方法在个性化图像合成任务中存在几何控制不足和风格一致性差的问题,且缺乏无需大量训练即可插入对象的有效方法。 Method: 利用现有的3D生成模型将2D对象转换为3D,在3D层面进行交互式编辑,并从指定视角重新渲染为2D图像,结合扩散适配器实现几何、风格和内容控制。 Result: 实现了对插入对象的精确几何控制和与背景的风格一致性,生成了高质量、逼真的编辑图像,且无需额外训练。 Conclusion: FreeInsert为文本到图像扩散模型中的对象插入提供了有效的训练-free解决方案,在几何控制和风格一致性方面表现出色。 Abstract: Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose \textit{FreeInsert}, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.[89] CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion
Maoye Ren,Praneetha Vaddamanu,Jianjin Xu,Fernando De la Torre Frade
Main category: cs.CV
TL;DR: 本文提出了一种名为CustomEnhancer的新框架,用于增强现有的身份定制模型,通过零样本方式结合人脸交换技术和预训练扩散模型,实现了生成与重建过程的统一,并引入ResInversion方法显著降低反转时间。
Details
Motivation: 现有文本到图像扩散模型在生成逼真人像时存在场景退化、控制不足和感知身份不理想的问题,因此需要一种更高效、可控且保真的个性化生成方法。 Method: 提出CustomEnhancer框架,采用三流融合的PerGeneration方法,结合两个反向潜在空间来操控个性化模型的关键空间;同时引入ResInversion进行噪声校正,减少反转所需时间。 Result: 实验表明,CustomEnhancer在场景多样性、身份保真度和无需训练的控制方面达到SOTA水平,ResInversion比NTI快129倍。 Conclusion: CustomEnhancer实现了高效、精确控制的个性化人像生成,无需额外训练控制器,且显著提升了生成质量和效率。 Abstract: Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.[90] CompressAI-Vision: Open-source software to evaluate compression methods for computer vision tasks
Hyomin Choi,Heeji Han,Chris Rosewarne,Fabien Racapé
Main category: cs.CV
TL;DR: CompressAI-Vision是一个面向视觉任务优化的视频压缩技术评估平台,支持远程和分割推理场景,已作为开源软件被MPEG用于Feature Coding for Machines (FCM)标准制定。
Details
Motivation: 随着基于神经网络的计算机视觉应用增多,需要一个统一平台来评估针对下游视觉任务优化的压缩方法。 Method: 提出CompressAI-Vision平台,集成多种标准编解码器,评估不同数据集下比特率与任务精度之间的权衡。 Result: 平台展示了在多个数据集上压缩性能的增益,并被MPEG采纳用于FCM标准开发。 Conclusion: CompressAI-Vision为面向机器视觉的压缩技术提供了开放、统一的评估框架,推动了相关标准的发展。 Abstract: With the increasing use of neural network (NN)-based computer vision applications that process image and video data as input, interest has emerged in video compression technology optimized for computer vision tasks. In fact, given the variety of vision tasks, associated NN models and datasets, a consolidated platform is needed as a common ground to implement and evaluate compression methods optimized for downstream vision tasks. CompressAI-Vision is introduced as a comprehensive evaluation platform where new coding tools compete to efficiently compress the input of vision network while retaining task accuracy in the context of two different inference scenarios: "remote" and "split" inferencing. Our study showcases various use cases of the evaluation platform incorporated with standard codecs (under development) by examining the compression gain on several datasets in terms of bit-rate versus task accuracy. This evaluation platform has been developed as open-source software and is adopted by the Moving Pictures Experts Group (MPEG) for the development the Feature Coding for Machines (FCM) standard. The software is available publicly at https://github.com/InterDigitalInc/CompressAI-Vision.[91] Dual-supervised Asymmetric Co-training for Semi-supervised Medical Domain Generalization
Jincai Song,Haipeng Chen,Jun Qin,Na Zhao
Main category: cs.CV
TL;DR: 本文提出了一种用于医学图像分割中跨域半监督域泛化(CD-SSDG)的双监督非对称协同训练(DAC)框架,通过特征级监督和非对称自监督任务提升模型在标签与无标签数据存在域偏移时的泛化性能。
Details
Motivation: 传统SSDG方法假设每个源域都有标注和未标注数据,但在实际中难以满足;本文针对训练集中标注数据有限且存在域偏移的问题,探索更实际的CD-SSDG场景。 Method: 提出DAC框架,采用两个子模型进行交叉伪标签监督,引入特征级监督以缓解因域偏移导致的伪标签不准问题,并为每个子模型设计不同的辅助自监督任务以增强域不变特征学习并防止模型崩溃。 Result: 在真实医学图像数据集(Fundus、Polyp、SCGM)上的实验表明,所提DAC框架相较于现有方法显著提升了分割性能与泛化能力。 Conclusion: DAC框架有效应对了CD-SSDG中的域偏移与伪标签误差问题,在多种医学图像分割任务中展现出优越的鲁棒性与应用潜力。 Abstract: Semi-supervised domain generalization (SSDG) in medical image segmentation offers a promising solution for generalizing to unseen domains during testing, addressing domain shift challenges and minimizing annotation costs. However, conventional SSDG methods assume labeled and unlabeled data are available for each source domain in the training set, a condition that is not always met in practice. The coexistence of limited annotation and domain shift in the training set is a prevalent issue. Thus, this paper explores a more practical and challenging scenario, cross-domain semi-supervised domain generalization (CD-SSDG), where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets. Existing SSDG methods exhibit sub-optimal performance under such domain shifts because of inaccurate pseudolabels. To address this issue, we propose a novel dual-supervised asymmetric co-training (DAC) framework tailored for CD-SSDG. Building upon the co-training paradigm with two sub-models offering cross pseudo supervision, our DAC framework integrates extra feature-level supervision and asymmetric auxiliary tasks for each sub-model. This feature-level supervision serves to address inaccurate pseudo supervision caused by domain shifts between labeled and unlabeled data, utilizing complementary supervision from the rich feature space. Additionally, two distinct auxiliary self-supervised tasks are integrated into each sub-model to enhance domain-invariant discriminative feature learning and prevent model collapse. Extensive experiments on real-world medical image segmentation datasets, i.e., Fundus, Polyp, and SCGM, demonstrate the robust generalizability of the proposed DAC framework.[92] Real-Time Object Detection Meets DINOv3
Shihua Huang,Yongjie Hou,Longfei Liu,Xuanlong Yu,Xi Shen
Main category: cs.CV
TL;DR: 本文提出了DEIMv2,一种基于DINOv3特征的实时目标检测框架,通过引入空间调优适配器(STA)和轻量级设计,在多个模型规模上实现了性能与成本的最优权衡,显著超越了现有方法。
Details
Motivation: 为了进一步提升实时DETR框架的性能,并在不同硬件平台上实现更高效的部署,研究者希望扩展现有的DEIM框架,结合更强的预训练特征和更灵活的多尺度特征生成机制。 Method: 采用DINOv3预训练或蒸馏的骨干网络,并引入空间调优适配器(STA)将单尺度输出转换为多尺度特征;对于超轻量级模型使用HGNetv2结合深度和宽度剪枝;配合简化解码器和改进的Dense O2O策略。 Result: DEIMv2在多个模型规模上达到SOTA性能:DEIMv2-X以50.3M参数实现57.8 AP;DEIMv2-S以9.71M参数突破50 AP;DEIMv2-Pico仅1.5M参数即达38.5 AP,效率优于YOLOv10-Nano。 Conclusion: DEIMv2通过统一的设计框架,在广泛的应用场景中实现了卓越的性能-成本平衡,成为新一代实时DETR的标杆。 Abstract: Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters.[93] DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation
Ved Umrajkar
Main category: cs.CV
TL;DR: 提出DAC-LoRA框架,将对抗训练融入参数高效微调(PEFT),通过渐进式智能攻击课程提升CLIP等视觉语言模型的对抗鲁棒性,且不影响干净样本精度。
Details
Motivation: 现有VLM在安全关键应用中易受对抗攻击,尤其基于CLIP的模型存在级联风险,需在保持高效微调的同时增强鲁棒性。 Method: 提出DAC-LoRA,结合一阶平稳条件(FOSC)与TRADES启发的损失,设计动态对抗课程,在LoRA等PEFT过程中引入逐步增强的对抗攻击进行训练。 Result: 在保持原有准确率的同时显著提升模型对抗鲁棒性,适用于多种迭代攻击场景,且可轻松集成到标准PEFT流程中。 Conclusion: DAC-LoRA为VLM提供了一种轻量、高效且通用的对抗训练方案,有效平衡了鲁棒性与性能,推动安全关键领域VLM的可靠部署。 Abstract: Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to specialized tasks, these models remain vulnerable to adversarial attacks that can compromise safety-critical decisions. CLIP, the backbone for numerous downstream VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem. We propose Dynamic Adversarial Curriculum DAC-LoRA, a novel framework that integrates adversarial training into PEFT. The core principle of our method i.e. an intelligent curriculum of progressively challenging attack, is general and can potentially be applied to any iterative attack method. Guided by the First-Order Stationary Condition (FOSC) and a TRADES-inspired loss, DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. Our work presents an effective, lightweight, and broadly applicable method to demonstrate that the DAC-LoRA framework can be easily integrated into a standard PEFT pipeline to significantly enhance robustness.[94] Federated Domain Generalization with Domain-specific Soft Prompts Generation
Jianhan Wu,Xiaoyang Qu,Zhangcheng Huang,Jianzong Wang
Main category: cs.CV
TL;DR: 提出了一种基于生成式域特定软提示(FedDSPG)的联邦域泛化方法,通过在训练中引入域特定软提示并结合内容与域知识,在多个公开数据集上实现了优于现有方法的性能。
Details
Motivation: 现有基于提示学习的联邦域泛化方法生成的软提示多样性不足,且忽视未知域的信息,限制了模型在下游任务中的泛化能力。 Method: 提出FedDSPG方法,在训练阶段为每个域引入域特定软提示(DSPs),并通过客户端间的生成模型融合内容与域知识;在推理阶段利用生成器获取未见目标域的DSPs以指导下游任务。 Result: 在多个公共数据集上的实验表明,该方法在联邦域泛化任务中优于现有的强基线方法,达到最先进水平。 Conclusion: FedDSPG通过生成域特定软提示有效提升了联邦学习中模型对未知域的适应能力和泛化性能,为联邦域泛化提供了新思路。 Abstract: Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. engendering domain shift among clients and posing a formidable challenge for downstream-task adaptation. Existing federated domain generalization (FDG) methods based on prompt learning typically learn soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts exhibit limited diversity and tend to ignore information from unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, during training, we introduce domain-specific soft prompts (DSPs) for each domain and integrate content and domain knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Comprehensive evaluations across several public datasets confirm that our method outperforms existing strong baselines in FDG, achieving state-of-the-art results.[95] Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning
Thanh Binh Le,Hoang Nhat Khang Vo,Tan-Ha Mai,Trong Nhan Phan
Main category: cs.CV
TL;DR: LumbarCLIP是一种基于对比语言-图像预训练的多模态框架,用于对齐腰椎MRI图像与放射学文本描述,在分类任务中达到95.00%准确率和94.75% F1分数,表现优越。
Details
Motivation: 低背痛全球普遍,需要能够联合分析医学影像和文本报告的诊断模型,以提升自动化诊断能力。 Method: 采用ResNet-50、Vision Transformer、Swin Transformer等视觉编码器与BERT文本编码器结合,通过可学习的线性或非线性投影头将特征映射到共享嵌入空间,并使用软CLIP损失进行对比训练。 Result: 在测试集上最高达到95.00%准确率和94.75% F1-score,优于现有方法;消融实验表明线性投影头更有利于跨模态对齐。 Conclusion: LumbarCLIP为肌肉骨骼系统的自动诊断和临床决策支持提供了有效且有前景的基础框架。 Abstract: Low back pain affects millions worldwide, driving the need for robust diagnostic models that can jointly analyze complex medical images and accompanying text reports. We present LumbarCLIP, a novel multimodal framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions. Built upon a curated dataset containing axial MRI views paired with expert-written reports, LumbarCLIP integrates vision encoders (ResNet-50, Vision Transformer, Swin Transformer) with a BERT-based text encoder to extract dense representations. These are projected into a shared embedding space via learnable projection heads, configurable as linear or non-linear, and normalized to facilitate stable contrastive training using a soft CLIP loss. Our model achieves state-of-the-art performance on downstream classification, reaching up to 95.00% accuracy and 94.75% F1-score on the test set, despite inherent class imbalance. Extensive ablation studies demonstrate that linear projection heads yield more effective cross-modal alignment than non-linear variants. LumbarCLIP offers a promising foundation for automated musculoskeletal diagnosis and clinical decision support.[96] Poisoning Prompt-Guided Sampling in Video Large Language Models
Yuxin Cao,Wei Song,Jingling Xue,Jin Song Dong
Main category: cs.CV
TL;DR: 本文提出了PoisonVID,首个针对视频大语言模型中提示引导采样机制的黑盒中毒攻击方法,通过闭环优化策略生成通用扰动以抑制有害帧的相关性得分,实验表明该攻击在多种采样策略和先进VideoLLM上成功率高达82%-99%。
Details
Motivation: 尽管早期帧采样策略已发现存在漏洞,但提示引导采样的安全性尚未被探索,本文旨在填补这一空白。 Method: 提出PoisonVID,采用闭环优化策略,利用影子VideoLLM和轻量级语言模型(如GPT-4o-mini)构建描述集,迭代优化通用扰动以抑制有害帧的相关性得分。 Result: 在三种提示引导采样策略和三个先进VideoLLM上全面评估,PoisonVID攻击成功率达到82%至99%。 Conclusion: 提示引导采样机制存在严重安全漏洞,未来需开发更先进的采样策略以提升VideoLLM的安全性。 Abstract: Video Large Language Models (VideoLLMs) have emerged as powerful tools for understanding videos, supporting tasks such as summarization, captioning, and question answering. Their performance has been driven by advances in frame sampling, progressing from uniform-based to semantic-similarity-based and, most recently, prompt-guided strategies. While vulnerabilities have been identified in earlier sampling strategies, the safety of prompt-guided sampling remains unexplored. We close this gap by presenting PoisonVID, the first black-box poisoning attack that undermines prompt-guided sampling in VideoLLMs. PoisonVID compromises the underlying prompt-guided sampling mechanism through a closed-loop optimization strategy that iteratively optimizes a universal perturbation to suppress harmful frame relevance scores, guided by a depiction set constructed from paraphrased harmful descriptions leveraging a shadow VideoLLM and a lightweight language model, i.e., GPT-4o-mini. Comprehensively evaluated on three prompt-guided sampling strategies and across three advanced VideoLLMs, PoisonVID achieves 82% - 99% attack success rate, highlighting the importance of developing future advanced sampling strategies for VideoLLMs.[97] Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer
Abdur Rehman,S M A Sharif,Md Abdur Rahaman,Mohamed Jismy Aashik Rasool,Seongwan Kim,Jaeho Lee
Main category: cs.CV
TL;DR: 提出了一种名为Game of Regularizer (GoR) 的可学习正则化方法,通过两个可训练参数自适应平衡任务特定和知识蒸馏损失,显著提升低比特量化模型性能,并结合集成蒸馏框架QAT-EKD-GoR实现超越全精度模型的表现。
Details
Motivation: 现有量化感知训练与知识蒸馏(QAT-KD)方法在低比特量化下因梯度幅值差异难以平衡任务损失与蒸馏损失,导致训练不稳定和性能下降。 Method: 设计了GoR方法,引入仅含两个可学习参数的动态损失加权机制,自适应调节任务特定与知识蒸馏目标;进一步提出QAT-EKD-GoR框架,利用多个异构教师模型进行集成蒸馏。 Result: 在图像分类、目标检测和大语言模型压缩任务中,GoR consistently 超越现有QAT-KD方法;在低功耗边缘设备上实现更快推理速度并保持全精度准确率;EKD-GoR在最优条件下性能超越全精度模型。 Conclusion: GoR有效缓解了多目标优化中的梯度冲突,提升了小量化模型的收敛性与性能,为资源受限场景下的AI部署提供了高效且鲁棒的解决方案。 Abstract: Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization. We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small quantized models (SQMs). Experiments on image classification, object detection (OD), and large language model (LLM) compression show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.[98] Plant identification based on noisy web data: the amazing performance of deep learning (LifeCLEF 2017)
Herve Goeau,Pierre Bonnet,Alexis Joly
Main category: cs.CV
TL;DR: LifeCLEF 2017植物识别挑战赛评估了大规模网络采集的含噪声数据集与小规模专家标注数据集在植物识别任务中的性能对比,使用Pl@ntNet应用数据作为测试集。
Details
Motivation: 尽管已有大量植物图像资源,但多数物种仍缺乏足够图像或图像质量差,且机构数据有限,需探索利用网络上广泛但含噪声的图像数据提升植物识别系统性能。 Method: 通过构建两个训练集——一个是从网络收集的大规模含噪声数据集,另一个是专家验证的小规模高质量数据集,并使用来自Pl@ntNet移动应用的独立测试集进行公平比较,评估不同方法在跨数据源自适应能力上的表现。 Result: 挑战吸引了多个研究团队参与,结果显示基于大规模噪声数据训练的模型在某些情况下可与基于小规模干净数据训练的模型相媲美,但也暴露出标签噪声带来的挑战和模型泛化问题。 Conclusion: 利用网络爬取的大规模图像数据具有潜力用于大规模植物识别,但需有效处理标签噪声;结合专家知识与弱监督学习是未来方向。 Abstract: The 2017-th edition of the LifeCLEF plant identification challenge is an important milestone towards automated plant identification systems working at the scale of continental floras with 10.000 plant species living mainly in Europe and North America illustrated by a total of 1.1M images. Nowadays, such ambitious systems are enabled thanks to the conjunction of the dazzling recent progress in image classification with deep learning and several outstanding international initiatives, such as the Encyclopedia of Life (EOL), aggregating the visual knowledge on plant species coming from the main national botany institutes. However, despite all these efforts the majority of the plant species still remain without pictures or are poorly illustrated. Outside the institutional channels, a much larger number of plant pictures are available and spread on the web through botanist blogs, plant lovers web-pages, image hosting websites and on-line plant retailers. The LifeCLEF 2017 plant challenge presented in this paper aimed at evaluating to what extent a large noisy training dataset collected through the web and containing a lot of labelling errors can compete with a smaller but trusted training dataset checked by experts. To fairly compare both training strategies, the test dataset was created from a third data source, i.e. the Pl@ntNet mobile application that collects millions of plant image queries all over the world. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.[99] TasselNetV4: A vision foundation model for cross-scene, cross-scale, and cross-species plant counting
Xiaonan Hu,Xuebing Li,Jinyu Xu,Abdulkadir Duran Adan,Letian Zhou,Xuhui Zhu,Yanan Li,Wei Guo,Shouyang Liu,Wenzhong Liu,Hao Lu
Main category: cs.CV
TL;DR: 本文提出了TasselNetV4,一种用于跨物种植物计数的视觉基础模型,结合局部计数与提取匹配范式,在新构建的PAC-105和PAC-Somalia数据集上表现出优越性能。
Details
Motivation: 传统植物计数方法依赖特定物种模型,难以应对植物多样性及新品种不断出现的问题,因此需要一种更通用的跨物种计数方法。 Method: 基于TasselNet框架,引入vision transformer架构,采用多分支盒感知局部计数器,结合类无关计数(CAC)的提取-匹配范式,实现跨物种、跨尺度的植物计数。 Result: 在PAC-105和PAC-Somalia两个新数据集上,TasselNetV4在计数准确性和效率方面均优于现有的类无关计数模型。 Conclusion: TasselNetV4可作为跨场景、跨尺度、跨物种植物计数的视觉基础模型,为农业中的作物监测提供通用解决方案。 Abstract: Accurate plant counting provides valuable information for agriculture such as crop yield prediction, plant density assessment, and phenotype quantification. Vision-based approaches are currently the mainstream solution. Prior art typically uses a detection or a regression model to count a specific plant. However, plants have biodiversity, and new cultivars are increasingly bred each year. It is almost impossible to exhaust and build all species-dependent counting models. Inspired by class-agnostic counting (CAC) in computer vision, we argue that it is time to rethink the problem formulation of plant counting, from what plants to count to how to count plants. In contrast to most daily objects with spatial and temporal invariance, plants are dynamic, changing with time and space. Their non-rigid structure often leads to worse performance than counting rigid instances like heads and cars such that current CAC and open-world detection models are suboptimal to count plants. In this work, we inherit the vein of the TasselNet plant counting model and introduce a new extension, TasselNetV4, shifting from species-specific counting to cross-species counting. TasselNetV4 marries the local counting idea of TasselNet with the extract-and-match paradigm in CAC. It builds upon a plain vision transformer and incorporates novel multi-branch box-aware local counters used to enhance cross-scale robustness. Two challenging datasets, PAC-105 and PAC-Somalia, are harvested. Extensive experiments against state-of-the-art CAC models show that TasselNetV4 achieves not only superior counting performance but also high efficiency.Our results indicate that TasselNetV4 emerges to be a vision foundation model for cross-scene, cross-scale, and cross-species plant counting.[100] SD-RetinaNet: Topologically Constrained Semi-Supervised Retinal Lesion and Layer Segmentation in OCT
Botond Fazekas,Guilherme Aresta,Philipp Seeböck,Julia Mai,Ursula Schmidt-Erfurth,Hrvoje Bogunović
Main category: cs.CV
TL;DR: 提出一种新的半监督模型,通过引入完全可微分的生物标志物拓扑引擎,实现解剖学上正确的视网膜层和病灶分割,显著提升分割精度和鲁棒性。
Details
Motivation: 现有半监督方法在视网膜OCT图像分割中常产生解剖学上不合理的结构,难以有效建模层与病灶之间的相互作用,且缺乏拓扑正确性的保证。 Method: 提出一种新型半监督模型,集成完全可微分的生物标志物拓扑引擎,实现层与病灶的双向联合学习,并学习解耦的表征(空间与风格因素分离),利用未标记和部分标记数据进行训练。 Result: 在公共和内部OCT数据集上,该模型在层和病灶分割任务上均优于当前最先进方法,并能使用部分标注数据泛化到病理情况下的层分割。 Conclusion: 将解剖约束引入半监督学习可显著提升视网膜生物标志物分割的准确性、鲁棒性和可信度,具有临床应用潜力。 Abstract: Optical coherence tomography (OCT) is widely used for diagnosing and monitoring retinal diseases, such as age-related macular degeneration (AMD). The segmentation of biomarkers such as layers and lesions is essential for patient diagnosis and follow-up. Recently, semi-supervised learning has shown promise in improving retinal segmentation performance. However, existing methods often produce anatomically implausible segmentations, fail to effectively model layer-lesion interactions, and lack guarantees on topological correctness. To address these limitations, we propose a novel semi-supervised model that introduces a fully differentiable biomarker topology engine to enforce anatomically correct segmentation of lesions and layers. This enables joint learning with bidirectional influence between layers and lesions, leveraging unlabeled and diverse partially labeled datasets. Our model learns a disentangled representation, separating spatial and style factors. This approach enables more realistic layer segmentations and improves lesion segmentation, while strictly enforcing lesion location in their anatomically plausible positions relative to the segmented layers. We evaluate the proposed model on public and internal datasets of OCT scans and show that it outperforms the current state-of-the-art in both lesion and layer segmentation, while demonstrating the ability to generalize layer segmentation to pathological cases using partially annotated training data. Our results demonstrate the potential of using anatomical constraints in semi-supervised learning for accurate, robust, and trustworthy retinal biomarker segmentation.[101] Plant identification in an open-world (LifeCLEF 2016)
Herve Goeau,Pierre Bonnet,Alexis Joly
Main category: cs.CV
TL;DR: LifeCLEF 2016植物识别挑战赛评估了在大规模真实生物多样性监测场景下的植物识别方法,首次采用开放集识别任务,要求系统能有效拒绝未知物种的误分类。
Details
Motivation: 推动植物自动识别技术的发展,以支持大规模生物多样性监测,特别是在开放环境中处理未知物种的识别问题。 Method: 基于超过11万张图像和1000种西欧植物构建数据集,采用开放集识别设置,评估系统对已知类别分类及对未知类别的鲁棒性。 Result: 多个研究团队提交了不同的识别系统,挑战揭示了现有方法在处理开放集识别时的局限性,特别是在拒绝假阳性分类方面的挑战。 Conclusion: 开放集识别是植物自动识别走向实际应用的关键挑战,未来需增强模型对未知类别的判断能力。 Abstract: The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2016-th edition was actually conducted on a set of more than 110K images illustrating 1000 plant species living in West Europe, built through a large-scale participatory sensing platform initiated in 2011 and which now involves tens of thousands of contributors. The main novelty over the previous years is that the identification task was evaluated as an open-set recognition problem, i.e. a problem in which the recognition system has to be robust to unknown and never seen categories. Beyond the brute-force classification across the known classes of the training set, the big challenge was thus to automatically reject the false positive classification hits that are caused by the unknown classes. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.[102] SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering
Yan Zhang,Jiaqing Lin,Miao Zhang,Kui Xiao,Xiaoju Hou,Yue Zhao,Zhifei Li
Main category: cs.CV
TL;DR: 提出了一种名为SCRA-VQA的新方法,利用预训练的视觉语言模型生成图像描述,并通过上下文示例生成、摘要和重排序优化描述,以提高大语言模型在知识型视觉问答任务中的推理能力和适应性,无需昂贵的端到端训练,在OK-VQA和A-OKVQA数据集上表现出色。
Details
Motivation: 现有方法使用大语言模型结合图像描述进行知识型视觉问答,但描述中常包含与问题无关的噪声,且大语言模型难以理解视觉问答任务,限制了其推理能力。 Method: 提出SCRA-VQA方法,使用预训练视觉语言模型生成图像描述,并通过生成 contextual examples 同时对描述进行摘要和重排序,去除无关信息,增强大语言模型对图像和问题的理解。 Result: 在OK-VQA和A-OKVQA两个具有挑战性的知识型视觉问答数据集上,基于67亿参数的大语言模型,SCRA-VQA分别达到了38.8%和34.6%的准确率。 Conclusion: SCRA-VQA通过优化图像描述的生成与组织方式,有效提升了大语言模型在知识型视觉问答中的性能,且无需端到端训练,具备良好的任务适应性和推理能力。 Abstract: Acquiring high-quality knowledge is a central focus in Knowledge-Based Visual Question Answering (KB-VQA). Recent methods use large language models (LLMs) as knowledge engines for answering. These methods generally employ image captions as visual text descriptions to assist LLMs in interpreting images. However, the captions frequently include excessive noise irrelevant to the question, and LLMs generally do not comprehend VQA tasks, limiting their reasoning capabilities. To address this issue, we propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA), which employs a pre-trained visual language model to convert images into captions. Moreover, SCRA-VQA generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information. The caption-rerank process enables LLMs to understand the image information and questions better, thus enhancing the model's reasoning ability and task adaptability without expensive end-to-end training. Based on an LLM with 6.7B parameters, SCRA-VQA performs excellently on two challenging knowledge-based VQA datasets: OK-VQA and A-OKVQA, achieving accuracies of 38.8% and 34.6%. Our code is available at https://github.com/HubuKG/SCRA-VQA.[103] The Unanticipated Asymmetry Between Perceptual Optimization and Assessment
Jiabei Zhang,Qi Wang,Siyu Wu,Du Chen,Tianhe Wu
Main category: cs.CV
TL;DR: 研究揭示了感知优化与图像质量评估(IQA)指标之间的不对称性:在优化中表现好的保真度度量未必适用于IQA,且判别器设计对优化效果有关键影响。
Details
Motivation: 探讨感知优化目标(如保真度和对抗性目标)作为图像质量评估指标的有效性及其相关性。 Method: 系统分析保真度和对抗性目标在感知优化与IQA中的表现,并比较不同判别器架构对细节重建的影响。 Result: 发现保真度指标在IQA中表现好并不代表其在感知优化中同样有效,尤其在对抗训练下存在明显错配;判别器虽能抑制伪影,但其特征表示对IQA模型初始化帮助有限;patch级卷积架构比普通或Transformer架构更利于细节重建。 Conclusion: 感知优化与IQA之间存在未被充分认识的不对称性,判别器结构设计对优化结果有决定性作用,这对损失函数设计和IQA可迁移性提供了重要启示。 Abstract: Perceptual optimization is primarily driven by the fidelity objective, which enforces both semantic consistency and overall visual realism, while the adversarial objective provides complementary refinement by enhancing perceptual sharpness and fine-grained detail. Despite their central role, the correlation between their effectiveness as optimization objectives and their capability as image quality assessment (IQA) metrics remains underexplored. In this work, we conduct a systematic analysis and reveal an unanticipated asymmetry between perceptual optimization and assessment: fidelity metrics that excel in IQA are not necessarily effective for perceptual optimization, with this misalignment emerging more distinctly under adversarial training. In addition, while discriminators effectively suppress artifacts during optimization, their learned representations offer only limited benefits when reused as backbone initializations for IQA models. Beyond this asymmetry, our findings further demonstrate that discriminator design plays a decisive role in shaping optimization, with patch-level and convolutional architectures providing more faithful detail reconstruction than vanilla or Transformer-based alternatives. These insights advance the understanding of loss function design and its connection to IQA transferability, paving the way for more principled approaches to perceptual optimization.[104] Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering
Zhifei Li,Feng Qiu,Yiran Wang,Yujing Xia,Kui Xiao,Miao Zhang,Yan Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的VQA模型IOG-VQA,结合对象交互自注意力和基于GAN的去偏方法,以提升在存在数据偏差情况下的视觉问答性能。
Details
Motivation: 现有VQA模型因训练数据偏差而过度依赖表面模式,泛化能力不足,难以应对多样化的问题和图像。 Method: 引入对象交互自注意力机制捕捉图像中对象间的复杂交互,并采用基于GAN的去偏框架生成无偏数据分布,从而增强模型的鲁棒性和泛化能力。 Result: 在VQA-CP v1和v2数据集上的实验表明,该模型在处理有偏和不平衡数据时优于现有方法。 Conclusion: 同时建模对象交互和缓解数据偏差对提升VQA模型性能至关重要。 Abstract: Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA.[105] Nuclear Diffusion Models for Low-Rank Background Suppression in Videos
Tristan S. W. Stevens,Oisín Nolan,Jean-Luc Robert,Ruud J. G. van Sloun
Main category: cs.CV
TL;DR: 提出了一种名为Nuclear Diffusion的混合框架,结合低秩时间建模与扩散后验采样,用于视频去噪与恢复,在心脏超声去雾任务中表现优于传统RPCA方法。
Details
Motivation: 传统鲁棒主成分分析(RPCA)中的稀疏性假设难以捕捉真实视频数据的丰富变化,导致在复杂噪声和背景干扰下性能受限。 Method: 提出Nuclear Diffusion方法,将低秩时间建模与扩散模型的后验采样相结合,利用深度生成先验增强对动态内容的建模能力。 Result: 在真实心脏超声去雾任务中,该方法在对比度增强(gCNR)和信号保真度(KS统计量)方面均优于传统RPCA方法。 Conclusion: 结合基于模型的时间建模与深度生成先验,能有效提升视频恢复的质量,尤其适用于医学影像等高保真需求场景。 Abstract: Video sequences often contain structured noise and background artifacts that obscure dynamic content, posing challenges for accurate analysis and restoration. Robust principal component methods address this by decomposing data into low-rank and sparse components. Still, the sparsity assumption often fails to capture the rich variability present in real video data. To overcome this limitation, a hybrid framework that integrates low-rank temporal modeling with diffusion posterior sampling is proposed. The proposed method, Nuclear Diffusion, is evaluated on a real-world medical imaging problem, namely cardiac ultrasound dehazing, and demonstrates improved dehazing performance compared to traditional RPCA concerning contrast enhancement (gCNR) and signal preservation (KS statistic). These results highlight the potential of combining model-based temporal models with deep generative priors for high-fidelity video restoration.[106] FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies
Shuqiao Liang,Jian Liu,Renzhang Chen,Quanlong Guan
Main category: cs.CV
TL;DR: 本文提出了一种基于局部像素依赖性(LPD)的轻量级神经网络FerretNet,用于检测合成图像,通过捕捉生成过程中的潜在分布偏差和解码平滑效应,在跨22种生成模型的开放世界基准上表现出色,平均准确率达97.1%。
Details
Motivation: 随着VAE、GAN和LDM等模型生成的合成图像越来越逼真,传统检测方法面临挑战,亟需能有效识别先进生成模型所产生图像的新方法。 Method: 利用马尔可夫随机场中的局部像素依赖性(LPD)特性,重建图像以暴露纹理和边缘不一致性,并据此设计了参数量仅为110万的轻量级网络FerretNet。 Result: FerretNet在仅使用4类ProGAN数据训练的情况下,在包含22种生成模型的开放世界基准上平均准确率达到97.1%,超过现有最优方法10.6%。 Conclusion: FerretNet是一种高效且鲁棒的合成图像检测方法,能够在未见生成模型上实现优异泛化性能,具有实际应用潜力。 Abstract: The increasing realism of synthetic images generated by advanced models such as VAEs, GANs, and LDMs poses significant challenges for synthetic image detection. To address this issue, we explore two artifact types introduced during the generation process: (1) latent distribution deviations and (2) decoding-induced smoothing effects, which manifest as inconsistencies in local textures, edges, and color transitions. Leveraging local pixel dependencies (LPD) properties rooted in Markov Random Fields, we reconstruct synthetic images using neighboring pixel information to expose disruptions in texture continuity and edge coherence. Building upon LPD, we propose FerretNet, a lightweight neural network with only 1.1M parameters that delivers efficient and robust synthetic image detection. Extensive experiments demonstrate that FerretNet, trained exclusively on the 4-class ProGAN dataset, achieves an average accuracy of 97.1% on an open-world benchmark comprising across 22 generative models, surpassing state-of-the-art methods by 10.6%.[107] Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification
Patrick Knab,Sascha Marton,Philipp J. Schubert,Drago Guggiana,Christian Bartelt
Main category: cs.CV
TL;DR: MoTIF是一种基于Transformer的可解释框架,将概念瓶颈模型扩展到视频分类,通过建模时间序列中的语义概念及其动态关系,在保持竞争力性能的同时实现对动作的可解释性分析。
Details
Motivation: 将图像中的可解释概念模型(如CBM)扩展到视频面临时序依赖建模的挑战,而现有方法难以捕捉视频中跨时间重复出现的概念及其动态演变。 Method: 提出MoTIF框架,采用类Transformer架构,在概念瓶颈基础上引入对全局概念重要性、局部时间窗内概念相关性和概念时序依赖的建模,以处理任意长度视频序列。 Result: 实验表明MoTIF能有效迁移概念瓶颈范式至视频数据,支持多粒度可解释分析(全局/局部/时序),并在视频分类任务上保持竞争性性能。 Conclusion: MoTIF成功将基于概念的可解释模型应用于视频分类,通过显式建模时间维度上的概念动态,实现了对视频动作的语义化解释与高性能分类的平衡。 Abstract: Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., 'bow', 'mount', 'shoot') that reoccur across time - forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance. Code available at github.com/patrick-knab/MoTIF.[108] FSMODNet: A Closer Look at Few-Shot Detection in Multispectral Data
Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre
Main category: cs.CV
TL;DR: 本文提出了一种名为FSMODNet的框架,用于解决少样本多光谱目标检测(FSMOD)问题,通过可变形注意力机制融合可见光与热成像特征,在标注数据极少的情况下实现了鲁棒的检测性能。
Details
Motivation: 在标注数据稀缺的情况下,如何有效利用可见光和热成像模态的优势进行目标检测是一个挑战,尤其是在复杂光照和环境条件下。 Method: 提出FSMODNet框架,采用跨模态特征融合策略,利用可变形注意力机制整合可见光与热成像特征,提升少样本条件下的检测能力。 Result: 在两个公开数据集上的实验表明,该方法在低数据场景下显著优于多个基于先进模型构建的基线方法。 Conclusion: FSMODNet通过有效的跨模态特征集成,在少样本多光谱目标检测任务中表现出色,具备在复杂环境中应用的潜力。 Abstract: Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named "FSMODNet" that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strengths of visible and thermal imagery using deformable attention, the proposed method demonstrates robust adaptability in complex illumination and environmental conditions. Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines we established from state-of-the-art models. All code, models, and experimental data splits can be found at https://anonymous.4open.science/r/Test-B48D.[109] Finding 3D Positions of Distant Objects from Noisy Camera Movement and Semantic Segmentation Sequences
Julius Pesonen,Arno Solin,Eija Honkavaara
Main category: cs.CV
TL;DR: 本文提出使用粒子滤波器在计算资源受限或目标距离较远的情况下,基于相机姿态和图像分割实现三维物体定位,适用于无人机野火监测等安全关键任务。
Details
Motivation: 在远距离或计算资源受限的场景下,传统的密集深度估计或3D场景重建方法难以实现有效的3D物体定位,因此需要一种更可行且灵活的方法。 Method: 采用粒子滤波器处理单目标和多目标场景下的3D物体定位,结合GNSS提供的相机位姿和图像分割结果,在3D仿真和真实无人机图像序列上进行验证。 Result: 实验结果表明,粒子滤波器能够在其他方法失效的情况下成功完成实际定位任务,且该方法不依赖于具体的检测方式,具有良好的通用性和灵活性。 Conclusion: 粒子滤波器是一种适用于资源受限环境下基于相机的3D物体定位的有效方法,可与现有图像分割模型结合用于无人机野火监测等应用。 Abstract: 3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with dense depth estimation or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved using particle filters for both single and multiple target scenarios. The method was studied using a 3D simulation and a drone-based image segmentation sequence with global navigation satellite system (GNSS)-based camera pose estimates. The results showed that a particle filter can be used to solve practical localisation tasks based on camera poses and image segments in these situations where other solutions fail. The particle filter is independent of the detection method, making it flexible for new tasks. The study also demonstrates that drone-based wildfire monitoring can be conducted using the proposed method paired with a pre-existing image segmentation model.[110] SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images
Qinfeng Zhu,Han Li,Liang He,Lei Fan
Main category: cs.CV
TL;DR: 本文提出了一种名为SwinMamba的新框架,用于遥感图像语义分割。该方法结合了局部和全局扫描机制,在前两个阶段进行局部扫描以捕捉细节特征,后两个阶段采用全局扫描融合上下文信息,显著提升了分割性能。
Details
Motivation: 现有Vision Mamba模型在处理遥感图像时因依赖全局扫描而忽视局部纹理和边缘特征,导致分割精度受限,本文旨在解决这一问题。 Method: 受Swin Transformer启发,SwinMamba在移位窗口内引入局部Mamba式扫描,并与全局感受野结合;前两阶段聚焦局部细节,后两阶段融合全局上下文,通过重叠移位窗口增强区域间信息交换。 Result: 在LoveDA和ISPRS Potsdam数据集上的实验表明,SwinMamba优于当前最先进的方法,展现出卓越的分割性能和应用潜力。 Conclusion: SwinMamba有效平衡了局部细节与全局上下文的建模能力,为遥感图像语义分割提供了一个高效且准确的新框架。 Abstract: Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba-style scanning within shifted windows with a global receptive field, to enhance the model's perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine-grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter-region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state-of-the-art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.[111] Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Framework
Wenhao Tang,Heng Fang,Ge Wu,Xiang Li,Ming-Ming Cheng
Main category: cs.CV
TL;DR: 提出了一种基于包的多实例学习(pack-based MIL)框架,用于解决计算病理学中全切片图像序列长度极长、变化大且监督有限的问题,显著提升训练效率和准确性。
Details
Motivation: 全切片图像(WSI)序列长度极长且变化大,传统方法在有限监督下难以兼顾训练效率与数据异质性,导致优化困难和特征丢失。 Method: 提出pack-based MIL框架:将多个变长特征序列打包为固定长度序列以实现批训练;引入残差分支构建超滑片(hyperslide)并提供多滑片监督;设计注意力驱动的降采样器减少冗余。 Result: 相比传统方法,在仅使用12%训练时间的情况下,准确率最高提升8%,并在PANDA(UNI)数据集上验证了有效性。 Conclusion: 通过系统性应对数据异质性、冗余和监督不足问题,所提方法显著提升了计算病理学模型的训练效率和性能,表明关注数据挑战在基础模型时代具有重要意义。 Abstract: Computational pathology (CPath) digitizes pathology slides into whole slide images (WSIs), enabling analysis for critical healthcare tasks such as cancer diagnosis and prognosis. However, WSIs possess extremely long sequence lengths (up to 200K), significant length variations (from 200 to 200K), and limited supervision. These extreme variations in sequence length lead to high data heterogeneity and redundancy. Conventional methods often compromise on training efficiency and optimization to preserve such heterogeneity under limited supervision. To comprehensively address these challenges, we propose a pack-based MIL framework. It packs multiple sampled, variable-length feature sequences into fixed-length ones, enabling batched training while preserving data heterogeneity. Moreover, we introduce a residual branch that composes discarded features from multiple slides into a hyperslide which is trained with tailored labels. It offers multi-slide supervision while mitigating feature loss from sampling. Meanwhile, an attention-driven downsampler is introduced to compress features in both branches to reduce redundancy. By alleviating these challenges, our approach achieves an accuracy improvement of up to 8% while using only 12% of the training time in the PANDA(UNI). Extensive experiments demonstrate that focusing data challenges in CPath holds significant potential in the era of foundation models. The code is https://github.com/FangHeng/PackMIL[112] SimDiff: Simulator-constrained Diffusion Model for Physically Plausible Motion Generation
Akihisa Watanabe,Jiawei Ren,Li Siyao,Yichen Peng,Erwin Wu,Edgar Simo-Serra
Main category: cs.CV
TL;DR: 提出SimDiff,一种将环境参数直接融入去噪过程的模拟器约束扩散模型,以高效生成物理上合理的动作。
Details
Motivation: 现有方法在扩散过程中使用基于模拟器的动作投影层来保证物理合理性,但因模拟器的顺序性导致计算成本高且无法并行化。 Method: 将基于模拟器的动作投影解释为扩散过程中的引导形式,并提出SimDiff模型,直接在去噪过程中引入重力、风等环境参数进行条件控制。 Result: SimDiff无需在推理时反复调用模拟器,能高效生成物理合理的动作,并实现对不同物理系数的细粒度控制,还能泛化到未见过的环境参数组合。 Conclusion: SimDiff通过将物理环境参数融入扩散模型,在保持物理可解释性的同时显著提升效率和可控性,展现出良好的组成式泛化能力。 Abstract: Generating physically plausible human motion is crucial for applications such as character animation and virtual reality. Existing approaches often incorporate a simulator-based motion projection layer to the diffusion process to enforce physical plausibility. However, such methods are computationally expensive due to the sequential nature of the simulator, which prevents parallelization. We show that simulator-based motion projection can be interpreted as a form of guidance, either classifier-based or classifier-free, within the diffusion process. Building on this insight, we propose SimDiff, a Simulator-constrained Diffusion Model that integrates environment parameters (e.g., gravity, wind) directly into the denoising process. By conditioning on these parameters, SimDiff generates physically plausible motions efficiently, without repeated simulator calls at inference, and also provides fine-grained control over different physical coefficients. Moreover, SimDiff successfully generalizes to unseen combinations of environmental parameters, demonstrating compositional generalization.[113] Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models
Bum Jun Kim,Makoto Kawano,Yusuke Iwasawa,Yutaka Matsuo
Main category: cs.CV
TL;DR: 该论文通过分析1174个预训练视觉模型,发现并理论解释了四种提升对高斯噪声鲁棒性的设计模式:更大的stem卷积核、更小的输入分辨率、使用平均池化以及监督式ViT而非CLIP ViT,提出了可解释且实用的鲁棒性设计准则。
Details
Motivation: 现有研究多关注视觉模型的鲁棒性度量,但缺乏对其背后架构设计选择影响的深入分析,本文旨在揭示哪些架构因素真正提升了模型对高斯噪声的鲁棒性,并建立因果性解释。 Method: 通过对1,174个预训练视觉模型进行大规模实验评估,识别出与鲁棒性相关的架构模式;结合理论分析,从噪声衰减、下采样、池化机制和预处理方式等方面建立数学模型以解释这些现象。 Result: 发现了四个显著提升鲁棒性的设计原则:更大的stem核可二次降低噪声增益,抗混叠下采样按平方比例减少噪声能量,平均池化能有效抑制噪声且无偏,而CLIP ViT因较小的归一化标准差导致最坏情况敏感性增加最多达1.91倍。这些发现带来了最高506名的排名提升和21.6%的准确率增益。 Conclusion: 视觉模型对高斯噪声的鲁棒性可被分解为多个可解释的模块化设计因素,本文不仅提供了经验规律,还建立了理论基础,最终形成了一套实用、即插即用的模型设计指南。 Abstract: While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6\%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that low-pass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.[114] Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery
Angelo Henriques,Korab Hoxha,Daniel Zapp,Peter C. Issa,Nassir Navab,M. Ali Nasseri
Main category: cs.CV
TL;DR: 该论文综述了场景图(SG)在手术中的研究进展,指出其在内部视角使用真实2D视频、外部视角依赖模拟4D数据之间的‘数据鸿沟’,并强调专用基础模型在手术场景中优于通用视觉语言模型,SG正成为提升手术安全、效率和培训的关键技术。
Details
Motivation: 为了系统梳理场景图在手术环境中的应用现状与挑战,揭示研究空白并指导未来方向。 Method: 采用PRISMA-ScR指南进行范围综述,分析现有文献中的应用场景、方法演进及数据使用情况。 Result: 发现领域快速增长,但存在‘数据divide’:内视研究多用真实2D视频,外视4D建模依赖模拟数据;方法上从图神经网络发展到专用基础模型,性能超越通用大模型;SG已广泛应用于手术流程识别、安全监控和生成式模拟。 Conclusion: 场景图正在成为连接感知与语义的桥梁,是构建下一代智能手术系统的核心技术,尽管仍面临数据标注和实时性挑战。 Abstract: Scene graphs (SGs) provide structured relational representations crucial for decoding complex, dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, charting its applications, methodological advancements, and future directions. Our analysis reveals rapid growth, yet uncovers a critical 'data divide': internal-view research (e.g., triplet recognition) almost exclusively uses real-world 2D video, while external-view 4D modeling relies heavily on simulated data, exposing a key translational research gap. Methodologically, the field has advanced from foundational graph neural networks to specialized foundation models that now significantly outperform generalist large vision-language models in surgical contexts. This progress has established SGs as a cornerstone technology for both analysis, such as workflow recognition and automated safety monitoring, and generative tasks like controllable surgical simulation. Although challenges in data annotation and real-time implementation persist, they are actively being addressed through emerging techniques. Surgical SGs are maturing into an essential semantic bridge, enabling a new generation of intelligent systems to improve surgical safety, efficiency, and training.[115] A Real-Time On-Device Defect Detection Framework for Laser Power-Meter Sensors via Unsupervised Learning
Dongqi Zheng,Wenjin Fu,Guangzong Chen
Main category: cs.CV
TL;DR: 提出了一种基于视觉的自动化系统,用于激光功率计传感器涂层缺陷检测与分类,采用无监督异常检测框架和UFlow网络,实现了高精度检测和快速处理。
Details
Motivation: 检测激光功率计传感器涂层的缺陷(如热损伤和划痕)以确保医疗和工业应用中激光能量测量的准确性。 Method: 使用仅包含正常样本的图像训练无监督异常检测模型,结合Laplacian边缘检测和K-means聚类进行预处理,通过StyleGAN2进行数据增强,并采用UFlow架构进行多尺度特征提取和异常图生成。 Result: 在366张真实图像上测试,缺陷样本准确率达93.8%,正常样本达89.3%,图像级AUROC为0.957,像素级AUROC为0.961,单图处理时间0.5秒。 Conclusion: 该系统可高效准确地检测涂层缺陷,具备实际部署能力,有望通过自动化质检带来显著成本节约。 Abstract: We present an automated vision-based system for defect detection and classification of laser power meter sensor coatings. Our approach addresses the critical challenge of identifying coating defects such as thermal damage and scratches that can compromise laser energy measurement accuracy in medical and industrial applications. The system employs an unsupervised anomaly detection framework that trains exclusively on ``good'' sensor images to learn normal coating distribution patterns, enabling detection of both known and novel defect types without requiring extensive labeled defect datasets. Our methodology consists of three key components: (1) a robust preprocessing pipeline using Laplacian edge detection and K-means clustering to segment the area of interest, (2) synthetic data augmentation via StyleGAN2, and (3) a UFlow-based neural network architecture for multi-scale feature extraction and anomaly map generation. Experimental evaluation on 366 real sensor images demonstrates $93.8\%$ accuracy on defective samples and $89.3\%$ accuracy on good samples, with image-level AUROC of 0.957 and pixel-level AUROC of 0.961. The system provides potential annual cost savings through automated quality control and processing times of 0.5 seconds per image in on-device implementation.[116] Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos
Sarmistha Das,R E Zera Marveen Lyngkhoi,Sriparna Saha,Alka Maurya
Main category: cs.CV
TL;DR: 本文提出了FASTER,一个用于金融咨询视频的多模态摘要框架,结合文本、语音和图像特征,生成准确且可解释的摘要,并发布了Fin-APT数据集以促进相关研究。
Details
Motivation: 长时、多模态的金融咨询视频难以高效提取关键信息,现有方法在跨模态对齐和事实一致性方面存在不足。 Method: FASTER框架融合BLIP生成视觉语义描述、OCR提取画面文本、Whisper进行带说话人区分的语音转录,并采用改进的DPO损失函数结合BOS特征与事实核查机制;通过排序检索机制实现关键帧与文本摘要的对齐。 Result: 在跨领域实验中,FASTER优于现有的大语言模型和视觉-语言模型,表现出更强的鲁棒性与泛化能力;同时发布了包含470个金融咨询视频的Fin-APT数据集。 Conclusion: FASTER为多模态金融内容摘要设立了新标准,提升了信息可访问性和实用性,推动了多模态金融内容分析的研究发展。 Abstract: The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER[117] An Adaptor for Triggering Semi-Supervised Learning to Out-of-Box Serve Deep Image Clustering
Yue Duan,Lei Qi,Yinghuan Shi,Yang Gao
Main category: cs.CV
TL;DR: 提出ASD适配器,无需预训练即可实现自监督学习在深度图像聚类中的冷启动。
Details
Motivation: 现有方法需要预训练或聚类模型作为前提,限制了自监督学习在图像聚类中的灵活应用。 Method: 通过伪标签数据和实例级分类器学习语义对齐标签,追踪预测类别转移以提取高层相似性,并赋予聚类级标签来启动SSL学习器。 Result: 在多个基准上优于最新的深度图像聚类方法,相较于使用真实标签的SSL方法仅有微小差距(如CIFAR-10上仅差1.33%),并可提升现有方法性能。 Conclusion: ASD实现了无需任何前置条件的SSL冷启动聚类,具有优越的性能和广泛适用性。 Abstract: Recently, some works integrate SSL techniques into deep clustering frameworks to enhance image clustering performance. However, they all need pretraining, clustering learning, or a trained clustering model as prerequisites, limiting the flexible and out-of-box application of SSL learners in the image clustering task. This work introduces ASD, an adaptor that enables the cold-start of SSL learners for deep image clustering without any prerequisites. Specifically, we first randomly sample pseudo-labeled data from all unlabeled data, and set an instance-level classifier to learn them with semantically aligned instance-level labels. With the ability of instance-level classification, we track the class transitions of predictions on unlabeled data to extract high-level similarities of instance-level classes, which can be utilized to assign cluster-level labels to pseudo-labeled data. Finally, we use the pseudo-labeled data with assigned cluster-level labels to trigger a general SSL learner trained on the unlabeled data for image clustering. We show the superior performance of ASD across various benchmarks against the latest deep image clustering approaches and very slight accuracy gaps compared to SSL methods using ground-truth, e.g., only 1.33% on CIFAR-10. Moreover, ASD can also further boost the performance of existing SSL-embedded deep image clustering methods.[118] SiNGER: A Clearer Voice Distills Vision Transformers Further
Geunhyeok Yu,Sunjae Jeong,Yoonyoung Choi,Jaeseung Kim,Hyoseok Hwang
Main category: cs.CV
TL;DR: 本文提出了一种名为SiNGER的新型知识蒸馏框架,通过利用零空间引导的扰动来抑制教师模型中的高范数伪影,同时保留有用信息,从而提升学生模型的性能。
Details
Motivation: Vision Transformers作为视觉基础模型的主干网络,会产生影响表示质量的高范数伪影。在知识蒸馏过程中,这些伪影会主导学习目标,导致学生模型过拟合伪影而忽略有用信号,限制了大模型带来的增益。现有方法在抑制伪影和保留信息之间存在权衡问题。 Method: 提出SiNGER框架,核心是对教师特征进行原则性优化:利用零空间引导的扰动,在抑制伪影的同时保留信息;然后将优化后的特征用于蒸馏学生模型。该扰动通过基于LoRA的适配器高效实现,无需大幅修改结构。 Result: 大量实验表明,SiNGER在多个下游任务中 consistently 提升学生模型性能,达到最先进的表现,并生成更清晰、可解释的表示。 Conclusion: SiNGER有效解决了知识蒸馏中伪影抑制与信息保留之间的冲突,为Vision Transformer的知识蒸馏提供了高效且即插即用的改进方案。 Abstract: Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher's features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.[119] Fast-SEnSeI: Lightweight Sensor-Independent Cloud Masking for On-board Multispectral Sensors
Jan Kněžík,Jonáš Herec,Rado Pitoňák
Main category: cs.CV
TL;DR: 提出了一种轻量级、传感器无关的编码模块Fast-SEnSeI,用于多光谱传感器上的星载云分割,支持任意波段组合并在CPU-FPGA混合架构上高效运行。
Details
Motivation: 现有云分割模型通常依赖特定传感器配置并需地面处理,缺乏灵活性和实时性,难以适应不同卫星传感器的多样化波段设置。 Method: 基于SEnSeI-v2改进,引入更优的光谱描述符、轻量化架构和鲁棒的填充波段处理机制,支持任意输入波段;结合量化后的U-Net变体,在嵌入式CPU上用Apache TVM运行编码模块,在FPGA部署分割模型,实现高效星载处理。 Result: 在Sentinel-2和Landsat 8数据集上验证了模型对不同波段配置的良好适应性和准确的云分割性能,实现了低资源消耗下的高效推理。 Conclusion: Fast-SEnSeI实现了灵活、高效的星载云分割,具备跨传感器泛化能力,适用于资源受限的空间任务,推动遥感数据的实时自主处理。 Abstract: Cloud segmentation is a critical preprocessing step for many Earth observation tasks, yet most models are tightly coupled to specific sensor configurations and rely on ground-based processing. In this work, we propose Fast-SEnSeI, a lightweight, sensor-independent encoder module that enables flexible, on-board cloud segmentation across multispectral sensors with varying band configurations. Building upon SEnSeI-v2, Fast-SEnSeI integrates an improved spectral descriptor, lightweight architecture, and robust padding-band handling. It accepts arbitrary combinations of spectral bands and their wavelengths, producing fixed-size feature maps that feed into a compact, quantized segmentation model based on a modified U-Net. The module runs efficiently on embedded CPUs using Apache TVM, while the segmentation model is deployed on FPGA, forming a CPU-FPGA hybrid pipeline suitable for space-qualified hardware. Evaluations on Sentinel-2 and Landsat 8 datasets demonstrate accurate segmentation across diverse input configurations.[120] A Single Neuron Works: Precise Concept Erasure in Text-to-Image Diffusion Models
Qinqin He,Jiaqi Weng,Jialing Tao,Hui Xue
Main category: cs.CV
TL;DR: 提出了一种基于单个神经元的文本到图像模型中有害概念擦除方法SNCE,通过稀疏自编码器和激活模式评分精确定位并抑制有害概念神经元,实现高精度擦除且保持图像质量。
Details
Motivation: 现有概念擦除方法难以在彻底去除有害内容的同时保持图像生成质量,需要更精确、低干扰的安全控制方法。 Method: 训练稀疏自编码器(SAE)将文本嵌入映射到稀疏解耦的潜在空间,并设计基于调制频率评分的神经元识别方法,定位与有害概念相关的单个神经元,通过抑制其激活实现概念擦除。 Result: 在多个基准测试中,SNCE在目标概念擦除方面达到SOTA效果,显著优于现有方法,同时保持非目标概念的生成能力和对对抗攻击的强鲁棒性。 Conclusion: SNCE通过单神经元操作实现了高效、精准的概念擦除,为文本到图像模型的安全控制提供了可扩展且低侵入性的新思路。 Abstract: Text-to-image models exhibit remarkable capabilities in image generation. However, they also pose safety risks of generating harmful content. A key challenge of existing concept erasure methods is the precise removal of target concepts while minimizing degradation of image quality. In this paper, we propose Single Neuron-based Concept Erasure (SNCE), a novel approach that can precisely prevent harmful content generation by manipulating only a single neuron. Specifically, we train a Sparse Autoencoder (SAE) to map text embeddings into a sparse, disentangled latent space, where individual neurons align tightly with atomic semantic concepts. To accurately locate neurons responsible for harmful concepts, we design a novel neuron identification method based on the modulated frequency scoring of activation patterns. By suppressing activations of the harmful concept-specific neuron, SNCE achieves surgical precision in concept erasure with minimal disruption to image quality. Experiments on various benchmarks demonstrate that SNCE achieves state-of-the-art results in target concept erasure, while preserving the model's generation capabilities for non-target concepts. Additionally, our method exhibits strong robustness against adversarial attacks, significantly outperforming existing methods.[121] OmniPlantSeg: Species Agnostic 3D Point Cloud Organ Segmentation for High-Resolution Plant Phenotyping Across Modalities
Andreas Gilson,Lukas Meyer,Oliver Scholz,Ute Schmid
Main category: cs.CV
TL;DR: 提出了一种简单而有效的KDSS算法,用于生物点云的子采样,适用于不同植物种类和传感器模态,无需降采样即可实现全分辨率点云分割。
Details
Motivation: 现有植物器官点云分割方法通常针对特定物种或传感器模态,且需大量预处理和降采样,限制了通用性和精度。 Method: 提出KDSS算法,一种与传感器和植物种类无关的子采样方法,保留原始分辨率,结合当前最先进的分割模型进行全分辨率点云分割。 Result: 在多种传感器模态(如摄影测量、激光三角测量和LiDAR)及不同植物物种上验证了KDSS的有效性,结果令人满意。 Conclusion: KDSS是一种轻量级、保留分辨率的预处理替代方案,具有跨物种和跨传感器模态的广泛适用性。 Abstract: Accurate point cloud segmentation for plant organs is crucial for 3D plant phenotyping. Existing solutions are designed problem-specific with a focus on certain plant species or specified sensor-modalities for data acquisition. Furthermore, it is common to use extensive pre-processing and down-sample the plant point clouds to meet hardware or neural network input size requirements. We propose a simple, yet effective algorithm KDSS for sub-sampling of biological point clouds that is agnostic to sensor data and plant species. The main benefit of this approach is that we do not need to down-sample our input data and thus, enable segmentation of the full-resolution point cloud. Combining KD-SS with current state-of-the-art segmentation models shows satisfying results evaluated on different modalities such as photogrammetry, laser triangulation and LiDAR for various plant species. We propose KD-SS as lightweight resolution-retaining alternative to intensive pre-processing and down-sampling methods for plant organ segmentation regardless of used species and sensor modality.[122] Background Prompt for Few-Shot Out-of-Distribution Detection
Songyue Cai,Zongqian Wu,Yujie Mo,Liang Peng,Ping Hu,Xiaoshuang Shi,Xiaofeng Zhu
Main category: cs.CV
TL;DR: 提出了一种新的前景-背景分解框架Mambo,用于少样本异常检测,通过结合局部类别相似性和改进的背景相似性,并引入自校准调整机制,提升了鲁棒性和性能。
Details
Motivation: 现有方法因过度依赖局部类别相似性和固定的背景块提取策略而导致鲁棒性低。 Method: 首先学习一个包含背景和语义信息的背景提示以获得局部背景相似性,然后利用局部类别相似性对其进行优化;结合两者进行背景提取,并提出补丁自校准调优以灵活选择不同样本的背景块数量。 Result: 在真实世界数据集上的实验表明,Mambo在OOD检测和近OOD检测任务上优于现有最先进方法。 Conclusion: Mambo有效缓解了传统方法对局部相似性的依赖和固定背景提取策略的问题,显著提升了少样本异常检测的性能与鲁棒性。 Abstract: Existing foreground-background (FG-BG) decomposition methods for the few-shot out-of-distribution (FS-OOD) detection often suffer from low robustness due to over-reliance on the local class similarity and a fixed background patch extraction strategy. To address these challenges, we propose a new FG-BG decomposition framework, namely Mambo, for FS-OOD detection. Specifically, we propose to first learn a background prompt to obtain the local background similarity containing both the background and image semantic information, and then refine the local background similarity using the local class similarity. As a result, we use both the refined local background similarity and the local class similarity to conduct background extraction, reducing the dependence of the local class similarity in previous methods. Furthermore, we propose the patch self-calibrated tuning to consider the sample diversity to flexibly select numbers of background patches for different samples, and thus exploring the issue of fixed background extraction strategies in previous methods. Extensive experiments on real-world datasets demonstrate that our proposed Mambo achieves the best performance, compared to SOTA methods in terms of OOD detection and near OOD detection setting. The source code will be released at https://github.com/YuzunoKawori/Mambo.[123] Stratify or Die: Rethinking Data Splits in Image Segmentation
Naga Venkata Sai Jitin Jami,Thomas Altstidl,Jonas Mueller,Jindong Li,Dario Zanca,Bjoern Eskofier,Heike Leutheuser
Main category: cs.CV
TL;DR: 本文提出了一种新的数据集划分方法WDES,通过最小化Wasserstein距离来优化分割任务中标签分布的相似性,相比随机采样能生成更具代表性的数据划分,尤其适用于小规模、类别不平衡和低多样性数据集。
Details
Motivation: 图像分割任务中常用的随机数据划分方法容易导致测试集不具代表性,从而产生有偏评估和模型泛化能力差的问题;现有分层抽样方法在分类任务中有效,但在分割任务中因多标签结构和类别不平衡难以应用。 Method: 基于迭代像素分层(IPS)思想,提出Wasserstein驱动的进化分层(WDES),采用遗传算法最小化数据划分间的Wasserstein距离,以优化标签分布的相似性,并证明其在足够代数下具有全局最优性。 Result: 通过新提出的统计异质性指标评估,WDES在街景、医学影像和卫星图像等多种分割任务中均比随机采样产生更一致的数据划分,显著降低性能方差,提升模型评估可靠性。 Conclusion: WDES是一种有效的图像分割数据划分方法,能够生成更具代表性的训练/测试集,在处理小样本、类别不平衡和低多样性数据时优势尤为明显,有助于提高模型评估的准确性和稳定性。 Abstract: Random splitting of datasets in image segmentation often leads to unrepresentative test sets, resulting in biased evaluations and poor model generalization. While stratified sampling has proven effective for addressing label distribution imbalance in classification tasks, extending these ideas to segmentation remains challenging due to the multi-label structure and class imbalance typically present in such data. Building on existing stratification concepts, we introduce Iterative Pixel Stratification (IPS), a straightforward, label-aware sampling method tailored for segmentation tasks. Additionally, we present Wasserstein-Driven Evolutionary Stratification (WDES), a novel genetic algorithm designed to minimize the Wasserstein distance, thereby optimizing the similarity of label distributions across dataset splits. We prove that WDES is globally optimal given enough generations. Using newly proposed statistical heterogeneity metrics, we evaluate both methods against random sampling and find that WDES consistently produces more representative splits. Applying WDES across diverse segmentation tasks, including street scenes, medical imaging, and satellite imagery, leads to lower performance variance and improved model evaluation. Our results also highlight the particular value of WDES in handling small, imbalanced, and low-diversity datasets, where conventional splitting strategies are most prone to bias.[124] EnGraf-Net: Multiple Granularity Branch Network with Fine-Coarse Graft Grained for Classification Task
Riccardo La Grassa,Ignazio Gallo,Nicola Landro
Main category: cs.CV
TL;DR: 本文提出了一种名为EnGraf-Net的端到端深度神经网络模型,利用层次化语义关联(分类法)作为监督信号,无需裁剪或人工标注,在细粒度分类任务中表现出与现有方法相当甚至更优的性能。
Details
Motivation: 现有的细粒度分类方法依赖于部件标注或自动注意力机制,但往往导致局部特征表示不完整。作者认为引入语义关联和层次结构可更好地区分高度相似的类别。 Method: 提出EnGraf-Net模型,将语义层次结构(taxonomy)作为监督信号融入端到端深度神经网络,利用层级语义关系增强特征表示,避免使用边界框、部件定位或文本属性等外部标注信息。 Result: 在CIFAR-100、CUB-200-2011和FGVC-Aircraft三个主流数据集上进行了广泛实验,结果表明EnGraf-Net优于许多现有细粒度分类模型,并与最新的最先进方法具有竞争力。 Conclusion: 通过引入层次化语义关联作为监督信号,EnGraf-Net在无需部件标注或自动注意力机制的情况下,实现了高效且鲁棒的细粒度图像分类,验证了语义层次结构在细粒度识别中的有效性。 Abstract: Fine-grained classification models are designed to focus on the relevant details necessary to distinguish highly similar classes, particularly when intra-class variance is high and inter-class variance is low. Most existing models rely on part annotations such as bounding boxes, part locations, or textual attributes to enhance classification performance, while others employ sophisticated techniques to automatically extract attention maps. We posit that part-based approaches, including automatic cropping methods, suffer from an incomplete representation of local features, which are fundamental for distinguishing similar objects. While fine-grained classification aims to recognize the leaves of a hierarchical structure, humans recognize objects by also forming semantic associations. In this paper, we leverage semantic associations structured as a hierarchy (taxonomy) as supervised signals within an end-to-end deep neural network model, termed EnGraf-Net. Extensive experiments on three well-known datasets CIFAR-100, CUB-200-2011, and FGVC-Aircraft demonstrate the superiority of EnGraf-Net over many existing fine-grained models, showing competitive performance with the most recent state-of-the-art approaches, without requiring cropping techniques or manual annotations.[125] Vision Transformers: the threat of realistic adversarial patches
Kasper Cools,Clara Maathuis,Alexander M. van Oers,Claudia S. Hübner,Nikos Deligiannis,Marijke Vandewal,Geert De Cubber
Main category: cs.CV
TL;DR: 该研究探讨了视觉Transformer(ViT)在面对基于CNN的对抗性补丁攻击时的脆弱性,特别是在人物分类任务中使用Creases Transformation(CT)技术生成的现实对抗补丁。实验结果显示不同ViT模型对抗攻击的成功率差异显著,表明对抗补丁具有跨架构迁移性,且模型的预训练数据规模和方法显著影响其鲁棒性。
Details
Motivation: 随着机器学习系统在关键领域的广泛应用,其安全性变得至关重要。尽管ViT相比CNN在性能和对抗鲁棒性上有所提升,但仍可能受到逃避攻击(如对抗性补丁)的威胁。因此,研究ViT在真实场景下的安全漏洞具有重要意义。 Method: 采用Creases Transformation(CT)技术生成具有自然褶皱特征的对抗性补丁,模拟衣物穿戴时的几何畸变,并测试这些补丁在多个微调后的ViT模型上的迁移攻击效果,评估其在二分类(人/非人)任务中的攻击成功率。 Result: 在四个ViT模型上的实验显示攻击成功率从40.04%(google/vit-base-patch16-224-in21k)到99.97%(facebook/dino-vitb16)不等,其中google/vit-base-patch16-224为66.40%,facebook/dinov3-vitb16为65.17%,证实了CNN生成的对抗补丁可有效迁移到ViT模型。 Conclusion: ViT模型仍易受来自CNN的对抗性补丁攻击,攻击的转移性较强,模型的鲁棒性与其预训练数据集规模和训练方法密切相关,需进一步优化以提升安全性。 Abstract: The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.[126] UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition
Guojun Lei,Rong Zhang,Chi Wang,Tianhang Liu,Hong Li,Zhiyuan Ma,Weiwei Xu
Main category: cs.CV
TL;DR: 提出UniTransfer架构,通过空间和扩散时间步分解实现精确可控的视频概念迁移。
Details
Motivation: 实现更精细、可控的视频概念转移,提升生成质量和编辑能力。 Method: 引入空间分解(前景、背景、运动流)和基于DiT的双流到单流架构,结合自监督预训练和Chain-of-Prompt机制进行时间步分解,利用LLM指导分阶段生成。 Result: 在多种参考图像和场景下实现了高质量、高可控性的视频概念转移,优于现有基线方法。 Conclusion: UniTransfer通过空间与时间双重分解策略,显著提升了视频概念转移的精度与可控性。 Abstract: We propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability. Web Page: https://yu-shaonian.github.io/UniTransfer-Web/[127] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
Ziang Yan,Xinhao Li,Yinan He,Zhengrong Yue,Xiangyu Zeng,Yali Wang,Yu Qiao,Limin Wang,Yi Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为视觉测试时扩展(VTTS)的新方法,通过推理过程中的迭代感知来增强多模态大语言模型(MLLMs)的推理能力,并引入了配套的VTTS-80K数据集。
Details
Motivation: 现有的MLLM主要依赖静态感知阶段分析视觉信息,限制了其推理能力,因此需要一种能够实现动态、渐进式感知的方法以提升模型的人类级理解能力。 Method: 提出VTTS方法,采用迭代感知(ITP)机制,结合强化学习和时空监督,在推理过程中逐步聚焦高置信度的时空区域,并通过文本预测更新引导感知优化。同时构建VTTS-80K数据集支持该范式。 Result: 在超过15个涵盖视频对话、视频推理和时空感知的基准上,基于VTTS的Videochat-R1.5模型相比Qwen2.5VL-3B和-7B等强基线平均提升超过5%。 Conclusion: VTTS通过推理时增加感知计算实现了性能提升,有效增强了MLLM的多模态推理能力,具有良好的泛化性和应用潜力。 Abstract: Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs' reasoning via iterative perception during inference. VTTS mimics humans' hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS's effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5\%, compared to robust baselines such as Qwen2.5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.[128] Mammo-CLIP Dissect: A Framework for Analysing Mammography Concepts in Vision-Language Models
Suaiba Amina Salahuddin,Teresa Dorszewski,Marit Almenning Martiniussen,Tone Hovda,Antonio Portaluri,Solveig Thrun,Michael Kampffmeyer,Elisabeth Wetzer,Kristoffer Wickstrøm,Robert Jenssen
Main category: cs.CV
TL;DR: 本文提出了Mammo-CLIP Dissect,首个用于乳腺X线摄影深度学习模型的概念性可解释性框架,利用乳腺特异性视觉-语言模型标注神经元并量化其与领域知识的对齐程度。
Details
Motivation: 理解深度学习模型在临床环境中学到的内容至关重要,而现有研究多关注基于像素的可解释方法,较少关注模型学习到的文本概念,这些概念可能更贴近临床医生的推理方式。 Method: 提出Mammo-CLIP Dissect框架,使用乳腺特异性视觉-语言模型(Mammo-CLIP)作为“解剖器”,对指定层的神经元进行人类可理解的文本概念标注,并量化其与领域知识的对齐程度,系统分析不同训练数据和微调策略下模型的概念学习差异。 Result: 乳腺X线专用数据训练的模型比通用图像训练的模型捕捉到更多临床相关概念,且更符合放射科医生的工作流程;任务特定微调增强了某些概念(如良性钙化)的学习,但可能削弱其他概念(如密度特征)的覆盖,揭示了专业化与泛化之间的权衡。 Conclusion: Mammo-CLIP Dissect有助于揭示CNN如何捕获乳腺X线领域的专业知识,比较不同训练和微调方案可阐明领域特定训练和任务适应对概念学习的影响。 Abstract: Understanding what deep learning (DL) models learn is essential for the safe deployment of artificial intelligence (AI) in clinical settings. While previous work has focused on pixel-based explainability methods, less attention has been paid to the textual concepts learned by these models, which may better reflect the reasoning used by clinicians. We introduce Mammo-CLIP Dissect, the first concept-based explainability framework for systematically dissecting DL vision models trained for mammography. Leveraging a mammography-specific vision-language model (Mammo-CLIP) as a "dissector," our approach labels neurons at specified layers with human-interpretable textual concepts and quantifies their alignment to domain knowledge. Using Mammo-CLIP Dissect, we investigate three key questions: (1) how concept learning differs between DL vision models trained on general image datasets versus mammography-specific datasets; (2) how fine-tuning for downstream mammography tasks affects concept specialisation; and (3) which mammography-relevant concepts remain underrepresented. We show that models trained on mammography data capture more clinically relevant concepts and align more closely with radiologists' workflows than models not trained on mammography data. Fine-tuning for task-specific classification enhances the capture of certain concept categories (e.g., benign calcifications) but can reduce coverage of others (e.g., density-related features), indicating a trade-off between specialisation and generalisation. Our findings show that Mammo-CLIP Dissect provides insights into how convolutional neural networks (CNNs) capture mammography-specific knowledge. By comparing models across training data and fine-tuning regimes, we reveal how domain-specific training and task-specific adaptation shape concept learning. Code and concept set are available: https://github.com/Suaiba/Mammo-CLIP-Dissect.[129] MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
Sicheng Tao,Jungang Li,Yibo Yan,Junyan Zhang,Yubo Gao,Hanqian Li,ShuHang Xun,Yuxuan Fan,Hong Chen,Jianxiang He,Xuming Hu
Main category: cs.CV
TL;DR: 本文提出MOSS-ChatV,一种基于动态时间规整(DTW)过程奖励的强化学习框架,用于提升多模态大语言模型在视频推理中的过程一致性,并构建了带标注推理轨迹的基准MOSS-Video进行评估。
Details
Motivation: 现有MLLM在视频推理中常出现中间推理过程与视频动态不一致的问题,即使最终答案正确也影响可解释性和鲁棒性。 Method: 设计基于DTW的过程奖励函数,通过规则化方法对齐推理轨迹与时间对齐的参考内容,使用强化学习实现过程监督,并构建MOSS-Video基准用于训练和评估。 Result: MOSS-ChatV在MOSS-Video测试集上达到87.2%的准确率,在MVBench和MMVU等通用视频基准上性能提升,且在不同模型架构上均表现增益,GPT-4o评估显示其推理过程更一致稳定。 Conclusion: 该框架有效提升了视频推理过程中的一致性与稳定性,具有良好的通用性和应用潜力。 Abstract: Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2\% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.[130] CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration
Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,Shayan Baghayi Nejad,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: 本文提出了一种名为CARINOX的统一框架,结合了噪声优化与探索,并通过基于人类判断相关性的奖励选择机制,显著提升了文本到图像生成中的组合对齐性能。
Details
Motivation: 现有的文本到图像扩散模型在处理复杂对象关系、属性或空间布局时难以实现良好的组合对齐,且现有方法(优化或探索)单独使用时存在局限性。 Method: 提出CARINOX框架,将初始噪声的优化与探索相结合,并引入一种基于与人类判断相关性的原则性奖励选择策略,以提供更一致和有效的引导。 Result: 在T2I-CompBench++和HRS两个基准上分别平均提升16%和11%的对齐分数,持续优于当前最先进的优化和探索方法,同时保持图像质量和多样性。 Conclusion: CARINOX通过融合优化与探索并采用更可靠的奖励机制,在无需模型微调的情况下有效改善了文本到图像生成的组合性表现。 Abstract: Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/{this URL}.[131] MotionFlow:Learning Implicit Motion Flow for Complex Camera Trajectory Control in Video Generation
Guojun Lei,Chi Wang,Yikai Wang,Hong Li,Ying Song,Weiwei Xu
Main category: cs.CV
TL;DR: 提出一种新方法,通过将相机和物体运动转化为像素运动,结合稳定扩散网络和语义先验生成符合指定相机轨迹且物体运动一致的视频。
Details
Motivation: 现有方法在处理相机与物体同时运动时难以区分相对运动,导致视频生成的不一致和泛化能力差。 Method: 将相机和物体运动统一为像素运动,利用稳定扩散网络学习参考运动图,并结合语义对象先验输入图像到视频网络生成视频。 Result: 实验表明,该模型在遵循指定相机轨迹的同时保持物体运动一致性,显著优于当前最先进方法。 Conclusion: 所提方法有效解决了相机与物体运动耦合带来的生成一致性问题,提升了视频生成的质量和可控性。 Abstract: Generating videos guided by camera trajectories poses significant challenges in achieving consistency and generalizability, particularly when both camera and object motions are present. Existing approaches often attempt to learn these motions separately, which may lead to confusion regarding the relative motion between the camera and the objects. To address this challenge, we propose a novel approach that integrates both camera and object motions by converting them into the motion of corresponding pixels. Utilizing a stable diffusion network, we effectively learn reference motion maps in relation to the specified camera trajectory. These maps, along with an extracted semantic object prior, are then fed into an image-to-video network to generate the desired video that can accurately follow the designated camera trajectory while maintaining consistent object motions. Extensive experiments verify that our model outperforms SOTA methods by a large margin.[132] The Unwinnable Arms Race of AI Image Detection
Till Aczel,Lorenzo Vettor,Andreas Plesner,Roger Wattenhofer
Main category: cs.CV
TL;DR: 本文研究了在图像生成AI快速发展的背景下,判别器在何种条件下处于劣势,分析了数据维度和复杂度两个关键因素。
Details
Motivation: 随着生成模型的进步,真实与合成图像的界限变得模糊,迫切需要理解判别器何时难以区分二者,以推动更鲁棒的检测方法。 Method: 利用Kolmogorov复杂度作为数据集内在结构的度量,理论分析并实证研究了数据维度和复杂度对判别器性能的影响。 Result: 发现极高或极低复杂度的数据集不利于检测合成图像,而中等复杂度数据集最有利于判别器发现生成模型的缺陷。 Conclusion: 判别器在面对中等复杂度数据时最具优势,该结果揭示了生成与判别模型竞争中的关键影响因素,为未来检测技术的设计提供了指导。 Abstract: The rapid progress of image generative AI has blurred the boundary between synthetic and real images, fueling an arms race between generators and discriminators. This paper investigates the conditions under which discriminators are most disadvantaged in this competition. We analyze two key factors: data dimensionality and data complexity. While increased dimensionality often strengthens the discriminators ability to detect subtle inconsistencies, complexity introduces a more nuanced effect. Using Kolmogorov complexity as a measure of intrinsic dataset structure, we show that both very simple and highly complex datasets reduce the detectability of synthetic images; generators can learn simple datasets almost perfectly, whereas extreme diversity masks imperfections. In contrast, intermediate-complexity datasets create the most favorable conditions for detection, as generators fail to fully capture the distribution and their errors remain visible.[133] WAVECLIP: Wavelet Tokenization for Adaptive-Resolution CLIP
Moshe Kimhi,Erez Koifman,Ehud Rivlin,Eli Schwartz,Chaim Baskin
Main category: cs.CV
TL;DR: 提出WAVECLIP,一种基于小波分解的统一模型,支持CLIP中的自适应分辨率推理,通过多级小波分解和关键值缓存实现计算效率提升。
Details
Motivation: 传统CLIP模型使用固定分辨率的图像块嵌入,限制了计算效率与精度的灵活权衡,难以在推理时动态调整资源消耗。 Method: 用基于小波的多级分解替代标准图像块嵌入,引入因果跨层级注意力机制和键值缓存,在推理时从低分辨率开始,按需细化,并支持早期退出。 Result: 在零样本分类任务中验证了方法的有效性,通过置信度门控实现自适应退出,在保持竞争力精度的同时显著降低计算开销。 Conclusion: WAVECLIP通过单一部署模型实现了可调节的计算-精度权衡,仅需轻量级蒸馏即可达到高效推理,适用于资源敏感场景。 Abstract: We introduce WAVECLIP, a single unified model for adaptive resolution inference in CLIP, enabled by wavelet-based tokenization. WAVECLIP replaces standard patch embeddings with a multi-level wavelet decomposition, enabling the model to process images coarse to fine while naturally supporting multiple resolutions within the same model. At inference time, the model begins with low resolution tokens and refines only when needed, using key-value caching and causal cross-level attention to reuse computation, effectively introducing to the model only new information when needed. We evaluate WAVECLIP in zero-shot classification, demonstrating that a simple confidence-based gating mechanism enables adaptive early exits. This allows users to dynamically choose a compute-accuracy trade-off using a single deployed model. Our approach requires only lightweight distillation from a frozen CLIP teacher and achieves competitive accuracy with significant computational savings.[134] Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy
Aymen Bouguerra,Daniel Montoya,Alexandra Gomez-Villa,Fabio Arnez,Chokri Mraidha
Main category: cs.CV
TL;DR: 本文研究了量化对CLIP模型在准确率之外的可靠性指标的影响,发现量化能改善欠自信模型的校准性,且即使在校准性下降的情况下仍可提升OOD检测性能,并通过特定的量化感知训练方法实现了效率与性能的双赢。
Details
Motivation: 尽管CLIP等视觉语言模型在分布外检测等安全相关任务中展现出零样本泛化能力,但其在量化下的可靠性表现尚未被充分探索,尤其是在实际部署中对计算效率和可靠性的双重需求。 Method: 对CLIP模型进行了大规模的量化评估,不仅考察了分布内准确率,还系统分析了包括模型校准性和分布外检测在内的多种可靠性指标,并探究了不同量化感知训练方法的效果。 Result: 量化能持续改善通常欠自信预训练模型的校准性,但可能降低过自信模型的校准性;然而这种校准性下降并不妨碍OOD检测性能的提升;特定的量化感知训练方法可同时提升零样本准确率、校准性和OOD鲁棒性。 Conclusion: 量化不仅是提升效率的手段,还能在多目标优化中增强模型的可靠性与鲁棒性,挑战了传统效率-性能权衡的观点,为高效可靠部署VLM提供了新视角。 Abstract: The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP's performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.[135] TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
Iñigo Alonso,Imanol Miranda,Eneko Agirre,Mirella Lapata
Main category: cs.CV
TL;DR: TABLET是一个大规模视觉表格理解(VTU)数据集,包含400万个样本,涵盖20项任务,基于200万张真实表格,88%保留原始可视化。提供图像-HTML配对、元数据和来源信息,支持模型在真实场景中的训练与评估。
Details
Motivation: 现有VTU数据集多使用合成渲染,缺乏真实世界表格的复杂性和多样性,且无法访问底层序列化数据,限制了模型的泛化与可扩展性。 Method: 构建包含400万样本的大规模数据集TABLET,覆盖20项任务,保留88%原始可视化,提供图像-HTML配对、元数据和来源追溯信息,并用于微调如Qwen2.5-VL-7B等视觉语言模型。 Result: 在TABLET上微调的模型在已见和未见VTU任务上性能提升,对真实世界表格图像的鲁棒性增强。 Conclusion: TABLET通过保留真实可视化和数据溯源,为VTU模型提供了更可靠、可扩展的训练与评估基础。 Abstract: While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.[136] Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
Muxin Pu,Mei Kuan Lim,Chun Yong Chong,Chen Change Loy
Main category: cs.CV
TL;DR: 提出Sigma,一种基于骨架的统一手语理解框架,通过符号感知早期融合、分层对齐学习和多任务预训练,在多个任务上实现最先进性能。
Details
Motivation: 现有手语理解方法存在语义接地弱、局部与全局信息不平衡、跨模态学习效率低三个问题。 Method: 设计了符号感知的早期融合机制、分层对齐学习策略以及结合对比学习、文本匹配和语言建模的统一预训练框架。 Result: 在孤立词识别、连续识别和无gloss翻译等多个基准上达到SOTA。 Conclusion: 骨骼数据可作为手语理解的有效独立模态,语义丰富的预训练能显著提升模型性能。 Abstract: Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.[137] Learning Conformal Explainers for Image Classifiers
Amr Alkhatib,Stephanie Lowry
Main category: cs.CV
TL;DR: 提出一种基于共形预测的新方法,使用户能够直接控制生成解释的保真度,并在多个图像数据集上验证了其在保真度和信息效率方面优于现有方法。
Details
Motivation: 现有的特征归因方法在解释图像预测时存在鲁棒性和保真性不足的问题,且通常需要真实解释进行校准,限制了其可靠性与实用性。 Method: 提出一种基于共形预测的方法,通过识别足以保持模型预测的显著特征子集来生成可信赖的解释,并设计了四种一致性函数来量化解释与模型预测的一致性程度,无需访问真实解释进行校准。 Result: 在六个图像数据集上对五种解释器进行了评估,结果表明FastSHAP在保真度和信息效率(解释区域大小)方面 consistently 优于竞争方法,且基于超像素的一致性度量比像素级更有效。 Conclusion: 该方法能有效提升解释的保真度和稳健性,允许用户控制解释质量,且不依赖真实标签,具有较强的实用价值。 Abstract: Feature attribution methods are widely used for explaining image-based predictions, as they provide feature-level insights that can be intuitively visualized. However, such explanations often vary in their robustness and may fail to faithfully reflect the reasoning of the underlying black-box model. To address these limitations, we propose a novel conformal prediction-based approach that enables users to directly control the fidelity of the generated explanations. The method identifies a subset of salient features that is sufficient to preserve the model's prediction, regardless of the information carried by the excluded features, and without demanding access to ground-truth explanations for calibration. Four conformity functions are proposed to quantify the extent to which explanations conform to the model's predictions. The approach is empirically evaluated using five explainers across six image datasets. The empirical results demonstrate that FastSHAP consistently outperforms the competing methods in terms of both fidelity and informational efficiency, the latter measured by the size of the explanation regions. Furthermore, the results reveal that conformity measures based on super-pixels are more effective than their pixel-wise counterparts.[138] Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation
Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: 本文研究了文本-图像生成中广泛使用的评估指标在组合性任务中的表现,发现没有单一指标在所有任务上都一致优越,不同类型的指标在不同场景下表现各异,强调了选择和使用评估指标时需谨慎透明。
Details
Motivation: 现有的文本-图像生成评估指标多依赖惯例或流行度,缺乏对人类判断的可靠对齐验证,因此需要系统评估这些指标在组合性理解上的有效性。 Method: 通过对多种常用指标进行广泛分析,涵盖不同的组合性挑战任务,比较各类指标(如基于VQA、基于嵌入和纯图像指标)与人类判断的一致性。 Result: 结果显示:1)无单一指标在所有任务中 consistently 表现最佳;2)VQA类指标并不总是最优;3)某些基于嵌入的指标在特定情况下更优;4)仅依赖图像的指标对组合性评估贡献甚微。 Conclusion: 应根据具体任务谨慎选择评估指标,避免盲目依赖流行方法,并提倡透明报告以提升评估可信度及其在生成模型中的应用价值。 Abstract: Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at \href{https://amirkasaei.com/eval-the-evals/}{this URL}.[139] Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation
Seyed Amir Kasaei,Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: 本文提出了文本到图像生成模型中“幻觉”现象的新定义,将其归类为由模型偏见引起的偏离,并提出包含属性、关系和对象三类的分类体系,以促进对该类模型更深入的评估。
Details
Motivation: 现有评估主要关注提示词与生成内容的一致性,忽略了模型在提示之外生成的内容,缺乏对文本到图像模型中幻觉现象的清晰界定。 Method: 提出将文本到图像生成中的幻觉定义为由模型先验知识或偏见导致的偏差,并构建了包含属性、关系和对象三类幻觉的分类体系。 Result: 该框架为评估文本到图像模型提供了上限,并揭示了隐藏的模型偏见,支持更丰富的模型分析。 Conclusion: 通过新定义和分类体系,能够更系统地识别和评估文本到图像模型中的幻觉问题,为未来改进模型可靠性奠定基础。 Abstract: In language and vision-language models, hallucination is broadly understood as content generated from a model's prior knowledge or biases rather than from the given input. While this phenomenon has been studied in those domains, it has not been clearly framed for text-to-image (T2I) generative models. Existing evaluations mainly focus on alignment, checking whether prompt-specified elements appear, but overlook what the model generates beyond the prompt. We argue for defining hallucination in T2I as bias-driven deviations and propose a taxonomy with three categories: attribute, relation, and object hallucinations. This framing introduces an upper bound for evaluation and surfaces hidden biases, providing a foundation for richer assessment of T2I models.[140] SlideMamba: Entropy-Based Adaptive Fusion of GNN and Mamba for Enhanced Representation Learning in Digital Pathology
Shakib Khan,Fariba Dambandkhameneh,Nazim Shaikh,Yao Nie,Raghavan Venugopal,Xiao Li
Main category: cs.CV
TL;DR: 提出一种结合Mamba架构与图神经网络(GNN)的可泛化深度学习框架SlideMamba,用于全切片图像(WSI)分析,通过熵驱动的自适应融合策略有效整合局部和全局信息,在基因融合与突变状态预测任务中表现优于现有方法。
Details
Motivation: 为了提升全切片图像(WSI)分析中对局部空间关系和长程上下文依赖的建模能力,解决传统方法在捕捉多尺度病理特征方面的局限性。 Method: 提出SlideMamba框架,结合Mamba模块(擅长捕捉长程依赖)和GNN(强调细粒度局部交互),并设计基于熵的置信加权机制来自适应融合两个分支的输出。 Result: 在基因融合与突变预测任务中,SlideMamba的PRAUC达到0.751±0.05,优于MIL、Trans-MIL、Mamba-only、GNN-only及GAT-Mamba;同时在ROC AUC、敏感性和特异性上也表现出竞争力。 Conclusion: 集成Mamba与GNN并通过熵基自适应融合的架构显著提升了WSI分析性能,展示了其在计算病理学中空间分辨预测建模的潜力。 Abstract: Advances in computational pathology increasingly rely on extracting meaningful representations from Whole Slide Images (WSIs) to support various clinical and biological tasks. In this study, we propose a generalizable deep learning framework that integrates the Mamba architecture with Graph Neural Networks (GNNs) for enhanced WSI analysis. Our method is designed to capture both local spatial relationships and long-range contextual dependencies, offering a flexible architecture for digital pathology analysis. Mamba modules excels in capturing long-range global dependencies, while GNNs emphasize fine-grained short-range spatial interactions. To effectively combine these complementary signals, we introduce an adaptive fusion strategy that uses an entropy-based confidence weighting mechanism. This approach dynamically balances contributions from both branches by assigning higher weight to the branch with more confident (lower-entropy) predictions, depending on the contextual importance of local versus global information for different downstream tasks. We demonstrate the utility of our approach on a representative task: predicting gene fusion and mutation status from WSIs. Our framework, SlideMamba, achieves an area under the precision recall curve (PRAUC) of 0.751 \pm 0.05, outperforming MIL (0.491 \pm 0.042), Trans-MIL (0.39 \pm 0.017), Mamba-only (0.664 \pm 0.063), GNN-only (0.748 \pm 0.091), and a prior similar work GAT-Mamba (0.703 \pm 0.075). SlideMamba also achieves competitive results across ROC AUC (0.738 \pm 0.055), sensitivity (0.662 \pm 0.083), and specificity (0.725 \pm 0.094). These results highlight the strength of the integrated architecture, enhanced by the proposed entropy-based adaptive fusion strategy, and suggest promising potential for application of spatially-resolved predictive modeling tasks in computational pathology.[141] Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets
Team Hunyuan3D,:,Bowen Zhang,Chunchao Guo,Haolin Liu,Hongyu Yan,Huiwen Shi,Jingwei Huang,Junlin Yu,Kunhong Li,Linus,Penghao Wang,Qingxiang Lin,Sicong Liu,Xianghui Yang,Yixuan Tang,Yunfei Zhao,Zeqiang Lai,Zhihao Liang,Zibo Zhao
Main category: cs.CV
TL;DR: Hunyuan3D-Omni是一个基于Hunyuan3D 2.1的统一框架,支持细粒度、可控的3D资产生成,通过融合多种条件信号(如点云、体素、边界框和骨骼姿态)实现对几何、拓扑和姿态的精确控制。
Details
Motivation: 现有3D生成模型主要依赖图像或文本条件,缺乏细粒度跨模态控制,限制了可控性和实际应用。因此需要一个能支持多种输入模态并实现精细控制的统一框架。 Method: 提出Hunyuan3D-Omni,采用统一的跨模态架构处理多种输入信号(图像、点云、体素、边界框、骨骼姿态),并通过难度感知的渐进式采样策略训练模型,优先学习较难的控制信号(如骨骼姿态),降低易信号(如点云)的权重,以增强多模态融合能力和对缺失输入的鲁棒性。 Result: 实验证明该方法提升了生成精度,支持几何感知的变换,并增强了在生产流程中的鲁棒性。额外的控制信号显著改善了几何结构和姿态控制效果。 Conclusion: Hunyuan3D-Omni通过统一的跨模态架构和新颖的训练策略,实现了更精细、可控的3D资产生成,推动了其在游戏、电影和设计等领域的实用化进展。 Abstract: Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.[142] Learning to Look: Cognitive Attention Alignment with Vision-Language Models
Ryan L. Yang,Dipkamal Bhusal,Nidhi Rastogi
Main category: cs.CV
TL;DR: 提出一种可扩展的框架,利用视觉-语言模型自动生成语义注意力图,通过辅助损失对齐CNN注意力,提升模型泛化能力和认知合理性。
Details
Motivation: 现有方法依赖人工标注的概念监督和解释正则化来引导模型注意力,但标注成本高、难以扩展。 Method: 利用视觉-语言模型结合自然语言提示自动生成语义注意力图,并引入辅助损失使CNN注意力与这些语言引导的注意力图对齐。 Result: 在ColorMNIST上达到SOTA,在DecoyMNIST上性能接近需大量标注的基线方法,减少了对捷径特征的依赖,注意力更符合人类直觉。 Conclusion: 该方法无需人工标注即可提升CNN决策的可靠性和可解释性,具有良好的可扩展性和应用前景。 Abstract: Convolutional Neural Networks (CNNs) frequently "cheat" by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.[143] Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations
Zhijian Yang,Noel DSouza,Istvan Megyeri,Xiaojian Xu,Amin Honarmandi Shandiz,Farzin Haddadpour,Krisztian Koos,Laszlo Rusko,Emanuele Valeriano,Bharadwaj Swaninathan,Lei Wu,Parminder Bhatia,Taha Kass-Hout,Erhan Bas
Main category: cs.CV
TL;DR: 本文提出了Decipher-MR,一种针对3D MRI的视觉-语言基础模型,基于大规模多区域、多序列MRI数据集,结合自监督学习与报告引导的文本监督,实现跨多种临床任务的高效适应,在疾病分类、解剖定位等多个基准上优于现有方法。
Details
Motivation: 由于数据稀缺和解剖区域局限,现有基础模型在MRI分析中的应用受限,亟需一个可扩展、泛化性强的MRI专用基础模型。 Method: 提出Decipher-MR,采用自监督视觉学习与报告引导的文本监督相结合的方式,在20万MRI序列的大规模数据集上训练;采用模块化设计,固定编码器并微调轻量级任务特定解码器以适应不同任务。 Result: 在疾病分类、人口统计预测、解剖定位和跨模态检索等多个基准任务上,Decipher-MR consistently 优于现有基础模型和任务专用方法,展现出更强的泛化性和适应性。 Conclusion: Decipher-MR是一个可扩展且通用的MRI基础模型,支持高效、低开销的多任务适应,有望推动基于MRI的AI在临床和研究领域的广泛应用。 Abstract: Magnetic Resonance Imaging (MRI) is a critical medical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity pose challenges for automated analysis, particularly in scalable and generalizable machine learning applications. While foundation models have revolutionized natural language and vision tasks, their application to MRI remains limited due to data scarcity and narrow anatomical focus. In this work, we present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on a large-scale dataset comprising 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust, generalizable representations, enabling effective adaptation across broad applications. To enable robust and diverse clinical tasks with minimal computational overhead, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across diverse benchmarks including disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent performance gains over existing foundation models and task-specific approaches. Our results establish Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.[144] Instruction-tuned Self-Questioning Framework for Multimodal Reasoning
You-Won Jang,Yu-Jung Heo,Jaeseok Kim,Minsu Lee,Du-Seong Chang,Byoung-Tak Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为SQ-InstructBLIP的视觉语言理解模型,通过生成图像感知的子问题和子答案来提升多步推理性能。
Details
Motivation: 现有方法在处理需要多步推理的视觉问答任务时存在局限,尤其是无法有效利用图像细粒度信息且依赖黑盒大语言模型,难以复现和解释。 Method: 设计了一个包含Questioner、Answerer和Reasoner的统一架构,三者共享模型结构,通过迭代生成与图像相关的子问题和子答案,并结合这些信息进行主问题推理。 Result: 实验表明,SQ-InstructBLIP在VQA任务中比先前方法能进行更准确的推理,验证了引入图像感知子问题的有效性。 Conclusion: SQ-InstructBLIP通过可解释的模块化设计提升了视觉语言模型的多步推理能力,解决了黑盒模型和视觉信息利用不足的问题。 Abstract: The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.[145] Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization
Feng-Qi Cui,Jinyang Huang,Anyang Tong,Ziyu Jia,Jie Zhang,Zhi Liu,Dan Guo,Jianwei Lu,Meng Wang
Main category: cs.CV
TL;DR: 提出了一种独立于个体的通用微动作识别框架,通过分布鲁棒优化学习个体无关的表征,在特征和损失层面设计了两个即插即用模块,显著提升了模型在真实场景中的准确性和鲁棒性。
Details
Motivation: 现有微动作识别方法因个体间差异导致相同动作表现不同,难以在真实场景中实现鲁棒泛化。 Method: 提出Person Independence Universal Micro-action Recognition Framework,包含时频对齐模块(时间分支使用Wasserstein正则化对齐动态轨迹,频率分支引入方差引导扰动)和组不变正则化损失(通过伪分组模拟未见个体分布,加权边界样本并正则化子组方差)。 Result: 在大规模MA-52数据集上实验表明,该框架在准确性和鲁棒性方面优于现有方法,能在细粒度条件下实现稳定泛化。 Conclusion: 所提框架有效缓解了个体差异对微动作识别的影响,通过特征级和损失级的双重优化增强了模型的泛化能力,适用于真实场景下的微动作识别任务。 Abstract: Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.[146] Dense Semantic Matching with VGGT Prior
Songlin Yang,Tianyi Wei,Yushi Lan,Zeqi Xiao,Anyi Rao,Xingang Pan
Main category: cs.CV
TL;DR: 本文提出了一种基于3D几何基础模型VGGT的语义匹配方法,通过重用早期特征、微调后期特征并引入语义头和循环一致训练策略,在数据稀缺情况下实现了跨实例像素级语义匹配,显著提升了几何感知能力和匹配可靠性。
Details
Motivation: 现有语义匹配方法存在几何模糊性和依赖最近邻规则的问题,难以处理对称结构且缺乏泛化能力,同时忽略跨图像不可见性和流形保持,需要更鲁棒的几何感知描述符和整体密集匹配机制。 Method: 利用VGGT提供的几何感知特征和整体匹配能力,保留其早期特征阶段,微调后期阶段,并添加语义头以实现双向对应;通过循环一致性训练、合成数据增强和渐进式训练策略(含混叠伪影抑制)适应语义匹配任务。 Result: 实验表明该方法在几何感知、匹配可靠性和流形保持方面优于先前方法,在多个基准上取得了更优的密集语义匹配性能。 Conclusion: 本文成功将面向单实例多视角几何匹配的VGGT模型适配到跨实例语义匹配任务中,解决了数据稀缺下的模型迁移难题,为语义匹配提供了新的有效范式。 Abstract: Semantic matching aims to establish pixel-level correspondences between instances of the same category and represents a fundamental task in computer vision. Existing approaches suffer from two limitations: (i) Geometric Ambiguity: Their reliance on 2D foundation model features (e.g., Stable Diffusion, DINO) often fails to disambiguate symmetric structures, requiring extra fine-tuning yet lacking generalization; (ii) Nearest-Neighbor Rule: Their pixel-wise matching ignores cross-image invisibility and neglects manifold preservation. These challenges call for geometry-aware pixel descriptors and holistic dense correspondence mechanisms. Inspired by recent advances in 3D geometric foundation models, we turn to VGGT, which provides geometry-grounded features and holistic dense matching capabilities well aligned with these needs. However, directly transferring VGGT is challenging, as it was originally designed for geometry matching within cross views of a single instance, misaligned with cross-instance semantic matching, and further hindered by the scarcity of dense semantic annotations. To address this, we propose an approach that (i) retains VGGT's intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences; and (ii) adapts VGGT to the semantic matching scenario under data scarcity through cycle-consistent training strategy, synthetic data augmentation, and progressive training recipe with aliasing artifact mitigation. Extensive experiments demonstrate that our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.[147] MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation
Xinyu Liu,Guolei Sun,Cheng Wang,Yixuan Yuan,Ender Konukoglu
Main category: cs.CV
TL;DR: 本文提出了一种针对医学视频超分辨率(VSR)的新型框架MedVSR,以应对低分辨率医学视频中存在的相机抖动、噪声和帧间突变等挑战,并有效提升组织结构的重建质量。
Details
Motivation: 由于硬件限制和生理约束,高分辨率医学视频难以获取;现有VSR模型在处理临床低分辨率视频时面临光流误差大、对齐困难及产生伪影等问题,可能误导医生诊断。 Method: 提出MedVSR框架,包含两个核心模块:Cross State-Space Propagation (CSSP)用于通过状态空间模型将远距离帧作为控制矩阵来实现精确对齐;Inner State-Space Reconstruction (ISSR)通过长距离空间特征学习与大核短距离信息聚合联合优化组织结构重建并减少伪影。 Result: 在包括内窥镜和白内障手术在内的四个医学数据集上实验表明,MedVSR在重建性能和效率方面显著优于现有的VSR方法。 Conclusion: MedVSR有效解决了医学视频超分辨率中的对齐难题和结构失真问题,具有良好的临床应用潜力。 Abstract: High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.[148] MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Sicong Leng,Jing Wang,Jiaxi Li,Hao Zhang,Zhiqiang Hu,Boqiang Zhang,Yuming Jiang,Hang Zhang,Xin Li,Lidong Bing,Deli Zhao,Wei Lu,Yu Rong,Aixin Sun,Shijian Lu
Main category: cs.CV
TL;DR: 本文提出了一种方差感知采样(VAS)策略,用于提升多模态推理模型的强化学习后训练稳定性,并发布了大规模高质量的长链思维数据集和开源模型系列。
Details
Motivation: 现有大規模多模態推理模型受限於缺乏高質量長鏈思維數據以及強化學習算法在後訓練中的不穩定性,特別是低獎勵方差導致梯度消失問題。 Method: 提出方差感知採樣(VAS),基於方差促進分數(VPS)選擇數據,結合結果方差與路徑多樣性來提升獎勵方差;並發布了約160萬條長鏈思維數據和1.5萬個強化學習問答對,以及完整的端到端訓練代碼。 Result: 實驗表明VAS有效提升了模型在數學推理基準上的表現,消融研究驗證了各組件貢獻,理論分析證明獎勵方差下界與策略梯度大小相關。 Conclusion: VAS能有效穩定策略優化過程,所發布的數據與模型為社區提供了標準化基線,推動多模態推理領域發展。 Abstract: Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.[149] A Sentinel-3 foundation model for ocean colour
Geoffrey Dawson,Remy Vandaele,Andrew Taylor,David Moffat,Helen Tamura-Wicks,Sarah Jackson,Rosie Lickorish,Paolo Fraccaro,Hywel Williams,Chunbo Luo,Anne Jones
Main category: cs.CV
TL;DR: 提出了一种基于Prithvi-EO Vision Transformer架构的新型地理空间AI基础模型,用于海洋遥感任务,通过自监督预训练和少量标注数据微调,在叶绿素浓度和海洋初级生产力估算中表现出优越性能。
Details
Motivation: 海洋科学中标注数据稀疏且获取成本高,传统模型难以充分利用有限数据,因此需要能够利用大规模无标签数据的基础模型来提升遥感应用性能。 Method: 采用Prithvi-EO视觉Transformer架构,使用Sentinel-3 OLCI数据进行自监督预训练,通过重建任务学习特征表示,并在两个下游任务(叶绿素浓度估计和海洋初级生产估算)上进行微调评估。 Result: 该模型在叶绿素浓度量化和海洋初级生产遥感估算任务中优于现有基线模型,能有效利用少量高质量标注数据,精确捕捉海洋颜色的空间细节,并与实地观测点匹配良好。 Conclusion: 新一代地理空间AI基础模型有望为海洋生态系统及其在全球气候过程中的作用提供更稳健、数据驱动的洞察,特别适用于标注数据稀缺的场景。 Abstract: Artificial Intelligence (AI) Foundation models (FMs), pre-trained on massive unlabelled datasets, have the potential to drastically change AI applications in ocean science, where labelled data are often sparse and expensive to collect. In this work, we describe a new foundation model using the Prithvi-EO Vision Transformer architecture which has been pre-trained to reconstruct data from the Sentinel-3 Ocean and Land Colour Instrument (OLCI). We evaluate the model by fine-tuning on two downstream marine earth observation tasks. We first assess model performance compared to current baseline models used to quantify chlorophyll concentration. We then evaluate the FMs ability to refine remote sensing-based estimates of ocean primary production. Our results demonstrate the utility of self-trained FMs for marine monitoring, in particular for making use of small amounts of high quality labelled data and in capturing detailed spatial patterns of ocean colour whilst matching point observations. We conclude that this new generation of geospatial AI models has the potential to provide more robust, data-driven insights into ocean ecosystems and their role in global climate processes.[150] Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Shilin Lu,Zhuming Lian,Zihan Zhou,Shaocong Zhang,Chen Zhao,Adams Wai-Kin Kong
Main category: cs.CV
TL;DR: 提出了一种无需训练的图像合成框架SHINE,通过流形引导的锚点损失、降质抑制引导和自适应背景融合,在复杂光照和高分辨率场景下实现高质量、无缝的对象插入。
Details
Motivation: 现有图像合成方法在处理复杂光照(如阴影、水面反射)和多样化高分辨率输入时表现不佳,且依赖潜在空间反演或脆弱的注意力操作,导致对象姿态不自然或生成质量低。 Method: 提出SHINE框架,利用预训练定制化适配器(如IP-Adapter)引入流形引导的锚点损失,结合降质抑制引导和自适应背景融合策略,无需训练即可实现高保真插入;同时构建新基准ComplexCompo以评估复杂场景下的性能。 Result: 在ComplexCompo和DreamEditBench上实验表明,SHINE在DINOv2等标准指标及DreamSim、ImageReward等人类对齐评分中均达到最先进水平,显著减少伪影和错位问题。 Conclusion: SHINE为图像合成提供了一种高效、无需训练的解决方案,有效释放了扩散模型中的物理与分辨率先验,提升了复杂场景下的合成质量与真实性。 Abstract: Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.[151] Quantized Visual Geometry Grounded Transformer
Weilun Feng,Haotong Qin,Mingqiang Wu,Chuanguang Yang,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu
Main category: cs.CV
TL;DR: 本文提出了首个针对视觉几何接地Transformer(VGGTs)的量化框架QuantVGGT,通过双平滑细粒度量化和噪声过滤多样性采样,有效解决了大规模3D重建模型在后训练量化中的激活分布重尾和校准不稳定性问题,在4位量化下实现显著的内存压缩与加速,同时保持98%以上的精度。
Details
Motivation: 现有的后训练量化方法在处理十亿参数规模的视觉几何接地Transformer(VGGTs)时面临挑战,特别是由于特殊token导致的重尾激活分布以及多视角3D数据带来的校准样本选择不稳定,限制了其在实际场景中的部署。 Method: 提出QuantVGGT框架,包含两项关键技术:1)双平滑细粒度量化,结合全局Hadamard旋转和局部通道平滑以缓解重尾分布和通道间方差;2)噪声过滤多样性采样,利用深层统计信息去除异常值,并构建帧感知的多样化校准簇以稳定量化范围。 Result: 实验表明,QuantVGGT在多个基准和比特宽度下均达到最先进的量化性能,在4位量化下实现3.7倍内存减少和2.5倍硬件推理加速,同时保持超过全精度模型98%的重建精度。 Conclusion: QuantVGGT显著提升了大规模VGGTs模型在资源受限场景下的部署可行性,展示了其在高效3D重建中的实用价值和广阔优势。 Abstract: Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.[152] NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
Yu Yuan,Xijun Wang,Tharindu Wickremasinghe,Zeeshan Nadir,Bole Ma,Stanley H. Chan
Main category: cs.CV
TL;DR: 提出NewtonGen框架,结合数据驱动合成与可学习物理原理,通过可训练的神经牛顿动力学(NND)实现物理一致且可控的视频生成。
Details
Motivation: 现有文本到视频生成模型在物理一致性和可控性方面存在瓶颈,常产生不现实的运动,且缺乏对不同初始条件下动态行为的精确控制。 Method: 引入可训练的神经牛顿动力学(NND),将物理规律融入视频生成过程,联合利用数据先验和动力学引导。 Result: NewtonGen能够生成物理上更一致的视频,并支持对运动参数的精确控制,显著提升动态合理性。 Conclusion: 通过融合学习物理原理,NewtonGen有效解决了仅从外观学习导致的动力学理解缺失问题,推动了可控视频生成的发展。 Abstract: A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.[153] SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
Hmrishav Bandyopadhyay,Rahim Entezari,Jim Scott,Reshinth Adithyan,Yi-Zhe Song,Varun Jampani
Main category: cs.CV
TL;DR: SD3.5-Flash 是一种高效的少步蒸馏框架,可在消费级设备上实现高质量图像生成。