Skip to content

Table of Contents

cs.CL [Back]

[1] Interpreting Public Sentiment in Diplomacy Events: A Counterfactual Analysis Framework Using Large Language Models

Leyi Ouyang

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的反事实生成框架,通过调整外交事件叙述中的文本特征,在保持核心事实不变的情况下将公众情绪从负面转向中性或正面,实验显示成功率达70%。

Details Motivation: 传统测量公众情绪的方法耗时耗力,且缺乏前瞻性分析能力;同时,如何有效引导外交事件中的公众舆论仍缺乏数据驱动的解决方案。 Method: 首先构建包含外交事件描述及其相关公众讨论的数据集,训练语言模型预测公众反应;结合传播理论与领域专家意见,预设可修改的文本特征;开发基于大语言模型的反事实生成算法,系统化生成原文本的修改版本。 Result: 该框架在70%的案例中成功将负面公众情绪转为中性或正面,验证了通过调整叙事框架影响情绪的可行性。 Conclusion: 该框架可作为外交人员、政策制定者和传播专家的实用工具,提供数据驱动的策略建议,以更有效地塑造有利的公共舆论环境。 Abstract: Diplomatic events consistently prompt widespread public discussion and debate. Public sentiment plays a critical role in diplomacy, as a good sentiment provides vital support for policy implementation, helps resolve international issues, and shapes a nation's international image. Traditional methods for gauging public sentiment, such as large-scale surveys or manual content analysis of media, are typically time-consuming, labor-intensive, and lack the capacity for forward-looking analysis. We propose a novel framework that identifies specific modifications for diplomatic event narratives to shift public sentiment from negative to neutral or positive. First, we train a language model to predict public reaction towards diplomatic events. To this end, we construct a dataset comprising descriptions of diplomatic events and their associated public discussions. Second, guided by communication theories and in collaboration with domain experts, we predetermined several textual features for modification, ensuring that any alterations changed the event's narrative framing while preserving its core facts.We develop a counterfactual generation algorithm that employs a large language model to systematically produce modified versions of an original text. The results show that this framework successfully shifted public sentiment to a more favorable state with a 70\% success rate. This framework can therefore serve as a practical tool for diplomats, policymakers, and communication specialists, offering data-driven insights on how to frame diplomatic initiatives or report on events to foster a more desirable public sentiment.

[2] Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

Shreya G. Upadhyay,Carlos Busso,Chi-Chun Lee

Main category: cs.CL

TL;DR: 提出了一种说话人风格感知的音素锚定框架,用于跨语言语音情感识别,通过在说话人和音素空间中进行双空间锚定,提升跨语言情感迁移效果。

Details Motivation: 跨语言语音情感识别因不同语言间的语音差异和说话人表达风格差异而具有挑战性,需要一种能对齐不同说话人和语言间情感表达的框架。 Method: 通过基于图的聚类构建情感特定的说话人社区以捕捉共性特征,并在说话人空间和音素空间进行双空间锚定,实现跨语言情感对齐。 Result: 在MSP-Podcast(英语)和BIIC-Podcast(台湾普通话)语料库上的实验表明,该方法优于现有基线模型,提升了跨语言情感识别的泛化能力。 Conclusion: 所提出的双空间锚定框架有效捕捉了跨语言情感表达的共性,增强了模型在不同语言间的迁移性能。 Abstract: Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.

[3] CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Nithin Somasekharan,Ling Yue,Yadi Cao,Weichao Li,Patrick Emami,Pochinapeddi Sai Bhargav,Anurag Acharya,Xingyu Xie,Shaowu Pan

Main category: cs.CL

TL;DR: 本文提出了CFDLLMBench,一个用于评估大语言模型在计算流体动力学(CFD)中自动化数值实验能力的基准测试套件,包含三个组件,旨在全面评估模型的知识、推理和实现能力。

Details Motivation: 尽管大语言模型在自然语言处理任务中表现出色,但其在复杂物理系统数值实验自动化中的应用仍待探索。计算流体动力学作为一个关键且劳动密集型领域,为评估大语言模型的科学能力提供了独特挑战。 Method: 设计了一个包含CFDQuery、CFDCodeBench和FoamBench三个部分的基准测试套件,结合真实CFD实践,通过详细的任务分类和严格的评估框架,评估大语言模型在代码可执行性、解的准确性和数值收敛行为等方面的表现。 Result: CFDLLMBench能够全面评估大语言模型在计算流体动力学领域的知识掌握、数值与物理推理以及实际工作流实施能力,为大语言模型驱动的数值实验自动化奠定了基础。 Conclusion: CFDLLMBench为评估和推动大语言模型在复杂物理系统数值实验自动化中的应用提供了可靠的基础和工具。 Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

[4] Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text

Sharanya Parimanoharan,Ruwan D. Nawarathna

Main category: cs.CL

TL;DR: 本研究评估了多种机器学习方法在区分ChatGPT-3.5生成文本与人类撰写科研摘要方面的性能,发现DistilBERT表现最佳,而基于投票的集成模型未能超越单一最优模型,表明在AI文本检测中,高质量的Transformer模型优于简单的模型集成。

Details Motivation: 随着大语言模型(如ChatGPT)的广泛应用,AI生成文本与人类文本的界限日益模糊,引发了对学术诚信、知识产权和虚假信息传播的担忧,因此亟需可靠的AI文本检测技术来维护真实性与信任。 Method: 研究采用250对涵盖多领域研究主题的人类撰写与ChatGPT-3.5生成的摘要数据集,比较了经典机器学习方法(如基于词袋、词性标注和TF-IDF特征的逻辑回归)与基于Transformer的模型(包括BERT、DistilBERT、带自定义分类器的BERT以及LSTM与N-gram结合模型)的检测效果,并测试了三种最佳模型的投票集成是否能提升性能。 Result: 实验结果显示,DistilBERT整体表现最优,逻辑回归和BERT-Custom表现稳健且均衡,而LSTM与BERT-N-gram方法表现较差;由三个最佳模型构成的最大投票集成未能超越单独的DistilBERT,说明单一高性能Transformer模型优于简单集成策略。 Conclusion: 本研究表明,当前最先进的Transformer模型(尤其是DistilBERT)在检测AI生成科研文本方面具有优越性能,未来应构建更强大、基于更大更丰富数据集的Transformer框架,以应对不断进步的生成式AI模型带来的挑战。 Abstract: The rapid adoption of large language models (LLMs) such as ChatGPT has blurred the line between human and AI-generated texts, raising urgent questions about academic integrity, intellectual property, and the spread of misinformation. Thus, reliable AI-text detection is needed for fair assessment to safeguard human authenticity and cultivate trust in digital communication. In this study, we investigate how well current machine learning (ML) approaches can distinguish ChatGPT-3.5-generated texts from human-written texts employing a labeled data set of 250 pairs of abstracts from a wide range of research topics. We test and compare both classical (Logistic Regression armed with classical Bag-of-Words, POS, and TF-IDF features) and transformer-based (BERT augmented with N-grams, DistilBERT, BERT with a lightweight custom classifier, and LSTM-based N-gram models) ML detection techniques. As we aim to assess each model's performance in detecting AI-generated research texts, we also aim to test whether an ensemble of these models can outperform any single detector. Results show DistilBERT achieves the overall best performance, while Logistic Regression and BERT-Custom offer solid, balanced alternatives; LSTM- and BERT-N-gram approaches lag. The max voting ensemble of the three best models fails to surpass DistilBERT itself, highlighting the primacy of a single transformer-based representation over mere model diversity. By comprehensively assessing the strengths and weaknesses of these AI-text detection approaches, this work lays a foundation for more robust transformer frameworks with larger, richer datasets to keep pace with ever-improving generative AI models.

[5] ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models

Haoxuan Li,Zhen Wen,Qiqi Jiang,Chenxiao Li,Yuwei Wu,Yuchen Yang,Yiyao Wang,Xiuqi Huang,Minfeng Zhu,Wei Chen

Main category: cs.CL

TL;DR: ConceptViz 是一个用于探索大语言模型(LLM)中概念的可视化分析系统,通过结合稀疏自动编码器(SAE)与人类可理解概念,提升对LLM内部知识表示的可解释性。

Details Motivation: 尽管SAE能提取LLM中的可解释特征,但这些特征与人类概念之间缺乏对齐,导致解释困难且耗时,因此需要一种工具来桥接这一差距。 Method: 提出ConceptViz系统,采用“识别=>解释=>验证”新流程,支持用户基于感兴趣的概念查询SAE、交互式探索概念与特征的对应关系,并通过模型行为验证其对应性。 Result: 通过两个使用场景和一项用户研究表明,ConceptViz能够有效促进对LLM中概念表示的发现与验证,显著提升可解释性研究效率。 Conclusion: ConceptViz有助于研究人员构建更准确的LLM特征心理模型,推动LLM知识表示的可解释性研究发展。 Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. Understanding how LLMs internally represent knowledge remains a significant challenge. Despite Sparse Autoencoders (SAEs) have emerged as a promising technique for extracting interpretable features from LLMs, SAE features do not inherently align with human-understandable concepts, making their interpretation cumbersome and labor-intensive. To bridge the gap between SAE features and human concepts, we present ConceptViz, a visual analytics system designed for exploring concepts in LLMs. ConceptViz implements a novel dentification => Interpretation => Validation pipeline, enabling users to query SAEs using concepts of interest, interactively explore concept-to-feature alignments, and validate the correspondences through model behavior verification. We demonstrate the effectiveness of ConceptViz through two usage scenarios and a user study. Our results show that ConceptViz enhances interpretability research by streamlining the discovery and validation of meaningful concept representations in LLMs, ultimately aiding researchers in building more accurate mental models of LLM features. Our code and user guide are publicly available at https://github.com/Happy-Hippo209/ConceptViz.

[6] SKILL-RAG: Self-Knowledge Induced Learning and Filtering for Retrieval-Augmented Generation

Tomoaki Isoda

Main category: cs.CL

TL;DR: 提出SKILL-RAG方法,利用模型的自知能力通过强化学习框架筛选有用检索内容,提升RAG性能。

Details Motivation: 现有检索增强生成(RAG)系统可能引入无关内容导致幻觉,需识别并过滤无用信息以提升性能。关键挑战在于如何有效结合模型内部知识与外部检索知识,理解模型“知道”和“不知道”的内容(即自知能力)。 Method: 提出SKILL-RAG方法,设计基于强化学习的训练框架,从模型中显式激发其自知能力,并在句子级别粒度上过滤无关内容,保留有用知识,从而指导高质量检索结果的选择。 Result: 在Llama2-7B和Qwen3-8B上多个问答基准测试中,SKILL-RAG不仅提升了生成质量,还显著减少了输入文档数量。 Conclusion: 模型的自知能力对RAG中检索内容的选择至关重要,SKILL-RAG有效利用该能力提高了知识密集型任务的性能。 Abstract: Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive tasks in recent years. However, since retrieval systems may return irrelevant content, incorporating such information into the model often leads to hallucinations. Thus, identifying and filtering out unhelpful retrieved content is a key challenge for improving RAG performance.To better integrate the internal knowledge of the model with external knowledge from retrieval, it is essential to understand what the model "knows" and "does not know" (which is also called "self-knowledge"). Based on this insight, we propose SKILL-RAG (Self-Knowledge Induced Learning and Filtering for RAG), a novel method that leverages the model's self-knowledge to determine which retrieved documents are beneficial for answering a given query. We design a reinforcement learning-based training framework to explicitly elicit self-knowledge from the model and employs sentence-level granularity to filter out irrelevant content while preserving useful knowledge.We evaluate SKILL-RAG using Llama2-7B and Qwen3-8B on several question answering benchmarks. Experimental results demonstrate that SKILL-RAG not only improves generation quality but also significantly reduces the number of input documents, validating the importance of self-knowledge in guiding the selection of high-quality retrievals.

[7] Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation

Sirui Wang,Andong Chen,Tiejun Zhao

Main category: cs.CL

TL;DR: 提出Emo-FiLM框架,实现基于LLM的文本到语音中词级别的细粒度情感控制。

Details Motivation: 现有情感TTS系统多依赖句子级别的情感控制,难以捕捉句子内部的情感动态变化。 Method: 通过emotion2vec提取帧级特征并对其与词对齐得到词级情感标注,利用FiLM层调制文本嵌入以实现词级情感控制。 Result: 在全局和细粒度情感控制任务上均优于现有方法,在构建的FEDD数据集上验证了其有效性与通用性。 Conclusion: Emo-FiLM能有效实现细粒度情感控制,提升情感TTS的自然性和表现力。 Abstract: Emotional text-to-speech (E-TTS) is central to creating natural and trustworthy human-computer interaction. Existing systems typically rely on sentence-level control through predefined labels, reference audio, or natural language prompts. While effective for global emotion expression, these approaches fail to capture dynamic shifts within a sentence. To address this limitation, we introduce Emo-FiLM, a fine-grained emotion modeling framework for LLM-based TTS. Emo-FiLM aligns frame-level features from emotion2vec to words to obtain word-level emotion annotations, and maps them through a Feature-wise Linear Modulation (FiLM) layer, enabling word-level emotion control by directly modulating text embeddings. To support evaluation, we construct the Fine-grained Emotion Dynamics Dataset (FEDD) with detailed annotations of emotional transitions. Experiments show that Emo-FiLM outperforms existing approaches on both global and fine-grained tasks, demonstrating its effectiveness and generality for expressive speech synthesis.

[8] USB-Rec: An Effective Framework for Improving Conversational Recommendation Capability of Large Language Model

Jianyu Wen,Jingyun Wang,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Ying Zhang

Main category: cs.CL

TL;DR: 提出了一种基于用户模拟器的训练-推理框架(USB-Rec),通过强化学习训练和自增强策略提升大语言模型在对话推荐系统中的性能。

Details Motivation: 现有基于大语言模型的对话推荐系统多关注如何利用模型的分析与总结能力,而忽视了模型训练问题,缺乏对推荐策略的深入学习。 Method: 设计了一种基于LLM的偏好优化(PO)数据集构建策略用于强化学习训练,并在推理阶段引入自增强策略(SES),以充分挖掘训练中获得的推荐能力。 Result: 在多个数据集上的实验表明,该方法显著优于现有的最先进方法。 Conclusion: USB-Rec框架通过在模型层面引入训练机制和推理优化,有效提升了大语言模型在对话推荐任务中的表现。 Abstract: Recently, Large Language Models (LLMs) have been widely employed in Conversational Recommender Systems (CRSs). Unlike traditional language model approaches that focus on training, all existing LLMs-based approaches are mainly centered around how to leverage the summarization and analysis capabilities of LLMs while ignoring the issue of training. Therefore, in this work, we propose an integrated training-inference framework, User-Simulator-Based framework (USB-Rec), for improving the performance of LLMs in conversational recommendation at the model level. Firstly, we design a LLM-based Preference Optimization (PO) dataset construction strategy for RL training, which helps the LLMs understand the strategies and methods in conversational recommendation. Secondly, we propose a Self-Enhancement Strategy (SES) at the inference stage to further exploit the conversational recommendation potential obtained from RL training. Extensive experiments on various datasets demonstrate that our method consistently outperforms previous state-of-the-art methods.

[9] Document Summarization with Conformal Importance Guarantees

Bruce Kuwahara,Chen-Yuan Lin,Xiao Shi Huang,Kin Kwan Leung,Jullian Arta Yapeter,Ilya Stanevich,Felipe Perez,Jesse C. Cresswell

Main category: cs.CL

TL;DR: 提出了Conformal Importance Summarization框架,利用共形预测为关键内容的覆盖和召回提供严格、分布无关的保证,实现可靠且可控的自动摘要生成。

Details Motivation: 现有基于大语言模型的自动摘要系统在高风险领域缺乏对关键内容包含的可靠保障。 Method: 通过在校准集上调整句子级重要性得分的阈值,结合共形预测方法,实现可指定覆盖率和召回率的抽取式摘要生成。 Result: 在多个标准摘要基准上验证了该方法能够达到理论保证的信息覆盖率,且具有模型无关性和低数据需求。 Conclusion: Conformal Importance Summarization为高风险场景下安全部署AI摘要工具提供了可行路径,可与现有技术结合实现更可靠的摘要系统。 Abstract: Automatic summarization systems have advanced rapidly with large language models (LLMs), yet they still lack reliable guarantees on inclusion of critical content in high-stakes domains like healthcare, law, and finance. In this work, we introduce Conformal Importance Summarization, the first framework for importance-preserving summary generation which uses conformal prediction to provide rigorous, distribution-free coverage guarantees. By calibrating thresholds on sentence-level importance scores, we enable extractive document summarization with user-specified coverage and recall rates over critical content. Our method is model-agnostic, requires only a small calibration set, and seamlessly integrates with existing black-box LLMs. Experiments on established summarization benchmarks demonstrate that Conformal Importance Summarization achieves the theoretically assured information coverage rate. Our work suggests that Conformal Importance Summarization can be combined with existing techniques to achieve reliable, controllable automatic summarization, paving the way for safer deployment of AI summarization tools in critical applications. Code is available at https://github.com/layer6ai-labs/conformal-importance-summarization.

[10] ShortCheck: Checkworthiness Detection of Multilingual Short-Form Videos

Henrik Vatndal,Vinay Setty

Main category: cs.CL

TL;DR: 提出了一种名为ShortCheck的模块化、仅推理的流水线系统,用于自动识别短视频平台(如TikTok)中值得核查的视频,以辅助人工事实核查。

Details Motivation: 短视频平台的内容具有多模态、动态和噪声大的特点,给虚假信息检测带来了独特挑战。 Method: 整合了语音转录、OCR、物体和深度伪造检测、视频到文本摘要以及声明验证等技术,构建了一个用户友好的自动化检测流程。 Result: 在两个手动标注的多语言TikTok视频数据集上进行验证,该系统取得了加权F1分数超过70%的良好效果。 Conclusion: ShortCheck能够有效识别值得核查的短视频内容,在多语言环境下表现出良好的性能,有助于提升人工事实核查的效率。 Abstract: Short-form video platforms like TikTok present unique challenges for misinformation detection due to their multimodal, dynamic, and noisy content. We present ShortCheck, a modular, inference-only pipeline with a user-friendly interface that automatically identifies checkworthy short-form videos to help human fact-checkers. The system integrates speech transcription, OCR, object and deepfake detection, video-to-text summarization, and claim verification. ShortCheck is validated by evaluating it on two manually annotated datasets with TikTok videos in a multilingual setting. The pipeline achieves promising results with F1-weighted score over 70\%.

[11] MARS: toward more efficient multi-agent collaboration for LLM reasoning

Xiao Wang,Jia Wang,Yijie Wang,Pengtao Dang,Sha Cao,Chi Zhang

Main category: cs.CL

TL;DR: 本文提出了MARS(多智能体评审系统),一种基于角色协作的推理框架,通过作者、评审者和元评审者的分工,在保持与多智能体辩论(MAD)相当准确率的同时,将token使用量和推理时间减少约50%。

Details Motivation: 大型语言模型在单智能体下推理能力有限,多智能体辩论(MAD)虽有效但计算开销大,因此需要一种更高效的协作推理方法。 Method: 提出MARS框架:一个作者智能体生成初始解,多个评审智能体独立给出评论,元评审智能体整合反馈并指导修订,避免了评审者之间的直接通信。 Result: 在多个基准上实验表明,MARS与MAD具有相当的准确性,但token消耗和推理时间减少了约50%,且优于其他先进推理策略。 Conclusion: MARS通过结构化的角色分工实现了高效、高质量的多智能体协同推理,显著降低了计算成本,为大规模语言模型的推理提供了一种更具可扩展性的解决方案。 Abstract: Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.

[12] SiniticMTError: A Machine Translation Dataset with Error Annotations for Sinitic Languages

Hannah Liu,Junghyun Min,Ethan Yue Heng Cheung,Shou-Yi Hung,Syed Mekael Wasti,Runtong Liang,Shiyao Qian,Shizhao Zheng,Elsie Chan,Ka Ieng Charlotte Lo,Wing Yu Yip,Richard Tzong-Han Tsai,En-Shiun Annie Lee

Main category: cs.CL

TL;DR: 本文介绍了SiniticMTError,一个针对英语到普通话、粤语和吴语机器翻译的错误标注数据集,包含错误范围、类型和严重程度,旨在支持低资源语言的翻译质量评估与错误感知生成研究。

Details Motivation: 尽管近年来机器翻译取得了显著进展,但许多缺乏大规模训练数据和语言资源的低资源语言仍发展受限,如粤语和吴语虽使用广泛但相关研究不足。 Method: 基于现有平行语料库,由母语者进行严格标注,构建包含错误跨度、类型和严重程度的SiniticMTError数据集,并分析标注者间一致性、迭代反馈及错误模式。 Result: 成功构建了面向三种汉语方言的机器翻译错误标注数据集,提供了详细的错误分析结果和较高的标注一致性。 Conclusion: SiniticMTError为低资源语言的机器翻译质量评估、错误感知生成和模型微调提供了有价值的资源,推动了相关领域的研究发展。 Abstract: Despite major advances in machine translation (MT) in recent years, progress remains limited for many low-resource languages that lack large-scale training data and linguistic resources. Cantonese and Wu Chinese are two Sinitic examples, although each enjoys more than 80 million speakers around the world. In this paper, we introduce SiniticMTError, a novel dataset that builds on existing parallel corpora to provide error span, error type, and error severity annotations in machine-translated examples from English to Mandarin, Cantonese, and Wu Chinese. Our dataset serves as a resource for the MT community to utilize in fine-tuning models with error detection capabilities, supporting research on translation quality estimation, error-aware generation, and low-resource language evaluation. We report our rigorous annotation process by native speakers, with analyses on inter-annotator agreement, iterative feedback, and patterns in error type and severity.

[13] SwasthLLM: a Unified Cross-Lingual, Multi-Task, and Meta-Learning Zero-Shot Framework for Medical Diagnosis Using Contrastive Representations

Ayan Sar,Pranav Singh Puri,Sumit Aich,Tanupriya Choudhury,Abhijit Kumar

Main category: cs.CL

TL;DR: 本文提出了SwasthLLM,一个统一的、零样本、跨语言多任务学习框架,用于在英语、印地语和孟加拉语中进行医疗诊断,无需语言特定微调。

Details Motivation: 由于低资源语言中标注医学数据稀缺以及人群间的语言差异,多语言医疗环境中从临床文本自动诊断疾病仍具挑战性。 Method: 采用多语言XLM-RoBERTa编码器,结合语言感知注意力机制、Siamese对比学习模块、翻译一致性模块和对比投影头,并通过多任务学习联合优化疾病分类、翻译对齐和对比学习目标,使用MAML提升模型快速适应能力。 Result: 在监督设置下达到97.22%准确率和97.17% F1分数;零样本场景下,印地语准确率为92.78%,孟加拉语为73.33%。 Conclusion: SwasthLLM在低资源语言环境下展现出强大的跨语言泛化能力和高诊断性能,适用于多语言医疗诊断应用。 Abstract: In multilingual healthcare environments, automatic disease diagnosis from clinical text remains a challenging task due to the scarcity of annotated medical data in low-resource languages and the linguistic variability across populations. This paper proposes SwasthLLM, a unified, zero-shot, cross-lingual, and multi-task learning framework for medical diagnosis that operates effectively across English, Hindi, and Bengali without requiring language-specific fine-tuning. At its core, SwasthLLM leverages the multilingual XLM-RoBERTa encoder augmented with a language-aware attention mechanism and a disease classification head, enabling the model to extract medically relevant information regardless of the language structure. To align semantic representations across languages, a Siamese contrastive learning module is introduced, ensuring that equivalent medical texts in different languages produce similar embeddings. Further, a translation consistency module and a contrastive projection head reinforce language-invariant representation learning. SwasthLLM is trained using a multi-task learning strategy, jointly optimizing disease classification, translation alignment, and contrastive learning objectives. Additionally, we employ Model-Agnostic Meta-Learning (MAML) to equip the model with rapid adaptation capabilities for unseen languages or tasks with minimal data. Our phased training pipeline emphasizes robust representation alignment before task-specific fine-tuning. Extensive evaluation shows that SwasthLLM achieves high diagnostic performance, with a test accuracy of 97.22% and an F1-score of 97.17% in supervised settings. Crucially, in zero-shot scenarios, it attains 92.78% accuracy on Hindi and 73.33% accuracy on Bengali medical text, demonstrating strong generalization in low-resource contexts.

[14] Dynamic Reasoning Chains through Depth-Specialized Mixture-of-Experts in Transformer Architectures

Sampurna Roy,Ayan Sar,Anurag Kaushish,Kanav Gupta,Tanupriya Choudhury,Abhijit Kumar

Main category: cs.CL

TL;DR: 提出动态推理链的深度专业化混合专家模型(DS-MoE),通过根据输入复杂度动态组合不同推理深度的专家模块,提升语言模型的效率、准确性和可解释性。

Details Motivation: 传统Transformer对所有输入采用相同的处理深度,导致简单查询和复杂逻辑问题消耗相同计算资源,造成效率低下并限制深层推理能力。 Method: 扩展Mixture of Experts范式,从宽度扩展到深度专业化,构建多个针对不同推理层次优化的专家模块(如浅层模式识别、组合推理、逻辑推断等),并通过学习路由网络动态组装定制化的推理链。 Result: 在The Pile数据集上实验显示,相比固定深度模型,DS-MoE节省最多16%计算量,推理速度快35%,在复杂多步推理任务上准确率提高2.8%,且路由过程生成可解释的推理路径。 Conclusion: DS-MoE通过深度专业化和动态推理链,在不牺牲性能的前提下显著提升效率和可解释性,是自适应神经架构的重要进展。 Abstract: Contemporary transformer architectures apply identical processing depth to all inputs, creating inefficiencies and limiting reasoning quality. Simple factual queries are subjected to the same multilayered computation as complex logical problems, wasting resources while constraining deep inference. To overcome this, we came up with a concept of Dynamic Reasoning Chains through Depth Specialised Mixture of Experts (DS-MoE), a modular framework that extends the Mixture of Experts paradigm from width-based to depth specialised computation. DS-MoE introduces expert modules optimised for distinct reasoning depths, shallow pattern recognition, compositional reasoning, logical inference, memory integration, and meta-cognitive supervision. A learned routing network dynamically assembles custom reasoning chains, activating only the necessary experts to match input complexity. The dataset on which we trained and evaluated DS-MoE is on The Pile, an 800GB corpus covering diverse domains such as scientific papers, legal texts, programming code, and web content, enabling systematic assessment across reasoning depths. Experimental results demonstrate that DS-MoE achieves up to 16 per cent computational savings and 35 per cent faster inference compared to uniform-depth transformers, while delivering 2.8 per cent higher accuracy on complex multi-step reasoning benchmarks. Furthermore, routing decisions yield interpretable reasoning chains, enhancing transparency and scalability. These findings establish DS-MoE as a significant advancement in adaptive neural architectures, demonstrating that depth-specialised modular processing can simultaneously improve efficiency, reasoning quality, and interpretability in large-scale language models.

[15] Hierarchical Resolution Transformers: A Wavelet-Inspired Architecture for Multi-Scale Language Understanding

Ayan Sar,Sampurna Roy,Kanav Gupta,Anurag Kaushish,Tanupriya Choudhury,Abhijit Kumar

Main category: cs.CL

TL;DR: 提出了一种受小波启发的分层分辨率Transformer(HRT),通过多分辨率处理语言,实现了O(nlogn)复杂度,在多个基准上优于标准Transformer,同时提升了效率和语言理解能力。

Details Motivation: Transformer将文本视为扁平的token序列,未能有效建模语言的层次结构,导致计算成本高、组成泛化能力弱和篇章级建模不足。 Method: 设计了多分辨率注意力机制,从字符到篇章单元同时处理语言,采用跨尺度指数序列缩减,并引入跨分辨率注意力与尺度专用模块。 Result: 在GLUE、SuperGLUE和Long Range Arena上分别平均提升3.8%、4.5%和6.1%,内存减少42%,推理延迟降低37%,且具备O(nlogn)计算复杂度。 Conclusion: HRT首次使计算结构与人类语言的层次结构对齐,验证了多尺度、小波启发式处理在理论效率和实际语言理解上的优势。 Abstract: Transformer architectures have achieved state-of-the-art performance across natural language tasks, yet they fundamentally misrepresent the hierarchical nature of human language by processing text as flat token sequences. This results in quadratic computational cost, weak computational cost, weak compositional generalization, and inadequate discourse-level modeling. We propose Hierarchical Resolution Transformer (HRT), a novel wavelet-inspired neural architecture that processes language simultaneously across multiple resolutions, from characters to discourse-level units. HRT constructs a multi-resolution attention, enabling bottom-up composition and top-down contextualization. By employing exponential sequence reduction across scales, HRT achieves O(nlogn) complexity, offering significant efficiency improvements over standard transformers. We evaluated HRT on a diverse suite of benchmarks, including GLUE, SuperGLUE, Long Range Arena, and WikiText-103, and results demonstrated that HRT outperforms standard transformer baselines by an average of +3.8% on GLUE, +4.5% on SuperGLUE, and +6.1% on Long Range Arena, while reducing memory usage by 42% and inference latency by 37% compared to BERT and GPT style models of similar parameter count. Ablation studies confirm the effectiveness of cross-resolution attention and scale-specialized modules, showing that each contributes independently to both efficiency and accuracy. Our findings establish HRT as the first architecture to align computational structure with the hierarchical organization of human language, demonstrating that multi-scale, wavelet-inspired processing yields both theoretical efficiency gains and practical improvements in language understanding.

[16] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Amin Karimi Monsefi,Nikhil Bhendawade,Manuel Rafael Ciosici,Dominic Culver,Yizhe Zhang,Irina Belousova

Main category: cs.CL

TL;DR: 本文提出了一种名为FS-DFM的少步离散流匹配模型,旨在实现快速且高质量的语言生成。该模型通过将采样步数作为显式参数,并训练模型在不同步数预算下保持一致性,从而用一次大更新替代多次小更新。结合可靠的更新规则和从长轨迹中蒸馏出的强教师指导,FS-DFM在仅8步采样的情况下即可达到与1024步基线模型相当的困惑度,速度提升高达128倍。

Details Motivation: 自回归语言模型虽然性能良好,但生成过程是串行的,导致长序列生成时吞吐量低、延迟高;而传统的离散扩散模型虽可并行化,但通常需要数百到数千步才能达到高质量,因此亟需一种既能保持生成质量又能显著减少采样步数的新型模型。 Method: 提出FS-DFM(Few-Step Discrete Flow-Matching)模型,将采样步数设为显式参数,训练模型在不同步数下保持一致性;采用不会过度调整的可靠更新规则,并利用从长程轨迹蒸馏出的强教师信号进行指导,以实现稳定、准确的少步采样。 Result: 在语言建模基准上,使用8步采样的FS-DFM在生成1024个token时达到了与1024步离散流基线模型相当的困惑度,采样速度快达128倍,显著提升了推理效率和吞吐量。 Conclusion: FS-DFM通过设计一致性的训练机制和高效的更新策略,成功实现了高质量语言生成与极少数采样步数的平衡,为高效文本生成提供了新的可行路径。 Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.

[17] Look Before you Leap: Estimating LLM Benchmark Scores from Descriptions

Jungsoo Park,Ethan Mendes,Gabriel Stanovsky,Alan Ritter

Main category: cs.CL

TL;DR: 本文提出了一种在不运行实验的情况下预测大语言模型性能的方法,通过构建包含任务描述和配置的红脱数据集PRECOG,探索文本-only性能预测的可行性。

Details Motivation: 大语言模型的发展受限于评估瓶颈,传统方法需要反复构建基准并进行实验迭代。因此,作者希望在实验前就能预测模型表现,以提升研发效率。 Method: 提出文本-only性能预测任务,构建PRECOG数据集,使用具备检索模块的模型进行预测,并评估其在不同设置下的表现,包括零泄露场景。 Result: 实验表明该任务具有挑战性但可行,配备检索模块的模型在高置信度下达到准确子集上平均绝对误差低至8.7;更强的推理模型表现出多样化的迭代查询行为,而现有开源模型表现较差;在新数据集上的零泄露预测中,GPT-5结合网络搜索仍具非平凡预测能力。 Conclusion: PRECOG数据集和相关分析为开放式的前瞻性评估提供了初步基础,有助于难度估计和更智能的实验优先级排序。 Abstract: Progress in large language models is constrained by an evaluation bottleneck: build a benchmark, evaluate models and settings, then iterate. We therefore ask a simple question: can we forecast outcomes before running any experiments? We study text-only performance forecasting: estimating a model's score from a redacted task description and intended configuration, with no access to dataset instances. To support systematic study, we curate PRECOG, a corpus of redacted description-performance pairs spanning diverse tasks, domains, and metrics. Experiments show the task is challenging but feasible: models equipped with a retrieval module that excludes source papers achieve moderate prediction performance with well-calibrated uncertainty, reaching mean absolute error as low as 8.7 on the Accuracy subset at high-confidence thresholds. Our analysis indicates that stronger reasoning models engage in diverse, iterative querying, whereas current open-source models lag and often skip retrieval or gather evidence with limited diversity. We further test a zero-leakage setting, forecasting on newly released datasets or experiments before their papers are indexed, where GPT-5 with built-in web search still attains nontrivial prediction accuracy. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter experiment prioritization.

[18] Building Tailored Speech Recognizers for Japanese Speaking Assessment

Yotaro Kubo,Richard Sproat,Chihiro Taguchi,Llion Jones

Main category: cs.CL

TL;DR: 本文提出了一种用于日语发音评估的语音识别方法,通过多任务学习和模型融合来缓解带声调标记音素转录数据稀疏的问题。

Details Motivation: 由于标注了声调信息的日语音素转录数据稀缺,难以训练高精度的发音评估系统,因此需要有效利用仅有文本标注的数据以提升模型性能。 Method: 采用多任务训练框架,引入正字法标签和基频模式的辅助损失函数,并融合基于音标字符串和文本标记序列的两个估计器,使用有限状态转换器框架进行结果整合。 Result: 在CSJ核心测试集上,所提方法将平均音拍标签错误率从12.3%降低至7.1%,显著优于通用多语言识别器。 Conclusion: 多任务学习与模型融合策略能有效提升日语带声调音素识别的准确性,适用于资源有限的发音评估任务。 Abstract: This paper presents methods for building speech recognizers tailored for Japanese speaking assessment tasks. Specifically, we build a speech recognizer that outputs phonemic labels with accent markers. Although Japanese is resource-rich, there is only a small amount of data for training models to produce accurate phonemic transcriptions that include accent marks. We propose two methods to mitigate data sparsity. First, a multitask training scheme introduces auxiliary loss functions to estimate orthographic text labels and pitch patterns of the input signal, so that utterances with only orthographic annotations can be leveraged in training. The second fuses two estimators, one over phonetic alphabet strings, and the other over text token sequences. To combine these estimates we develop an algorithm based on the finite-state transducer framework. Our results indicate that the use of multitask learning and fusion is effective for building an accurate phonemic recognizer. We show that this approach is advantageous compared to the use of generic multilingual recognizers. The relative advantages of the proposed methods were also compared. Our proposed methods reduced the average of mora-label error rates from 12.3% to 7.1% over the CSJ core evaluation sets.

[19] Enhancing Molecular Property Prediction with Knowledge from Large Language Models

Peng Zhou,Lai Hou Tim,Zhixiang Cheng,Kun Xie,Chaoyi Li,Wei Liu,Xiangxiang Zeng

Main category: cs.CL

TL;DR: 提出了一种新框架,首次将大语言模型(LLM)提取的知识与预训练分子模型的结构特征结合,用于分子属性预测(MPP),并通过多款先进LLM进行知识提取,实验表明该方法优于现有方法。

Details Motivation: 尽管图神经网络和自监督学习在分子属性预测中取得进展,但大语言模型存在知识盲区和幻觉问题,难以准确预测研究较少的分子属性,因此需要融合人类先验知识以提升预测性能。 Method: 通过提示大语言模型生成领域相关知识和可执行的分子向量化代码,提取知识特征,并将其与预训练分子模型获得的结构特征融合,构建端到端的分子属性预测框架。 Result: 在多个实验中,该方法显著优于现有分子属性预测方法,验证了LLM衍生知识与结构信息融合的有效性。 Conclusion: 结合大语言模型提取的知识与分子结构特征是一种鲁棒且高效的分子属性预测方案,为药物发现中的分子建模提供了新方向。 Abstract: Predicting molecular properties is a critical component of drug discovery. Recent advances in deep learning, particularly Graph Neural Networks (GNNs), have enabled end-to-end learning from molecular structures, reducing reliance on manual feature engineering. However, while GNNs and self-supervised learning approaches have advanced molecular property prediction (MPP), the integration of human prior knowledge remains indispensable, as evidenced by recent methods that leverage large language models (LLMs) for knowledge extraction. Despite their strengths, LLMs are constrained by knowledge gaps and hallucinations, particularly for less-studied molecular properties. In this work, we propose a novel framework that, for the first time, integrates knowledge extracted from LLMs with structural features derived from pre-trained molecular models to enhance MPP. Our approach prompts LLMs to generate both domain-relevant knowledge and executable code for molecular vectorization, producing knowledge-based features that are subsequently fused with structural representations. We employ three state-of-the-art LLMs, GPT-4o, GPT-4.1, and DeepSeek-R1, for knowledge extraction. Extensive experiments demonstrate that our integrated method outperforms existing approaches, confirming that the combination of LLM-derived knowledge and structural information provides a robust and effective solution for MPP.

[20] RedHerring Attack: Testing the Reliability of Attack Detection

Jonathan Rusert

Main category: cs.CL

TL;DR: 提出并测试了一种新的攻击方法RedHerring,旨在通过使检测模型误报来降低其可靠性,同时保持分类器的准确性。

Details Motivation: 探索攻击检测模型的可靠性,并提出新的攻击方式以揭示潜在威胁。 Method: 设计RedHerring攻击,在不改变文本语义的前提下,使检测模型错误地预测存在攻击,从而制造分类器与检测器之间的矛盾。 Result: 在4个数据集上对3种检测器和4种分类器进行测试,RedHerring使检测准确率下降20-71个百分点,同时保持或提升了分类器准确率;提出一种无需重新训练的置信度检查防御方法,显著提高检测准确率。 Conclusion: RedHerring揭示了攻击者可能针对检测模型的新途径,强调了提升检测模型鲁棒性的必要性。 Abstract: In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an ``incorrect'' prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.

[21] Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms

Abhinay Shankar Belde,Rohit Ramkumar,Jonathan Rusert

Main category: cs.CL

TL;DR: 提出两种新的攻击选择策略Hybrid Select和Dynamic Select,有效减少对抗文本攻击中的查询次数,同时保持攻击效果。

Details Motivation: 由于基于transformer的模型计算成本高,现有黑盒攻击方法查询次数多,效率低,限制了资源有限的研究者进行攻击测试。 Method: 提出了Hybrid Select和Dynamic Select两种策略,结合BinarySelect和GreedySelect的优点,通过设定阈值或学习机制决定在不同文本长度下使用哪种选择算法。 Result: 在4个数据集和6个目标模型上实验表明,句子级Hybrid Select平均减少25.82%的查询次数,且不损失攻击有效性。 Conclusion: 所提出的策略显著降低了对抗攻击的查询成本,提升了攻击效率,适用于包括编码器模型和大语言模型在内的多种NLP模型。 Abstract: Adversarial text attack research plays a crucial role in evaluating the robustness of NLP models. However, the increasing complexity of transformer-based architectures has dramatically raised the computational cost of attack testing, especially for researchers with limited resources (e.g., GPUs). Existing popular black-box attack methods often require a large number of queries, which can make them inefficient and impractical for researchers. To address these challenges, we propose two new attack selection strategies called Hybrid and Dynamic Select, which better combine the strengths of previous selection algorithms. Hybrid Select merges generalized BinarySelect techniques with GreedySelect by introducing a size threshold to decide which selection algorithm to use. Dynamic Select provides an alternative approach of combining the generalized Binary and GreedySelect by learning which lengths of texts each selection method should be applied to. This greatly reduces the number of queries needed while maintaining attack effectiveness (a limitation of BinarySelect). Across 4 datasets and 6 target models, our best method(sentence-level Hybrid Select) is able to reduce the number of required queries per attack up 25.82\% on average against both encoder models and LLMs, without losing the effectiveness of the attack.

[22] MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model

Hsiao-Ying Huang,Yi-Cheng Lin,Hung-yi Lee

Main category: cs.CL

TL;DR: 提出MI-Fuse框架,通过融合API-only大模型与源域训练的辅助教师模型,在无源数据共享的情况下实现跨域语音情感识别性能提升。

Details Motivation: 现有语音情感识别在领域不匹配时表现不佳,且无法访问源数据,仅能通过API使用大模型。 Method: 提出MI-Fuse框架,利用互信息不确定性加权多个教师模型的预测分布,并结合指数移动平均稳定训练过程。 Result: 在三个公开情感数据集和六种跨域迁移设置中均取得一致增益,学生模型超越大音频语言模型,比最强基线提升3.9%。 Conclusion: MI-Fuse可在无需共享源数据的情况下有效提升目标域语音情感识别性能,适用于现实场景中的模型适应。 Abstract: Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.

[23] Probability Distribution Collapse: A Critical Bottleneck to Compact Unsupervised Neural Grammar Induction

Jinwook Park,Kangil Kim

Main category: cs.CL

TL;DR: 提出了一种缓解概率分布坍缩问题的神经参数化方法,显著提升了无监督神经语法归纳的解析性能,并实现了更紧凑的语法结构。

Details Motivation: 现有模型存在表达能力瓶颈,导致语法过大但性能不足,主要原因是概率分布坍缩问题。 Method: 分析了神经参数化中概率分布坍缩的成因,并提出了“坍缩缓解神经参数化”方法来解决该问题。 Result: 新方法在多种语言上显著提升了无监督语法归纳的解析性能,同时允许使用更紧凑的语法。 Conclusion: 通过缓解概率分布坍缩,可以有效提升模型表达能力和效率,为无监督语法归纳提供了新的优化方向。 Abstract: Unsupervised neural grammar induction aims to learn interpretable hierarchical structures from language data. However, existing models face an expressiveness bottleneck, often resulting in unnecessarily large yet underperforming grammars. We identify a core issue, $\textit{probability distribution collapse}$, as the underlying cause of this limitation. We analyze when and how the collapse emerges across key components of neural parameterization and introduce a targeted solution, $\textit{collapse-relaxing neural parameterization}$, to mitigate it. Our approach substantially improves parsing performance while enabling the use of significantly more compact grammars across a wide range of languages, as demonstrated through extensive empirical analysis.

[24] Confidence-guided Refinement Reasoning for Zero-shot Question Answering

Youwon Jang,Woo Suk Choi,Minjoon Jung,Minsu Lee,Byoung-Tak Zhang

Main category: cs.CL

TL;DR: 提出了一种无需训练的框架C2R,通过构建和优化子问题及其答案,并利用模型自身的置信度评分来提升跨模态问答任务的性能。

Details Motivation: 现有的问答系统在复杂推理任务中表现不稳定,缺乏对推理路径可靠性的有效评估机制。 Method: C2R通过选择高置信度的子问题与答案组合,探索多样化的推理路径,并基于模型输出的置信度分数筛选最优最终答案。 Result: 在多种模型和基准测试上均实现了性能提升,验证了方法的通用性和有效性;同时分析了子问题数量与质量对推理可靠性的影响。 Conclusion: C2R是一种通用、无需训练的增强型推理框架,能有效提升多模态问答系统的准确性和稳健性。 Abstract: We propose Confidence-guided Refinement Reasoning (C2R), a novel training-free framework applicable to question-answering (QA) tasks across text, image, and video domains. C2R strategically constructs and refines sub-questions and their answers (sub-QAs), deriving a better confidence score for the target answer. C2R first curates a subset of sub-QAs to explore diverse reasoning paths, then compares the confidence scores of the resulting answer candidates to select the most reliable final answer. Since C2R relies solely on confidence scores derived from the model itself, it can be seamlessly integrated with various existing QA models, demonstrating consistent performance improvements across diverse models and benchmarks. Furthermore, we provide essential yet underexplored insights into how leveraging sub-QAs affects model behavior, specifically analyzing the impact of both the quantity and quality of sub-QAs on achieving robust and reliable reasoning.

[25] SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

Jiacheng Lin,Zhongruo Wang,Kun Qian,Tian Wang,Arvind Srinivasan,Hansi Zeng,Ruochen Jiao,Xie Zhou,Jiri Gesi,Dakuo Wang,Yufan Guo,Kai Zhong,Weiqi Zhang,Sujay Sanghavi,Changyou Chen,Hyokun Yun,Lihong Li

Main category: cs.CL

TL;DR: 本文研究了在特定领域数据集上进行监督微调(SFT)对大语言模型通用能力的影响,提出使用较小学习率可缓解性能下降,并提出了Token-自适应损失重加权(TALR)方法,在保持领域性能的同时更好地维持通用能力。

Details Motivation: 解决SFT在提升领域性能时可能导致大模型通用能力下降的问题,重新审视这一权衡并寻找有效缓解策略。 Method: 通过实验分析不同学习率的影响,结合理论分析提出TALR方法,并比较L2正则化、LoRA、模型平均、FLOW等策略的效果。 Result: 较小学习率能显著减轻通用性能退化;TALR在多种基准中优于其他方法,更有效地平衡领域专精与通用能力。 Conclusion: SFT不必然损害通用能力,合理选择学习率并采用TALR可有效缓解权衡,为LLM领域适配提供了实用指南。 Abstract: Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.

[26] Towards Atoms of Large Language Models

Chenhui Hu,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao

Main category: cs.CL

TL;DR: 本文提出了语言模型内部表示的基本单元——原子(atoms)理论,通过原子内积(AIP)校正表示偏移,并证明了原子满足稀疏表示的稳定性与唯一性条件,且可通过阈值激活的稀疏自编码器可靠识别。实验在多个大模型上验证了该理论的有效性。

Details Motivation: 现有语言模型内部表示的基本单元(如神经元或特征)存在多义性、重建不稳定等问题,限制了对模型机制的理解,因此需要更可靠的表示单元定义。 Method: 提出原子理论,引入原子内积(AIP)来校正表示偏移,形式化定义原子并证明其满足受限等距性质(RIP),在更强条件下建立稀疏表示的唯一性和ℓ₁可恢复性,理论分析结合阈值激活的稀疏自编码器(SAE)进行原子识别。 Result: 在Gemma2-2B、Gemma2-9B和Llama3.1-8B上训练SAE,平均实现99.9%的稀疏重建精度,超过99.8%的原子满足唯一性条件,远高于神经元(0.5%)和特征(68.2%),验证了原子能更真实地捕捉LLM内部表示。 Conclusion: 原子理论为大语言模型的内部表示提供了稳定、唯一且可恢复的理论框架,支持机制可解释性研究,是理解LLM内部机制的重要基础。 Abstract: The fundamental units of internal representations in large language models (LLMs) remain undefined, limiting further understanding of their mechanisms. Neurons or features are often regarded as such units, yet neurons suffer from polysemy, while features face concerns of unreliable reconstruction and instability. To address this issue, we propose the Atoms Theory, which defines such units as atoms. We introduce the atomic inner product (AIP) to correct representation shifting, formally define atoms, and prove the conditions that atoms satisfy the Restricted Isometry Property (RIP), ensuring stable sparse representations over atom set and linking to compressed sensing. Under stronger conditions, we further establish the uniqueness and exact $\ell_1$ recoverability of the sparse representations, and provide guarantees that single-layer sparse autoencoders (SAEs) with threshold activations can reliably identify the atoms. To validate the Atoms Theory, we train threshold-activated SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% sparse reconstruction across layers on average, and more than 99.8% of atoms satisfy the uniqueness condition, compared to 0.5% for neurons and 68.2% for features, showing that atoms more faithfully capture intrinsic representations of LLMs. Scaling experiments further reveal the link between SAEs size and recovery capacity. Overall, this work systematically introduces and validates Atoms Theory of LLMs, providing a theoretical framework for understanding internal representations and a foundation for mechanistic interpretability. Code available at https://github.com/ChenhuiHu/towards_atoms.

[27] Few-Shot and Training-Free Review Generation via Conversational Prompting

Genki Kusano

Main category: cs.CL

TL;DR: 本文提出了一种名为对话式提示(Conversational Prompting)的轻量级方法,用于在少样本且无需训练的场景下生成个性化评论。该方法通过将用户评论重构为多轮对话,显著提升了大语言模型生成评论的个性化程度。

Details Motivation: 现有个性化评论生成方法通常依赖大量用户历史评论或额外模型训练,难以应对现实应用中评论数据稀少且无法微调模型的挑战。因此,需要一种无需训练且适用于少样本场景的有效方法。 Method: 提出两种对话式提示方法:简单版本(SCP)仅使用目标用户自身的评论构建多轮对话;对比版本(CCP)则引入其他用户或LLM生成的错误回复作为负例,并让模型纠正,从而学习用户写作风格。 Result: 在八个产品领域和五种大语言模型上的实验表明,传统非对话式提示生成的评论与随机用户相似,而SCP和CCP显著提升了生成评论与目标用户风格的一致性。即使每名用户仅有两条评论,该方法仍表现良好;CCP在有高质量负例时效果更优,SCP在缺乏负例时仍具竞争力。 Conclusion: 对话式提示是一种在少样本、无需训练条件下生成个性化评论的有效且实用的方法,优于传统提示方式,具有广泛的应用潜力。 Abstract: Personalized review generation helps businesses understand user preferences, yet most existing approaches assume extensive review histories of the target user or require additional model training. Real-world applications often face few-shot and training-free situations, where only a few user reviews are available and fine-tuning is infeasible. It is well known that large language models (LLMs) can address such low-resource settings, but their effectiveness depends on prompt engineering. In this paper, we propose Conversational Prompting, a lightweight method that reformulates user reviews as multi-turn conversations. Its simple variant, Simple Conversational Prompting (SCP), relies solely on the user's own reviews, while the contrastive variant, Contrastive Conversational Prompting (CCP), inserts reviews from other users or LLMs as incorrect replies and then asks the model to correct them, encouraging the model to produce text in the user's style. Experiments on eight product domains and five LLMs showed that the conventional non-conversational prompt often produced reviews similar to those written by random users, based on text-based metrics such as ROUGE-L and BERTScore, and application-oriented tasks like user identity matching and sentiment analysis. In contrast, both SCP and CCP produced reviews much closer to those of the target user, even when each user had only two reviews. CCP brings further improvements when high-quality negative examples are available, whereas SCP remains competitive when such data cannot be collected. These results suggest that conversational prompting offers a practical solution for review generation under few-shot and training-free constraints.

[28] Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching

Songze Li,Zhiqiang Liu,Zhengke Gui,Huajun Chen,Wen Zhang

Main category: cs.CL

TL;DR: 提出了一种名为Enrich-on-Graph(EoG)的灵活框架,利用大语言模型的先验知识增强知识图谱,弥合查询与图之间的语义鸿沟,提升知识图谱问答任务的性能。

Details Motivation: 大语言模型在知识密集型任务中存在幻觉和事实错误问题,主要由于结构化知识图谱与非结构化查询之间存在语义鸿沟。现有方法通常在原始知识图谱上进行推理,忽视了这一差距。 Method: 提出Enrich-on-Graph(EoG)框架,利用大语言模型的先验知识对知识图谱进行动态增强,并设计了三个图质量评估指标来分析查询与图的对齐程度,支持理论验证优化目标。 Result: 在两个KGQA基准数据集上的实验表明,EoG能有效生成高质量的知识图谱,并实现最先进的性能,同时具有低计算成本、可扩展性和跨方法适应性。 Conclusion: EoG通过弥合查询与知识图谱之间的语义鸿沟,显著提升了知识图谱问答的准确性和鲁棒性,为知识增强推理提供了高效且可扩展的解决方案。 Abstract: Large Language Models (LLMs) exhibit strong reasoning capabilities in complex tasks. However, they still struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA). We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures. Existing methods usually employ resource-intensive, non-scalable workflows reasoning on vanilla KGs, but overlook this gap. To address this challenge, we propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs' prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries. EoG enables efficient evidence extraction from KGs for precise and robust reasoning, while ensuring low computational costs, scalability, and adaptability across different methods. Furthermore, we propose three graph quality evaluation metrics to analyze query-graph alignment in KGQA task, supported by theoretical validation of our optimization objectives. Extensive experiments on two KGQA benchmark datasets indicate that EoG can effectively generate high-quality KGs and achieve the state-of-the-art performance. Our code and data are available at https://github.com/zjukg/Enrich-on-Graph.

[29] Leveraging What's Overfixed: Post-Correction via LLM Grammatical Error Overcorrection

Taehee Park,Heejin Do,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 提出了一种名为Post-Correction via Overcorrection (PoCO)的新方法,通过利用大模型的生成能力和小模型的可靠性,在语法纠错任务中有效平衡了召回率与精确率。

Details Motivation: 小规模语言模型在语法纠错中精度高但召回率低,而大规模语言模型则容易过度修正导致精度下降,因此需要一种方法来结合两者优势。 Method: 首先使用大语言模型进行故意的过度修正以提升召回率,然后通过微调的小模型对输出进行后修正,纠正错误修正以保持精度。 Result: 实验表明,PoCO在提升召回率的同时保持了有竞争力的精度,整体上提高了语法纠错的性能。 Conclusion: PoCO成功地平衡了大模型和小模型的优势,在语法纠错任务中实现了更高质量的修正结果。 Abstract: Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.

[30] Distilling Many-Shot In-Context Learning into a Cheat Sheet

Ukyo Honda,Soichiro Murakami,Peinan Zhang

Main category: cs.CL

TL;DR: 提出了一种称为“cheat-sheet ICL”的方法,通过将多示例上下文学习的信息压缩成简洁的文本摘要,在减少输入token数量的同时保持甚至提升性能。

Details Motivation: 解决大规模语言模型在多示例上下文学习中因输入token过长导致的高计算开销问题。 Method: 将多示例上下文学习的信息提炼为一个简洁的文本摘要(即‘作弊表’),在推理时用作上下文。 Result: 在复杂推理任务上的实验表明,cheat-sheet ICL 在显著减少token使用的情况下,性能优于或相当于多示例ICL,并且无需测试时检索即可匹配基于检索的ICL。 Conclusion: cheat-sheet ICL 是一种实用的、高效利用大语言模型进行下游任务的方法。 Abstract: Recent advances in large language models (LLMs) enable effective in-context learning (ICL) with many-shot examples, but at the cost of high computational demand due to longer input tokens. To address this, we propose cheat-sheet ICL, which distills the information from many-shot ICL into a concise textual summary (cheat sheet) used as the context at inference time. Experiments on challenging reasoning tasks show that cheat-sheet ICL achieves comparable or better performance than many-shot ICL with far fewer tokens, and matches retrieval-based ICL without requiring test-time retrieval. These findings demonstrate that cheat-sheet ICL is a practical alternative for leveraging LLMs in downstream tasks.

Shuo Huang,Xingliang Yuan,Gholamreza Haffari,Lizhen Qu

Main category: cs.CL

TL;DR: 提出一种零样本、基于树搜索的迭代句子重写算法,用于在保护隐私的同时保持文本的连贯性和自然性。

Details Motivation: 现有文本匿名化技术难以在隐私保护与文本自然性之间取得良好平衡,且可能损害文本效用。 Method: 采用基于奖励模型引导的树搜索策略,逐步重写敏感语句片段,通过零样本迭代重写实现信息混淆或删除。 Result: 在隐私敏感数据集上的实验表明,该方法显著优于现有基线方法,在隐私保护和文本效用保持方面表现更优。 Conclusion: 所提方法能有效平衡隐私保护与文本质量,适用于云服务中的大语言模型输入隐私防护。 Abstract: The increasing adoption of large language models (LLMs) in cloud-based services has raised significant privacy concerns, as user inputs may inadvertently expose sensitive information. Existing text anonymization and de-identification techniques, such as rule-based redaction and scrubbing, often struggle to balance privacy preservation with text naturalness and utility. In this work, we propose a zero-shot, tree-search-based iterative sentence rewriting algorithm that systematically obfuscates or deletes private information while preserving coherence, relevance, and naturalness. Our method incrementally rewrites privacy-sensitive segments through a structured search guided by a reward model, enabling dynamic exploration of the rewriting space. Experiments on privacy-sensitive datasets show that our approach significantly outperforms existing baselines, achieving a superior balance between privacy protection and utility preservation.

[32] Concise and Sufficient Sub-Sentence Citations for Retrieval-Augmented Generation

Guo Chen,Qiuyuan Li,Qiuxian Li,Hongliang Dai,Xiang Chen,Piji Li

Main category: cs.CL

TL;DR: 本文提出了一种生成细粒度、子句级引用的方法,以提升检索增强生成(RAG)系统中引用的可读性和验证效率。

Details Motivation: 现有引用方法存在粒度过粗(句子或段落级)和信息不完整的问题,影响用户验证LLM输出的正确性。 Method: 制定子句级引用标注规范并构建数据集;设计基于大语言模型的归因框架,自动生成微调数据,并利用信用模型筛选低质量样本。 Result: 实验表明,该方法能生成更高质量、简洁且充分的引用,显著提升可读性和用户验证效率。 Conclusion: 子句级引用结合自动数据生成与质量过滤,有效改善了RAG系统中的引用质量。 Abstract: In retrieval-augmented generation (RAG) question answering systems, generating citations for large language model (LLM) outputs enhances verifiability and helps users identify potential hallucinations. However, we observe two problems in the citations produced by existing attribution methods. First, the citations are typically provided at the sentence or even paragraph level. Long sentences or paragraphs may include a substantial amount of irrelevant content. Second, sentence-level citations may omit information that is essential for verifying the output, forcing users to read the surrounding context. In this paper, we propose generating sub-sentence citations that are both concise and sufficient, thereby reducing the effort required by users to confirm the correctness of the generated output. To this end, we first develop annotation guidelines for such citations and construct a corresponding dataset. Then, we propose an attribution framework for generating citations that adhere to our standards. This framework leverages LLMs to automatically generate fine-tuning data for our task and employs a credit model to filter out low-quality examples. Our experiments on the constructed dataset demonstrate that the propose approach can generate high-quality and more readable citations.

[33] WeFT: Weighted Entropy-driven Fine-Tuning for dLLMs

Guowei Xu,Wenxin Xu,Jiawang Zhao,Kaisheng Ma

Main category: cs.CL

TL;DR: 提出了一种名为WeFT的加权监督微调方法,通过基于熵为标记分配不同权重来提升扩散语言模型在少量样本下的推理能力,在多个基准上显著优于标准SFT。

Details Motivation: 扩散模型在语言生成中具有快速生成的优势,但缺乏精确的概率估计,导致生成过程不可控且不一致,难以有效进行监督微调。 Method: 提出WeFT方法,基于扩散理论,根据每个标记的熵为其分配权重,在微调过程中强化关键标记的影响,从而更好地引导生成方向。 Result: 在s1K、s1K-1.1和3k样本上训练,WeFT在Sudoku、Countdown、GSM8K和MATH-500四个推理基准上相比标准SFT取得了39%、64%和83%的相对提升。 Conclusion: WeFT有效提升了扩散语言模型在少样本场景下的推理性能,为扩散模型的可控微调提供了新思路。 Abstract: Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose WeFT, a weighted SFT method for diffusion language models, where tokens are assigned different weights based on their entropy. Derived from diffusion theory, WeFT delivers substantial gains: training on s1K, s1K-1.1, and 3k samples from open-r1, it achieves relative improvements of 39%, 64%, and 83% over standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500). The code and models will be made publicly available.

[34] Single Answer is Not Enough: On Generating Ranked Lists with Medical Reasoning Models

Pittawat Taveekitworachai,Natpatchara Pongjirapat,Krittaphas Chaisutyakorn,Piyalitt Ittichaiwong,Tossaporn Saengja,Kunat Pipatanakul

Main category: cs.CL

TL;DR: 本论文首次系统研究了如何使医学推理模型(MRMs)生成开放式问题的排序答案列表,提出并比较了提示与微调方法,发现基于强化微调(RFT)的模型在多种答案格式下更具鲁棒性,并展示了其在多正确答案临床场景中的潜力。

Details Motivation: 临床决策通常需要考虑多个可能答案以避免片面判断,但现有医学推理模型多被训练为仅输出单一答案,难以满足实际需求。因此,亟需探索支持多答案排序输出的新方法。 Method: 提出两种方法:提示(prompting)和微调(包括监督微调SFT与强化微调RFT)。设计针对排序列表的新奖励函数,并进行RFT消融实验。在修改版MedQA数据集上开展案例研究。 Result: 部分SFT模型能在特定格式上泛化,但RFT训练的模型在多种答案格式下表现更稳健;模型虽可能未选中基准设定的标准答案,但能识别出其他有效答案。 Conclusion: RFT方法在提升医学推理模型对多样化答案格式的适应性和鲁棒性方面优于SFT,为医学领域发展超越单一答案的输出形式提供了可行路径。 Abstract: This paper presents a systematic study on enabling medical reasoning models (MRMs) to generate ranked lists of answers for open-ended questions. Clinical decision-making rarely relies on a single answer but instead considers multiple options, reducing the risks of narrow perspectives. Yet current MRMs are typically trained to produce only one answer, even in open-ended settings. We propose an alternative format: ranked lists and investigate two approaches: prompting and fine-tuning. While prompting is a cost-effective way to steer an MRM's response, not all MRMs generalize well across different answer formats: choice, short text, and list answers. Based on our prompting findings, we train and evaluate MRMs using supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT teaches a model to imitate annotated responses, and RFT incentivizes exploration through the responses that maximize a reward. We propose new reward functions targeted at ranked-list answer formats, and conduct ablation studies for RFT. Our results show that while some SFT models generalize to certain answer formats, models trained with RFT are more robust across multiple formats. We also present a case study on a modified MedQA with multiple valid answers, finding that although MRMs might fail to select the benchmark's preferred ground truth, they can recognize valid answers. To the best of our knowledge, this is the first systematic investigation of approaches for enabling MRMs to generate answers as ranked lists. We hope this work provides a first step toward developing alternative answer formats that are beneficial beyond single answers in medical domains.

[35] Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch

Main category: cs.CL

TL;DR: 本文提出了一种名为SummQ的对抗性多智能体框架,用于解决长文档摘要中的信息丢失、事实不一致和连贯性问题。该方法通过摘要与问答两个领域的协同智能体协作,实现摘要的迭代优化,在自动与人工评估中均显著优于现有方法。

Details Motivation: 现有的大语言模型在处理超长文档时普遍存在信息丢失、事实错误和摘要不连贯的问题,难以满足高质量长文档摘要的需求。 Method: 设计了一个包含摘要生成者、评审者、问答生成者、评审者及应试者在内的多智能体对抗框架(SummQ),通过问答机制作为质量监控手段,利用多轮反馈实现摘要的迭代优化。 Result: 在三个主流长文档摘要基准上的实验表明,SummQ在ROUGE、BERTScore、LLM-as-a-Judge以及人工评价指标上均显著优于当前最先进的方法。 Conclusion: SummQ通过多智能体的对抗性协作,特别是引入问答机制进行持续质量验证,有效提升了长文档摘要的质量,为该领域提供了一种新范式。 Abstract: Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.

[36] MemLens: Uncovering Memorization in LLMs with Activation Trajectories

Zirui He,Haiyan Zhao,Ali Payani,Mengnan du

Main category: cs.CL

TL;DR: 提出MemLens方法,通过分析生成过程中数字token的概率轨迹来检测大语言模型中的记忆化现象,发现被污染样本在早期层就快速锁定答案,而干净样本则逐步积累证据。

Details Motivation: 现有检测方法对隐式污染数据泛化能力差,难以有效识别大语言模型在挑战性基准上的记忆化问题。 Method: 分析生成过程中数字token在模型各层的概率变化轨迹,利用LoRA微调注入设计好的样本来验证轨迹模式。 Result: 被污染和干净样本展现出明显分离的推理轨迹,MemLens能稳定捕捉真实记忆信号而非虚假相关性。 Conclusion: MemLens有效揭示了记忆化行为的本质特征,为检测模型记忆提供了可解释且鲁棒的新工具。 Abstract: Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, which are susceptible to contamination and risk of being memorized. Existing detection methods, which primarily rely on surface-level lexical overlap and perplexity, demonstrate low generalization and degrade significantly when encountering implicitly contaminated data. In this paper, we propose MemLens (An Activation Lens for Memorization Detection) to detect memorization by analyzing the probability trajectories of numeric tokens during generation. Our method reveals that contaminated samples exhibit ``shortcut'' behaviors, locking onto an answer with high confidence in the model's early layers, whereas clean samples show more gradual evidence accumulation across the model's full depth. We observe that contaminated and clean samples exhibit distinct and well-separated reasoning trajectories. To further validate this, we inject carefully designed samples into the model through LoRA fine-tuning and observe the same trajectory patterns as in naturally contaminated data. These results provide strong evidence that MemLens captures genuine signals of memorization rather than spurious correlations.

[37] Cross-Linguistic Analysis of Memory Load in Sentence Comprehension: Linear Distance and Structural Density

Krishna Aggarwal

Main category: cs.CL

TL;DR: 该研究通过跨语言依存树库和混合效应模型,比较了线性距离与结构密度对句子理解中记忆负荷的解释力,提出“干预成分复杂度”(Intervener Complexity)作为比线性距离更精细的结构化指标。

Details Motivation: 旨在探究句法相关词之间的线性距离与中间成分的结构密度哪一个更能解释句子理解中的记忆负荷问题,并弥补现有线性距离指标的不足。 Method: 采用统一依赖语法(UD)树库,使用混合效应模型分析多个语言中句子长度、依存距离和干预成分复杂度对记忆负荷的预测作用,并将记忆负荷操作化为特征干扰与特征错绑的线性总和。 Result: 三者均与记忆负荷正相关,其中句子长度影响最广,而干预成分复杂度在线性距离之外提供了额外解释力,表明中间成分的头部数量是更直接的认知负荷指标。 Conclusion: 研究调和了线性与层级结构对局部性的解释,主张依存距离是表层线索,而干预头词数量更能反映句法整合与记忆维持的认知需求;方法上展示了如何用图结构指标和跨语言建模分离线性和结构因素的影响。 Abstract: This study examines whether sentence-level memory load in comprehension is better explained by linear proximity between syntactically related words or by the structural density of the intervening material. Building on locality-based accounts and cross-linguistic evidence for dependency length minimization, the work advances Intervener Complexity-the number of intervening heads between a head and its dependent-as a structurally grounded lens that refines linear distance measures. Using harmonized dependency treebanks and a mixed-effects framework across multiple languages, the analysis jointly evaluates sentence length, dependency length, and Intervener Complexity as predictors of the Memory-load measure. Studies in Psycholinguistics have reported the contributions of feature interference and misbinding to memory load during processing. For this study, I operationalized sentence-level memory load as the linear sum of feature misbinding and feature interference for tractability; current evidence does not establish that their cognitive contributions combine additively. All three factors are positively associated with memory load, with sentence length exerting the broadest influence and Intervener Complexity offering explanatory power beyond linear distance. Conceptually, the findings reconcile linear and hierarchical perspectives on locality by treating dependency length as an important surface signature while identifying intervening heads as a more proximate indicator of integration and maintenance demands. Methodologically, the study illustrates how UD-based graph measures and cross-linguistic mixed-effects modelling can disentangle linear and structural contributions to processing efficiency, providing a principled path for evaluating competing theories of memory load in sentence comprehension.

[38] Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning

Asim Ersoy,Enes Altinisik,Husrev Taha Sencar,Kareem Darwish

Main category: cs.CL

TL;DR: 本文研究了阿拉伯语工具调用的关键问题,包括是否需要使用阿拉伯语数据、通用指令微调的影响以及特定工具微调的价值。通过将两个开源工具调用数据集翻译为阿拉伯语并进行实验,作者评估了不同策略对阿拉伯语大模型工具调用性能的影响。

Details Motivation: 现有工具调用研究和资源主要集中在英语,缺乏对阿拉伯语等其他语言的支持,限制了多语言场景下的应用。因此,亟需探索如何有效实现阿拉伯语的工具调用能力。 Method: 作者翻译并适配了两个开源工具调用数据集至阿拉伯语,基于开源阿拉伯大模型的基线及后训练变体,设计实验评估跨语言迁移、通用指令微调与特定工具微调的效果。 Result: 研究表明,在阿拉伯语工具调用中,使用本语言数据显著优于纯跨语言迁移;通用指令微调有一定帮助,而针对高优先级工具的微调能进一步提升性能。 Conclusion: 要构建高效的阿拉伯语工具增强型智能体,应结合本语言工具调用数据、通用指令微调,并对关键工具进行专门微调,这为多语言工具调用系统的发展提供了可行路径。 Abstract: Tool calling is a critical capability that allows Large Language Models (LLMs) to interact with external systems, significantly expanding their utility. However, research and resources for tool calling are predominantly English-centric, leaving a gap in our understanding of how to enable this functionality for other languages, such as Arabic. This paper investigates three key research questions: (1) the necessity of in-language (Arabic) tool-calling data versus relying on cross-lingual transfer, (2) the effect of general-purpose instruction tuning on tool-calling performance, and (3) the value of fine-tuning on specific, high-priority tools. To address these questions, we conduct extensive experiments using base and post-trained variants of an open-weight Arabic LLM. To enable this study, we bridge the resource gap by translating and adapting two open-source tool-calling datasets into Arabic. Our findings provide crucial insights into the optimal strategies for developing robust tool-augmented agents for Arabic.

[39] Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting

Valeria Ramirez-Garcia,David de-Fitero-Dominguez,Antonio Garcia-Cabot,Eva Garcia-Lopez

Main category: cs.CL

TL;DR: 本研究探讨了基于大语言模型(LLM)的自动评分系统在高等教育文本输入题中的应用,提出了五种基于评分标准的评估方法,并在自建数据集上与人类评分进行对比,发现“参考辅助评估”效果最佳,具有较低的偏差和较高的评价质量,表明AI辅助评分系统具备作为教学补充工具的潜力。

Details Motivation: 为了提升教育领域中自动评估系统的准确性与公平性,探索大语言模型作为评分者在学术文本回答评分中的有效性,并弥补现有方法在评分一致性与信息完整性方面的不足。 Method: 提出并比较了五种LLM驱动的评估方法:JudgeLM评估、参考辅助评估、无参考评估、加性评估和自适应评估;使用包含110个计算机科学学生答案的自定义数据集,采用JudgeLM、Llama-3.1-8B和DeepSeek-R1-Distill-Llama-8B三种模型进行实验,并与人类评分结果对比,以中位绝对偏差和均方根偏差为评价指标。 Result: 参考辅助评估表现最优,中位绝对偏差为0.945,均方根偏差为1.214,评分最接近人类;其他方法在处理简短回答或缺乏参考答案时表现较差,JudgeLM因模型限制未能取得良好效果。 Conclusion: 在合适方法支持下,基于大语言模型的自动评估系统具备作为教育评估补充工具的潜力,尤其参考辅助评估在准确性与评语质量方面表现突出,值得进一步推广应用。 Abstract: Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education, LLMs have been studied as assistant tools for students and teachers. Our research investigates LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics. We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM evaluation, which uses the model's single answer prompt to obtain a score; Reference Aided Evaluation, which uses a correct answer as a guide aside from the original context of the question; No Reference Evaluation, which ommits the reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive Evaluation, which is an evaluation done with generated criteria fitted to each question. All evaluation methods have been compared with the results of a human evaluator. Results show that the best method to automatically evaluate and score Text-Input Problems using LLMs is Reference Aided Evaluation. With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations. Other methods such as Additive and Adaptive Evaluation fail to provide good results in concise answers, No Reference Evaluation lacks information needed to correctly assess questions and JudgeLM Evaluations have not provided good results due to the model's limitations. As a result, we conclude that Artificial Intelligence-driven automatic evaluation systems, aided with proper methodologies, show potential to work as complementary tools to other academic resources.

[40] Generative AI for FFRDCs

Arun S. Maiya

Main category: cs.CL

TL;DR: 本文展示了如何利用大语言模型结合少量示例来加速联邦资助研发中心(FFRDCs)对文本密集型任务的处理,并通过OnPrem.LLM框架在敏感政府环境中安全地应用生成式AI。

Details Motivation: FFRDCs面临大量文本分析任务,手动处理效率低,亟需自动化工具提升分析速度和质量,同时满足政府环境中的安全与合规要求。 Method: 采用OnPrem.LLM开源框架,在本地部署大语言模型,使用少量输入输出示例实现文本摘要、分类、信息提取和认知分析功能。 Result: 在国防政策文件(如NDAA)和科学文献数据集(如NSF Awards)上的案例研究表明,该方法能有效提升监督能力和战略分析效率,同时保障数据可审计性和主权。 Conclusion: 基于OnPrem.LLM的大语言模型方法可在确保安全性与合规性的前提下,显著提升FFRDCs在敏感场景下的文本分析能力,具有广泛的应用前景。 Abstract: Federally funded research and development centers (FFRDCs) face text-heavy workloads, from policy documents to scientific and engineering papers, that are slow to analyze manually. We show how large language models can accelerate summarization, classification, extraction, and sense-making with only a few input-output examples. To enable use in sensitive government contexts, we apply OnPrem$.$LLM, an open-source framework for secure and flexible application of generative AI. Case studies on defense policy documents and scientific corpora, including the National Defense Authorization Act (NDAA) and National Science Foundation (NSF) Awards, demonstrate how this approach enhances oversight and strategic analysis while maintaining auditability and data sovereignty.

[41] Behind RoPE: How Does Causal Mask Encode Positional Information?

Junu Kim,Xiao Liu,Zhenghao Lin,Lei Ji,Yeyun Gong,Edward Choi

Main category: cs.CL

TL;DR: 本文研究了Transformer解码器中因果掩码对注意力分数的位置依赖性影响,发现即使没有参数或输入中的因果依赖,因果掩码也能诱导出类似位置编码的局部偏好模式,并与RoPE相互作用导致其相对性失真。

Details Motivation: 探究除了显式位置编码(如RoPE)外,因果掩码是否也提供重要的位置信息。 Method: 通过理论分析证明因果掩码可诱导注意力机制中的位置依赖模式,并通过实证分析验证训练模型中的此类行为。 Result: 因果掩码能自发产生偏向邻近查询-键对的注意力模式,且与RoPE结合时会破坏RoPE的相对位置特性;该现象在现代大语言模型中普遍存在。 Conclusion: 因果掩码是位置信息的重要来源,应与显式位置编码一同被考虑。 Abstract: While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE's relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.

[42] When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

Keno Harada,Yudai Yamazaki,Masachika Taniguchi,Edison Marrese-Taylor,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CL

TL;DR: 本文提出了两个用于评估大语言模型在多指令跟随能力上的基准测试:ManyIFEval(文本生成)和StyleMBPP(代码生成),并发现随着指令数量增加,模型性能持续下降。研究还开发了三种回归模型,可用少量样本预测未见指令组合下的性能,其中基于指令数量的逻辑回归模型误差约为10%。

Details Motivation: 随着大语言模型在现实场景中的广泛应用,亟需系统评估其同时遵循多条指令的能力,尤其是在文本和代码生成等关键领域。 Method: 构建了两个新基准ManyIFEval和StyleMBPP,涵盖最多十条文本指令和六条代码指令;在十个LLM上进行实验,并训练三种回归模型(如逻辑回归)来预测不同指令数量和未见组合下的性能。 Result: 实验表明,所有模型在多指令情况下性能随指令数增加而下降;逻辑回归模型能以约10%的误差预测性能;仅需500(ManyIFEval)和300(StyleMBPP)样本即可实现有效估计。 Conclusion: 多指令跟随仍是当前LLM的挑战;提出的回归模型可高效估算复杂指令组合下的模型表现,为实际应用中的性能预测提供了可行方案。 Abstract: As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.

[43] SoM-1K: A Thousand-Problem Benchmark Dataset for Strength of Materials

Qixin Wan,Zilong Wang,Jingwen Zhou,Wanting Wang,Ziheng Geng,Jiachen Liu,Ran Cao,Minghui Cheng,Lu Cheng

Main category: cs.CL

TL;DR: SoM-1K 是首个面向材料力学领域的大型多模态基准数据集,用于评估基础模型在复杂工程问题中的表现,研究发现当前模型表现不佳,最佳准确率仅为 56.6%。

Details Motivation: 探索基础模型在复杂、多模态工程问题中的性能,特别是在材料力学领域缺乏系统评估基准的情况下。 Method: 构建包含 1065 个标注问题的 SoM-1K 数据集,提出“图像描述”(DoI)提示策略,使用专家生成的文本描述替代视觉图表输入,并对 8 个代表性基础模型(包括 LLMs 和 VLMs)进行评估。 Result: 当前基础模型在该任务上表现较差,最佳模型准确率为 56.6%;提供 DoI 的 LLM 常常优于使用图像输入的 VLM;错误分析表明 DoI 能有效减少视觉误解错误。 Conclusion: SoM-1K 为工程 AI 建立了严格基准,表明当前基础模型在多模态推理(尤其是科学与工程场景)方面存在不足,强调发展更强大多模态推理能力的必要性。 Abstract: Foundation models have shown remarkable capabilities in various domains, but their performance on complex, multimodal engineering problems remains largely unexplored. We introduce SoM-1K, the first large-scale multimodal benchmark dataset dedicated to evaluating foundation models on problems in the strength of materials (SoM). The dataset, which contains 1,065 annotated SoM problems, mirrors real-world engineering tasks by including both textual problem statements and schematic diagrams. Due to the limited capabilities of current foundation models in understanding complicated visual information, we propose a novel prompting strategy called Descriptions of Images (DoI), which provides rigorous expert-generated text descriptions of the visual diagrams as the context. We evaluate eight representative foundation models, including both large language models (LLMs) and vision language models (VLMs). Our results show that current foundation models struggle significantly with these engineering problems, with the best-performing model achieving only 56.6% accuracy. Interestingly, we found that LLMs, when provided with DoI, often outperform VLMs provided with visual diagrams. A detailed error analysis reveals that DoI plays a crucial role in mitigating visual misinterpretation errors, suggesting that accurate text-based descriptions can be more effective than direct image input for current foundation models. This work establishes a rigorous benchmark for engineering AI and highlights a critical need for developing more robust multimodal reasoning capabilities in foundation models, particularly in scientific and engineering contexts.

[44] Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs

Yixin Wan,Xingrun Chen,Kai-Wei Chang

Main category: cs.CL

TL;DR: 本文提出并系统研究了大语言模型中存在的文化定位偏见问题,即模型在生成内容时倾向于主流美国文化视角,而将其他非主流文化视为外部者。为此,作者构建了CultureLens基准,并提出了基于提示的公平干预方法(FIP)和基于代理的公平性缓解框架(MFA),实验证明MFA在减少文化偏见方面具有显著效果。

Details Motivation: 发现大语言模型在生成内容时存在对非主流文化的边缘化倾向,缺乏对多元文化视角的公平表达,因此需要识别并量化这种文化定位偏见,并探索有效的缓解策略。 Method: 提出CultureLens基准,包含4000个生成提示和3个评估指标,通过情境化采访脚本生成任务来衡量偏见;设计两种推理阶段的缓解方法:基于提示的FIP方法,以及基于多智能体的MFA框架(包括单代理自反思重写和多代理分工协作机制)。 Result: 在5个最先进的大语言模型上实验发现,模型在美国文化情境下采用内部视角的比例超过88%,但在非主流文化中则多采用外部视角;所提出的MFA方法,尤其是多代理版本MFA-MA,在降低文化偏见方面表现最优,显著优于基线方法。 Conclusion: 大语言模型存在显著的文化定位偏见,需引起重视;基于多智能体的结构化公平性框架(如MFA)是一种有效且有前景的偏见缓解路径,有助于提升生成内容的文化公平性。 Abstract: Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM's default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.

[45] PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models

Mohammad Hosseini,Kimia Hosseini,Shayan Bali,Zahra Zanjani,Saeedeh Momtazi

Main category: cs.CL

TL;DR: PerHalluEval是首个针对波斯语的动态幻觉评估基准,通过LLM驱动的三阶段流程结合人工验证,评估12个大语言模型在波斯语任务中的幻觉问题,发现提供外部知识可部分缓解该问题,且专为波斯语训练的模型未表现出显著优势。

Details Motivation: 解决大语言模型在低资源语言(如波斯语)中普遍存在且难以评估的幻觉问题,尤其是缺乏针对波斯语的专门评估基准。 Method: 构建了一个三阶段的LLM驱动管道,结合人工验证生成合理的问答和摘要样本,利用生成token的对数概率筛选最可信的幻觉实例,并引入人工标注以识别与波斯文化相关的特定语境。 Result: 评估显示现有大语言模型普遍难以检测波斯语中的幻觉;提供原始文档等外部知识可部分减轻幻觉;专门为波斯语训练的模型与其他模型在幻觉表现上无显著差异。 Conclusion: 当前大语言模型在处理波斯语时仍面临严重幻觉挑战,需结合外部知识和文化敏感性设计更有效的缓解策略,且模型是否专为波斯语训练并非决定性因素。 Abstract: Hallucination is a persistent issue affecting all large language Models (LLMs), particularly within low-resource languages such as Persian. PerHalluEval (Persian Hallucination Evaluation) is the first dynamic hallucination evaluation benchmark tailored for the Persian language. Our benchmark leverages a three-stage LLM-driven pipeline, augmented with human validation, to generate plausible answers and summaries regarding QA and summarization tasks, focusing on detecting extrinsic and intrinsic hallucinations. Moreover, we used the log probabilities of generated tokens to select the most believable hallucinated instances. In addition, we engaged human annotators to highlight Persian-specific contexts in the QA dataset in order to evaluate LLMs' performance on content specifically related to Persian culture. Our evaluation of 12 LLMs, including open- and closed-source models using PerHalluEval, revealed that the models generally struggle in detecting hallucinated Persian text. We showed that providing external knowledge, i.e., the original document for the summarization task, could mitigate hallucination partially. Furthermore, there was no significant difference in terms of hallucination when comparing LLMs specifically trained for Persian with others.

[46] BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

Hyunseo Kim,Sangam Lee,Kwangwook Seo,Dongha Lee

Main category: cs.CL

TL;DR: 本文提出了BESPOKE,一个用于评估搜索增强型大语言模型中个性化的现实基准,通过真实的人类聊天和搜索历史以及细粒度的偏好评分和反馈来系统评估个性化效果。

Details Motivation: 现有的搜索增强型大语言模型在满足多样化用户需求方面仍显不足,缺乏对不同用户意图的识别和信息形式的个性化交付,且当前系统的个性化评估尚不充分。 Method: 提出BESPOKE基准,收集真实的人类聊天和搜索历史,结合细粒度偏好评分与诊断性反馈,通过长期深入的人工标注构建数据集,并进行系统性分析。 Result: BESPOKE能够有效揭示信息寻求任务中实现有效个性化的关键要求,支持对个性化搜索增强型LLMs进行细粒度评估。 Conclusion: BESPOKE为评估搜索增强型大语言模型的个性化能力提供了现实且具有诊断性的基准,推动了该领域的系统化研究。 Abstract: Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.

[47] VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model

Junhyuk Choi,Ro-hoon Oh,Jihwan Seol,Bugeun Kim

Main category: cs.CL

TL;DR: VoiceBBQ是BBQ数据集的语音扩展版本,用于评估口语语言模型中的社会偏见,区分内容和声学两个方面的偏见来源,并通过测试LLaMA-Omni和Qwen2-Audio展示了其有效性。

Details Motivation: 由于语音中社会偏见可能来自内容和声学两个方面,需要一个能够分别衡量这两种偏见的数据集来更全面地评估口语语言模型中的偏见问题。 Method: 将BBQ数据集的每个上下文转换为受控的语音条件,构建VoiceBBQ数据集,从而实现对内容和声学偏见的独立评估,并计算准确性、偏见和一致性得分。 Result: 在LLaMA-Omni和Qwen2-Audio两个模型上的实验表明,LLaMA-Omni抑制声学偏见但放大性别和口音偏见,而Qwen2-Audio则显著减弱这些偏见同时保持内容保真度。 Conclusion: VoiceBBQ提供了一个紧凑且可直接使用的测试平台,能够联合诊断口语语言模型中的内容和声学偏见,有助于未来更全面的偏见评估与缓解研究。 Abstract: We introduce VoiceBBQ, a spoken extension of the BBQ (Bias Benchmark for Question Answering) - a dataset that measures social bias by presenting ambiguous or disambiguated contexts followed by questions that may elicit stereotypical responses. Due to the nature of speech, social bias in Spoken Language Models (SLMs) can emerge from two distinct sources: 1) content aspect and 2) acoustic aspect. The dataset converts every BBQ context into controlled voice conditions, enabling per-axis accuracy, bias, and consistency scores that remain comparable to the original text benchmark. Using VoiceBBQ, we evaluate two SLMs - LLaMA-Omni and Qwen2-Audio - and observe architectural contrasts: LLaMA-Omni resists acoustic bias while amplifying gender and accent bias, whereas Qwen2-Audio substantially dampens these cues while preserving content fidelity. VoiceBBQ thus provides a compact, drop-in testbed for jointly diagnosing content and acoustic bias across spoken language models.

[48] Acoustic-based Gender Differentiation in Speech-aware Language Models

Junhyuk Choi,Jihwan Seol,Nayeon Kim,Chanhee Cho,EunBin Cho,Bugeun Kim

Main category: cs.CL

TL;DR: 该论文研究了语音语言模型(SpeechLMs)在处理不同性别声音时的潜在性别偏见,发现尽管整体响应看似无差异,但实际上存在系统性偏差:在性别刻板问题上模型偏向男性,在应区分性别的场景下却反而忽略性别信息。这种矛盾模式主要源于Whisper语音编码器产生的男性导向声学标记。

Details Motivation: 探究SpeechLMs是否存在基于声音性别的偏见,尤其是在相同问题因发音者性别不同时是否产生不同回应,并分析此类偏见的来源与机制。 Method: 构建了一个包含9,208个语音样本的新数据集,涵盖性别无关、性别刻板和性别依赖三类问题,评估LLaMA-Omni系列模型的表现,并对比SpeechLM与其基础大语言模型,分析偏差来源。 Result: 发现模型在性别刻板问题上一致表现出男性倾向;在应区分性别的语境下却忽略性别;该模式并非由中性选项或声音感知性别引起,且在使用性别中立化方法后仍存在;通过对比发现偏差主要来自Whisper语音编码器生成的男性导向声学标记。 Conclusion: 当前SpeechLMs虽追求普遍公平,但在性别信息处理上存在矛盾模式,未能正确利用性别上下文,需更精细的技术来实现真正公平的语音交互系统。 Abstract: Speech-aware Language Models (SpeechLMs) have fundamentally transformed human-AI interaction by enabling voice-based communication, yet they may exhibit acoustic-based gender differentiation where identical questions lead to different responses based on the speaker's gender. This paper propose a new dataset that enables systematic analysis of this phenomenon, containing 9,208 speech samples across three categories: Gender-Independent, Gender-Stereotypical, and Gender-Dependent. We further evaluated LLaMA-Omni series and discovered a paradoxical pattern; while overall responses seems identical regardless of gender, the pattern is far from unbiased responses. Specifically, in Gender-Stereotypical questions, all models consistently exhibited male-oriented responses; meanwhile, in Gender-Dependent questions where gender differentiation would be contextually appropriate, models exhibited responses independent to gender instead. We also confirm that this pattern does not result from neutral options nor perceived gender of a voice. When we allow neutral response, models tends to respond neutrally also in Gender-Dependent questions. The paradoxical pattern yet retains when we applied gender neutralization methods on speech. Through comparison between SpeechLMs with corresponding backbone LLMs, we confirmed that these paradoxical patterns primarily stem from Whisper speech encoders, which generates male-oriented acoustic tokens. These findings reveal that current SpeechLMs may not successfully remove gender biases though they prioritized general fairness principles over contextual appropriateness, highlighting the need for more sophisticated techniques to utilize gender information properly in speech technology.

[49] AutoIntent: AutoML for Text Classification

Ilya Alekseev,Roman Solomatin,Darina Rustamova,Denis Kuznetsov

Main category: cs.CL

TL;DR: AutoIntent是一个用于文本分类任务的自动化机器学习工具,提供端到端自动化,包括嵌入模型选择、分类器优化和决策阈值调整。

Details Motivation: 现有自动化机器学习工具在文本分类任务中缺乏端到端的自动化支持,特别是在多标签分类和范围外检测方面存在不足。 Method: AutoIntent采用模块化设计,具有类似sklearn的接口,支持嵌入模型选择、分类器优化和决策阈值调优,适用于多标签分类和范围外检测任务。 Result: 在标准意图分类数据集上,AutoIntent相比现有AutoML工具表现出更优的性能,并能有效平衡效果与资源消耗。 Conclusion: AutoIntent为文本分类任务提供了一个高效、灵活且易于使用的自动化解决方案,特别适合需要多标签分类和范围外检测的应用场景。 Abstract: AutoIntent is an automated machine learning tool for text classification tasks. Unlike existing solutions, AutoIntent offers end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning, all within a modular, sklearn-like interface. The framework is designed to support multi-label classification and out-of-scope detection. AutoIntent demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets and enables users to balance effectiveness and resource consumption.

[50] Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction

Lei Hei,Tingjing Liao,Yingxin Pei,Yiyang Qi,Jiaqi Wang,Ruiting Li,Feiliang Ren

Main category: cs.CL

TL;DR: 提出了一种名为ROC的新框架,将多模态关系抽取从分类任务重构为基于语义的检索任务,通过融合实体类型和位置信息、扩展关系标签为自然语言描述,并利用对比学习对齐实体-关系对,在MNRE和MORE数据集上实现了最先进的性能。

Details Motivation: 传统多模态关系抽取方法依赖分类范式,使用离散标签表示关系,忽略了结构约束(如实体类型和位置线索),且缺乏细粒度关系理解所需的语义表达能力。 Method: 提出ROC框架:1)通过多模态编码器整合实体类型和位置信息;2)利用大语言模型将关系标签扩展为自然语言描述;3)采用基于语义相似性的对比学习对齐实体-关系对。 Result: 在MNRE和MORE基准数据集上达到最先进性能,表现出更强的鲁棒性和可解释性。 Conclusion: ROC通过将多模态关系抽取转化为语义驱动的检索任务,有效提升了关系理解的语义表达能力和结构合理性,为未来研究提供了新范式。 Abstract: Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose \underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.

[51] Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

Chantal Shaib,Vinith M. Suriyakumar,Levent Sagun,Byron C. Wallace,Marzyeh Ghassemi

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型中句法模板、领域和语义之间的关系,发现句法与领域的虚假相关性可能降低模型性能,并影响安全性,建议在训练数据中增加句法多样性以避免此类问题。

Details Motivation: 理解句法、领域和语义在任务-指令对中的作用,识别模型因学习句法与领域的虚假相关性而导致性能下降或安全风险的问题。 Method: 通过构建合成训练数据集分析句法-领域相关性的影响,提出评估框架检测训练后模型中的该现象,并在FlanV2子集及多种主流模型上进行验证,同时开展安全微调的案例研究。 Result: 发现句法-领域相关性会降低OLMo-2系列模型在实体知识任务上的表现(平均性能0.51±0.06),并在OLMo、Llama-4-Maverick和GPT-4o等模型中检测到该现象;案例显示可利用此相关性绕过安全拒绝机制。 Conclusion: 需明确测试句法-领域相关性,并确保训练数据中(尤其是各领域内)的句法多样性,以防止模型学习到有害的虚假相关性。 Abstract: For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates--frequent sequences of Part-of-Speech (PoS) tags--are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.

[52] Who's Laughing Now? An Overview of Computational Humour Generation and Explanation

Tyler Loakman,William Thorne,Chenghua Lin

Main category: cs.CL

TL;DR: 本文综述了与生成和解释幽默相关的计算幽默领域,指出尽管理解幽默是自然语言处理的基础任务,但除双关语外的幽默生成与解释研究仍然不足,当前最先进的模型仍远未达到人类水平,并强调了将主观性和伦理模糊性纳入未来研究方向的重要性。

Details Motivation: 幽默是一种抽象、创造性和高度依赖上下文的人类特质,其计算理解对评估大语言模型的常识与推理能力具有重要意义,但现有研究在非双关幽默的生成与解释方面仍十分有限。 Method: 通过文献综述的方法,系统梳理了计算幽默在生成与解释任务中的研究现状,并分析了当前大语言模型在该任务上的表现与挑战。 Result: 发现目前关于幽默生成与解释的研究仍集中在双关语等简单形式,复杂幽默的理解与生成仍具挑战,最先进模型的表现仍显著低于人类水平。 Conclusion: 计算幽默应作为自然语言处理的重要子领域加以重视,未来研究需考虑幽默的主观性和伦理问题,并推动更具创造性与上下文敏感性的模型发展。 Abstract: The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (LLMs). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains sparse, while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.

[53] GEP: A GCG-Based method for extracting personally identifiable information from chatbots built on small language models

Jieli Zhu,Vi Ngoc-Nha Tran

Main category: cs.CL

TL;DR: 本文研究了基于小语言模型(SLM)的聊天机器人在下游任务中的个人身份信息(PII)泄露问题,提出了一种基于贪婪坐标梯度(GCG)的新型PII提取方法GEP,并在复杂真实场景下验证了其有效性,相比传统模板方法显著提升了PII泄露检测能力。

Details Motivation: 尽管小语言模型(SLM)在效率和性能上具有优势,但其在下游应用中可能存在的个人身份信息(PII)泄露风险尚未被充分研究,尤其是针对医疗等敏感领域的聊天机器人。 Method: 首先基于BioGPT架构和医疗数据集Alpaca与HealthCareMagic微调出ChatBioGPT模型;然后提出一种新的贪婪坐标梯度(GCG)方法——GEP,用于从SLM中提取PII,并在固定模板和自由风格插入两种情况下进行实验评估。 Result: 实验表明,GEP方法相比传统的模板攻击方法可将PII泄露检测效果提升高达60倍,在更复杂的自由风格插入场景下仍能达到最高4.53%的PII泄露率。 Conclusion: GEP是一种有效且强大的PII泄露检测方法,特别适用于小语言模型环境下的隐私风险评估,揭示了当前SLM在医疗对话系统中潜在的严重隐私泄露问题。 Abstract: Small language models (SLMs) become unprecedentedly appealing due to their approximately equivalent performance compared to large language models (LLMs) in certain fields with less energy and time consumption during training and inference. However, the personally identifiable information (PII) leakage of SLMs for downstream tasks has yet to be explored. In this study, we investigate the PII leakage of the chatbot based on SLM. We first finetune a new chatbot, i.e., ChatBioGPT based on the backbone of BioGPT using medical datasets Alpaca and HealthCareMagic. It shows a matchable performance in BERTscore compared with previous studies of ChatDoctor and ChatGPT. Based on this model, we prove that the previous template-based PII attacking methods cannot effectively extract the PII in the dataset for leakage detection under the SLM condition. We then propose GEP, which is a greedy coordinate gradient-based (GCG) method specifically designed for PII extraction. We conduct experimental studies of GEP and the results show an increment of up to 60$\times$ more leakage compared with the previous template-based methods. We further expand the capability of GEP in the case of a more complicated and realistic situation by conducting free-style insertion where the inserted PII in the dataset is in the form of various syntactic expressions instead of fixed templates, and GEP is still able to reveal a PII leakage rate of up to 4.53%.

[54] Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning

Xiangru Tang,Wanghan Xu,Yujie Wang,Zijie Guo,Daniel Shao,Jiapeng Chen,Cixuan Zhang,Ziyi Wang,Lixin Zhang,Guancheng Wan,Wenlong Zhang,Lei Bai,Zhenfei Yin,Philip Torr,Hanrui Wang,Di Jin

Main category: cs.CL

TL;DR: 本文提出了一种结合隐式检索与结构化协作的统一框架,以解决大语言模型在科学推理中的两个瓶颈:显式检索导致的推理碎片化和多智能体方案中因平均候选结果而稀释优质解的问题。该框架通过基于Monitor的令牌级检索模块整合外部知识,并采用分层解 refine(HSR)和质量感知迭代推理(QAIR)提升解的质量。在多个基准测试中表现最优,显著提升准确率并降低资源消耗。

Details Motivation: 大语言模型在科学推理上虽有进展,但仍受限于显式检索带来的额外开销以及多智能体系统中对候选解的简单平均导致优质解被稀释的问题。需要一种更高效、协同的推理机制来克服这些瓶颈。 Method: 提出一个统一框架,包含两个核心组件:1)Monitor-based隐式检索模块,在token级别融合外部知识,减少对推理流程的干扰;2)结构化协作机制,包括分层解 refine(HSR)和质量感知迭代推理(QAIR),通过锚定候选解并由同伴修复,根据解的质量动态调整优化过程。 Result: 在Humanity's Last Exam Bio/Chem Gold上达到48.3%的准确率,为当前最高水平,超过最强智能体基线13.4个百分点,领先主流大模型最多18.1个百分点,同时减少53.5%的token使用和43.7%的agent步骤。在SuperGPQA和TRQA上也验证了跨领域鲁棒性。错误分析显示85%以上的失败案例同时存在知识缺失和推理错误,多样性分析表明检索任务受益于解的多样性,而推理任务更依赖共识。 Conclusion: 隐式知识增强与结构化精炼机制能有效克服显式工具调用和统一聚合策略带来的效率低下问题,为科学推理提供更高效、精准的大模型解决方案。 Abstract: Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden "tool tax" of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3\% accuracy -- the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5\% and agent steps by 43.7\%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85\% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.

Xinzhe Xu,Liang Zhao,Hongshen Xu,Chen Chen

Main category: cs.CL

TL;DR: 本文提出了CLaw,一个专门用于评估大语言模型在中国法律知识及其推理应用能力的基准测试。

Details Motivation: 现有大语言模型在处理法律文本时因缺乏专门训练而可靠性不足,难以准确引用和理解中国法律条文。 Method: 构建了一个包含306部中国国家法律、细分到子条款级别并包含历史修订时间的精细语料库(64,849条),以及254个基于中国最高法院材料的真实案例推理任务。 Result: 实验表明大多数当前的大语言模型在准确复现法律条文方面表现不佳,影响其法律推理的可信度。 Conclusion: 实现可信的法律推理需要准确的知识检索(如通过监督微调或检索增强生成)与强大通用推理能力的结合;CLaw为领域特定的法律推理研究提供了重要基准。 Abstract: Large Language Models (LLMs) are increasingly tasked with analyzing legal texts and citing relevant statutes, yet their reliability is often compromised by general pre-training that ingests legal texts without specialized focus, obscuring the true depth of their legal knowledge. This paper introduces CLaw, a novel benchmark specifically engineered to meticulously evaluate LLMs on Chinese legal knowledge and its application in reasoning. CLaw comprises two key components: (1) a comprehensive, fine-grained corpus of all 306 Chinese national statutes, segmented to the subparagraph level and incorporating precise historical revision timesteps for rigorous recall evaluation (64,849 entries), and (2) a challenging set of 254 case-based reasoning instances derived from China Supreme Court curated materials to assess the practical application of legal knowledge. Our empirical evaluation reveals that most contemporary LLMs significantly struggle to faithfully reproduce legal provisions. As accurate retrieval and citation of legal provisions form the basis of legal reasoning, this deficiency critically undermines the reliability of their responses. We contend that achieving trustworthy legal reasoning in LLMs requires a robust synergy of accurate knowledge retrieval--potentially enhanced through supervised fine-tuning (SFT) or retrieval-augmented generation (RAG)--and strong general reasoning capabilities. This work provides an essential benchmark and critical insights for advancing domain-specific LLM reasoning, particularly within the complex legal sphere.

[56] SGMem: Sentence Graph Memory for Long-Term Conversational Agents

Yaxiong Wu,Yongyue Zhang,Sheng Liang,Yong Liu

Main category: cs.CL

TL;DR: 提出SGMem(Sentence Graph Memory)方法,通过句子级图结构在分块单元中表示对话,结合原始对话与生成记忆,提升大模型在长期对话问答中的准确性和信息检索效果。

Details Motivation: 现有基于事实提取或摘要的方法难以有效组织和检索跨不同粒度的对话信息,且无法很好处理超出大语言模型上下文窗口的长期对话记忆管理问题。 Method: 将对话划分为块单元,在每个块内构建句子级图结构,捕捉回合、轮次和会话级别的上下文关联,并结合原始对话内容与生成的记忆(如摘要、事实、洞察)进行记忆存储与检索。 Result: 在LongMemEval和LoCoMo数据集上的实验表明,SGMem在长期对话问答任务中显著优于强基线方法,提升了回答准确性。 Conclusion: SGMem通过细粒度的图结构表示和多层级上下文建模,有效改善了大语言模型在长对话场景下的记忆利用与响应生成能力。 Abstract: Long-term conversational agents require effective memory management to handle dialogue histories that exceed the context window of large language models (LLMs). Existing methods based on fact extraction or summarization reduce redundancy but struggle to organize and retrieve relevant information across different granularities of dialogue and generated memory. We introduce SGMem (Sentence Graph Memory), which represents dialogue as sentence-level graphs within chunked units, capturing associations across turn-, round-, and session-level contexts. By combining retrieved raw dialogue with generated memory such as summaries, facts and insights, SGMem supplies LLMs with coherent and relevant context for response generation. Experiments on LongMemEval and LoCoMo show that SGMem consistently improves accuracy and outperforms strong baselines in long-term conversational question answering.

[57] Query-Centric Graph Retrieval Augmented Generation

Yaxiong Wu,Jianyuan Bo,Yongyue Zhang,Sheng Liang,Yong Liu

Main category: cs.CL

TL;DR: QCG-RAG是一种查询中心化的图结构检索增强生成框架,通过可控粒度的查询中心化图构建和多跳检索机制,在多跳问答任务中优于现有方法。

Details Motivation: 现有基于图的RAG方法面临粒度困境:细粒度实体级图消耗高且丢失上下文,粗粒度文档级图难以捕捉复杂关系。 Method: 提出QCG-RAG框架,利用Doc2Query技术构建查询中心化图,并设计定制的多跳检索机制进行相关文本块检索。 Result: 在LiHuaWorld和MultiHop-RAG数据集上的实验表明,QCG-RAG在问答准确率上持续优于现有的基于块和基于图的RAG方法。 Conclusion: QCG-RAG通过查询中心化和可控粒度的图构建,为多跳推理提供了新的有效范式。 Abstract: Graph-based retrieval-augmented generation (RAG) enriches large language models (LLMs) with external knowledge for long-context understanding and multi-hop reasoning, but existing methods face a granularity dilemma: fine-grained entity-level graphs incur high token costs and lose context, while coarse document-level graphs fail to capture nuanced relations. We introduce QCG-RAG, a query-centric graph RAG framework that enables query-granular indexing and multi-hop chunk retrieval. Our query-centric approach leverages Doc2Query and Doc2Query{-}{-} to construct query-centric graphs with controllable granularity, improving graph quality and interpretability. A tailored multi-hop retrieval mechanism then selects relevant chunks via the generated queries. Experiments on LiHuaWorld and MultiHop-RAG show that QCG-RAG consistently outperforms prior chunk-based and graph-based RAG methods in question answering accuracy, establishing a new paradigm for multi-hop reasoning.

[58] Un-Doubling Diffusion: LLM-guided Disambiguation of Homonym Duplication

Evgeny Kaskov,Elizaveta Petrova,Petr Surovtsev,Anna Kostikova,Ilya Mistiurin,Alexander Kapitanov,Alexander Nagaev

Main category: cs.CL

TL;DR: 本文提出了一种测量同形异义词重复率的方法,并通过视觉-语言模型和人工评估对不同扩散模型进行了评估,同时研究了通过提示扩展缓解该问题的方法。

Details Motivation: 同形异义词在生成模型中容易导致多义混淆,尤其是在英-central偏见影响下,非英语词汇翻译成英文后可能产生额外的歧义,影响文本到图像生成的质量。 Method: 提出一种量化同形异义词重复现象的方法,结合视觉-语言模型进行自动评估,并采用提示扩展技术来缓解该问题。 Result: 实验表明,所提出的提示扩展方法能有效降低同形异义词重复率,包括由英-central偏见引起的重复问题。 Conclusion: 提示扩展是一种有效的策略,可用于减轻扩散模型中的同形异义词重复问题及其在跨语言翻译中的负面影响。 Abstract: Homonyms are words with identical spelling but distinct meanings, which pose challenges for many generative models. When a homonym appears in a prompt, diffusion models may generate multiple senses of the word simultaneously, which is known as homonym duplication. This issue is further complicated by an Anglocentric bias, which includes an additional translation step before the text-to-image model pipeline. As a result, even words that are not homonymous in the original language may become homonyms and lose their meaning after translation into English. In this paper, we introduce a method for measuring duplication rates and conduct evaluations of different diffusion models using both automatic evaluation utilizing Vision-Language Models (VLM) and human evaluation. Additionally, we investigate methods to mitigate the homonym duplication problem through prompt expansion, demonstrating that this approach also effectively reduces duplication related to Anglocentric bias. The code for the automatic evaluation pipeline is publicly available.

[59] LLM Output Homogenization is Task Dependent

Shomik Jain,Jack Lanchantin,Maximilian Nickel,Karen Ullrich,Ashia Wilson,Jamelle Watson-Daniels

Main category: cs.CL

TL;DR: 本文提出了一个任务分类法和任务锚定的功能多样性评估方法,以更准确地衡量和缓解大语言模型输出同质化问题,并提出了一种新的采样技术,在保持响应质量的同时提高功能多样性。

Details Motivation: 现有研究在处理输出同质化时未能根据任务类型来定义多样性,导致评估和改进不准确。本文旨在填补这一空白。 Method: 提出了包含八个任务类别的分类体系,引入任务锚定的功能多样性作为评估指标,并设计了任务锚定的采样技术以提升多样性。 Result: 该方法能在需要多样性的任务中增加功能多样性,同时在需要一致性的任务中保持同质化,并且不牺牲生成质量。 Conclusion: 任务依赖性能够有效提升对输出同质化的评估与缓解效果,挑战了多样性与质量之间存在权衡的传统观点。 Abstract: A large language model can be less helpful if it exhibits output response homogenization. But whether two responses are considered homogeneous, and whether such homogenization is problematic, both depend on the task category. For instance, in objective math tasks, we often expect no variation in the final answer but anticipate variation in the problem-solving strategy. Whereas, for creative writing tasks, we may expect variation in key narrative components (e.g. plot, genre, setting, etc), beyond the vocabulary or embedding diversity produced by temperature-sampling. Previous work addressing output homogenization often fails to conceptualize diversity in a task-dependent way. We address this gap in the literature directly by making the following contributions. (1) We present a task taxonomy comprised of eight task categories that each have distinct conceptualizations of output homogenization. (2) We introduce task-anchored functional diversity to better evaluate output homogenization. (3) We propose a task-anchored sampling technique that increases functional diversity for task categories where homogenization is undesired, while preserving homogenization where it is desired. (4) We challenge the perceived existence of a diversity-quality trade-off by increasing functional diversity while maintaining response quality. Overall, we demonstrate how task dependence improves the evaluation and mitigation of output homogenization.

[60] LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text

Irina Tolstykh,Aleksandra Tsybina,Sergey Yakubson,Maksim Kuprashevich

Main category: cs.CL

TL;DR: 本文提出了LLMTrace,一个大规模、双语(英语和俄语)的AI生成文本检测语料库,支持全文本分类和AI生成片段的字符级定位,以解决现有数据集模型过时、语言单一和缺乏混合作者标注的问题。

Details Motivation: 现有的AI生成文本检测数据集存在模型过时、语言以英语为主、缺乏对人机混合撰写文本的细粒度标注等问题,限制了检测系统的发展。 Method: 使用多种现代闭源和开源大语言模型构建了一个双语语料库,并提供了字符级标注,支持全文本二分类和AI生成区间检测两个任务。 Result: LLMTrace数据集能够支持更精细的AI生成内容检测,特别是实现AI生成片段的精确定位,为下一代检测模型提供重要资源。 Conclusion: LLMTrace填补了当前AI生成文本检测数据集在多语言、现代模型和细粒度标注方面的空白,有望推动更精确和实用的检测技术发展。 Abstract: The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models. The project page is available at \href{https://sweetdream779.github.io/LLMTrace-info/}{iitolstykh/LLMTrace}.

[61] Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond

Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng

Main category: cs.CL

TL;DR: 本文理论分析了输入扰动对思维链(CoT)输出波动的影响,推导了扰动上界,并证明该上界与推理步数正相关,且无限长的推理无法完全消除扰动影响;进一步在简化Transformer模型LSA上证明扰动上界与输入嵌入和隐藏状态向量范数负相关,实验验证了理论结论。

Details Motivation: 现有研究缺乏对输入扰动如何影响CoT输出的理论解释,限制了对推理过程中扰动传播机制的理解和提示优化方法的改进。 Method: 通过理论分析推导在输出波动可接受范围内的输入扰动上界,并将其应用于线性自注意力(LSA)模型,分析扰动上界与模型参数之间的关系。 Result: 证明了扰动上界与CoT推理步数正相关;无限长推理仍无法消除扰动;在LSA模型中,扰动上界与输入嵌入和隐藏状态向量的范数负相关;实验结果与理论分析一致。 Conclusion: 输入扰动对CoT输出的影响存在理论上限,推理步数增加虽可提升鲁棒性但无法彻底消除扰动,且输入和隐藏状态的向量范数在LSA中起关键调节作用。 Abstract: Existing research indicates that the output of Chain-of-Thought (CoT) is significantly affected by input perturbations. Although many methods aim to mitigate such impact by optimizing prompts, a theoretical explanation of how these perturbations influence CoT outputs remains an open area of research. This gap limits our in-depth understanding of how input perturbations propagate during the reasoning process and hinders further improvements in prompt optimization methods. Therefore, in this paper, we theoretically analyze the effect of input perturbations on the fluctuation of CoT outputs. We first derive an upper bound for input perturbations under the condition that the output fluctuation is within an acceptable range, based on which we prove that: (i) This upper bound is positively correlated with the number of reasoning steps in the CoT; (ii) Even an infinitely long reasoning process cannot eliminate the impact of input perturbations. We then apply these conclusions to the Linear Self-Attention (LSA) model, which can be viewed as a simplified version of the Transformer. For the LSA model, we prove that the upper bound for input perturbation is negatively correlated with the norms of the input embedding and hidden state vectors. To validate this theoretical analysis, we conduct experiments on three mainstream datasets and four mainstream models. The experimental results align with our theoretical analysis, empirically demonstrating the correctness of our findings.

[62] DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

Kin Ian Lo,Hala Hawashin,Mina Abbaszadeh,Tilen Limback-Stokin,Hadi Wazni,Mehrnoosh Sadrzadeh

Main category: cs.CL

TL;DR: DisCoCLIP结合冻结的CLIP视觉编码器与基于张量网络的文本编码器,显式建模语言句法结构,在保持参数效率的同时显著提升视觉-语言模型对词序和谓词-论元结构的理解能力。

Details Motivation: 现有视觉-语言模型虽擅长大规模图文对齐,但常忽视语言的组合结构,导致在依赖词序和谓词-论元结构的任务上表现不佳。 Method: 提出DisCoCLIP,使用Combinatory Categorial Grammar解析句子,构建分布式的词张量,并通过张量分解降低高阶张量参数量;结合冻结的CLIP视觉编码器,以自监督对比学习进行端到端训练。 Result: 在SVO-Probes上动词准确率从77.6%提升至82.4%,ARO属性和关系得分分别提高9%以上和4%以上,在新提出的SVO-Swap基准上达到93.7%的准确率。 Conclusion: 通过张量网络显式嵌入语言结构可实现可解释、参数高效且显著增强视觉-语言任务中组合推理能力的表示。 Abstract: Recent vision-language models excel at large-scale image-text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate-argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence's grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP's SVO-Probes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision-language tasks.

[63] The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

Pranjal A. Chitale,Varun Gumma,Sanchit Ahuja,Prashant Kodali,Manan Uppadhyay,Deepthi Sudharsan,Sunayana Sitaram

Main category: cs.CL

TL;DR: 提出了一种基于语言特定维基内容的自下而上生成策略,构建了包含13种印度语言、950万数据点的大规模合成指令跟随数据集Updesh,通过多维度评估验证其高质量,并证明在低中资源语言上的显著性能提升。

Details Motivation: 解决多语言AI系统在低资源环境下缺乏文化根基的问题,探索合成数据在多语言多文化背景下的有效性。 Method: 利用参数量≥235B的开源大语言模型,基于各语言维基百科内容自下而上生成具有文化语境的合成数据,构建Updesh数据集,并通过自动指标与人工评估结合的方式进行质量验证,进一步在15个多语言下游任务上微调和评估模型性能。 Result: Updesh数据集生成的数据质量高,微调后的模型在生成任务上表现显著提升,在选择类NLU任务上也具竞争力,尤其在低、中等资源语言中相对改进更为明显,缩小了与高资源语言的差距。 Conclusion: 有效的多语言AI需要融合上下文感知和文化扎根的多维度数据策展与生成策略,自下而上的文化敏感型合成数据方法是推动低资源语言AI发展的关键路径。 Abstract: Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

[64] Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs

Daniel Vennemeyer,Phan Anh Duong,Tiffany Zhan,Tianyu Jiang

Main category: cs.CL

TL;DR: 该研究将大语言模型中的谄媚行为分解为谄媚性同意和谄媚性赞美,并与真实同意进行对比,发现这三种行为在潜在空间中具有独立的线性方向,可独立调控且跨模型具有一致性。

Details Motivation: 明确大语言模型中谄媚行为是否由单一机制或多个独立过程引起。 Method: 使用均值差异方向、激活添加和子空间几何方法,在多个模型和数据集上分析谄媚性同意、谄媚性赞美与真实同意的表征。 Result: 三种行为在潜在空间中沿不同的线性方向编码;可独立增强或抑制每种行为而不影响其他行为;其表征结构在不同模型家族和规模间保持一致。 Conclusion: 谄媚行为对应于不同且可独立操控的表征,表明其源于多个独立机制。 Abstract: Large language models (LLMs) often exhibit sycophantic behaviors -- such as excessive agreement with or flattery of the user -- but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.

[65] RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Zhilin Wang,Jiaqi Zeng,Olivier Delalleau,Ellie Evans,Daniel Egert,Hoo-Chang Shin,Felipe Soares,Yi Dong,Oleksii Kuchaiev

Main category: cs.CL

TL;DR: 提出了一种新的强化学习范式RLBFF,结合人类反馈的灵活性和规则验证的精确性,通过二元原则提升奖励模型的可解释性和性能,并实现了开源对齐方案。

Details Motivation: 现有RLHF方法因缺乏明确标准导致可解释性差和奖励欺骗问题,而RLVR局限于正确性验证,无法捕捉响应质量的细微方面。因此需要一种兼具两者优势的新方法。 Method: 提出RLBFF框架,从自然语言反馈中提取可二元判断的原则(如信息准确性、代码可读性),将奖励模型训练建模为蕴含任务(即判断响应是否满足特定原则),并在推理时允许用户指定关注原则以定制化奖励模型。 Result: 所训练的奖励模型在RM-Bench上达到86.2%,JudgeBench上81.4%(截至2025年9月24日排名第一),优于同等数据下的Bradley-Terry模型;并提供了完全开源的Qwen3-32B对齐方案,在MT-Bench、WildBench和Arena Hard v2上表现媲美或超越o3-mini和DeepSeek R1,且推理成本低于5%。 Conclusion: RLBFF通过融合人类偏好与规则验证,提升了奖励模型的灵活性、可解释性和性能,支持定制化评估,并提供高效低成本的开源大模型对齐路径。 Abstract: Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost).

[66] SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines

Yizhou Wang,Chen Tang,Han Deng,Jiabei Xiao,Jiaqi Liu,Jianyu Wu,Jun Yao,Pengze Li,Encheng Su,Lintao Wang,Guohang Zhuang,Yuchen Ren,Ben Fei,Ming Hu,Xin Chen,Dongzhan Zhou,Junjun He,Xiangyu Yue,Zhenfei Yin,Jiamin Wu,Qihao Zheng,Yuhao Zhou,Huihui Xu,Chenglong Ma,Yan Lu,Wenlong Zhang,Chunfeng Song,Philip Torr,Shixiang Tang,Xinzhu Ma,Wanli Ouyang,Lei Bai

Main category: cs.CL

TL;DR: 提出了一种科学推理基础模型,通过多阶段训练对齐自然语言与科学表示,在多种任务上展现广泛指令覆盖和强跨领域泛化能力。

Details Motivation: 旨在构建能理解并生成科学内容的基础模型,弥合自然语言与异构科学表示之间的鸿沟,提升跨学科推理与泛化能力。 Method: 在2060亿token的多模态科学数据上预训练,结合监督微调(SFT)、退火冷启动自举生成长链思维,并通过任务特定奖励塑形的强化学习来增强科学推理能力。 Result: 支持五类能力共103项任务,涵盖文本与科学格式互译、信息抽取、性质预测与分类、序列生成与设计,相比专用系统具有更好的泛化性和保真度。 Conclusion: 跨学科学习显著提升迁移能力和下游任务可靠性,模型、指令微调数据集和评估代码已开源。 Abstract: We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.

cs.CV [Back]

[67] Leveraging NTPs for Efficient Hallucination Detection in VLMs

Ofir Azachi,Kfir Eliyahu,Eyal El Ani,Rom Himelstein,Roi Reichart,Yuval Pinter,Nitay Calderon

Main category: cs.CV

TL;DR: 提出一种基于视觉语言模型(VLM)下一词概率(NTP)的轻量级幻觉检测方法,利用NTP作为模型不确定性的指标,结合传统机器学习模型实现高效、实时的幻觉检测,并通过融合语言NTP和VLM预测得分进一步提升性能。

Details Motivation: 视觉语言模型(VLM)生成文本时可能出现与视觉内容不一致的幻觉问题,影响其可靠性;现有检测方法依赖VLM自身或其它大型模型,计算开销大、延迟高,亟需更高效的检测方案。 Method: 提取VLM生成过程中的下一词概率(NTP)作为不确定性信号,训练传统机器学习模型进行幻觉检测;引入包含1400个人工标注样本的数据集验证方法有效性;进一步结合仅基于生成文本的语言NTP以及VLM的幻觉预测分数进行特征增强。 Result: NTP特征能有效预测幻觉,轻量级模型性能媲美强VLM;加入语言NTP可提升检测效果;融合VLM预测得分后性能优于单独使用VLM或NTP的方法。 Conclusion: 基于NTP的轻量级方法可高效检测VLM幻觉,结合多源信号能进一步提升性能,为提高VLM可靠性提供了简单且实用的解决方案。 Abstract: Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM's next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.

[68] Quasi-Synthetic Riemannian Data Generation for Writer-Independent Offline Signature Verification

Elias N. Zois,Moises Diaz,Salem Said,Miguel A. Ferrer

Main category: cs.CV

TL;DR: 提出一种基于黎曼几何的准合成数据生成框架,用于离线手写签名验证,通过SPD矩阵上的高斯混合模型生成正负样本,并在真实数据集上验证了方法的有效性。

Details Motivation: 解决离线手写签名验证中跨书写者场景下缺乏足够真实训练数据的问题,探索不依赖大量真实签名数据的通用化建模方法。 Method: 利用对称正定矩阵(SPD)的黎曼几何特性,以少量真实样本为种子构建黎曼高斯混合模型,识别出代表不同书写者的黎曼中心及其方差,进而在各中心进行高斯采样生成正类和负类的合成SPD数据,采用度量学习框架对相似与不相似SPD样本对进行训练。 Result: 在两个包含西方和亚洲书写风格的真实签名数据集上进行了实验,结果表明该方法在同数据集内和跨数据集评估协议下均实现了较低的错误率。 Conclusion: 所提出的基于黎曼空间的准合成数据生成方法能有效提升书写者无关签名验证系统的性能,展示了在黎曼流形上生成合成数据的潜力。 Abstract: Offline handwritten signature verification remains a challenging task, particularly in writer-independent settings where models must generalize across unseen individuals. Recent developments have highlighted the advantage of geometrically inspired representations, such as covariance descriptors on Riemannian manifolds. However, past or present, handcrafted or data-driven methods usually depend on real-world signature datasets for classifier training. We introduce a quasi-synthetic data generation framework leveraging the Riemannian geometry of Symmetric Positive Definite matrices (SPD). A small set of genuine samples in the SPD space is the seed to a Riemannian Gaussian Mixture which identifies Riemannian centers as synthetic writers and variances as their properties. Riemannian Gaussian sampling on each center generates positive as well as negative synthetic SPD populations. A metric learning framework utilizes pairs of similar and dissimilar SPD points, subsequently testing it over on real-world datasets. Experiments conducted on two popular signature datasets, encompassing Western and Asian writing styles, demonstrate the efficacy of the proposed approach under both intra- and cross- dataset evaluation protocols. The results indicate that our quasi-synthetic approach achieves low error rates, highlighting the potential of generating synthetic data in Riemannian spaces for writer-independent signature verification systems.

[69] Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream,Yunpeng Chen,Yu Gao,Lixue Gong,Meng Guo,Qiushan Guo,Zhiyao Guo,Xiaoxia Hou,Weilin Huang,Yixuan Huang,Xiaowen Jian,Huafeng Kuang,Zhichao Lai,Fanshi Li,Liang Li,Xiaochen Lian,Chao Liao,Liyang Liu,Wei Liu,Yanzuo Lu,Zhengxiong Luo,Tongtong Ou,Guang Shi,Yichun Shi,Shiqi Sun,Yu Tian,Zhi Tian,Peng Wang,Rui Wang,Xun Wang,Ye Wang,Guofeng Wu,Jie Wu,Wenxu Wu,Yonghui Wu,Xin Xia,Xuefeng Xiao,Shuang Xu,Xin Yan,Ceyuan Yang,Jianchao Yang,Zhonghua Zhai,Chenlin Zhang,Heng Zhang,Qi Zhang,Xinyu Zhang,Yuwei Zhang,Shijia Zhao,Wenliang Zhao,Wenjia Zhu

Main category: cs.CV

TL;DR: Seedream 4.0 是一个高效、高性能的多模态图像生成系统,统一了文本到图像生成、图像编辑和多图像合成,支持快速生成高分辨率图像,并在多种任务上达到业界领先水平。

Details Motivation: 为了构建一个统一且高效的多模态图像生成框架,能够同时处理文本到图像生成、图像编辑和多图像组合任务,提升生成效率与交互性。 Method: 提出了一种高效的扩散Transformer结构和强大的VAE以减少图像token数量,结合大规模文本-图像对预训练、多模态后训练、对抗蒸馏、分布匹配、量化和推测解码等技术进行优化。 Result: Seedream 4.0 能在1.8秒内生成2K分辨率图像,在文本到图像生成和多模态图像编辑任务上达到SOTA性能,具备精确编辑、上下文推理和多图参考能力。 Conclusion: Seedream 4.0 将传统文本到图像系统扩展为更互动、多维度的创作工具,在创意和专业应用中推动了生成式AI的发展边界。 Abstract: We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.

[70] A Contrastive Learning Framework for Breast Cancer Detection

Samia Saeed,Khuram Naveed

Main category: cs.CV

TL;DR: 本研究提出了一种基于对比学习的半监督框架,利用ResNet-50在少量标注数据和大量未标注乳腺X线图像上进行训练,通过数据增强提升模型性能,在INbreast和MIAS数据集上实现了96.7%的准确率,优于现有方法。

Details Motivation: 由于深度学习在乳腺癌检测中受限于大规模标注数据集的缺乏,难以实现高精度,因此需要一种能在小样本标注数据下表现优异的方法。 Method: 采用ResNet-50网络,结合对比学习(Contrastive Learning)框架,在大量未标注乳腺X线图像上进行半监督训练,使用多种数据增强和变换策略,并在小型标注数据集上进行微调。 Result: 在INbreast和MIAS两个基准数据集上,该方法实现了96.7%的分类准确率,超过了现有的最先进方法。 Conclusion: 所提出的基于对比学习的半监督方法能有效利用未标注数据,显著提升在小规模标注数据下的乳腺癌检测性能,具有临床应用潜力。 Abstract: Breast cancer, the second leading cause of cancer-related deaths globally, accounts for a quarter of all cancer cases [1]. To lower this death rate, it is crucial to detect tumors early, as early-stage detection significantly improves treatment outcomes. Advances in non-invasive imaging techniques have made early detection possible through computer-aided detection (CAD) systems which rely on traditional image analysis to identify malignancies. However, there is a growing shift towards deep learning methods due to their superior effectiveness. Despite their potential, deep learning methods often struggle with accuracy due to the limited availability of large-labeled datasets for training. To address this issue, our study introduces a Contrastive Learning (CL) framework, which excels with smaller labeled datasets. In this regard, we train Resnet-50 in semi supervised CL approach using similarity index on a large amount of unlabeled mammogram data. In this regard, we use various augmentation and transformations which help improve the performance of our approach. Finally, we tune our model on a small set of labelled data that outperforms the existing state of the art. Specifically, we observed a 96.7% accuracy in detecting breast cancer on benchmark datasets INbreast and MIAS.

[71] Are Foundation Models Ready for Industrial Defect Recognition? A Reality Check on Real-World Data

Simon Baeuerle,Pratik Khanna,Nils Friederich,Angelo Jovin Yamachui Sitcheu,Damir Shakirov,Andreas Steimer,Ralf Mikut

Main category: cs.CV

TL;DR: 本文研究了基础模型(FMs)在工业质量检测中的应用,发现尽管这些模型在公共基准数据集上表现良好,但在真实工业图像数据上普遍失败。

Details Motivation: 探索基础模型在无需标注的零样本设置下用于工业质量检测的潜力,以减少对大量标注数据的依赖和模型部署成本。 Method: 在自定义的真实工业图像数据和公共图像数据上测试多个最新的基础模型,评估其零样本异常检测性能。 Result: 所有测试的基础模型在真实工业数据上均表现不佳,尽管它们在公共基准数据集上表现良好。 Conclusion: 当前的基础模型尚不适用于真实场景下的工业质量检测,表明从公共数据到实际工业应用存在显著差距,需进一步研究适应工业环境的解决方案。 Abstract: Foundation Models (FMs) have shown impressive performance on various text and image processing tasks. They can generalize across domains and datasets in a zero-shot setting. This could make them suitable for automated quality inspection during series manufacturing, where various types of images are being evaluated for many different products. Replacing tedious labeling tasks with a simple text prompt to describe anomalies and utilizing the same models across many products would save significant efforts during model setup and implementation. This is a strong advantage over supervised Artificial Intelligence (AI) models, which are trained for individual applications and require labeled training data. We test multiple recent FMs on both custom real-world industrial image data and public image data. We show that all of those models fail on our real-world data, while the very same models perform well on public benchmark datasets.

[72] Shared Neural Space: Unified Precomputed Feature Encoding for Multi-Task and Cross Domain Vision

Jing Li,Oskar Bartosz,Chengyu Wang,Michal Wnuczynski,Dilshan Godaliyadda,Michael Polley

Main category: cs.CV

TL;DR: 提出了一种通用的神经空间(NS),通过编码器-解码器框架在视觉和成像任务中预计算特征,实现多任务共享特征空间,提升效率和泛化能力。

Details Motivation: 现有AI模型针对特定高精度任务定制,难以高效处理一系列模块化任务,因每个任务需映射到不同的潜在空间,导致冗余和低效。 Method: 设计了一个轻量级、基于CNN的编码器-解码器框架,构建通用神经空间(NS),编码器学习具有变换感知性和可泛化的表示,使多个下游任务共享同一特征空间。 Result: 验证了NS可在多个成像与视觉任务(如去马赛克、去噪、深度估计、语义分割)中高效运行,减少冗余,提升跨域泛化能力,并支持更广泛的硬件部署。 Conclusion: 该通用神经空间架构为高效多任务视觉系统提供了基础,兼具轻量化和高性能优势,适用于多样化应用场景。 Abstract: The majority of AI models in imaging and vision are customized to perform on specific high-precision task. However, this strategy is inefficient for applications with a series of modular tasks, since each requires a mapping into a disparate latent domain. To address this inefficiency, we proposed a universal Neural Space (NS), where an encoder-decoder framework pre-computes features across vision and imaging tasks. Our encoder learns transformation aware, generalizable representations, which enable multiple downstream AI modules to share the same feature space. This architecture reduces redundancy, improves generalization across domain shift, and establishes a foundation for effecient multi-task vision pipelines. Furthermore, as opposed to larger transformer backbones, our backbone is lightweight and CNN-based, allowing for wider across hardware. We furthur demonstrate that imaging and vision modules, such as demosaicing, denoising, depth estimation and semantic segmentation can be performed efficiently in the NS.

[73] Data-Efficient Stream-Based Active Distillation for Scalable Edge Model Deployment

Dani Manjah,Tim Bary,Benoît Gérin,Benoît Macq,Christophe de Vleeschouwer

Main category: cs.CV

TL;DR: 提出一种结合高置信度流式策略与多样性方法的图像选择方案,以在低传输成本下最大化边缘摄像头系统模型质量。

Details Motivation: 边缘摄像头系统需要适应不断变化的环境,需频繁更新模型,但边缘设备计算能力有限,需高效选择训练数据以降低传输成本。 Method: 利用中心服务器上的复杂教师模型标注数据,采用高置信度流式策略结合基于多样性的方法选择最有用的图像进行训练。 Result: 在相似训练负载下,该方法能以极少的数据集查询获得高质量模型。 Conclusion: 高置信度与多样性结合的图像选择策略可有效提升边缘设备模型性能,同时显著降低数据传输开销。 Abstract: Edge camera-based systems are continuously expanding, facing ever-evolving environments that require regular model updates. In practice, complex teacher models are run on a central server to annotate data, which is then used to train smaller models tailored to the edge devices with limited computational power. This work explores how to select the most useful images for training to maximize model quality while keeping transmission costs low. Our work shows that, for a similar training load (i.e., iterations), a high-confidence stream-based strategy coupled with a diversity-based approach produces a high-quality model with minimal dataset queries.

[74] InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On

Julien Han,Shuwen Qiu,Qi Li,Xingzi Xu,Mehmet Saygin Seyfioglu,Kavosh Asadi,Karim Bouyarmane

Main category: cs.CV

TL;DR: InstructVTON 是一个基于自然语言指令的交互式虚拟试穿系统,通过结合视觉语言模型和图像分割技术,自动生成二值掩码,实现对单件或多件衣物的细粒度、复杂风格控制,克服了传统掩码方法的局限性,并与现有虚拟试穿模型兼容,达到最先进的效果。

Details Motivation: 传统基于掩码的虚拟试穿方法在生成精确掩码时存在困难,且无法处理复杂的穿衣场景(如卷起袖子),限制了用户的操作便利性和生成结果的灵活性。因此,需要一种更智能、易用的方法来实现精细化的服装样式控制。 Method: 将虚拟试穿问题建模为图像引导或图像条件的修复任务,利用视觉语言模型(VLM)和图像分割模型,根据用户提供的图像和自由文本风格指令自动生 成二值掩码,并支持多轮图像生成以实现复杂试穿场景。 Result: InstructVTON 能够自动化生成掩码,无需用户手动绘制,支持复杂风格控制(如 sleeves rolled up),并在多种场景下与现有虚拟试穿模型兼容,实现了优于现有方法的生成效果。 Conclusion: InstructVTON 通过结合 VLM 和图像分割技术,简化了用户操作,提升了虚拟试穿系统的可控性和适用范围,展示了在复杂、细粒度风格控制下的优越性能。 Abstract: We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditioned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with "sleeves rolled up" styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.

[75] Innovative Deep Learning Architecture for Enhanced Altered Fingerprint Recognition

Dana A Abdullah,Dana Rasul Hamad,Bishar Rasheed Ibrahim,Sirwan Abdulwahid Aula,Aso Khaleel Ameen,Sabat Salih Hamadamin

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的改性指纹识别模型DeepAFRNet,采用VGG16骨干网络提取特征并利用余弦相似度进行匹配,在真实改性指纹数据集SOCOFing上实现了高精度识别,验证了其在不同难度级别下的有效性,并强调了阈值选择对生物识别系统性能的重要影响。

Details Motivation: 由于攻击者可能故意修改指纹纹路以逃避检测,传统的指纹识别系统面临挑战,因此需要一种能够鲁棒识别被修改指纹的方法来提升生物识别系统的安全性与可靠性。 Method: 提出DeepAFRNet模型,使用VGG16作为特征提取 backbone,并通过计算嵌入向量之间的余弦相似度实现指纹匹配,模型在SOCOFing数据集的Real-Altered子集上进行训练和评估,涵盖三种篡改难度级别(简单、中等、困难)。 Result: 在严格阈值下,DeepAFRNet在三种难度级别上分别达到96.7%、98.76%和99.54%的准确率;当阈值从0.92放宽至0.72时,准确率显著下降至7.86%、27.05%和29.51%,表明阈值选择极为关键。 Conclusion: DeepAFRNet在真实改性指纹识别任务中表现优异,克服了以往研究依赖合成数据或验证协议不足的问题,具备实际部署潜力,适用于对安全性和识别鲁棒性要求较高的场景。 Abstract: Altered fingerprint recognition (AFR) is challenging for biometric verification in applications such as border control, forensics, and fiscal admission. Adversaries can deliberately modify ridge patterns to evade detection, so robust recognition of altered prints is essential. We present DeepAFRNet, a deep learning recognition model that matches and recognizes distorted fingerprint samples. The approach uses a VGG16 backbone to extract high-dimensional features and cosine similarity to compare embeddings. We evaluate on the SOCOFing Real-Altered subset with three difficulty levels (Easy, Medium, Hard). With strict thresholds, DeepAFRNet achieves accuracies of 96.7 percent, 98.76 percent, and 99.54 percent for the three levels. A threshold-sensitivity study shows that relaxing the threshold from 0.92 to 0.72 sharply degrades accuracy to 7.86 percent, 27.05 percent, and 29.51 percent, underscoring the importance of threshold selection in biometric systems. By using real altered samples and reporting per-level metrics, DeepAFRNet addresses limitations of prior work based on synthetic alterations or limited verification protocols, and indicates readiness for real-world deployments where both security and recognition resilience are critical.

[76] Large Pre-Trained Models for Bimanual Manipulation in 3D

Hanna Yurchyk,Wei-Di Chang,Gregory Dudek,David Meger

Main category: cs.CV

TL;DR: 将预训练Vision Transformer的注意力图集成到体素表示中,提升双手机器人操作性能。

Details Motivation: 希望通过利用视觉Transformer模型中的注意力机制来增强机器人在双手机器人操作任务中的感知与决策能力。 Method: 从自监督ViT模型DINOv2中提取注意力图,解释为RGB图像上的像素级显著性分数,并将其提升到3D体素网格中,作为语义线索融入行为克隆策略。 Result: 在RLBench双手机器人基准上,相比现有最先进体素策略,平均绝对性能提升8.2%,相对增益达21.9%。 Conclusion: 注意力引导的特征表示能有效提升体素化策略在复杂双手机器人操作任务中的表现。 Abstract: We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

[77] A Comparative Benchmark of Real-time Detectors for Blueberry Detection towards Precision Orchard Management

Xinyang Mu,Yuzhen Lu,Boyang Deng

Main category: cs.CV

TL;DR: 本研究提出了一种针对蓝莓检测的新型实时目标检测模型基准分析,使用了一个包含661张图像和85,879个标注实例的新数据集,并评估了YOLO和RT-DETR系列共36种模型变体。RT-DETRv2-X在mAP@50上达到93.6%的最佳性能,经半监督学习微调后进一步提升至94.8%。

Details Motivation: 蓝莓在自然环境中由于光照变化、遮挡和运动模糊等因素导致检测困难,现有深度学习模型缺乏充分多样化的数据支持,且实际部署中需要权衡精度、速度与内存消耗。 Method: 构建了一个新的蓝莓检测数据集,包含来自智能手机采集的661张树冠图像,并对YOLO(v8-v12)和RT-DETR(v1-v2)系列共36个模型变体进行系统性评测;采用Unbiased Mean Teacher框架利用1,035张无标签图像进行半监督微调以提升性能。 Result: YOLOv12m取得93.3% mAP@50,RT-DETRv2-X表现最优,达到93.6% mAP@50,微调后提升至94.8%;中等规模模型在精度与推理速度之间表现出良好平衡;部分模型精度有所下降,最大增益为2.9%。 Conclusion: RT-DETRv2-X在蓝莓检测任务中表现最佳,结合半监督学习可进一步提升性能;公开发布的数据集和代码有助于推动农业视觉领域的研究,未来需深入探索跨域无标签数据的有效利用。 Abstract: Blueberry detection in natural environments remains challenging due to variable lighting, occlusions, and motion blur due to environmental factors and imaging devices. Deep learning-based object detectors promise to address these challenges, but they demand a large-scale, diverse dataset that captures the real-world complexities. Moreover, deploying these models in practical scenarios often requires the right accuracy/speed/memory trade-off in model selection. This study presents a novel comparative benchmark analysis of advanced real-time object detectors, including YOLO (You Only Look Once) (v8-v12) and RT-DETR (Real-Time Detection Transformers) (v1-v2) families, consisting of 36 model variants, evaluated on a newly curated dataset for blueberry detection. This dataset comprises 661 canopy images collected with smartphones during the 2022-2023 seasons, consisting of 85,879 labelled instances (including 36,256 ripe and 49,623 unripe blueberries) across a wide range of lighting conditions, occlusions, and fruit maturity stages. Among the YOLO models, YOLOv12m achieved the best accuracy with a mAP@50 of 93.3%, while RT-DETRv2-X obtained a mAP@50 of 93.6%, the highest among all the RT-DETR variants. The inference time varied with the model scale and complexity, and the mid-sized models appeared to offer a good accuracy-speed balance. To further enhance detection performance, all the models were fine-tuned using Unbiased Mean Teacher-based semi-supervised learning (SSL) on a separate set of 1,035 unlabeled images acquired by a ground-based machine vision platform in 2024. This resulted in accuracy gains ranging from -1.4% to 2.9%, with RT-DETR-v2-X achieving the best mAP@50 of 94.8%. More in-depth research into SSL is needed to better leverage cross-domain unlabeled data. Both the dataset and software programs of this study are made publicly available to support further research.

[78] Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation

Farbod Bigdeli,Mohsen Mohammadagha,Ali Bigdeli

Main category: cs.CV

TL;DR: 提出一种轻量级的ROI增强策略,通过在训练中使用无标签边界框库中的随机ROI裁剪来增强Mini-DDSM数据集上的乳腺癌筛查性能。

Details Motivation: 现有深度学习在乳腺X线摄影解释中受限于低分辨率数据集和小样本量,影响模型性能。 Method: 引入一种仅在训练时使用的轻量级ROI增强方法,从预计算的无标签边界框库中随机采样ROI,并可加入抖动以增加变异性;采用严格的患者级别交叉验证进行评估。 Result: 在Mini-DDSM上,最佳参数下(p_roi=0.10, alpha=0.10)平均ROC-AUC略有提升,但PR-AUC持平或略降;性能在不同折叠间有差异;训练效率指标显示吞吐量和GPU内存使用良好。 Conclusion: 简单的数据中心化ROI增强策略可在不增加标注成本或修改网络结构的情况下,有效提升资源受限场景下的乳腺X线分类性能。 Abstract: Breast cancer screening with mammography remains central to early detection and mortality reduction. Deep learning has shown strong potential for automating mammogram interpretation, yet limited-resolution datasets and small sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset (9,684 images; 2,414 patients) and introduce a lightweight region-of-interest (ROI) augmentation strategy. During training, full images are probabilistically replaced with random ROI crops sampled from a precomputed, label-free bounding-box bank, with optional jitter to increase variability. We evaluate under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and training-time efficiency metrics (throughput and GPU memory). Because ROI augmentation is training-only, inference-time cost remains unchanged. On Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to slightly lower. These results demonstrate that simple, data-centric ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural modifications.

[79] Reflect3r: Single-View 3D Stereo Reconstruction Aided by Mirror Reflections

Jing Wu,Zirui Wang,Iro Laina,Victor Adrian Prisacariu

Main category: cs.CV

TL;DR: 本文提出了一种利用镜面反射在单张图像中构建虚拟双目视图的方法,通过设计物理有效的虚拟相机变换实现像素域的虚拟视角生成,并结合对称感知损失优化位姿估计,实现了从单图出发的通用、鲁棒的3D重建,且可扩展至含镜面的动态场景。

Details Motivation: 镜面反射在日常环境中普遍存在,但常被视为视觉干扰。然而,镜像同时包含真实与虚拟视角,蕴含立体信息。本文旨在利用这一特性,从单张图像中提取多视角立体信息,简化传统多视角采集流程,提升3D重建的实用性与泛化能力。 Method: 将镜面反射视为辅助视角,设计一种可生成物理合理虚拟相机的变换方法,直接在像素域合成虚拟视图,保持真实成像过程的一致性;引入对称感知损失以利用镜像的几何对称性,优化姿态估计;框架可自然扩展至每帧含镜面的动态场景,实现逐帧几何恢复。 Result: 在合成数据(16个Blender场景,含真值点云与相机位姿)和真实世界数据上进行了大量实验,验证了方法在3D重建和位姿估计上的有效性与鲁棒性,实现了高质量的单图立体重建。 Conclusion: 本文成功利用镜面反射构建虚拟双目系统,实现了从单图像出发的多视角立体重建,为基于反射的几何推理提供了新思路,具有良好的泛化性和实际应用潜力。 Abstract: Mirror reflections are common in everyday environments and can provide stereo information within a single capture, as the real and reflected virtual views are visible simultaneously. We exploit this property by treating the reflection as an auxiliary view and designing a transformation that constructs a physically valid virtual camera, allowing direct pixel-domain generation of the virtual view while adhering to the real-world imaging process. This enables a multi-view stereo setup from a single image, simplifying the imaging process, making it compatible with powerful feed-forward reconstruction models for generalizable and robust 3D reconstruction. To further exploit the geometric symmetry introduced by mirrors, we propose a symmetric-aware loss to refine pose estimation. Our framework also naturally extends to dynamic scenes, where each frame contains a mirror reflection, enabling efficient per-frame geometry recovery. For quantitative evaluation, we provide a fully customizable synthetic dataset of 16 Blender scenes, each with ground-truth point clouds and camera poses. Extensive experiments on real-world data and synthetic data are conducted to illustrate the effectiveness of our method.

[80] Recov-Vision: Linking Street View Imagery and Vision-Language Models for Post-Disaster Recovery

Yiming Xiao,Archit Gupta,Miguel Esparza,Yu-Hsuan Ho,Antonia Sebastian,Hannah Weas,Rose Houck,Ali Mostafavi

Main category: cs.CV

TL;DR: 本文提出了一种名为FacadeTrack的街景语言引导框架,用于灾后建筑级 occupancy 评估,通过将全景视频与地块关联并提取可解释属性,实现了高精度的占用判断,并支持可审计和可扩展的应急管理工作流集成。

Details Motivation: 灾后建筑 occupancy 信息对资源分配、安全检查等至关重要,但现有方法在覆盖范围与细节捕捉之间存在权衡,亟需一种能结合街景细节与地块匹配的可靠方法。 Method: 提出FacadeTrack框架,利用街景全景视频与地块匹配,通过语言引导进行立面校正,并提取影响 habitability 的可解释特征(如入口阻塞、临时遮盖等),设计了一阶段规则判断和二阶段分离感知与推理的决策策略。 Result: 在两次飓风Helene灾后调查中,二阶段方法达到0.927的精确率、0.781的召回率和0.848的F1分数,优于一阶段基线(0.943精确率、0.728召回率、0.822 F1分数);同时中间属性和空间诊断有助于识别误差来源。 Conclusion: FacadeTrack提供了一种可解释、可审计且可扩展的灾后 occupancy 评估方案,能够有效融合街景细节与地理空间分析,适用于实际应急管理和地理信息系统集成。 Abstract: Building-level occupancy after disasters is vital for triage, inspections, utility re-energization, and equitable resource allocation. Overhead imagery provides rapid coverage but often misses facade and access cues that determine habitability, while street-view imagery captures those details but is sparse and difficult to align with parcels. We present FacadeTrack, a street-level, language-guided framework that links panoramic video to parcels, rectifies views to facades, and elicits interpretable attributes (for example, entry blockage, temporary coverings, localized debris) that drive two decision strategies: a transparent one-stage rule and a two-stage design that separates perception from conservative reasoning. Evaluated across two post-Hurricane Helene surveys, the two-stage approach achieves a precision of 0.927, a recall of 0.781, and an F-1 score of 0.848, compared with the one-stage baseline at a precision of 0.943, a recall of 0.728, and an F-1 score of 0.822. Beyond accuracy, intermediate attributes and spatial diagnostics reveal where and why residual errors occur, enabling targeted quality control. The pipeline provides auditable, scalable occupancy assessments suitable for integration into geospatial and emergency-management workflows.

[81] Human Semantic Representations of Social Interactions from Moving Shapes

Yiling Yun,Hongjing Lu

Main category: cs.CV

TL;DR: 该研究探讨了人类在识别简单移动形状所展示的社会互动时,语义表征如何补充视觉特征。通过两项研究发现,基于动词的语义嵌入最能解释人类的相似性判断,表明社会感知反映了社会互动的语义结构。

Details Motivation: 理解人类在识别简单动态形状中的社会互动时,除了视觉特征外,还依赖哪些语义表征。 Method: 研究1让参与者根据对移动形状的印象进行标签化;研究2通过人类相似性判断测量27种社会互动的表征几何,并与基于视觉特征、标签和语义嵌入的模型预测进行比较。 Result: 人类反应具有分布性;语义模型(尤其是基于动词的嵌入)能最好地解释人类相似性判断,且为视觉特征提供了补充信息。 Conclusion: 简单动态显示中的社会感知反映了社会互动的语义结构,语义表征在视觉与抽象认知之间起到桥梁作用。 Abstract: Humans are social creatures who readily recognize various social interactions from simple display of moving shapes. While previous research has often focused on visual features, we examine what semantic representations that humans employ to complement visual features. In Study 1, we directly asked human participants to label the animations based on their impression of moving shapes. We found that human responses were distributed. In Study 2, we measured the representational geometry of 27 social interactions through human similarity judgments and compared it with model predictions based on visual features, labels, and semantic embeddings from animation descriptions. We found that semantic models provided complementary information to visual features in explaining human judgments. Among the semantic models, verb-based embeddings extracted from descriptions account for human similarity judgments the best. These results suggest that social perception in simple displays reflects the semantic structure of social interactions, bridging visual and abstract representations.

[82] Enhancing Cross-View Geo-Localization Generalization via Global-Local Consistency and Geometric Equivariance

Xiaowei Wang,Di Wang,Ke Li,Yifeng Wang,Chengjian Wang,Libin Sun,Zhihong Wu,Yiming Zhang,Quan Wang

Main category: cs.CV

TL;DR: 提出了一种新的跨视角地理定位框架EGS,通过E(2)-Steerable CNN和带虚拟超节点的图结构提升跨域泛化能力,在University-1652和SUES-200上达到最先进性能。

Details Motivation: 现有方法在应对无人机不同朝向和视场引起的显著外观变化时鲁棒性不足,且难以同时建模全局语义与局部细节,导致跨域泛化能力受限。 Method: 提出EGS框架:采用E(2)-Steerable CNN提取对旋转和视角变化鲁棒的特征;构建包含虚拟超节点的图结构,聚合并重分配全局语义至局部区域,实现全局-局部一致性。 Result: 在University-1652和SUES-200基准上实验表明,EGS显著优于现有方法,实现了跨域CVGL的新最先进性能。 Conclusion: EGS通过增强特征稳定性与全局-局部一致性,在跨视角地理定位任务中有效提升了跨域泛化能力,为该领域提供了新的解决方案。 Abstract: Cross-view geo-localization (CVGL) aims to match images of the same location captured from drastically different viewpoints. Despite recent progress, existing methods still face two key challenges: (1) achieving robustness under severe appearance variations induced by diverse UAV orientations and fields of view, which hinders cross-domain generalization, and (2) establishing reliable correspondences that capture both global scene-level semantics and fine-grained local details. In this paper, we propose EGS, a novel CVGL framework designed to enhance cross-domain generalization. Specifically, we introduce an E(2)-Steerable CNN encoder to extract stable and reliable features under rotation and viewpoint shifts. Furthermore, we construct a graph with a virtual super-node that connects to all local nodes, enabling global semantics to be aggregated and redistributed to local regions, thereby enforcing global-local consistency. Extensive experiments on the University-1652 and SUES-200 benchmarks demonstrate that EGS consistently achieves substantial performance gains and establishes a new state of the art in cross-domain CVGL.

[83] DENet: Dual-Path Edge Network with Global-Local Attention for Infrared Small Target Detection

Jiayi Zuo,Songwei Pei,Qian Li

Main category: cs.CV

TL;DR: 提出了一种双路径边缘网络(Dual-Path Edge Network)用于红外小目标检测,通过解耦边缘增强与语义建模,在多尺度下提升微弱目标的检测性能。

Details Motivation: 红外小目标缺乏明显的纹理和形态特征,易融入复杂背景,现有方法难以兼顾高分辨率细节与强语义上下文,导致特征错位和性能下降。 Method: 设计了双路径结构:一条路径采用双向交互模块,结合局部与全局自注意力机制捕捉多尺度特征依赖;另一条路径引入多边缘精炼器,利用级联的泰勒有限差分算子和注意力门控机制进行多尺度边缘增强。 Result: 所提方法在多个数据集上实现了优于现有方法的检测精度,能有效抑制噪声并精确定位不同尺寸的小目标。 Conclusion: 该双路径框架通过融合结构语义与边缘细化,显著提升了复杂背景下红外小目标的检测能力,具有良好的应用前景。 Abstract: Infrared small target detection is crucial for remote sensing applications like disaster warning and maritime surveillance. However, due to the lack of distinctive texture and morphological features, infrared small targets are highly susceptible to blending into cluttered and noisy backgrounds. A fundamental challenge in designing deep models for this task lies in the inherent conflict between capturing high-resolution spatial details for minute targets and extracting robust semantic context for larger targets, often leading to feature misalignment and suboptimal performance. Existing methods often rely on fixed gradient operators or simplistic attention mechanisms, which are inadequate for accurately extracting target edges under low contrast and high noise. In this paper, we propose a novel Dual-Path Edge Network that explicitly addresses this challenge by decoupling edge enhancement and semantic modeling into two complementary processing paths. The first path employs a Bidirectional Interaction Module, which uses both Local Self-Attention and Global Self-Attention to capture multi-scale local and global feature dependencies. The global attention mechanism, based on a Transformer architecture, integrates long-range semantic relationships and contextual information, ensuring robust scene understanding. The second path introduces the Multi-Edge Refiner, which enhances fine-grained edge details using cascaded Taylor finite difference operators at multiple scales. This mathematical approach, along with an attention-driven gating mechanism, enables precise edge localization and feature enhancement for targets of varying sizes, while effectively suppressing noise. Our method provides a promising solution for precise infrared small target detection and localization, combining structural semantics and edge refinement in a unified framework.

[84] Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset

Ruixu Zhang,Yuran Wang,Xinyi Hu,Chaoyu Mai,Wenxuan Liu,Danni Xu,Xian Zhong,Zheng Wang

Main category: cs.CV

TL;DR: 本文提出了群体意图预测(GIF)的新任务,并引入了SHOT数据集和GIFT框架,用于通过分析个体行为和互动来预测群体意图的出现。

Details Motivation: 传统意图识别主要关注个体意图,忽视了群体环境中集体意图的复杂性。因此,需要研究群体意图及其预测方法。 Method: 提出群体意图概念和GIF任务,构建包含多视角、多个体信息的SHOT数据集,并设计GIFT框架以提取细粒度个体特征并建模群体动态演化过程。 Result: 实验结果验证了SHOT数据集和GIFT框架在群体意图预测上的有效性,为相关研究奠定了基础。 Conclusion: 群体意图预测是一个有前景的研究方向,SHOT和GIFT为该领域提供了重要的数据资源和模型基础。 Abstract: Intention recognition has traditionally focused on individual intentions, overlooking the complexities of collective intentions in group settings. To address this limitation, we introduce the concept of group intention, which represents shared goals emerging through the actions of multiple individuals, and Group Intention Forecasting (GIF), a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions before the collective goal becomes apparent. To investigate GIF in a specific scenario, we propose SHOT, the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views and annotated with 6 types of individual attributes. SHOT is designed with 3 key characteristics: multi-individual information, multi-view adaptability, and multi-level intention, making it well-suited for studying emerging group intentions. Furthermore, we introduce GIFT (Group Intention ForecasTer), a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence. Experimental results confirm the effectiveness of SHOT and GIFT, establishing a strong foundation for future research in group intention forecasting. The dataset is available at https://xinyi-hu.github.io/SHOT_DATASET.

[85] Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection

Yu Guo,Shengfeng He,Yuxu Lu,Haonan An,Yihang Tao,Huilin Zhu,Jingxian Liu,Yuguang Fang

Main category: cs.CV

TL;DR: 本文提出了一种名为Neptune-X的数据生成与选择框架,用于提升海上目标检测性能,通过多模态条件生成模型X-to-Maritime合成多样化且逼真的海上场景,并结合属性相关主动采样策略优化训练效果。

Details Motivation: 由于标注的海上数据稀缺且现有模型在不同海上属性(如类别、视角、位置和成像环境)下泛化能力差,特别是在开阔海域等代表性不足的场景中表现不佳,因此需要更有效的训练数据增强方法。 Method: 提出Neptune-X框架,包括X-to-Maritime生成模型(含双向物体-水体注意力模块以提升边界真实感)和属性相关主动采样策略,动态筛选任务相关的合成样本;同时构建首个面向生成式海上学习的基准数据集Maritime Generation Dataset。 Result: 实验表明该方法在海上场景合成质量上达到新高度,显著提升了检测精度,尤其在挑战性和代表性不足的场景中表现突出。 Conclusion: Neptune-X通过数据-centric 的生成-选择范式有效缓解了海上目标检测中的数据稀缺与泛化问题,为未来海上视觉任务提供了高效的数据增强解决方案。 Abstract: Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). % In particular, models trained on existing datasets often underperform in underrepresented scenarios such as open-sea environments. To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings.The code is available at https://github.com/gy65896/Neptune-X.

[86] AI-Enabled Crater-Based Navigation for Lunar Mapping

Sofia McLeod,Chee-Kheng Chng,Matthew Rodda,Tat-Jun Chin

Main category: cs.CV

TL;DR: STELLA是首个用于长期月球测绘的端到端基于陨石坑导航(CBN)系统,结合了Mask R-CNN探测器、无描述符识别模块和鲁棒位姿求解器,并在新构建的年度模拟数据集CRESENT-365上验证了其米级定位和亚度级姿态精度。

Details Motivation: 传统CBN主要针对着陆任务,难以适应长期、稀疏、倾斜且光照多变的月球测绘任务,因此需要开发适用于此类复杂条件的导航系统。 Method: 提出STELLA系统:采用Mask R-CNN检测陨石坑,通过无需描述符的匹配方法进行识别,结合PnC位姿求解器与批量轨道确定后端,实现高精度姿态估计。 Result: 在CRESENT+和CRESENT-365数据集上实验表明,STELLA在不同视角、光照和纬度条件下平均保持米级位置精度和亚度级姿态精度。 Conclusion: 这是首次在真实月球测绘场景中对CBN进行全面评估,证明了其可行性,并为未来任务的操作条件提供了指导。 Abstract: Crater-Based Navigation (CBN) uses the ubiquitous impact craters of the Moon observed on images as natural landmarks to determine the six degrees of freedom pose of a spacecraft. To date, CBN has primarily been studied in the context of powered descent and landing. These missions are typically short in duration, with high-frequency imagery captured from a nadir viewpoint over well-lit terrain. In contrast, lunar mapping missions involve sparse, oblique imagery acquired under varying illumination conditions over potentially year-long campaigns, posing significantly greater challenges for pose estimation. We bridge this gap with STELLA - the first end-to-end CBN pipeline for long-duration lunar mapping. STELLA combines a Mask R-CNN-based crater detector, a descriptor-less crater identification module, a robust perspective-n-crater pose solver, and a batch orbit determination back-end. To rigorously test STELLA, we introduce CRESENT-365 - the first public dataset that emulates a year-long lunar mapping mission. Each of its 15,283 images is rendered from high-resolution digital elevation models with SPICE-derived Sun angles and Moon motion, delivering realistic global coverage, illumination cycles, and viewing geometries. Experiments on CRESENT+ and CRESENT-365 show that STELLA maintains metre-level position accuracy and sub-degree attitude accuracy on average across wide ranges of viewing angles, illumination conditions, and lunar latitudes. These results constitute the first comprehensive assessment of CBN in a true lunar mapping setting and inform operational conditions that should be considered for future missions.

[87] Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models

Zoe Wanying He,Sean Trott,Meenakshi Khosla

Main category: cs.CV

TL;DR: 研究表明,尽管深度视觉和语言模型在独立模态上训练,但它们的表征空间仍部分对齐,这种对齐出现在中后期层,具有语义性,并与人类判断一致,且通过示例聚合进一步增强。

Details Motivation: 探究单模态模型在无跨模态训练的情况下如何形成共享表征空间,理解对齐的层次位置、语义基础、人类偏好匹配能力及示例聚合的影响。 Method: 分析视觉和语言模型各层的表征对齐情况,通过改变语义或外观测试其鲁棒性,并设计‘Pick-a-Pic’任务评估模型对多对多图文匹配的判断是否符合人类偏好,同时研究平均化示例嵌入的影响。 Result: 发现对齐在中晚期层最强,依赖语义而非外观;模型在图文匹配上的偏好与人类一致,且在多 caption 对一图像时仍保持;平均多个示例的嵌入反而增强对齐。 Conclusion: 单模态视觉和语言模型自发形成共享的语义编码,该编码与人类语义判断一致,并能通过聚合示例得到强化。 Abstract: Recent studies show that deep vision-only and language-only models--trained on disjoint modalities--nonetheless project their inputs into a partially aligned representational space. Yet we still lack a clear picture of where in each network this convergence emerges, what visual or linguistic cues support it, whether it captures human preferences in many-to-many image-text scenarios, and how aggregating exemplars of the same concept affects alignment. Here, we systematically investigate these questions. We find that alignment peaks in mid-to-late layers of both model types, reflecting a shift from modality-specific to conceptually shared representations. This alignment is robust to appearance-only changes but collapses when semantics are altered (e.g., object removal or word-order scrambling), highlighting that the shared code is truly semantic. Moving beyond the one-to-one image-caption paradigm, a forced-choice "Pick-a-Pic" task shows that human preferences for image-caption matches are mirrored in the embedding spaces across all vision-language model pairs. This pattern holds bidirectionally when multiple captions correspond to a single image, demonstrating that models capture fine-grained semantic distinctions akin to human judgments. Surprisingly, averaging embeddings across exemplars amplifies alignment rather than blurring detail. Together, our results demonstrate that unimodal networks converge on a shared semantic code that aligns with human judgments and strengthens with exemplar aggregation.

[88] FreeInsert: Personalized Object Insertion with Geometric and Style Control

Yuhong Zhang,Han Wang,Yiwen Wang,Rong Xie,Li Song

Main category: cs.CV

TL;DR: 提出了一种无需训练的框架FreeInsert,利用3D几何信息实现图像中对象插入的几何控制和风格一致性。

Details Motivation: 现有图像编辑方法在个性化图像合成任务中存在缺乏几何控制和风格一致性的问题,且通常需要大量训练。 Method: 通过将2D对象转换为3D,在3D层面进行交互式编辑,并从指定视角重新渲染为2D图像,结合扩散模型中的扩散适配器实现几何、风格和内容控制。 Result: 实现了无需训练的对象插入,支持形状或视角等几何控制,并保持插入对象与背景的风格一致,生成更真实的编辑图像。 Conclusion: FreeInsert框架有效解决了图像编辑中几何控制不足和风格不一致的问题,提供了一种灵活、高质量的定制化对象插入方法。 Abstract: Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose \textit{FreeInsert}, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.

[89] CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion

Maoye Ren,Praneetha Vaddamanu,Jianjin Xu,Fernando De la Torre Frade

Main category: cs.CV

TL;DR: 本文提出CustomEnhancer,一种用于增强现有身份定制模型的零样本增强框架,结合人脸交换技术和预训练扩散模型,通过三流融合的PerGeneration方法统一生成与重建过程,并引入ResInversion方法显著降低反转时间。

Details Motivation: 现有文本到图像扩散模型在生成逼真人物图像时存在场景退化、控制不足和身份感知不准确的问题,需要更高效且无需训练的个性化生成方法。 Method: 提出CustomEnhancer框架,采用零样本方式利用人脸交换和预训练扩散模型获取额外表征;设计三流融合的PerGeneration方法,结合两个相容但反向的潜在空间来操控个性化模型的关键空间;引入ResInversion方法,通过预扩散机制进行噪声校正,大幅提升反转效率。 Result: 实验表明,CustomEnhancer在场景多样性、身份保真度和无需训练的控制方面达到SOTA水平,ResInversion相比NTI将反转时间减少129倍。 Conclusion: CustomEnhancer实现了高质量、高效率且无需额外训练的人脸个性化生成,为现有模型提供了通用、精确且高效的增强方案。 Abstract: Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.

[90] CompressAI-Vision: Open-source software to evaluate compression methods for computer vision tasks

Hyomin Choi,Heeji Han,Chris Rosewarne,Fabien Racapé

Main category: cs.CV

TL;DR: CompressAI-Vision是一个面向视觉任务优化的视频压缩评估平台,支持远程和拆分推理场景,已被MPEG用于Feature Coding for Machines (FCM)标准开发,并开源。

Details Motivation: 随着基于神经网络的视觉应用兴起,需要一个统一平台来评估针对下游视觉任务优化的压缩技术。 Method: 提出CompressAI-Vision平台,集成多种编码工具和标准编解码器,评估在不同推理场景下压缩效率与任务准确率的权衡。 Result: 平台在多个数据集上展示了压缩增益与任务准确率的关系,被MPEG采纳用于FCM标准开发。 Conclusion: CompressAI-Vision为面向机器视觉的压缩技术提供了开放、标准化的评估框架,推动了相关标准的发展。 Abstract: With the increasing use of neural network (NN)-based computer vision applications that process image and video data as input, interest has emerged in video compression technology optimized for computer vision tasks. In fact, given the variety of vision tasks, associated NN models and datasets, a consolidated platform is needed as a common ground to implement and evaluate compression methods optimized for downstream vision tasks. CompressAI-Vision is introduced as a comprehensive evaluation platform where new coding tools compete to efficiently compress the input of vision network while retaining task accuracy in the context of two different inference scenarios: "remote" and "split" inferencing. Our study showcases various use cases of the evaluation platform incorporated with standard codecs (under development) by examining the compression gain on several datasets in terms of bit-rate versus task accuracy. This evaluation platform has been developed as open-source software and is adopted by the Moving Pictures Experts Group (MPEG) for the development the Feature Coding for Machines (FCM) standard. The software is available publicly at https://github.com/InterDigitalInc/CompressAI-Vision.

[91] Dual-supervised Asymmetric Co-training for Semi-supervised Medical Domain Generalization

Jincai Song,Haipeng Chen,Jun Qin,Na Zhao

Main category: cs.CV

TL;DR: 本文提出了一种针对跨域半监督域泛化(CD-SSDG)的双监督非对称协同训练(DAC)框架,以应对医学图像分割中标注数据有限且存在域偏移的挑战。

Details Motivation: 传统SSDG方法假设每个源域都有标注和未标注数据,但在实际中这一条件常不满足。本文旨在解决训练集中标注与未标注数据之间也存在域偏移的更现实场景。 Method: 提出DAC框架,基于协同训练范式,引入两个子模型进行互伪监督,并结合特征级监督和非对称的辅助自监督任务,以缓解因域偏移导致的伪标签不准问题,增强域不变特征学习。 Result: 在真实医学图像数据集(Fundus、Polyp、SCGM)上的实验表明,所提DAC框架在跨域半监督域泛化设置下具有出色的泛化性能。 Conclusion: DAC通过特征级互补监督和非对称自监督任务有效提升了CD-SSDG下的模型鲁棒性和分割性能,适用于实际医疗场景中的域泛化问题。 Abstract: Semi-supervised domain generalization (SSDG) in medical image segmentation offers a promising solution for generalizing to unseen domains during testing, addressing domain shift challenges and minimizing annotation costs. However, conventional SSDG methods assume labeled and unlabeled data are available for each source domain in the training set, a condition that is not always met in practice. The coexistence of limited annotation and domain shift in the training set is a prevalent issue. Thus, this paper explores a more practical and challenging scenario, cross-domain semi-supervised domain generalization (CD-SSDG), where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets. Existing SSDG methods exhibit sub-optimal performance under such domain shifts because of inaccurate pseudolabels. To address this issue, we propose a novel dual-supervised asymmetric co-training (DAC) framework tailored for CD-SSDG. Building upon the co-training paradigm with two sub-models offering cross pseudo supervision, our DAC framework integrates extra feature-level supervision and asymmetric auxiliary tasks for each sub-model. This feature-level supervision serves to address inaccurate pseudo supervision caused by domain shifts between labeled and unlabeled data, utilizing complementary supervision from the rich feature space. Additionally, two distinct auxiliary self-supervised tasks are integrated into each sub-model to enhance domain-invariant discriminative feature learning and prevent model collapse. Extensive experiments on real-world medical image segmentation datasets, i.e., Fundus, Polyp, and SCGM, demonstrate the robust generalizability of the proposed DAC framework.

[92] Real-Time Object Detection Meets DINOv3

Shihua Huang,Yongjie Hou,Longfei Liu,Xuanlong Yu,Xi Shen

Main category: cs.CV

TL;DR: DEIMv2是基于DEIM框架的升级版本,引入DINOv3特征和Spatial Tuning Adapter(STA),在多种模型尺寸上实现检测性能与成本的最优权衡,显著优于YOLO系列及其他现有模型。

Details Motivation: 为了进一步提升实时目标检测器DETR的性能与效率,尤其是在不同硬件平台上的部署灵活性,同时克服现有方法在多尺度特征利用和轻量化设计方面的局限。 Method: 采用DINOv3预训练或蒸馏的主干网络,并引入Spatial Tuning Adapter(STA)将单尺度输出转换为多尺度特征;对于超轻量级模型使用HGNetv2结合深度宽度剪枝;配合简化解码器和改进的Dense O2O策略,形成统一架构。 Result: DEIMv2-X达到57.8 AP(仅5030万参数),DEIMv2-S以971万参数实现50.9 AP,成为首个突破50 AP的千万以下参数模型;DEIMv2-Pico以150万参数实现38.5 AP,性能媲美YOLOv10-Nano且参数减少约50%。 Conclusion: DEIMv2通过统一的设计在从大型到超轻量级的广泛模型规模中实现了最先进的性能-成本平衡,确立了实时DETR的新标杆。 Abstract: Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters.

[93] DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation

Ved Umrajkar

Main category: cs.CV

TL;DR: 提出了一种名为DAC-LoRA的新型框架,将对抗训练融入参数高效微调(PEFT),通过渐进式智能攻击课程显著提升视觉语言模型的对抗鲁棒性,同时保持良好的干净样本准确性。

Details Motivation: 现有的视觉语言模型(VLMs)在安全关键应用中易受对抗攻击影响,尤其是基于CLIP的模型,其脆弱性会蔓延至整个多模态AI系统,因此需要一种高效且鲁棒的微调方法。 Method: 提出DAC-LoRA框架,结合第一阶平稳条件(FOSC)和受TRADES启发的损失函数,设计动态对抗课程,在LoRA等PEFT过程中引入逐步增强的对抗攻击进行训练。 Result: DAC-LoRA在多种任务上显著提升了模型的对抗鲁棒性,同时未明显降低原始数据上的准确率,且可轻松集成到标准PEFT流程中。 Conclusion: DAC-LoRA是一种有效、轻量且广泛适用的方法,能够在不牺牲性能的前提下增强VLMs在安全关键场景中的鲁棒性,具有广泛的部署潜力。 Abstract: Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to specialized tasks, these models remain vulnerable to adversarial attacks that can compromise safety-critical decisions. CLIP, the backbone for numerous downstream VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem. We propose Dynamic Adversarial Curriculum DAC-LoRA, a novel framework that integrates adversarial training into PEFT. The core principle of our method i.e. an intelligent curriculum of progressively challenging attack, is general and can potentially be applied to any iterative attack method. Guided by the First-Order Stationary Condition (FOSC) and a TRADES-inspired loss, DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. Our work presents an effective, lightweight, and broadly applicable method to demonstrate that the DAC-LoRA framework can be easily integrated into a standard PEFT pipeline to significantly enhance robustness.

[94] Federated Domain Generalization with Domain-specific Soft Prompts Generation

Jianhan Wu,Xiaoyang Qu,Zhangcheng Huang,Jianzong Wang

Main category: cs.CV

TL;DR: 提出了一种基于生成式方法的联邦域泛化新方法FedDSPG,通过为每个域引入特定的软提示(DSPs),在训练时整合内容和域知识,在推理时生成 unseen 域的DSPs,显著提升了下游任务在未知域的适应能力。

Details Motivation: 现有基于提示学习的联邦域泛化方法生成的软提示多样性不足,且忽视了未知域的信息,导致泛化能力受限。 Method: 提出FedDSPG方法,在训练阶段为各客户端域引入域特定软提示(DSPs),并通过生成模型融合内容与域知识;在推理阶段利用生成器为未见目标域生成DSPs,以指导下游任务。 Result: 在多个公开数据集上的实验表明,该方法在联邦域泛化任务上优于现有的强基线方法,实现了最先进的性能。 Conclusion: FedDSPG通过生成式方式增强软提示的多样性和对未知域的适应性,有效提升了联邦域泛化性能,具有较强的实用价值。 Abstract: Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. engendering domain shift among clients and posing a formidable challenge for downstream-task adaptation. Existing federated domain generalization (FDG) methods based on prompt learning typically learn soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts exhibit limited diversity and tend to ignore information from unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, during training, we introduce domain-specific soft prompts (DSPs) for each domain and integrate content and domain knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Comprehensive evaluations across several public datasets confirm that our method outperforms existing strong baselines in FDG, achieving state-of-the-art results.

[95] Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning

Thanh Binh Le,Hoang Nhat Khang Vo,Tan-Ha Mai,Trong Nhan Phan

Main category: cs.CV

TL;DR: LumbarCLIP是一种基于对比语言-图像预训练的多模态框架,用于对齐腰椎MRI图像与放射学报告文本,在分类任务中达到95.00%准确率和94.75% F1分数。

Details Motivation: 腰痛患者众多,需要能够联合分析复杂医学影像及其文本报告的诊断模型,以提升临床诊断效率与准确性。 Method: 提出LumbarCLIP框架,采用ResNet-50、Vision Transformer等视觉编码器与BERT文本编码器提取特征,通过可学习的投影头将特征映射到共享嵌入空间,并使用软CLIP损失进行对比训练。 Result: 模型在测试集上达到最高95.00%的准确率和94.75%的F1分数,优于现有方法,且线性投影头比非线性投影头更有效。 Conclusion: LumbarCLIP为肌肉骨骼系统的自动化诊断和临床决策支持提供了有力的基础。 Abstract: Low back pain affects millions worldwide, driving the need for robust diagnostic models that can jointly analyze complex medical images and accompanying text reports. We present LumbarCLIP, a novel multimodal framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions. Built upon a curated dataset containing axial MRI views paired with expert-written reports, LumbarCLIP integrates vision encoders (ResNet-50, Vision Transformer, Swin Transformer) with a BERT-based text encoder to extract dense representations. These are projected into a shared embedding space via learnable projection heads, configurable as linear or non-linear, and normalized to facilitate stable contrastive training using a soft CLIP loss. Our model achieves state-of-the-art performance on downstream classification, reaching up to 95.00% accuracy and 94.75% F1-score on the test set, despite inherent class imbalance. Extensive ablation studies demonstrate that linear projection heads yield more effective cross-modal alignment than non-linear variants. LumbarCLIP offers a promising foundation for automated musculoskeletal diagnosis and clinical decision support.

[96] Poisoning Prompt-Guided Sampling in Video Large Language Models

Yuxin Cao,Wei Song,Jingling Xue,Jin Song Dong

Main category: cs.CV

TL;DR: 本文提出了PoisonVID,首个针对视频大语言模型中提示引导采样机制的黑盒投毒攻击,通过闭合优化策略生成通用扰动以抑制有害帧的相关性得分。

Details Motivation: 尽管早期的帧采样策略已被发现存在漏洞,但提示引导采样的安全性尚未被探索。本文旨在填补这一空白。 Method: 提出PoisonVID,采用闭合回路优化策略,利用影子VideoLLM和轻量级语言模型(如GPT-4o-mini)构建描述集,迭代优化通用扰动以抑制有害帧的关联得分。 Result: 在三种提示引导采样策略和三个先进VideoLLM上全面评估,PoisonVID实现了82%至99%的攻击成功率。 Conclusion: 提示引导采样机制存在严重安全漏洞,未来需开发更先进的采样策略以提升VideoLLM的安全性。 Abstract: Video Large Language Models (VideoLLMs) have emerged as powerful tools for understanding videos, supporting tasks such as summarization, captioning, and question answering. Their performance has been driven by advances in frame sampling, progressing from uniform-based to semantic-similarity-based and, most recently, prompt-guided strategies. While vulnerabilities have been identified in earlier sampling strategies, the safety of prompt-guided sampling remains unexplored. We close this gap by presenting PoisonVID, the first black-box poisoning attack that undermines prompt-guided sampling in VideoLLMs. PoisonVID compromises the underlying prompt-guided sampling mechanism through a closed-loop optimization strategy that iteratively optimizes a universal perturbation to suppress harmful frame relevance scores, guided by a depiction set constructed from paraphrased harmful descriptions leveraging a shadow VideoLLM and a lightweight language model, i.e., GPT-4o-mini. Comprehensively evaluated on three prompt-guided sampling strategies and across three advanced VideoLLMs, PoisonVID achieves 82% - 99% attack success rate, highlighting the importance of developing future advanced sampling strategies for VideoLLMs.

[97] Punching Above Precision: Small Quantized Model Distillation with Learnable Regularizer

Abdur Rehman,S M A Sharif,Md Abdur Rahaman,Mohamed Jismy Aashik Rasool,Seongwan Kim,Jaeho Lee

Main category: cs.CV

TL;DR: 提出了一种名为Game of Regularizer (GoR) 的可学习正则化方法,通过两个可训练参数动态平衡任务特定和知识蒸馏损失,显著提升低比特量化模型的性能,并结合集成蒸馏框架QAT-EKD-GoR实现超越全精度模型的表现。

Details Motivation: 现有量化感知训练与知识蒸馏(QAT-KD)方法在低比特量化下难以平衡任务特定损失和蒸馏损失,因梯度幅值差异导致优化冲突。 Method: 提出Game of Regularizer (GoR),使用两个可学习参数对损失进行动态加权,缓解监督信号冲突;并引入QAT-EKD-GoR,基于多个异构教师模型的集成蒸馏框架。 Result: 在图像分类、目标检测和大语言模型压缩任务中,GoR consistently 超越现有QAT-KD方法;在边缘设备上实现更快推理且保持全精度准确率;EKD-GoR在最优条件下可超越全精度模型。 Conclusion: GoR为低比特量化模型提供了一种高效、轻量的损失平衡机制,结合集成蒸馏显著提升压缩模型性能,适用于资源受限场景的实际部署。 Abstract: Quantization-aware training (QAT) combined with knowledge distillation (KD) is a promising strategy for compressing Artificial Intelligence (AI) models for deployment on resource-constrained hardware. However, existing QAT-KD methods often struggle to balance task-specific (TS) and distillation losses due to heterogeneous gradient magnitudes, especially under low-bit quantization. We propose Game of Regularizer (GoR), a novel learnable regularization method that adaptively balances TS and KD objectives using only two trainable parameters for dynamic loss weighting. GoR reduces conflict between supervision signals, improves convergence, and boosts the performance of small quantized models (SQMs). Experiments on image classification, object detection (OD), and large language model (LLM) compression show that GoR consistently outperforms state-of-the-art QAT-KD methods. On low-power edge devices, it delivers faster inference while maintaining full-precision accuracy. We also introduce QAT-EKD-GoR, an ensemble distillation framework that uses multiple heterogeneous teacher models. Under optimal conditions, the proposed EKD-GoR can outperform full-precision models, providing a robust solution for real-world deployment.

[98] Plant identification based on noisy web data: the amazing performance of deep learning (LifeCLEF 2017)

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: LifeCLEF 2017植物识别挑战赛评估了基于大规模网络采集的含噪声数据集与小规模专家标注数据集在植物识别任务中的性能对比,使用Pl@ntNet应用的真实查询图像作为测试集。

Details Motivation: 尽管已有大量植物图像资源,但多数物种仍缺乏或仅有少量图片;同时,网络上存在大量非机构来源的植物图像,但其标签可能存在错误。研究旨在评估这些大规模、含噪声的网络数据是否可用于构建有效的自动化植物识别系统。 Method: 通过比较两种训练数据策略:一是从网络收集的大规模但含噪声的训练集,二是小规模但经专家验证的高质量训练集。测试集来自独立来源——Pl@ntNet移动应用的用户上传图像,以确保公平评估。 Result: 论文展示了挑战赛的数据资源、评估方法,总结了各参赛团队采用的技术方案,并分析了不同方法的表现,揭示了噪声数据在大规模植物识别中的潜力与局限。 Conclusion: 大规模网络采集的含噪声数据在植物识别任务中具有一定竞争力,但在精度和稳定性上仍需结合专家验证数据进行优化。 Abstract: The 2017-th edition of the LifeCLEF plant identification challenge is an important milestone towards automated plant identification systems working at the scale of continental floras with 10.000 plant species living mainly in Europe and North America illustrated by a total of 1.1M images. Nowadays, such ambitious systems are enabled thanks to the conjunction of the dazzling recent progress in image classification with deep learning and several outstanding international initiatives, such as the Encyclopedia of Life (EOL), aggregating the visual knowledge on plant species coming from the main national botany institutes. However, despite all these efforts the majority of the plant species still remain without pictures or are poorly illustrated. Outside the institutional channels, a much larger number of plant pictures are available and spread on the web through botanist blogs, plant lovers web-pages, image hosting websites and on-line plant retailers. The LifeCLEF 2017 plant challenge presented in this paper aimed at evaluating to what extent a large noisy training dataset collected through the web and containing a lot of labelling errors can compete with a smaller but trusted training dataset checked by experts. To fairly compare both training strategies, the test dataset was created from a third data source, i.e. the Pl@ntNet mobile application that collects millions of plant image queries all over the world. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[99] TasselNetV4: A vision foundation model for cross-scene, cross-scale, and cross-species plant counting

Xiaonan Hu,Xuebing Li,Jinyu Xu,Abdulkadir Duran Adan,Letian Zhou,Xuhui Zhu,Yanan Li,Wei Guo,Shouyang Liu,Wenzhong Liu,Hao Lu

Main category: cs.CV

TL;DR: 本文提出了TasselNetV4,一种用于跨物种植物计数的视觉基础模型,结合局部计数与提取-匹配范式,提升了跨场景、跨尺度和跨物种的鲁棒性。

Details Motivation: 传统植物计数方法依赖特定物种模型,难以应对植物多样性及新品种不断出现的问题;现有通用计数模型在处理非刚性、时空变化的植物时性能受限,因此需要重新思考植物计数的问题设定。 Method: 基于TasselNet框架,引入TasselNetV4,融合局部计数思想与类无关计数(CAC)中的提取-匹配范式,采用纯视觉Transformer架构,并设计多分支盒感知局部计数器以增强跨尺度鲁棒性。 Result: 在新构建的PAC-105和PAC-Somalia两个挑战性数据集上,TasselNetV4在计数精度和效率方面均优于当前最先进的类无关计数模型。 Conclusion: TasselNetV4能够有效实现跨场景、跨尺度和跨物种的植物计数,展现出作为植物计数视觉基础模型的潜力。 Abstract: Accurate plant counting provides valuable information for agriculture such as crop yield prediction, plant density assessment, and phenotype quantification. Vision-based approaches are currently the mainstream solution. Prior art typically uses a detection or a regression model to count a specific plant. However, plants have biodiversity, and new cultivars are increasingly bred each year. It is almost impossible to exhaust and build all species-dependent counting models. Inspired by class-agnostic counting (CAC) in computer vision, we argue that it is time to rethink the problem formulation of plant counting, from what plants to count to how to count plants. In contrast to most daily objects with spatial and temporal invariance, plants are dynamic, changing with time and space. Their non-rigid structure often leads to worse performance than counting rigid instances like heads and cars such that current CAC and open-world detection models are suboptimal to count plants. In this work, we inherit the vein of the TasselNet plant counting model and introduce a new extension, TasselNetV4, shifting from species-specific counting to cross-species counting. TasselNetV4 marries the local counting idea of TasselNet with the extract-and-match paradigm in CAC. It builds upon a plain vision transformer and incorporates novel multi-branch box-aware local counters used to enhance cross-scale robustness. Two challenging datasets, PAC-105 and PAC-Somalia, are harvested. Extensive experiments against state-of-the-art CAC models show that TasselNetV4 achieves not only superior counting performance but also high efficiency.Our results indicate that TasselNetV4 emerges to be a vision foundation model for cross-scene, cross-scale, and cross-species plant counting.

[100] SD-RetinaNet: Topologically Constrained Semi-Supervised Retinal Lesion and Layer Segmentation in OCT

Botond Fazekas,Guilherme Aresta,Philipp Seeböck,Julia Mai,Ursula Schmidt-Erfurth,Hrvoje Bogunović

Main category: cs.CV

TL;DR: 提出一种新的半监督模型,通过引入完全可微的生物标志物拓扑引擎,实现解剖学上正确的视网膜层和病变分割,显著提升OCT图像中层与病变的分割精度和鲁棒性。

Details Motivation: 现有半监督方法在视网膜OCT图像分割中常产生解剖学上不合理的结构,难以有效建模层与病变之间的相互作用,且缺乏拓扑正确性的保证。 Method: 提出一种新型半监督模型,结合完全可微的生物标志物拓扑引擎,实现层与病变的联合学习与双向影响,并学习解耦表示以分离空间与风格因子,利用未标记和部分标记数据进行训练。 Result: 在公共和内部OCT数据集上验证,该模型在层和病变分割任务上均优于当前最先进方法,并能使用部分标注数据泛化到病理情况下的层分割。 Conclusion: 将解剖约束引入半监督学习可有效提升视网膜生物标志物分割的准确性、鲁棒性和可信度,具有临床应用潜力。 Abstract: Optical coherence tomography (OCT) is widely used for diagnosing and monitoring retinal diseases, such as age-related macular degeneration (AMD). The segmentation of biomarkers such as layers and lesions is essential for patient diagnosis and follow-up. Recently, semi-supervised learning has shown promise in improving retinal segmentation performance. However, existing methods often produce anatomically implausible segmentations, fail to effectively model layer-lesion interactions, and lack guarantees on topological correctness. To address these limitations, we propose a novel semi-supervised model that introduces a fully differentiable biomarker topology engine to enforce anatomically correct segmentation of lesions and layers. This enables joint learning with bidirectional influence between layers and lesions, leveraging unlabeled and diverse partially labeled datasets. Our model learns a disentangled representation, separating spatial and style factors. This approach enables more realistic layer segmentations and improves lesion segmentation, while strictly enforcing lesion location in their anatomically plausible positions relative to the segmented layers. We evaluate the proposed model on public and internal datasets of OCT scans and show that it outperforms the current state-of-the-art in both lesion and layer segmentation, while demonstrating the ability to generalize layer segmentation to pathological cases using partially annotated training data. Our results demonstrate the potential of using anatomical constraints in semi-supervised learning for accurate, robust, and trustworthy retinal biomarker segmentation.

[101] Plant identification in an open-world (LifeCLEF 2016)

Herve Goeau,Pierre Bonnet,Alexis Joly

Main category: cs.CV

TL;DR: LifeCLEF 2016植物识别挑战赛评估了在接近真实生物多样性监测场景下的大规模植物识别方法,首次采用开放集识别任务,要求系统能有效拒绝未知物种的误分类。

Details Motivation: 推动植物识别技术在真实环境中的应用,特别是在大规模、开放集条件下的鲁棒性。 Method: 使用超过11万张图像、涵盖西欧1000种植物的数据集,通过众包平台构建;将识别任务定义为开放集识别问题,评估系统对未知类别的处理能力。 Result: 多个研究团队提交了不同方法,挑战揭示了现有系统在处理未知类别时的局限性,并推动了拒识机制和开放集识别技术的发展。 Conclusion: 开放集识别是未来植物自动识别的关键方向,需进一步提升模型对未知物种的鲁棒性和拒识能力。 Abstract: The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2016-th edition was actually conducted on a set of more than 110K images illustrating 1000 plant species living in West Europe, built through a large-scale participatory sensing platform initiated in 2011 and which now involves tens of thousands of contributors. The main novelty over the previous years is that the identification task was evaluated as an open-set recognition problem, i.e. a problem in which the recognition system has to be robust to unknown and never seen categories. Beyond the brute-force classification across the known classes of the training set, the big challenge was thus to automatically reject the false positive classification hits that are caused by the unknown classes. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.

[102] SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering

Yan Zhang,Jiaqing Lin,Miao Zhang,Kui Xiao,Xiaoju Hou,Yue Zhao,Zhifei Li

Main category: cs.CV

TL;DR: 提出了一种新的知识库视觉问答方法SCRA-VQA,通过总结和重排序图像描述来增强大语言模型的理解与推理能力,在OK-VQA和A-OKVQA数据集上取得了优异表现。

Details Motivation: 现有方法使用图像描述作为视觉文本输入,但常包含与问题无关的噪声,且大语言模型难以理解VQA任务,限制了其推理能力。 Method: 采用预训练的视觉语言模型生成图像描述,并通过上下文示例生成、描述摘要和重排序来优化输入,提升大语言模型对图像和问题的理解。 Result: 基于67亿参数的大语言模型,SCRA-VQA在OK-VQA和A-OKVQA数据集上分别达到38.8%和34.6%的准确率。 Conclusion: SCRA-VQA通过去噪和结构化描述显著提升了大语言模型在知识型视觉问答中的推理能力和任务适应性,且无需昂贵的端到端训练。 Abstract: Acquiring high-quality knowledge is a central focus in Knowledge-Based Visual Question Answering (KB-VQA). Recent methods use large language models (LLMs) as knowledge engines for answering. These methods generally employ image captions as visual text descriptions to assist LLMs in interpreting images. However, the captions frequently include excessive noise irrelevant to the question, and LLMs generally do not comprehend VQA tasks, limiting their reasoning capabilities. To address this issue, we propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA), which employs a pre-trained visual language model to convert images into captions. Moreover, SCRA-VQA generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information. The caption-rerank process enables LLMs to understand the image information and questions better, thus enhancing the model's reasoning ability and task adaptability without expensive end-to-end training. Based on an LLM with 6.7B parameters, SCRA-VQA performs excellently on two challenging knowledge-based VQA datasets: OK-VQA and A-OKVQA, achieving accuracies of 38.8% and 34.6%. Our code is available at https://github.com/HubuKG/SCRA-VQA.

[103] The Unanticipated Asymmetry Between Perceptual Optimization and Assessment

Jiabei Zhang,Qi Wang,Siyu Wu,Du Chen,Tianhe Wu

Main category: cs.CV

TL;DR: 该论文研究了感知优化与图像质量评估(IQA)之间的不对称性,发现擅长IQA的保真度度量不一定适用于感知优化,尤其是在对抗训练下这种不匹配更明显。同时,判别器结构对优化效果有关键影响,基于patch和卷积的架构在细节重建上优于传统或Transformer架构。

Details Motivation: 尽管保真度和对抗目标在感知优化中起核心作用,但它们作为优化目标的有效性与其作为图像质量评估指标的能力之间的关系尚未被深入探索。 Method: 通过系统性分析,研究不同保真度和对抗性目标在感知优化与图像质量评估中的表现差异,并评估判别器结构对优化过程及IQA模型迁移性能的影响。 Result: 发现了感知优化与评估之间的意外不对称性:优秀的IQA度量不一定适合优化;判别器虽能有效抑制伪影,但其表征对IQA模型初始化帮助有限;判别器设计显著影响优化效果,patch级和卷积架构在细节重建上表现更优。 Conclusion: 感知优化与图像质量评估之间存在显著的不一致性,损失函数的设计需考虑任务特定需求,判别器架构的选择对优化结果至关重要,这为更合理的感知优化方法提供了方向。 Abstract: Perceptual optimization is primarily driven by the fidelity objective, which enforces both semantic consistency and overall visual realism, while the adversarial objective provides complementary refinement by enhancing perceptual sharpness and fine-grained detail. Despite their central role, the correlation between their effectiveness as optimization objectives and their capability as image quality assessment (IQA) metrics remains underexplored. In this work, we conduct a systematic analysis and reveal an unanticipated asymmetry between perceptual optimization and assessment: fidelity metrics that excel in IQA are not necessarily effective for perceptual optimization, with this misalignment emerging more distinctly under adversarial training. In addition, while discriminators effectively suppress artifacts during optimization, their learned representations offer only limited benefits when reused as backbone initializations for IQA models. Beyond this asymmetry, our findings further demonstrate that discriminator design plays a decisive role in shaping optimization, with patch-level and convolutional architectures providing more faithful detail reconstruction than vanilla or Transformer-based alternatives. These insights advance the understanding of loss function design and its connection to IQA transferability, paving the way for more principled approaches to perceptual optimization.

[104] Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering

Zhifei Li,Feng Qiu,Yiran Wang,Yujing Xia,Kui Xiao,Miao Zhang,Yan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的VQA模型IOG-VQA,结合对象交互自注意力和基于GAN的去偏方法,以提升在存在数据偏差情况下的视觉问答性能。

Details Motivation: 现有VQA模型容易受到训练数据偏差的影响,导致泛化能力差,难以应对多样化的图像和问题。 Method: 引入对象交互自注意力机制来捕捉图像中对象间的复杂交互,并采用基于GAN的去偏框架生成无偏数据分布,从而增强模型的鲁棒性和泛化能力。 Result: 在VQA-CP v1和v2数据集上的实验表明,该模型在处理有偏和不平衡数据时显著优于现有方法。 Conclusion: 同时建模对象交互和缓解数据偏差对提升VQA模型的泛化性能至关重要。 Abstract: Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA.

[105] Nuclear Diffusion Models for Low-Rank Background Suppression in Videos

Tristan S. W. Stevens,Oisín Nolan,Jean-Luc Robert,Ruud J. G. van Sloun

Main category: cs.CV

TL;DR: 提出了一种名为Nuclear Diffusion的混合框架,结合低秩时间建模与扩散后验采样,用于视频去噪与恢复,在心脏超声去雾任务中表现优于传统RPCA方法。

Details Motivation: 传统鲁棒主成分分析(RPCA)的稀疏性假设难以捕捉真实视频数据中的丰富变化,导致在复杂噪声和背景伪影下的视频恢复效果受限。 Method: 提出Nuclear Diffusion方法,将低秩时间建模与扩散模型的生成先验相结合,通过分解视频数据为低秩部分和由扩散模型驱动的动态内容,实现更精确的视频恢复。 Result: 在真实心脏超声去雾任务中,Nuclear Diffusion相比传统RPCA在对比度增强(gCNR)和信号保真度(KS统计量)方面表现更优。 Conclusion: 将基于模型的时间建模与深度生成先验相结合,能有效提升视频恢复的质量,尤其适用于存在复杂噪声的医学视频处理场景。 Abstract: Video sequences often contain structured noise and background artifacts that obscure dynamic content, posing challenges for accurate analysis and restoration. Robust principal component methods address this by decomposing data into low-rank and sparse components. Still, the sparsity assumption often fails to capture the rich variability present in real video data. To overcome this limitation, a hybrid framework that integrates low-rank temporal modeling with diffusion posterior sampling is proposed. The proposed method, Nuclear Diffusion, is evaluated on a real-world medical imaging problem, namely cardiac ultrasound dehazing, and demonstrates improved dehazing performance compared to traditional RPCA concerning contrast enhancement (gCNR) and signal preservation (KS statistic). These results highlight the potential of combining model-based temporal models with deep generative priors for high-fidelity video restoration.

[106] FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies

Shuqiao Liang,Jian Liu,Renzhang Chen,Quanlong Guan

Main category: cs.CV

TL;DR: 本文提出了一种基于局部像素依赖性(LPD)的轻量级神经网络FerretNet,用于检测合成图像,通过捕捉生成过程中的潜在分布偏差和解码平滑效应,在跨22种生成模型的开放世界基准上表现出色,平均准确率达97.1%。

Details Motivation: 随着VAEs、GANs和LDMs等模型生成的合成图像越来越逼真,传统检测方法面临挑战,因此需要更鲁棒的方法来识别这些图像。 Method: 利用马尔可夫随机场中的局部像素依赖性(LPD)特性,重建图像以暴露纹理和边缘的不一致性,并基于此设计了轻量级网络FerretNet。 Result: FerretNet在仅使用4类ProGAN数据训练的情况下,在包含22个生成模型的开放世界基准上取得了97.1%的平均准确率,超过现有最优方法10.6%。 Conclusion: FerretNet是一种高效且鲁棒的合成图像检测方法,能够泛化到未见过的生成模型,具有实际应用潜力。 Abstract: The increasing realism of synthetic images generated by advanced models such as VAEs, GANs, and LDMs poses significant challenges for synthetic image detection. To address this issue, we explore two artifact types introduced during the generation process: (1) latent distribution deviations and (2) decoding-induced smoothing effects, which manifest as inconsistencies in local textures, edges, and color transitions. Leveraging local pixel dependencies (LPD) properties rooted in Markov Random Fields, we reconstruct synthetic images using neighboring pixel information to expose disruptions in texture continuity and edge coherence. Building upon LPD, we propose FerretNet, a lightweight neural network with only 1.1M parameters that delivers efficient and robust synthetic image detection. Extensive experiments demonstrate that FerretNet, trained exclusively on the 4-class ProGAN dataset, achieves an average accuracy of 97.1% on an open-world benchmark comprising across 22 generative models, surpassing state-of-the-art methods by 10.6%.

[107] Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

Patrick Knab,Sascha Marton,Philipp J. Schubert,Drago Guggiana,Christian Bartelt

Main category: cs.CV

TL;DR: MoTIF是一种基于Transformer的可解释框架,将概念瓶颈模型扩展到视频分类,通过捕捉全局、局部和时间维度上的概念重要性,实现对视频中动作的可解释建模,同时保持良好性能。

Details Motivation: 将图像中的可解释概念模型(如CBM)扩展到视频面临时序依赖建模的挑战,而现有方法难以有效处理视频中动态变化的动作和事件。 Method: 提出MoTIF框架,采用类Transformer架构,在概念瓶颈模型基础上引入对视频序列中全局概念重要性、局部时间窗内概念相关性和概念时间依赖性的建模,支持任意长度视频输入。 Result: 实验表明,MoTIF能有效迁移概念建模范式至视频数据,在保持竞争性分类性能的同时,提供对概念在时间维度上贡献的深入理解。 Conclusion: MoTIF成功将可解释的概念瓶颈模型应用于视频分类,通过多视角概念分析实现了对视频动作的语义化解释,为时序数据的可解释AI提供了新思路。 Abstract: Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., 'bow', 'mount', 'shoot') that reoccur across time - forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining competitive performance. Code available at github.com/patrick-knab/MoTIF.

[108] FSMODNet: A Closer Look at Few-Shot Detection in Multispectral Data

Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre

Main category: cs.CV

TL;DR: 本文提出了一种名为FSMODNet的框架,用于解决少样本多光谱目标检测(FSMOD)问题,通过可变形注意力机制实现可见光与热成像模态的特征融合,在标注数据极少的情况下显著提升了检测性能。

Details Motivation: 由于标注多光谱数据成本高且困难,如何在少量标注样本下实现鲁棒的目标检测成为一个关键挑战,本文旨在探索跨模态特征融合方法以提升少样本条件下的检测能力。 Method: 提出FSMODNet,利用可变形注意力机制进行跨模态特征集成,有效结合可见光和热成像的优势,增强在复杂光照和环境下的检测鲁棒性。 Result: 在两个公开数据集上的实验表明,该方法在低数据条件下优于多个基于最先进模型构建的基线方法,展现出优异的检测性能。 Conclusion: FSMODNet通过有效的跨模态特征融合,在少样本多光谱目标检测任务中表现出色,为低标注资源场景下的多模态检测提供了可行解决方案。 Abstract: Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named "FSMODNet" that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strengths of visible and thermal imagery using deformable attention, the proposed method demonstrates robust adaptability in complex illumination and environmental conditions. Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines we established from state-of-the-art models. All code, models, and experimental data splits can be found at https://anonymous.4open.science/r/Test-B48D.

[109] Finding 3D Positions of Distant Objects from Noisy Camera Movement and Semantic Segmentation Sequences

Julius Pesonen,Arno Solin,Eija Honkavaara

Main category: cs.CV

TL;DR: 本文提出使用粒子滤波器在计算资源受限或远距离场景下,基于相机姿态和图像分割实现三维目标定位,适用于无人机野火监测等安全关键任务。

Details Motivation: 在远距离或计算资源受限的情况下,传统的密集深度估计或3D场景重建方法难以实现有效的目标定位,因此需要一种更灵活且可行的解决方案。 Method: 采用粒子滤波器处理单目标和多目标场景下的3D目标定位,结合GNSS提供的相机位姿和图像分割结果,在仿真环境和真实无人机数据上进行验证。 Result: 实验结果表明,粒子滤波器能够在其他方法失效的情况下成功完成实际定位任务,并且该方法不依赖于具体的检测方式,具有良好的通用性。 Conclusion: 粒子滤波器是一种适用于资源受限环境下基于视觉的目标定位的有效方法,可与现有图像分割模型结合用于无人机野火监测等应用。 Abstract: 3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with dense depth estimation or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved using particle filters for both single and multiple target scenarios. The method was studied using a 3D simulation and a drone-based image segmentation sequence with global navigation satellite system (GNSS)-based camera pose estimates. The results showed that a particle filter can be used to solve practical localisation tasks based on camera poses and image segments in these situations where other solutions fail. The particle filter is independent of the detection method, making it flexible for new tasks. The study also demonstrates that drone-based wildfire monitoring can be conducted using the proposed method paired with a pre-existing image segmentation model.

[110] SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images

Qinfeng Zhu,Han Li,Liang He,Lei Fan

Main category: cs.CV

TL;DR: 本文提出了一种名为SwinMamba的新框架,用于遥感图像语义分割,结合局部与全局扫描机制,在保持低计算复杂度的同时提升对细节和上下文特征的感知能力,实验表明其在多个数据集上优于现有方法。

Details Motivation: 现有的Vision Mamba在处理遥感图像时因依赖全局扫描而忽视了关键的局部特征(如纹理和边缘),影响分割精度,因此需要一种能同时捕捉局部与全局信息的方法。 Method: SwinMamba借鉴Swin Transformer的思想,采用滑动窗口内的局部Mamba式扫描与全局感受野相结合的方式;前两个阶段进行局部扫描以捕获细粒度特征,后两个阶段使用全局扫描融合上下文信息,并通过重叠的移位窗口增强区域间的信息交换。 Result: 在LoveDA和ISPRS Potsdam数据集上的大量实验表明,SwinMamba在性能上超过了当前最先进的语义分割方法。 Conclusion: SwinMamba有效平衡了局部细节与全局上下文的建模,显著提升了遥感图像语义分割的准确性,展现出作为高效分割模型的潜力。 Abstract: Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba-style scanning within shifted windows with a global receptive field, to enhance the model's perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine-grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter-region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state-of-the-art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.

[111] Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Framework

Wenhao Tang,Heng Fang,Ge Wu,Xiang Li,Ming-Ming Cheng

Main category: cs.CV

TL;DR: 提出了一种基于包的多实例学习(pack-based MIL)框架,用于解决计算病理学中全切片图像序列长度极长、变化大和监督有限的问题,显著提升了训练效率和准确性。

Details Motivation: 全切片图像(WSI)在计算病理学中具有极大的序列长度和变异性,且标注有限,导致传统方法在训练效率和优化上受限,难以有效处理数据异质性和冗余。 Method: 提出pack-based MIL框架:将多个可变长度特征序列打包为固定长度序列以实现批训练;引入残差分支构建跨切片的超滑片(hyperslide)以提供多切片监督并减少采样导致的特征丢失;设计注意力驱动的下采样器压缩两个分支的特征以降低冗余。 Result: 在PANDA(UNI)数据集上实现了最高8%的准确率提升,同时仅使用12%的训练时间,显著优于传统方法。 Conclusion: 通过系统性应对计算病理学中的数据异质性与冗余问题,所提方法在高效训练的同时保持高性能,验证了关注数据挑战在基础模型时代的重要性。 Abstract: Computational pathology (CPath) digitizes pathology slides into whole slide images (WSIs), enabling analysis for critical healthcare tasks such as cancer diagnosis and prognosis. However, WSIs possess extremely long sequence lengths (up to 200K), significant length variations (from 200 to 200K), and limited supervision. These extreme variations in sequence length lead to high data heterogeneity and redundancy. Conventional methods often compromise on training efficiency and optimization to preserve such heterogeneity under limited supervision. To comprehensively address these challenges, we propose a pack-based MIL framework. It packs multiple sampled, variable-length feature sequences into fixed-length ones, enabling batched training while preserving data heterogeneity. Moreover, we introduce a residual branch that composes discarded features from multiple slides into a hyperslide which is trained with tailored labels. It offers multi-slide supervision while mitigating feature loss from sampling. Meanwhile, an attention-driven downsampler is introduced to compress features in both branches to reduce redundancy. By alleviating these challenges, our approach achieves an accuracy improvement of up to 8% while using only 12% of the training time in the PANDA(UNI). Extensive experiments demonstrate that focusing data challenges in CPath holds significant potential in the era of foundation models. The code is https://github.com/FangHeng/PackMIL

[112] SimDiff: Simulator-constrained Diffusion Model for Physically Plausible Motion Generation

Akihisa Watanabe,Jiawei Ren,Li Siyao,Yichen Peng,Erwin Wu,Edgar Simo-Serra

Main category: cs.CV

TL;DR: 提出SimDiff模型,通过将环境参数直接融入去噪过程,实现高效生成物理上合理的运动,避免了推理时重复调用模拟器。

Details Motivation: 现有基于模拟器的运动投影方法因顺序执行导致计算昂贵,难以并行化,限制了生成效率。 Method: 将模拟器约束视为扩散过程中的引导信号,提出SimDiff模型,直接在去噪过程中条件化环境参数(如重力、风力),实现无需反复调用模拟器的物理合理运动生成。 Result: SimDiff在保持物理合理性的同时显著提升生成效率,支持对物理系数的细粒度控制,并能泛化到未见的环境参数组合。 Conclusion: SimDiff通过将物理环境参数融入扩散模型,实现了高效、可控且具有良好泛化性的物理合理运动生成。 Abstract: Generating physically plausible human motion is crucial for applications such as character animation and virtual reality. Existing approaches often incorporate a simulator-based motion projection layer to the diffusion process to enforce physical plausibility. However, such methods are computationally expensive due to the sequential nature of the simulator, which prevents parallelization. We show that simulator-based motion projection can be interpreted as a form of guidance, either classifier-based or classifier-free, within the diffusion process. Building on this insight, we propose SimDiff, a Simulator-constrained Diffusion Model that integrates environment parameters (e.g., gravity, wind) directly into the denoising process. By conditioning on these parameters, SimDiff generates physically plausible motions efficiently, without repeated simulator calls at inference, and also provides fine-grained control over different physical coefficients. Moreover, SimDiff successfully generalizes to unseen combinations of environmental parameters, demonstrating compositional generalization.

[113] Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models

Bum Jun Kim,Makoto Kawano,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CV

TL;DR: 本文研究了视觉模型对高斯噪声的鲁棒性与架构设计之间的关系,通过分析1174个预训练模型,总结出四种提升鲁棒性的设计模式,并提供了理论解释和实用设计指南。

Details Motivation: 现有研究多关注视觉模型的鲁棒性度量,但缺乏对架构设计选择如何影响鲁棒性的深入分析,本文旨在揭示哪些设计选择能带来更强的噪声鲁棒性,并提供可解释、可操作的设计规则。 Method: 通过对1,174个预训练视觉模型进行大规模评估,识别出影响高斯噪声鲁棒性的关键设计因素;结合理论分析,将经验观察转化为因果机制,包括对stem核、下采样、池化方式和ViT类型的作用进行数学建模与证明。 Result: 发现四大设计模式可显著提升鲁棒性:更大的stem核、更小的输入分辨率、使用平均池化、采用监督训练的ViT而非CLIP ViT;理论分析表明低通stem核可二次衰减噪声,抗混叠下采样可降低噪声能量,平均池化优于最大池化,CLIP ViT因归一化标准差较小而更敏感;实验显示最高可达506名排名提升和21.6%的准确率增益。 Conclusion: 视觉模型的鲁棒性可通过模块化设计有效提升,本文将经验观察与理论分析结合,提出了可解释且即插即用的设计准则,为构建抗噪声视觉模型提供了系统性指导。 Abstract: While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6\%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that low-pass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.

[114] Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery

Angelo Henriques,Korab Hoxha,Daniel Zapp,Peter C. Issa,Nassir Navab,M. Ali Nasseri

Main category: cs.CV

TL;DR: 该论文综述了场景图(SG)在手术中的研究进展,揭示了内外视角数据使用之间的‘数据鸿沟’,并指出专用基础模型在手术场景中已超越通用视觉-语言模型。

Details Motivation: 为了系统梳理场景图在手术环境中的应用现状与挑战,推动其在复杂动态手术场景中的发展和转化应用。 Method: 采用PRISMA-ScR指导的范围综述方法,对现有文献进行系统性分析,涵盖应用场景、方法演进及未来方向。 Result: 发现领域快速增长但存在‘数据分割’问题:内视研究多用真实2D视频,外视4D建模依赖模拟数据;同时,专用基础模型显著优于通用大模型,并已在工作流识别、安全监控和手术模拟等任务中成为关键技术。 Conclusion: 手术场景图正逐步成熟,作为语义桥梁支持智能系统的开发,有望提升手术的安全性、效率和培训水平,但仍需解决数据标注和实时性挑战。 Abstract: Scene graphs (SGs) provide structured relational representations crucial for decoding complex, dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, charting its applications, methodological advancements, and future directions. Our analysis reveals rapid growth, yet uncovers a critical 'data divide': internal-view research (e.g., triplet recognition) almost exclusively uses real-world 2D video, while external-view 4D modeling relies heavily on simulated data, exposing a key translational research gap. Methodologically, the field has advanced from foundational graph neural networks to specialized foundation models that now significantly outperform generalist large vision-language models in surgical contexts. This progress has established SGs as a cornerstone technology for both analysis, such as workflow recognition and automated safety monitoring, and generative tasks like controllable surgical simulation. Although challenges in data annotation and real-time implementation persist, they are actively being addressed through emerging techniques. Surgical SGs are maturing into an essential semantic bridge, enabling a new generation of intelligent systems to improve surgical safety, efficiency, and training.

[115] A Real-Time On-Device Defect Detection Framework for Laser Power-Meter Sensors via Unsupervised Learning

Dongqi Zheng,Wenjin Fu,Guangzong Chen

Main category: cs.CV

TL;DR: 提出一种基于视觉的自动化系统,用于激光功率计传感器涂层缺陷检测与分类,采用无监督异常检测框架和UFlow网络,在真实图像上实现了高精度检测。

Details Motivation: 解决医疗和工业应用中因涂层缺陷(如热损伤和划痕)影响激光能量测量准确性的关键问题。 Method: 使用仅基于“良好”样本训练的无监督异常检测方法,结合Laplacian边缘检测与K-means聚类进行预处理,通过StyleGAN2进行合成数据增强,并采用UFlow架构实现多尺度特征提取与异常图生成。 Result: 在366张真实传感器图像上的实验显示,缺陷样本准确率达93.8%,正常样本达89.3%,图像级AUROC为0.957,像素级AUROC为0.961,单图处理时间仅0.5秒。 Conclusion: 该系统可高效、准确地检测已知和新型缺陷,具备实际部署能力,有望通过自动化质检带来显著成本节约。 Abstract: We present an automated vision-based system for defect detection and classification of laser power meter sensor coatings. Our approach addresses the critical challenge of identifying coating defects such as thermal damage and scratches that can compromise laser energy measurement accuracy in medical and industrial applications. The system employs an unsupervised anomaly detection framework that trains exclusively on ``good'' sensor images to learn normal coating distribution patterns, enabling detection of both known and novel defect types without requiring extensive labeled defect datasets. Our methodology consists of three key components: (1) a robust preprocessing pipeline using Laplacian edge detection and K-means clustering to segment the area of interest, (2) synthetic data augmentation via StyleGAN2, and (3) a UFlow-based neural network architecture for multi-scale feature extraction and anomaly map generation. Experimental evaluation on 366 real sensor images demonstrates $93.8\%$ accuracy on defective samples and $89.3\%$ accuracy on good samples, with image-level AUROC of 0.957 and pixel-level AUROC of 0.961. The system provides potential annual cost savings through automated quality control and processing times of 0.5 seconds per image in on-device implementation.

[116] Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos

Sarmistha Das,R E Zera Marveen Lyngkhoi,Sriparna Saha,Alka Maurya

Main category: cs.CV

TL;DR: 本文提出了一种名为FASTER的模块化框架,用于对金融咨询类长视频进行多模态摘要,结合文本、语音和图像信息生成精准、简洁且事实一致的摘要,并发布了一个包含470个公开视频的新数据集Fin-APT。

Details Motivation: 社交媒体上金融咨询视频内容广泛传播,但其时长长、多模态特性使得信息提取困难,现有方法难以保证摘要的准确性与跨模态一致性。 Method: FASTER框架融合BLIP生成视觉语义描述、OCR提取画面文本、Whisper结合说话人分离进行语音转录,并采用改进的基于DPO的损失函数(含事实核查)优化摘要质量;通过排序检索机制对齐关键帧与文本摘要。 Result: 在跨领域实验中,FASTER在摘要质量、鲁棒性和泛化能力上优于现有大语言模型和视觉-语言模型,实现了更优的多模态摘要性能。 Conclusion: FASTER为金融咨询视频的多模态摘要设立了新标准,提升了内容的可访问性和实用性,同时推动了相关领域的研究发展。 Abstract: The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER

[117] An Adaptor for Triggering Semi-Supervised Learning to Out-of-Box Serve Deep Image Clustering

Yue Duan,Lei Qi,Yinghuan Shi,Yang Gao

Main category: cs.CV

TL;DR: 本文提出ASD,一种无需预训练或先验条件的适配器,可实现自监督学习(SSL)模型在深度图像聚类中的冷启动,通过伪标签数据和实例级分类器提取高阶相似性,赋予聚类标签,从而激活SSL模型进行聚类,性能优于现有方法且与使用真实标签的SSL方法差距极小。

Details Motivation: 现有将自监督学习(SSL)引入深度聚类的方法均需预训练、聚类学习或已有聚类模型作为前提,限制了SSL模型在图像聚类任务中灵活即用的能力。 Method: 提出ASD适配器:首先从无标签数据中随机采样伪标签样本,训练一个实例级分类器学习语义对齐的实例标签;利用该分类器预测结果追踪类别转移,提取实例级类别的高阶相似性,用于为伪标签数据分配聚类级标签;最后使用这些带聚类标签的伪标签数据,在无标签数据上训练通用SSL模型进行图像聚类。 Result: ASD在多个基准上优于最新的深度图像聚类方法,且与使用真实标签的SSL方法相比仅有微小精度差距(如在CIFAR-10上仅差1.33%);同时ASD还能进一步提升现有嵌入SSL的聚类方法的性能。 Conclusion: ASD实现了SSL学习器在深度图像聚类中的即插即用式冷启动,无需任何前置条件,兼具高性能和广泛兼容性,推动了SSL在无监督聚类中的灵活应用。 Abstract: Recently, some works integrate SSL techniques into deep clustering frameworks to enhance image clustering performance. However, they all need pretraining, clustering learning, or a trained clustering model as prerequisites, limiting the flexible and out-of-box application of SSL learners in the image clustering task. This work introduces ASD, an adaptor that enables the cold-start of SSL learners for deep image clustering without any prerequisites. Specifically, we first randomly sample pseudo-labeled data from all unlabeled data, and set an instance-level classifier to learn them with semantically aligned instance-level labels. With the ability of instance-level classification, we track the class transitions of predictions on unlabeled data to extract high-level similarities of instance-level classes, which can be utilized to assign cluster-level labels to pseudo-labeled data. Finally, we use the pseudo-labeled data with assigned cluster-level labels to trigger a general SSL learner trained on the unlabeled data for image clustering. We show the superior performance of ASD across various benchmarks against the latest deep image clustering approaches and very slight accuracy gaps compared to SSL methods using ground-truth, e.g., only 1.33% on CIFAR-10. Moreover, ASD can also further boost the performance of existing SSL-embedded deep image clustering methods.

[118] SiNGER: A Clearer Voice Distills Vision Transformers Further

Geunhyeok Yu,Sunjae Jeong,Yoonyoung Choi,Jaeseung Kim,Hyoseok Hwang

Main category: cs.CV

TL;DR: 本文提出了一种名为SiNGER的新型知识蒸馏框架,通过利用零空间引导的扰动来抑制教师模型中的高范数伪影,同时保留有用信息,从而提升学生模型的性能。

Details Motivation: Vision Transformers作为视觉基础模型的主干网络,会产生影响表示质量的高范数伪影;在知识蒸馏过程中,这些伪影会主导学习目标,导致学生模型过拟合伪影并削弱对有效信号的学习,限制了大模型带来的增益。现有方法在去除伪影时难以平衡伪影抑制与信息保留之间的权衡。 Method: 提出Singular Nullspace-Guided Energy Reallocation (SiNGER) 框架,在蒸馏前对教师特征进行原则性优化:利用零空间引导的扰动抑制伪影的同时保留关键信息,并通过基于LoRA的适配器高效实现该扰动,仅需极少的结构修改。 Result: 大量实验表明,SiNGER在多个下游任务中显著提升了学生模型性能,达到了最先进的水平,且生成的表示更清晰、更具可解释性。 Conclusion: SiNGER有效解决了知识蒸馏中高范数伪影抑制与有用信号保留之间的矛盾,为基于Vision Transformer的知识蒸馏提供了一个高效、实用的解决方案。 Abstract: Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher's features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.

[119] Fast-SEnSeI: Lightweight Sensor-Independent Cloud Masking for On-board Multispectral Sensors

Jan Kněžík,Jonáš Herec,Rado Pitoňák

Main category: cs.CV

TL;DR: 提出了一种轻量级、传感器无关的编码模块Fast-SEnSeI,用于多光谱传感器上的灵活星载云分割。

Details Motivation: 现有云分割模型通常依赖特定传感器配置并需地面处理,缺乏灵活性和实时性。 Method: 基于SEnSeI-v2改进,引入更优的光谱描述符、轻量化架构和鲁棒的填充波段处理,支持任意波段组合输入,并与基于改进U-Net的紧凑量化分割模型结合,部署于CPU-FPGA混合流水线。 Result: 在Sentinel-2和Landsat 8数据集上验证了模型在不同输入配置下的准确云分割性能,可在嵌入式CPU和FPGA上高效运行。 Conclusion: Fast-SEnSeI实现了跨多光谱传感器的高效、灵活星载云分割,适用于空间任务中的资源受限环境。 Abstract: Cloud segmentation is a critical preprocessing step for many Earth observation tasks, yet most models are tightly coupled to specific sensor configurations and rely on ground-based processing. In this work, we propose Fast-SEnSeI, a lightweight, sensor-independent encoder module that enables flexible, on-board cloud segmentation across multispectral sensors with varying band configurations. Building upon SEnSeI-v2, Fast-SEnSeI integrates an improved spectral descriptor, lightweight architecture, and robust padding-band handling. It accepts arbitrary combinations of spectral bands and their wavelengths, producing fixed-size feature maps that feed into a compact, quantized segmentation model based on a modified U-Net. The module runs efficiently on embedded CPUs using Apache TVM, while the segmentation model is deployed on FPGA, forming a CPU-FPGA hybrid pipeline suitable for space-qualified hardware. Evaluations on Sentinel-2 and Landsat 8 datasets demonstrate accurate segmentation across diverse input configurations.

[120] A Single Neuron Works: Precise Concept Erasure in Text-to-Image Diffusion Models

Qinqin He,Jiaqi Weng,Jialing Tao,Hui Xue

Main category: cs.CV

TL;DR: 本文提出了一种基于单个神经元的概念消除方法(SNCE),通过稀疏自编码器和新型神经元识别技术,精确抑制有害内容生成,同时保持图像质量。

Details Motivation: 现有概念消除方法难以在彻底去除有害概念的同时保持图像生成质量,需要更精准、低干扰的解决方案。 Method: 训练稀疏自编码器(SAE)将文本嵌入映射到稀疏解耦的潜在空间,并设计基于调制频率评分的神经元识别方法,定位并抑制特定于有害概念的单个神经元。 Result: SNCE在多个基准上实现了最先进的目标概念消除效果,显著优于现有方法,且对对抗攻击具有强鲁棒性,同时保持非目标概念的生成能力。 Conclusion: SNCE通过单神经元操作实现了高精度、低破坏性的概念消除,为文本到图像模型的安全控制提供了有效且高效的新途径。 Abstract: Text-to-image models exhibit remarkable capabilities in image generation. However, they also pose safety risks of generating harmful content. A key challenge of existing concept erasure methods is the precise removal of target concepts while minimizing degradation of image quality. In this paper, we propose Single Neuron-based Concept Erasure (SNCE), a novel approach that can precisely prevent harmful content generation by manipulating only a single neuron. Specifically, we train a Sparse Autoencoder (SAE) to map text embeddings into a sparse, disentangled latent space, where individual neurons align tightly with atomic semantic concepts. To accurately locate neurons responsible for harmful concepts, we design a novel neuron identification method based on the modulated frequency scoring of activation patterns. By suppressing activations of the harmful concept-specific neuron, SNCE achieves surgical precision in concept erasure with minimal disruption to image quality. Experiments on various benchmarks demonstrate that SNCE achieves state-of-the-art results in target concept erasure, while preserving the model's generation capabilities for non-target concepts. Additionally, our method exhibits strong robustness against adversarial attacks, significantly outperforming existing methods.

[121] OmniPlantSeg: Species Agnostic 3D Point Cloud Organ Segmentation for High-Resolution Plant Phenotyping Across Modalities

Andreas Gilson,Lukas Meyer,Oliver Scholz,Ute Schmid

Main category: cs.CV

TL;DR: 提出了一种简单而有效的KDSS算法,用于生物点云的子采样,适用于不同传感器数据和植物种类,无需降采样即可实现全分辨率点云分割。

Details Motivation: 现有植物器官点云分割方法通常针对特定物种或传感器模态,且需大量预处理和降采样,限制了通用性和精度。 Method: 设计了一种与传感器和植物种类无关的KDSS子采样算法,保留原始分辨率,结合当前最先进的分割模型进行全分辨率点云分割。 Result: 在多种传感器模态(如摄影测量、激光三角测量和LiDAR)及不同植物物种上验证了该方法的有效性,结果令人满意。 Conclusion: KDSS是一种轻量级、保留分辨率的预处理替代方案,具有跨物种和跨传感器的通用性,适用于植物器官分割任务。 Abstract: Accurate point cloud segmentation for plant organs is crucial for 3D plant phenotyping. Existing solutions are designed problem-specific with a focus on certain plant species or specified sensor-modalities for data acquisition. Furthermore, it is common to use extensive pre-processing and down-sample the plant point clouds to meet hardware or neural network input size requirements. We propose a simple, yet effective algorithm KDSS for sub-sampling of biological point clouds that is agnostic to sensor data and plant species. The main benefit of this approach is that we do not need to down-sample our input data and thus, enable segmentation of the full-resolution point cloud. Combining KD-SS with current state-of-the-art segmentation models shows satisfying results evaluated on different modalities such as photogrammetry, laser triangulation and LiDAR for various plant species. We propose KD-SS as lightweight resolution-retaining alternative to intensive pre-processing and down-sampling methods for plant organ segmentation regardless of used species and sensor modality.

[122] Background Prompt for Few-Shot Out-of-Distribution Detection

Songyue Cai,Zongqian Wu,Yujie Mo,Liang Peng,Ping Hu,Xiaoshuang Shi,Xiaofeng Zhu

Main category: cs.CV

TL;DR: 提出了一种新的前景-背景分解框架Mambo,用于少样本异常检测,通过学习背景提示和自校准调整策略,提升了鲁棒性和性能。

Details Motivation: 现有方法因过度依赖局部类相似性和固定的背景块提取策略,导致少样本异常检测中鲁棒性较低。 Method: 提出学习背景提示以获取包含背景和语义信息的局部背景相似性,并结合局部类相似性进行优化;引入自校准调优机制,灵活选择不同样本的背景块数量。 Result: 在多个真实数据集上实验表明,Mambo在OOD检测和近OOD检测场景下性能优于现有最先进方法。 Conclusion: Mambo通过改进背景提取和利用语义信息,有效提升了少样本异常检测的鲁棒性和准确性。 Abstract: Existing foreground-background (FG-BG) decomposition methods for the few-shot out-of-distribution (FS-OOD) detection often suffer from low robustness due to over-reliance on the local class similarity and a fixed background patch extraction strategy. To address these challenges, we propose a new FG-BG decomposition framework, namely Mambo, for FS-OOD detection. Specifically, we propose to first learn a background prompt to obtain the local background similarity containing both the background and image semantic information, and then refine the local background similarity using the local class similarity. As a result, we use both the refined local background similarity and the local class similarity to conduct background extraction, reducing the dependence of the local class similarity in previous methods. Furthermore, we propose the patch self-calibrated tuning to consider the sample diversity to flexibly select numbers of background patches for different samples, and thus exploring the issue of fixed background extraction strategies in previous methods. Extensive experiments on real-world datasets demonstrate that our proposed Mambo achieves the best performance, compared to SOTA methods in terms of OOD detection and near OOD detection setting. The source code will be released at https://github.com/YuzunoKawori/Mambo.

[123] Stratify or Die: Rethinking Data Splits in Image Segmentation

Naga Venkata Sai Jitin Jami,Thomas Altstidl,Jonas Mueller,Jindong Li,Dario Zanca,Bjoern Eskofier,Heike Leutheuser

Main category: cs.CV

TL;DR: 提出了一种基于Wasserstein距离的进化分层方法(WDES)和迭代像素分层(IPS),用于图像分割任务中的数据集划分,显著提升了测试集的代表性和模型评估的稳定性。

Details Motivation: 传统随机划分数据集的方法在图像分割中容易导致测试集不具代表性,从而引发评估偏差和模型泛化能力差的问题;现有分层抽样方法难以应对分割任务中的多标签结构和类别不平衡问题。 Method: 提出了两种新方法:一是迭代像素分层(IPS),一种简单且标签感知的采样方法;二是Wasserstein驱动的进化分层(WDES),一种通过遗传算法最小化Wasserstein距离以优化数据划分中标签分布相似性的方法,并证明其在足够代数下具有全局最优性。 Result: 在街景、医学影像和卫星图像等多种分割任务中验证了WDES的有效性,结果显示其相比随机划分能产生更具代表性的数据划分,降低性能方差,提升模型评估可靠性,尤其在小样本、类别不平衡和低多样性数据集中优势更明显。 Conclusion: WDES是一种有效且理论上可证明最优的数据划分方法,能够显著改善图像分割任务中的评估质量,特别适用于具有挑战性的数据场景。 Abstract: Random splitting of datasets in image segmentation often leads to unrepresentative test sets, resulting in biased evaluations and poor model generalization. While stratified sampling has proven effective for addressing label distribution imbalance in classification tasks, extending these ideas to segmentation remains challenging due to the multi-label structure and class imbalance typically present in such data. Building on existing stratification concepts, we introduce Iterative Pixel Stratification (IPS), a straightforward, label-aware sampling method tailored for segmentation tasks. Additionally, we present Wasserstein-Driven Evolutionary Stratification (WDES), a novel genetic algorithm designed to minimize the Wasserstein distance, thereby optimizing the similarity of label distributions across dataset splits. We prove that WDES is globally optimal given enough generations. Using newly proposed statistical heterogeneity metrics, we evaluate both methods against random sampling and find that WDES consistently produces more representative splits. Applying WDES across diverse segmentation tasks, including street scenes, medical imaging, and satellite imagery, leads to lower performance variance and improved model evaluation. Our results also highlight the particular value of WDES in handling small, imbalanced, and low-diversity datasets, where conventional splitting strategies are most prone to bias.

[124] EnGraf-Net: Multiple Granularity Branch Network with Fine-Coarse Graft Grained for Classification Task

Riccardo La Grassa,Ignazio Gallo,Nicola Landro

Main category: cs.CV

TL;DR: 本文提出了一种名为EnGraf-Net的端到端深度神经网络模型,利用层次化语义关联(分类法)作为监督信号,用于细粒度分类,无需依赖裁剪技术或人工标注,在多个数据集上表现出与现有方法相当甚至更优的性能。

Details Motivation: 现有的细粒度分类方法多依赖部分标注或自动注意力机制,但存在局部特征表示不完整的问题。受人类通过语义关联识别物体的启发,作者希望引入层次化语义信息来增强分类性能。 Method: 提出EnGraf-Net模型,将语义层次结构(taxonomy)作为监督信号嵌入端到端的深度神经网络中,利用层级语义关联引导特征学习,避免使用边界框、部件定位或文本属性等额外标注。 Result: 在CIFAR-100、CUB-200-2011和FGVC-Aircraft三个主流细粒度数据集上进行了广泛实验,结果表明EnGraf-Net优于许多现有方法,且与最新的先进方法具有竞争力。 Conclusion: 通过引入层次化语义关联作为监督信号,EnGraf-Net在不依赖人工标注或自动裁剪的情况下实现了有效的细粒度分类,验证了语义结构信息对提升细粒度识别性能的重要性。 Abstract: Fine-grained classification models are designed to focus on the relevant details necessary to distinguish highly similar classes, particularly when intra-class variance is high and inter-class variance is low. Most existing models rely on part annotations such as bounding boxes, part locations, or textual attributes to enhance classification performance, while others employ sophisticated techniques to automatically extract attention maps. We posit that part-based approaches, including automatic cropping methods, suffer from an incomplete representation of local features, which are fundamental for distinguishing similar objects. While fine-grained classification aims to recognize the leaves of a hierarchical structure, humans recognize objects by also forming semantic associations. In this paper, we leverage semantic associations structured as a hierarchy (taxonomy) as supervised signals within an end-to-end deep neural network model, termed EnGraf-Net. Extensive experiments on three well-known datasets CIFAR-100, CUB-200-2011, and FGVC-Aircraft demonstrate the superiority of EnGraf-Net over many existing fine-grained models, showing competitive performance with the most recent state-of-the-art approaches, without requiring cropping techniques or manual annotations.

[125] Vision Transformers: the threat of realistic adversarial patches

Kasper Cools,Clara Maathuis,Alexander M. van Oers,Claudia S. Hübner,Nikos Deligiannis,Marijke Vandewal,Geert De Cubber

Main category: cs.CV

TL;DR: 该研究探讨了卷积神经网络(CNN)中的对抗性补丁攻击在视觉Transformer(ViT)模型上的迁移性,发现不同ViT模型对这类攻击的脆弱性差异显著,攻击成功率从40.04%到99.97%不等,表明预训练数据规模和方法显著影响模型的抗攻击能力。

Details Motivation: 随着机器学习系统广泛应用,其安全性成为关键问题。尽管ViT在性能和对抗扰动鲁棒性上优于CNN,但仍可能受到对抗性补丁等逃避攻击的影响,因此需要评估其实际安全性。 Method: 采用Creases Transformation(CT)技术生成具有自然褶皱特征的对抗性补丁,并在四个微调后的ViT模型上进行跨架构攻击实验,测试其在二分类人/非人任务中的攻击成功率。 Result: 实验结果显示不同ViT模型对抗性补丁的敏感度存在显著差异:google/vit-base-patch16-224-in21k攻击成功率为40.04%,facebook/dino-vitb16高达99.97%,其余两个模型分别为66.40%和65.17%,验证了CNN生成的对抗补丁可迁移到ViT。 Conclusion: ViT并非天然免疫对抗性补丁攻击,其防御能力受预训练数据集规模和训练方法影响较大,需针对性设计防御机制以提升安全性。 Abstract: The increasing reliance on machine learning systems has made their security a critical concern. Evasion attacks enable adversaries to manipulate the decision-making processes of AI systems, potentially causing security breaches or misclassification of targets. Vision Transformers (ViTs) have gained significant traction in modern machine learning due to increased 1) performance compared to Convolutional Neural Networks (CNNs) and 2) robustness against adversarial perturbations. However, ViTs remain vulnerable to evasion attacks, particularly to adversarial patches, unique patterns designed to manipulate AI classification systems. These vulnerabilities are investigated by designing realistic adversarial patches to cause misclassification in person vs. non-person classification tasks using the Creases Transformation (CT) technique, which adds subtle geometric distortions similar to those occurring naturally when wearing clothing. This study investigates the transferability of adversarial attack techniques used in CNNs when applied to ViT classification models. Experimental evaluation across four fine-tuned ViT models on a binary person classification task reveals significant vulnerability variations: attack success rates ranged from 40.04% (google/vit-base-patch16-224-in21k) to 99.97% (facebook/dino-vitb16), with google/vit-base-patch16-224 achieving 66.40% and facebook/dinov3-vitb16 reaching 65.17%. These results confirm the cross-architectural transferability of adversarial patches from CNNs to ViTs, with pre-training dataset scale and methodology strongly influencing model resilience to adversarial attacks.

[126] UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition

Guojun Lei,Rong Zhang,Chi Wang,Tianhang Liu,Hong Li,Zhiyuan Ma,Weiwei Xu

Main category: cs.CV

TL;DR: 提出了一种名为UniTransfer的新架构,通过空间和扩散时间步分解实现精确可控的视频概念迁移。

Details Motivation: 为了实现更精细和可控的视频概念转移,解决现有方法在编辑性和保真度上的不足。 Method: 引入空间分解(前景、背景、运动流)和基于DiT的双流到单流架构,并提出Chain-of-Prompt机制进行时间步分解,结合LLM指导生成过程,采用自监督预训练增强表示学习。 Result: 在多个参考图像和场景下实现了高质量、高可控性的视频概念转移,优于现有基线方法,在视觉保真度和可编辑性方面表现突出。 Conclusion: UniTransfer通过空间和时间双重分解策略,显著提升了视频概念转移的效果,为未来研究提供了新的思路和数据支持。 Abstract: We propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability. Web Page: https://yu-shaonian.github.io/UniTransfer-Web/

[127] VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception

Ziang Yan,Xinhao Li,Yinan He,Zhengrong Yue,Xiangyu Zeng,Yali Wang,Yu Qiao,Limin Wang,Yi Wang

Main category: cs.CV

TL;DR: 本文提出了Visual Test-Time Scaling (VTTS) 方法,通过推理时的迭代感知机制提升多模态大语言模型(MLLMs)的推理能力,结合强化学习与时空监督,并构建了VTTS-80K数据集进行验证,在多个任务上显著优于现有基线模型。

Details Motivation: 现有MLLMs依赖静态视觉解析进行推理,难以实现类似人类的动态感知与理解,因此需要一种能够在推理过程中持续优化感知的方法。 Method: 提出VTTS方法,包含迭代感知(ITP)机制,利用强化学习和时空监督,根据文本预测动态调整对视频中时空区域的关注,实现感知与推理的协同优化。同时构建VTTS-80K数据集支持该范式训练。 Result: 在超过15个涵盖视频对话、视频推理和时空感知的基准上,基于VTTS的Videochat-R1.5模型相比Qwen2.5VL-3B和-7B等强基线平均提升超过5%。 Conclusion: VTTS通过推理时增加感知计算实现了性能提升,验证了迭代感知对增强MLLMs推理能力的有效性和泛化性。 Abstract: Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs' reasoning via iterative perception during inference. VTTS mimics humans' hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS's effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5\%, compared to robust baselines such as Qwen2.5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.

[128] Mammo-CLIP Dissect: A Framework for Analysing Mammography Concepts in Vision-Language Models

Suaiba Amina Salahuddin,Teresa Dorszewski,Marit Almenning Martiniussen,Tone Hovda,Antonio Portaluri,Solveig Thrun,Michael Kampffmeyer,Elisabeth Wetzer,Kristoffer Wickstrøm,Robert Jenssen

Main category: cs.CV

TL;DR: 本文提出了Mammo-CLIP Dissect,首个用于乳腺X线摄影深度学习模型的概念性可解释性框架,利用乳腺特异性视觉-语言模型标注神经元并量化其与领域知识的对齐程度,揭示了不同训练数据和微调策略对概念学习的影响。

Details Motivation: 理解深度学习模型在乳腺X线诊断中学习到的内容对于临床AI的安全部署至关重要,现有基于像素的可解释方法难以反映临床推理过程,因此需要一种基于文本概念的可解释方法以更好匹配放射科医生的思维模式。 Method: 提出Mammo-CLIP Dissect框架,使用乳腺特异性的视觉-语言模型(Mammo-CLIP)作为“解剖器”,将特定层的神经元与人类可理解的文本概念进行标签化,并量化其与领域知识的对齐程度,系统分析不同训练数据和微调策略下的概念学习差异。 Result: 发现使用乳腺数据训练的模型能捕捉更多临床相关概念且更贴近放射科医生工作流;微调可增强特定类别概念(如良性钙化)的学习,但可能削弱其他特征(如密度相关)的覆盖,存在专业化与泛化之间的权衡。 Conclusion: Mammo-CLIP Dissect能够有效揭示CNN在乳腺X线任务中如何捕获领域知识,表明领域特定训练和任务适应显著影响概念学习,为设计更安全、可解释的AI辅助诊断系统提供了新视角。 Abstract: Understanding what deep learning (DL) models learn is essential for the safe deployment of artificial intelligence (AI) in clinical settings. While previous work has focused on pixel-based explainability methods, less attention has been paid to the textual concepts learned by these models, which may better reflect the reasoning used by clinicians. We introduce Mammo-CLIP Dissect, the first concept-based explainability framework for systematically dissecting DL vision models trained for mammography. Leveraging a mammography-specific vision-language model (Mammo-CLIP) as a "dissector," our approach labels neurons at specified layers with human-interpretable textual concepts and quantifies their alignment to domain knowledge. Using Mammo-CLIP Dissect, we investigate three key questions: (1) how concept learning differs between DL vision models trained on general image datasets versus mammography-specific datasets; (2) how fine-tuning for downstream mammography tasks affects concept specialisation; and (3) which mammography-relevant concepts remain underrepresented. We show that models trained on mammography data capture more clinically relevant concepts and align more closely with radiologists' workflows than models not trained on mammography data. Fine-tuning for task-specific classification enhances the capture of certain concept categories (e.g., benign calcifications) but can reduce coverage of others (e.g., density-related features), indicating a trade-off between specialisation and generalisation. Our findings show that Mammo-CLIP Dissect provides insights into how convolutional neural networks (CNNs) capture mammography-specific knowledge. By comparing models across training data and fine-tuning regimes, we reveal how domain-specific training and task-specific adaptation shape concept learning. Code and concept set are available: https://github.com/Suaiba/Mammo-CLIP-Dissect.

[129] MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

Sicheng Tao,Jungang Li,Yibo Yan,Junyan Zhang,Yubo Gao,Hanqian Li,ShuHang Xun,Yuxuan Fan,Hong Chen,Jianxiang He,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出MOSS-ChatV,一种基于动态时间规整(DTW)过程奖励的强化学习框架,用于提升多模态大语言模型在视频推理中的过程一致性,并构建MOSS-Video基准进行评估。

Details Motivation: 现有MLLM在视频推理中存在推理过程与视频动态不一致的问题,即使最终答案正确,也影响可解释性和鲁棒性。 Method: 设计基于DTW的过程奖励函数,通过规则方法对齐推理轨迹与时间标注的参考文本,并在MOSS-Video数据集上使用强化学习进行训练。 Result: MOSS-ChatV在MOSS-Video测试集上达到87.2%的准确率,在MVBench和MMVU等通用视频基准上性能提升,并在多种模型架构上验证了通用性,GPT-4o评估显示其推理更一致稳定。 Conclusion: 该框架有效提升了视频推理过程中模型思维与实际动态的一致性,无需额外奖励模型,具有高效性和广泛适用性。 Abstract: Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2\% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.

[130] CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,Shayan Baghayi Nejad,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: 本文提出了一种名为CARINOX的统一框架,通过结合噪声优化与探索以及基于人类判断相关性的奖励选择机制,显著提升了文本到图像扩散模型在复杂语义组合任务中的对齐性能。

Details Motivation: 现有的文本到图像扩散模型在处理复杂对象关系、属性或空间布局时难以实现良好的组合一致性,而当前基于奖励函数的优化或探索方法各自存在局限性,且缺乏可靠的组合性评估指标。 Method: 提出了CARINOX框架,结合初始噪声的优化与探索策略,并引入一种基于与人类判断相关性的原则性奖励选择方法,以更有效地指导搜索过程。 Result: 在T2I-CompBench++和HRS两个基准上分别平均提升16%和11%的对齐分数,持续优于现有最先进方法,同时保持图像质量和多样性。 Conclusion: CARINOX通过融合优化与探索并采用更合理的奖励选择机制,在无需模型微调的情况下显著改善了文本到图像生成中的组合对齐问题。 Abstract: Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/{this URL}.

[131] MotionFlow:Learning Implicit Motion Flow for Complex Camera Trajectory Control in Video Generation

Guojun Lei,Chi Wang,Yikai Wang,Hong Li,Ying Song,Weiwei Xu

Main category: cs.CV

TL;DR: 提出一种新方法,通过将相机和物体运动转换为像素运动,结合稳定扩散网络和语义先验生成符合指定相机轨迹且物体运动一致的视频。

Details Motivation: 现有方法在处理相机与物体同时运动时难以保持一致性与泛化性,易产生相对运动混淆。 Method: 将相机和物体运动统一为像素运动,利用稳定扩散网络学习参考运动图,并结合语义对象先验输入图像到视频网络生成视频。 Result: 实验表明该模型在遵循指定相机轨迹和保持物体运动一致性方面显著优于现有最先进方法。 Conclusion: 所提方法有效解决了相机与物体运动耦合下的视频生成难题,实现了更高的真实性和一致性。 Abstract: Generating videos guided by camera trajectories poses significant challenges in achieving consistency and generalizability, particularly when both camera and object motions are present. Existing approaches often attempt to learn these motions separately, which may lead to confusion regarding the relative motion between the camera and the objects. To address this challenge, we propose a novel approach that integrates both camera and object motions by converting them into the motion of corresponding pixels. Utilizing a stable diffusion network, we effectively learn reference motion maps in relation to the specified camera trajectory. These maps, along with an extracted semantic object prior, are then fed into an image-to-video network to generate the desired video that can accurately follow the designated camera trajectory while maintaining consistent object motions. Extensive experiments verify that our model outperforms SOTA methods by a large margin.

[132] The Unwinnable Arms Race of AI Image Detection

Till Aczel,Lorenzo Vettor,Andreas Plesner,Roger Wattenhofer

Main category: cs.CV

TL;DR: 本文研究了在生成对抗网络中,判别器在何种条件下处于劣势,分析了数据维度和复杂度的影响。

Details Motivation: 随着图像生成AI的快速发展,合成图像与真实图像的界限变得模糊,需要探究判别器检测合成图像的能力受限的条件。 Method: 通过分析数据维度和使用Kolmogorov复杂度衡量数据集内在结构,研究其对判别器性能的影响。 Result: 发现极高或极低复杂度的数据集都降低合成图像的可检测性,而中等复杂度数据集最有利于检测。 Conclusion: 数据复杂度显著影响判别器性能,中等复杂度数据为检测合成图像提供了最佳条件。 Abstract: The rapid progress of image generative AI has blurred the boundary between synthetic and real images, fueling an arms race between generators and discriminators. This paper investigates the conditions under which discriminators are most disadvantaged in this competition. We analyze two key factors: data dimensionality and data complexity. While increased dimensionality often strengthens the discriminators ability to detect subtle inconsistencies, complexity introduces a more nuanced effect. Using Kolmogorov complexity as a measure of intrinsic dataset structure, we show that both very simple and highly complex datasets reduce the detectability of synthetic images; generators can learn simple datasets almost perfectly, whereas extreme diversity masks imperfections. In contrast, intermediate-complexity datasets create the most favorable conditions for detection, as generators fail to fully capture the distribution and their errors remain visible.

[133] WAVECLIP: Wavelet Tokenization for Adaptive-Resolution CLIP

Moshe Kimhi,Erez Koifman,Ehud Rivlin,Eli Schwartz,Chaim Baskin

Main category: cs.CV

TL;DR: WAVECLIP提出了一种基于小波的统一模型,用于CLIP中的自适应分辨率推理,通过多级小波分解实现图像从粗到精的处理,并支持同一模型内多分辨率输入。

Details Motivation: 传统CLIP模型使用固定分辨率的patch嵌入,计算开销大且缺乏灵活性,难以在推理时动态调整精度与计算量的权衡。 Method: 用基于小波的多级分解替代标准patch嵌入,引入关键值缓存和因果跨层注意力机制,在推理时从低分辨率开始,仅在需要时逐步细化,结合置信度门控实现自适应退出。 Result: 在零样本分类任务中验证了方法的有效性,通过轻量级蒸馏即可达到具有竞争力的准确率,同时显著降低计算成本。 Conclusion: WAVECLIP实现了单个模型下的自适应分辨率推理,支持灵活的计算-精度权衡,为高效视觉-语言模型部署提供了新思路。 Abstract: We introduce WAVECLIP, a single unified model for adaptive resolution inference in CLIP, enabled by wavelet-based tokenization. WAVECLIP replaces standard patch embeddings with a multi-level wavelet decomposition, enabling the model to process images coarse to fine while naturally supporting multiple resolutions within the same model. At inference time, the model begins with low resolution tokens and refines only when needed, using key-value caching and causal cross-level attention to reuse computation, effectively introducing to the model only new information when needed. We evaluate WAVECLIP in zero-shot classification, demonstrating that a simple confidence-based gating mechanism enables adaptive early exits. This allows users to dynamically choose a compute-accuracy trade-off using a single deployed model. Our approach requires only lightweight distillation from a frozen CLIP teacher and achieves competitive accuracy with significant computational savings.

[134] Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy

Aymen Bouguerra,Daniel Montoya,Alexandra Gomez-Villa,Fabio Arnez,Chokri Mraidha

Main category: cs.CV

TL;DR: 本文研究了量化对CLIP模型在准确率之外的可靠性指标的影响,发现量化可提升某些模型的校准性,同时不损害甚至提升其分布外检测能力,并通过量化感知训练实现了效率与性能的双赢。

Details Motivation: 尽管CLIP等视觉语言模型在分布外检测等安全相关任务中展现出零样本泛化能力,但其在实际部署中的计算效率和可靠性仍面临挑战,尤其是量化对其性能影响尚不明确。 Method: 对多种CLIP模型进行了大规模的量化评估,涵盖分类准确率、模型校准性和分布外检测等多个可靠性指标,并分析不同预训练来源对结果的影响,同时探索了量化感知训练方法的效果。 Result: 量化能持续改善通常欠置信模型的校准性,但会降低过置信模型的校准性;然而即使校准性下降,分布外检测性能仍可能提升;特定的量化感知训练方法可同时提升零样本准确率、校准性和分布外鲁棒性。 Conclusion: 量化不仅是提升模型效率的手段,还能在多目标优化中增强模型的可靠性和鲁棒性,挑战了传统的效率-性能权衡观念,为高效可靠部署视觉语言模型提供了新视角。 Abstract: The powerful zero-shot generalization capabilities of vision-language models (VLMs) like CLIP have enabled new paradigms for safety-related tasks such as out-of-distribution (OOD) detection. However, additional aspects crucial for the computationally efficient and reliable deployment of CLIP are still overlooked. In particular, the impact of quantization on CLIP's performance beyond accuracy remains underexplored. This work presents a large-scale evaluation of quantization on CLIP models, assessing not only in-distribution accuracy but a comprehensive suite of reliability metrics and revealing counterintuitive results driven by pre-training source. We demonstrate that quantization consistently improves calibration for typically underconfident pre-trained models, while often degrading it for overconfident variants. Intriguingly, this degradation in calibration does not preclude gains in other reliability metrics; we find that OOD detection can still improve for these same poorly calibrated models. Furthermore, we identify specific quantization-aware training (QAT) methods that yield simultaneous gains in zero-shot accuracy, calibration, and OOD robustness, challenging the view of a strict efficiency-performance trade-off. These findings offer critical insights for navigating the multi-objective problem of deploying efficient, reliable, and robust VLMs by utilizing quantization beyond its conventional role.

[135] TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

Iñigo Alonso,Imanol Miranda,Eneko Agirre,Mirella Lapata

Main category: cs.CV

TL;DR: TABLET是一个大规模视觉表格理解(VTU)数据集,包含400万个样本,涵盖20项任务,基于200万个唯一表格,88%保留原始可视化。它提供图像-HTML配对表示、元数据和来源信息,支持对视觉语言模型进行更鲁棒的训练和可扩展评估。

Details Motivation: 现有VTU基准多使用缺乏真实复杂性的合成渲染表格,且数据集固定、无法访问底层数据,限制了模型在真实场景中的泛化能力。因此需要一个基于真实可视化、可追溯的大规模数据集来推动VTU研究。 Method: 构建TABLET数据集,收集200万个真实表格及其原始可视化,生成400万图像-HTML配对样本,覆盖20种任务,并附加详细元数据与来源信息;使用该数据集微调如Qwen2.5-VL-7B等视觉语言模型。 Result: 在TABLET上微调的模型在已见和未见VTU任务上性能提升,并在真实世界表格图像上表现出更强的鲁棒性。 Conclusion: TABLET通过保留原始视觉特征和数据可追溯性,为VTU模型的训练与评估提供了更真实、可扩展的基础,显著提升模型在现实场景中的表现。 Abstract: While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.

[136] Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding

Muxin Pu,Mei Kuan Lim,Chun Yong Chong,Chen Change Loy

Main category: cs.CV

TL;DR: 本文提出了一种基于骨架的统一手语理解框架Sigma,通过语言感知的早期融合、分层对齐学习和多任务预训练,有效解决了当前方法在语义关联、局部与全局信息平衡及跨模态学习方面的局限性,在多个任务上达到最先进性能。

Details Motivation: 现有手语理解方法存在语义接地弱、局部细节与全局上下文失衡以及跨模态学习效率低的问题,亟需一种能充分利用骨架数据并实现强语义对齐的统一框架。 Method: 提出Sigma框架:1)语言感知的早期融合机制,增强视觉特征的语言上下文;2)分层对齐学习策略,联合优化多层级跨模态特征匹配;3)结合对比学习、文本匹配和语言建模的统一预训练框架。 Result: Sigma在孤立词识别、连续手语识别和无gloss翻译等多个基准任务上取得新的最先进结果,涵盖多种手语和口语语言。 Conclusion: 语义丰富的预训练和基于骨架数据的独立解决方案能显著提升手语理解性能,Sigma为跨模态语义对齐提供了有效途径。 Abstract: Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.

[137] Learning Conformal Explainers for Image Classifiers

Amr Alkhatib,Stephanie Lowry

Main category: cs.CV

TL;DR: 提出一种基于共形预测的新方法,用于控制图像预测解释的保真度,通过识别关键特征子集来保持模型预测,无需真实解释进行校准。

Details Motivation: 现有特征归因方法在鲁棒性和对模型推理过程的忠实性方面存在不足,需要一种可量化且可控的解释保真度方法。 Method: 提出基于共形预测的方法,设计四种一致性函数来衡量解释与模型预测的一致性,识别足以维持模型预测的显著特征子集,无需访问真实解释进行校准。 Result: 在六个图像数据集上评估五种解释器,结果表明FastSHAP在保真度和信息效率(解释区域大小)方面优于现有方法,且基于超像素的一致性度量优于像素级度量。 Conclusion: 该方法能有效提升解释的保真度和效率,为黑盒模型提供更可靠、可控的可视化解释。 Abstract: Feature attribution methods are widely used for explaining image-based predictions, as they provide feature-level insights that can be intuitively visualized. However, such explanations often vary in their robustness and may fail to faithfully reflect the reasoning of the underlying black-box model. To address these limitations, we propose a novel conformal prediction-based approach that enables users to directly control the fidelity of the generated explanations. The method identifies a subset of salient features that is sufficient to preserve the model's prediction, regardless of the information carried by the excluded features, and without demanding access to ground-truth explanations for calibration. Four conformity functions are proposed to quantify the extent to which explanations conform to the model's predictions. The approach is empirically evaluated using five explainers across six image datasets. The empirical results demonstrate that FastSHAP consistently outperforms the competing methods in terms of both fidelity and informational efficiency, the latter measured by the size of the explanation regions. Furthermore, the results reveal that conformity measures based on super-pixels are more effective than their pixel-wise counterparts.

[138] Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation

Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: 本文研究了文本-图像生成中常用评估指标与人类判断的一致性,发现没有单一指标在所有组合任务中表现一致,不同指标在不同类型任务中各有优劣,强调应谨慎选择评估指标。

Details Motivation: 现有文本-图像生成评估指标多依赖自动化方法,但其是否真实反映人类偏好尚不明确,亟需系统评估各类指标的有效性。 Method: 对广泛使用的组合文本-图像评估指标进行了全面研究,超越简单相关性分析,考察其在多种组合挑战中的表现,并比较不同指标家族与人类判断的对齐程度。 Result: 结果显示,没有一个指标在所有任务中 consistently 表现最佳;VQA类指标并非始终优越,某些基于嵌入的指标在特定情况下更强,而仅基于图像的指标对组合评估贡献较小。 Conclusion: 应根据具体任务谨慎、透明地选择评估指标,以确保评估可信度及其在生成模型中的有效应用。 Abstract: Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at \href{https://amirkasaei.com/eval-the-evals/}{this URL}.

[139] Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation

Seyed Amir Kasaei,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: 本文提出了文本到图像生成模型中“幻觉”现象的定义,并将其分为属性、关系和对象三类,强调现有评估方法忽略了模型在提示之外生成的内容,建议通过偏差驱动的偏离来理解和评估幻觉。

Details Motivation: 现有的文本到图像生成模型评估主要关注提示内容是否对齐,但忽视了模型生成超出提示内容的现象,缺乏对生成偏差和幻觉的系统性定义与分析。 Method: 提出将幻觉定义为由模型偏见导致的生成偏离,并构建包含属性、关系和对象三类的幻觉分类体系,用于更全面地评估文本到图像模型。 Result: 建立了文本到图像模型中幻觉的分类框架,揭示了模型潜在的隐藏偏差,并为后续评估提供了理论基础和上限标准。 Conclusion: 通过引入偏差驱动的幻觉定义和分类体系,能够更深入地评估文本到图像生成模型的行为,推动对模型生成内容的全面理解。 Abstract: In language and vision-language models, hallucination is broadly understood as content generated from a model's prior knowledge or biases rather than from the given input. While this phenomenon has been studied in those domains, it has not been clearly framed for text-to-image (T2I) generative models. Existing evaluations mainly focus on alignment, checking whether prompt-specified elements appear, but overlook what the model generates beyond the prompt. We argue for defining hallucination in T2I as bias-driven deviations and propose a taxonomy with three categories: attribute, relation, and object hallucinations. This framing introduces an upper bound for evaluation and surfaces hidden biases, providing a foundation for richer assessment of T2I models.

[140] SlideMamba: Entropy-Based Adaptive Fusion of GNN and Mamba for Enhanced Representation Learning in Digital Pathology

Shakib Khan,Fariba Dambandkhameneh,Nazim Shaikh,Yao Nie,Raghavan Venugopal,Xiao Li

Main category: cs.CV

TL;DR: 本文提出了一种名为SlideMamba的深度学习框架,结合Mamba架构与图神经网络(GNN),通过熵基自适应融合策略,有效整合长程依赖与局部空间关系,用于全切片图像(WSI)分析。在基因融合与突变预测任务中表现优于多种现有方法。

Details Motivation: 为了提升全切片图像(WSI)分析的性能,需同时建模局部空间关系和长距离上下文依赖,现有方法难以兼顾二者优势,因此需要一种可泛化的融合框架。 Method: 提出SlideMamba框架,结合Mamba模块(捕捉长程依赖)与GNN(捕捉细粒度局部交互),并设计基于预测熵的自适应融合策略,动态平衡两分支贡献。 Result: 在基因融合与突变预测任务中,SlideMamba的PRAUC达到0.751±0.05,优于MIL、Trans-MIL、Mamba-only、GNN-only及GAT-Mamba;ROC AUC、敏感性和特异性也表现出竞争力。 Conclusion: 集成Mamba与GNN并辅以熵驱动的自适应融合策略,能有效提升WSI分析性能,展示了其在计算病理学中空间解析预测建模的潜力。 Abstract: Advances in computational pathology increasingly rely on extracting meaningful representations from Whole Slide Images (WSIs) to support various clinical and biological tasks. In this study, we propose a generalizable deep learning framework that integrates the Mamba architecture with Graph Neural Networks (GNNs) for enhanced WSI analysis. Our method is designed to capture both local spatial relationships and long-range contextual dependencies, offering a flexible architecture for digital pathology analysis. Mamba modules excels in capturing long-range global dependencies, while GNNs emphasize fine-grained short-range spatial interactions. To effectively combine these complementary signals, we introduce an adaptive fusion strategy that uses an entropy-based confidence weighting mechanism. This approach dynamically balances contributions from both branches by assigning higher weight to the branch with more confident (lower-entropy) predictions, depending on the contextual importance of local versus global information for different downstream tasks. We demonstrate the utility of our approach on a representative task: predicting gene fusion and mutation status from WSIs. Our framework, SlideMamba, achieves an area under the precision recall curve (PRAUC) of 0.751 \pm 0.05, outperforming MIL (0.491 \pm 0.042), Trans-MIL (0.39 \pm 0.017), Mamba-only (0.664 \pm 0.063), GNN-only (0.748 \pm 0.091), and a prior similar work GAT-Mamba (0.703 \pm 0.075). SlideMamba also achieves competitive results across ROC AUC (0.738 \pm 0.055), sensitivity (0.662 \pm 0.083), and specificity (0.725 \pm 0.094). These results highlight the strength of the integrated architecture, enhanced by the proposed entropy-based adaptive fusion strategy, and suggest promising potential for application of spatially-resolved predictive modeling tasks in computational pathology.

[141] Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Team Hunyuan3D,:,Bowen Zhang,Chunchao Guo,Haolin Liu,Hongyu Yan,Huiwen Shi,Jingwei Huang,Junlin Yu,Kunhong Li,Linus,Penghao Wang,Qingxiang Lin,Sicong Liu,Xianghui Yang,Yixuan Tang,Yunfei Zhao,Zeqiang Lai,Zhihao Liang,Zibo Zhao

Main category: cs.CV

TL;DR: Hunyuan3D-Omni是一个基于Hunyuan3D 2.1的统一框架,支持图像、点云、体素、边界框和骨骼姿态等多种输入模态,实现细粒度、可控的3D资产生成。

Details Motivation: 现有3D生成模型主要依赖图像或文本条件输入,缺乏跨模态精细控制,限制了可控性和实际应用。 Method: 提出统一的跨模态架构,融合多种条件信号(如点云、体素、边界框、骨骼姿态),并采用渐进式、难度感知的采样策略进行训练,优先学习较难模态(如骨骼姿态),降低简单模态(如点云)的权重。 Result: 实验表明,该方法提升了生成精度,支持几何感知的变换,并增强了在生产流程中的鲁棒性。 Conclusion: Hunyuan3D-Omni通过统一多模态输入和优化训练策略,实现了更精确、可控且鲁棒的3D资产生成,推动了其在游戏、影视和设计等领域的实用化。 Abstract: Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.

[142] Learning to Look: Cognitive Attention Alignment with Vision-Language Models

Ryan L. Yang,Dipkamal Bhusal,Nidhi Rastogi

Main category: cs.CV

TL;DR: 提出一种可扩展的框架,利用视觉-语言模型自动生成语义注意力图,通过辅助损失对齐CNN注意力,提升模型泛化能力和认知合理性。

Details Motivation: 现有方法依赖专家标注的概念监督和解释正则化来引导模型注意力,但标注成本高、难以扩展。 Method: 利用视觉-语言模型和自然语言提示自动生成语义注意力图,并设计辅助损失函数使CNN注意力与这些语言引导的注意力图对齐。 Result: 在ColoredMNIST上达到SOTA,在DecoyMNIST上表现与依赖大量标注的方法相当,减少了对捷径的依赖,注意力更符合人类直觉。 Conclusion: 该方法无需人工标注即可提升模型的可靠性与认知可解释性,具有良好的可扩展性和应用前景。 Abstract: Convolutional Neural Networks (CNNs) frequently "cheat" by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.

[143] Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

Zhijian Yang,Noel DSouza,Istvan Megyeri,Xiaojian Xu,Amin Honarmandi Shandiz,Farzin Haddadpour,Krisztian Koos,Laszlo Rusko,Emanuele Valeriano,Bharadwaj Swaninathan,Lei Wu,Parminder Bhatia,Taha Kass-Hout,Erhan Bas

Main category: cs.CV

TL;DR: Decipher-MR是一个基于大规模多区域MRI数据训练的3D MRI专用视觉-语言基础模型,结合自监督视觉学习与报告引导的文本监督,支持模块化轻量任务解码器,在多种临床任务中表现出优越性能。

Details Motivation: 由于数据稀缺和解剖区域局限,现有基础模型在MRI应用中受限,缺乏可扩展性和泛化能力。 Method: 提出Decipher-MR,采用自监督视觉学习与报告引导的文本监督联合训练,使用冻结的预训练编码器和可调的轻量任务特定解码器实现模块化设计。 Result: 在疾病分类、人口统计预测、解剖定位和跨模态检索等任务上,Decipher-MR consistently优于现有基础模型和特定任务方法。 Conclusion: Decipher-MR是一个可扩展且通用的MRI基础模型,有助于推动基于MRI的AI在临床和研究领域的高效发展。 Abstract: Magnetic Resonance Imaging (MRI) is a critical medical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity pose challenges for automated analysis, particularly in scalable and generalizable machine learning applications. While foundation models have revolutionized natural language and vision tasks, their application to MRI remains limited due to data scarcity and narrow anatomical focus. In this work, we present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on a large-scale dataset comprising 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust, generalizable representations, enabling effective adaptation across broad applications. To enable robust and diverse clinical tasks with minimal computational overhead, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across diverse benchmarks including disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent performance gains over existing foundation models and task-specific approaches. Our results establish Decipher-MR as a scalable and versatile foundation for MRI-based AI, facilitating efficient development across clinical and research domains.

[144] Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

You-Won Jang,Yu-Jung Heo,Jaeseok Kim,Minsu Lee,Du-Seong Chang,Byoung-Tak Zhang

Main category: cs.CV

TL;DR: 提出SQ-InstructBLIP模型,通过生成图像感知的子问题和子答案来提升视觉问答任务中的多步推理性能。

Details Motivation: 现有方法在处理需要多步推理的视觉语言理解任务时存在局限,如无法获取细粒度视觉信息和黑箱模型难以复现等问题。 Method: 设计由Questioner、Answerer和Reasoner组成的SQ-InstructBLIP模型,共享相同架构,迭代生成子问题和子答案,并结合主问题进行推理。 Result: 实验表明,SQ-InstructBLIP在使用生成的子问题作为额外信息时,比先前方法具有更准确的推理能力。 Conclusion: SQ-InstructBLIP有效提升了视觉问答中多步推理的性能,解决了现有方法在视觉内容利用和模型可复现性方面的不足。 Abstract: The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.

[145] Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization

Feng-Qi Cui,Jinyang Huang,Anyang Tong,Ziyu Jia,Jie Zhang,Zhi Liu,Dan Guo,Jianwei Lu,Meng Wang

Main category: cs.CV

TL;DR: 提出了一种独立于个体的通用微动作识别框架,通过分布鲁棒优化学习个体无关表征,在特征和损失层面设计了两个即插即用模块,显著提升了真实场景下的准确性和鲁棒性。

Details Motivation: 现有微动作识别方法因个体间差异导致相同动作表现不同,难以在真实场景中实现鲁棒泛化。 Method: 提出Person Independence Universal Micro-action Recognition Framework,包含时频对齐模块(时间分支使用Wasserstein正则化对齐动态轨迹,频率分支引入方差引导扰动)和组不变正则化损失(通过伪分组模拟未见个体分布,加权边界样本并正则化子组方差)。 Result: 在大规模MA-52数据集上实验表明,该框架在准确性和鲁棒性方面均优于现有方法,能在细粒度条件下实现稳定泛化。 Conclusion: 所提框架有效缓解了个体差异对微动作识别的影响,通过特征级和损失级的双重优化实现了更鲁棒的泛化性能。 Abstract: Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.

[146] Dense Semantic Matching with VGGT Prior

Songlin Yang,Tianyi Wei,Yushi Lan,Zeqi Xiao,Anyi Rao,Xingang Pan

Main category: cs.CV

TL;DR: 本文提出了一种基于3D几何基础模型VGGT的语义匹配方法,通过重用早期特征、微调后期特征并添加语义头,结合循环一致性训练和合成数据增强,在数据稀缺情况下实现了跨实例像素级语义匹配,显著提升了几何感知能力和匹配可靠性。

Details Motivation: 现有语义匹配方法存在几何模糊和依赖最近邻规则的问题,难以处理对称结构且缺乏流形保持能力,同时受限于密集语义标注的稀缺性。 Method: 复用VGGT的早期特征,微调后期特征,增加语义头以实现双向匹配;采用循环一致性训练、合成数据增强和渐进式训练策略来适应数据稀缺下的语义匹配任务。 Result: 实验表明该方法在几何感知、匹配可靠性和流形保持方面优于先前方法,在多个基准上取得了更优性能。 Conclusion: 所提出的方法有效将面向几何匹配的VGGT模型适配到语义匹配任务中,解决了跨实例匹配与标注稀缺的挑战,为语义匹配提供了新的解决方案。 Abstract: Semantic matching aims to establish pixel-level correspondences between instances of the same category and represents a fundamental task in computer vision. Existing approaches suffer from two limitations: (i) Geometric Ambiguity: Their reliance on 2D foundation model features (e.g., Stable Diffusion, DINO) often fails to disambiguate symmetric structures, requiring extra fine-tuning yet lacking generalization; (ii) Nearest-Neighbor Rule: Their pixel-wise matching ignores cross-image invisibility and neglects manifold preservation. These challenges call for geometry-aware pixel descriptors and holistic dense correspondence mechanisms. Inspired by recent advances in 3D geometric foundation models, we turn to VGGT, which provides geometry-grounded features and holistic dense matching capabilities well aligned with these needs. However, directly transferring VGGT is challenging, as it was originally designed for geometry matching within cross views of a single instance, misaligned with cross-instance semantic matching, and further hindered by the scarcity of dense semantic annotations. To address this, we propose an approach that (i) retains VGGT's intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences; and (ii) adapts VGGT to the semantic matching scenario under data scarcity through cycle-consistent training strategy, synthetic data augmentation, and progressive training recipe with aliasing artifact mitigation. Extensive experiments demonstrate that our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.

[147] MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation

Xinyu Liu,Guolei Sun,Cheng Wang,Yixuan Yuan,Ender Konukoglu

Main category: cs.CV

TL;DR: 本文提出了一种针对医学视频超分辨率(VSR)的新型框架MedVSR,以应对低分辨率医学视频中存在的相机抖动、噪声和帧间突变等问题,有效提升了高分辨率重建的质量与效率。

Details Motivation: 由于硬件限制和生理约束,获取高分辨率医学视频困难,临床采集的低分辨率视频存在运动模糊、噪声和帧间不连续等问题,导致现有VSR方法在特征对齐和结构恢复上表现不佳,易引入伪影,影响医生判断。 Method: 提出MedVSR框架,包含两个核心模块:Cross State-Space Propagation (CSSP) 通过将远距离帧作为控制矩阵引入状态空间模型,实现跨帧一致特征的选择性传播,改善对齐;Inner State-Space Reconstruction (ISSR) 结合长程空间特征学习与大核短程信息聚合,增强组织结构并抑制伪影。 Result: 在包括内窥镜和白内障手术在内的四个医学视频数据集上实验表明,MedVSR在重建质量(如PSNR、SSIM)和计算效率方面均显著优于现有VSR方法。 Conclusion: MedVSR通过状态空间建模有效解决了医学视频超分辨率中的对齐误差与结构失真问题,为临床诊断提供了更可靠、清晰的视频重建方案。 Abstract: High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.

[148] MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Sicong Leng,Jing Wang,Jiaxi Li,Hao Zhang,Zhiqiang Hu,Boqiang Zhang,Yuming Jiang,Hang Zhang,Xin Li,Lidong Bing,Deli Zhao,Wei Lu,Yu Rong,Aixin Sun,Shijian Lu

Main category: cs.CV

TL;DR: 本文提出了方差感知采样(VAS)方法,解决了多模态推理模型训练中奖励方差低导致的梯度消失问题,并发布了大规模高质量的长链思维数据集与强化学习问答对,开源了多尺度多模态推理模型,实验验证了VAS和数据的有效性。

Details Motivation: 当前大規模多模態推理模型受限於缺乏高質量長鏈思維數據以及強化學習算法在後訓練中的不穩定,特別是獎勵方差低導致梯度消失問題,影響優化效果。 Method: 提出方差感知採樣(VAS),基於方差促進分數(VPS)選擇數據,結合結果方差與路徑多樣性來提升獎勵方差;同時構建了約160萬條長鏈思維數據和1.5萬個強化學習問答對,並開源訓練代碼與多模態模型。 Result: 在數學推理基準上驗證了VAS和所構建數據的有效性,消融實驗顯示各組件貢獻顯著,理論分析表明獎勵方差下界約束策略梯度幅度,VAS能有效提升該下界。 Conclusion: VAS能有效穩定強化學習中的策略優化過程,所發布的高質量數據集和開源模型為多模態推理研究提供了重要資源,推動領域發展。 Abstract: Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

[149] A Sentinel-3 foundation model for ocean colour

Geoffrey Dawson,Remy Vandaele,Andrew Taylor,David Moffat,Helen Tamura-Wicks,Sarah Jackson,Rosie Lickorish,Paolo Fraccaro,Hywel Williams,Chunbo Luo,Anne Jones

Main category: cs.CV

TL;DR: 提出了一种基于Prithvi-EO Vision Transformer架构的新型AI基础模型,用于海洋遥感数据重建,并在叶绿素浓度和海洋初级生产力估算任务中展现出优越性能。

Details Motivation: 海洋科学中标记数据稀少且获取成本高,传统模型难以充分利用有限数据;基础模型有望通过预训练提升对海洋遥感数据的利用效率。 Method: 采用Prithvi-EO Vision Transformer架构,使用Sentinel-3 OLCI无标签数据进行预训练,通过自监督学习重建遥感数据,并在两个下游任务(叶绿素浓度估算和海洋初级生产估算)中进行微调评估。 Result: 该模型在少量高质量标注数据下表现出色,能准确捕捉海洋颜色的空间细节,优于现有基线模型,在叶绿素浓度和初级生产估算任务中提升了精度。 Conclusion: 新一代地理空间AI基础模型能够更有效地利用稀疏标注数据,为海洋生态系统监测和全球气候研究提供更可靠的数据驱动洞见。 Abstract: Artificial Intelligence (AI) Foundation models (FMs), pre-trained on massive unlabelled datasets, have the potential to drastically change AI applications in ocean science, where labelled data are often sparse and expensive to collect. In this work, we describe a new foundation model using the Prithvi-EO Vision Transformer architecture which has been pre-trained to reconstruct data from the Sentinel-3 Ocean and Land Colour Instrument (OLCI). We evaluate the model by fine-tuning on two downstream marine earth observation tasks. We first assess model performance compared to current baseline models used to quantify chlorophyll concentration. We then evaluate the FMs ability to refine remote sensing-based estimates of ocean primary production. Our results demonstrate the utility of self-trained FMs for marine monitoring, in particular for making use of small amounts of high quality labelled data and in capturing detailed spatial patterns of ocean colour whilst matching point observations. We conclude that this new generation of geospatial AI models has the potential to provide more robust, data-driven insights into ocean ecosystems and their role in global climate processes.

[150] Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Shilin Lu,Zhuming Lian,Zihan Zhou,Shaocong Zhang,Chen Zhao,Adams Wai-Kin Kong

Main category: cs.CV

TL;DR: 本文提出了SHINE,一种无需训练的高保真图像合成框架,通过流形引导的锚点损失、降质抑制引导和自适应背景融合,在复杂光照和高分辨率场景下实现无缝对象插入,并发布了包含多样化挑战条件的新基准ComplexCompo。

Details Motivation: 现有图像合成方法在处理复杂光照(如阴影、水面反射)和高分辨率输入时表现不佳,且依赖潜在空间反演或脆弱的注意力操作,易导致对象姿态不自然或生成质量下降。 Method: SHINE利用预训练的定制化适配器(如IP-Adapter)引入流形引导的锚点损失,指导潜在表示以保持主体保真度并维护背景完整性;结合降质抑制引导和自适应背景融合策略,减少低质量输出和拼接痕迹。 Result: 在新构建的ComplexCompo和DreamEditBench数据集上,SHINE在DINOv2等标准指标及DreamSim、ImageReward等人类对齐评分中均达到最先进水平。 Conclusion: SHINE为无需训练的图像合成提供了有效解决方案,显著提升了在复杂真实场景下的插入质量与鲁棒性,同时推动了该领域评估基准的发展。 Abstract: Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

[151] Quantized Visual Geometry Grounded Transformer

Weilun Feng,Haotong Qin,Mingqiang Wu,Chuanguang Yang,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出了首个针对视觉几何接地Transformer(VGGTs)的量化框架QuantVGGT,通过双平滑细粒度量化和噪声过滤多样化采样技术,有效解决了大规模3D重建模型在后训练量化中的激活分布重尾和校准不稳定问题,实现了显著的内存压缩与加速,同时保持高精度。

Details Motivation: 现有的后训练量化方法在处理十亿级VGGT模型时面临激活分布重尾和多视角数据导致的校准样本选择不稳定的挑战,难以部署到实际场景中。 Method: 提出QuantVGGT框架,包含两项关键技术:1)双平滑细粒度量化,结合全局Hadamard旋转和局部通道平滑以缓解重尾分布和通道间方差;2)噪声过滤多样化采样,利用深层统计信息去除异常值并构建帧感知的多样化校准簇。 Result: 实验表明,QuantVGGT在多个基准和比特宽度下均达到最先进水平,4比特量化可实现3.7倍内存减少和2.5倍硬件推理加速,且重建精度保持在全精度模型的98%以上。 Conclusion: QuantVGGT显著提升了大规模VGGT模型在资源受限场景下的实用性和部署效率,为3D重建模型的轻量化提供了有效解决方案。 Abstract: Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.

[152] NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics

Yu Yuan,Xijun Wang,Tharindu Wickremasinghe,Zeeshan Nadir,Bole Ma,Stanley H. Chan

Main category: cs.CV

TL;DR: 提出NewtonGen框架,结合数据驱动合成与可学习物理原理,通过引入可训练的神经牛顿动力学(NND)实现物理一致且可控的视频生成。

Details Motivation: 现有文本到视频生成模型在物理一致性和可控性方面存在瓶颈,难以生成符合真实物理规律的运动,缺乏对不同初始条件下动态行为的精确控制。 Method: 提出NewtonGen框架,核心是可训练的神经牛顿动力学(NND),将数据先验与动力学引导相结合,在视频生成过程中注入潜在的动力学约束。 Result: NewtonGen能够生成更符合物理规律的视频,减少不现实的运动现象,如物体向上掉落或速度方向突变,并支持对初始条件等参数的精确控制。 Conclusion: 通过融合学习物理原理与数据驱动方法,NewtonGen显著提升了大规模文本到视频生成的物理一致性与可控性。 Abstract: A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.

[153] SD3.5-Flash: Distribution-Guided Distillation of Generative Flows

Hmrishav Bandyopadhyay,Rahim Entezari,Jim Scott,Reshinth Adithyan,Yi-Zhe Song,Varun Jampani

Main category: cs.CV

TL;DR: SD3.5-Flash 是一种高效的少步蒸馏框架,可在消费级设备上实现高质量图像生成。

Details Motivation: 将计算成本高昂的修正流模型压缩到可在少数步骤内运行并适配低资源设备的模型,以提升生成效率和设备兼容性。 Method: 通过重新设计分布匹配目标函数,引入“时间步共享”减少梯度噪声,“分步微调”提升提示对齐,并结合文本编码器重构和专用量化等优化手段。 Result: 在多种硬件上实现了快速生成和内存高效部署,用户研究表明其性能优于现有少步生成方法。 Conclusion: SD3.5-Flash 成功实现了高质量、快速、跨设备的图像生成,推动了生成式AI在实际场景中的普及。 Abstract: We present SD3.5-Flash, an efficient few-step distillation framework that brings high-quality image generation to accessible consumer devices. Our approach distills computationally prohibitive rectified flow models through a reformulated distribution matching objective tailored specifically for few-step generation. We introduce two key innovations: "timestep sharing" to reduce gradient noise and "split-timestep fine-tuning" to improve prompt alignment. Combined with comprehensive pipeline optimizations like text encoder restructuring and specialized quantization, our system enables both rapid generation and memory-efficient deployment across different hardware configurations. This democratizes access across the full spectrum of devices, from mobile phones to desktop computers. Through extensive evaluation including large-scale user studies, we demonstrate that SD3.5-Flash consistently outperforms existing few-step methods, making advanced generative AI truly accessible for practical deployment.