Table of Contents
cs.CL [Back]
[1] CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples
Kyohoon Jin,Juhwan Choi,Jungmin Yun,Junho Lee,Soojin Jang,Youngbin Kim
Main category: cs.CL
TL;DR: CoBA是一种通过语义三元组级别操作来解决虚假相关性的数据增强框架,提高了任务性能并减少了偏差。
Details
Motivation: 深度学习模型常常利用训练数据中的虚假相关性,导致在未见数据上的泛化能力下降。为了解决这个问题,作者提出了CoBA方法。 Method: CoBA通过将文本分解为三元组(主语-谓语-宾语),然后有选择地修改这些三元组来打破虚假相关性,最后重新构建文本以生成反偏差数据。 Result: 实验表明,CoBA不仅提高了下游任务的性能,还有效减少了偏差并增强了分布外的鲁棒性。 Conclusion: CoBA提供了一个通用且强大的解决方案,用于应对虚假相关性带来的挑战。 Abstract: Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.[2] Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting
Jan Fillies,Michael Peter Hoffmann,Rebecca Reichel,Roman Salzwedel,Sven Bodemer,Adrian Paschke
Main category: cs.CL
TL;DR: 该研究与德国公共网络funk合作,推出了首个大规模带毒性标注并包含年龄估计的德语数据集,包含3,024条人工标注和30,024条LLM标注的来自Instagram、TikTok和YouTube的评论。
Details
Motivation: 现有毒性言论数据集缺乏人口统计背景,限制了我们对不同年龄组在线交流方式的理解。 Method: 使用预定义的毒性关键词整合评论,并结合人工专业知识与最先进的语言模型进行标注,标注流程包括侮辱、虚假信息和广播费用批评等关键类别。 Result: 数据集中16.7%的评论被标记为有问题,研究揭示了基于年龄的毒性言论模式差异:年轻用户倾向于使用表达性强的语言,而年长用户更常参与虚假信息和贬低行为。 Conclusion: 该资源为研究人口统计学中的语言变异提供了新机会,并支持开发更公平和年龄感知的内容审核系统。 Abstract: A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7\% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.[3] Granite Embedding R2 Models
Parul Awasthy,Aashka Trivedi,Yulong Li,Meet Doshi,Riyaz Bhat,Vignesh P,Vishwajeet Kumar,Yushu Yang,Bhavani Iyer,Abraham Daniels,Rudra Murthy,Ken Barker,Martin Franz,Madison Lee,Todd Ward,Salim Roukos,David Cox,Luis Lastras,Jaydeep Sen,Radu Florian
Main category: cs.CL
TL;DR: The Granite R2 models are a family of high-performance, open-source embedding models for enterprise-scale retrieval applications, offering expanded context length, superior performance across diverse domains, speed advantages, and are publicly available under an enterprise-friendly license.
Details
Motivation: To provide high-performance, enterprise-scale embedding models for dense retrieval applications that offer improved retrieval speed and accuracy, versatility across various domains, and transparent data provenance with enterprise-ready licensing. Method: The authors built upon their first-generation embedding models to develop the Granite R2 models, which include bi-encoder and cross-encoder architectures with improved context length, performance across retrieval domains, and speed. The models were trained on enterprise-appropriate data with governance oversight. Result: The Granite Embedding R2 models offer a 16x expanded context length, state-of-the-art performance in multiple retrieval domains, speed advantages of 19-44% over competitors, and demonstrate exceptional versatility across benchmarks and real-world enterprise use cases. Conclusion: The Granite R2 models are publicly available under the Apache 2.0 license, allowing unrestricted research and commercial use, and represent a significant advancement in embedding models for enterprise applications. Abstract: We introduce the Granite Embedding R2 models, a comprehensive family of high-performance English encoder-based embedding models engineered for enterprise-scale dense retrieval applications. Building upon our first-generation release, these models deliver substantial improvements, including 16x expanded context length (8,192 tokens), state-of-the-art performance across diverse retrieval domains - text, code, long-document search, multi-turn conversational, and tabular data - and measurable speed advantages of 19-44\% over leading competitors while maintaining superior accuracy. Our release encompasses both bi-encoder and cross-encoder architectures, featuring a highly effective 22-layer retriever model and its efficient 12-layer counterpart, alongside a high-quality reranker model, all trained exclusively on enterprise-appropriate data with comprehensive governance oversight. The models demonstrate exceptional versatility across standard benchmarks, IBM-developed evaluation suites, and real-world enterprise use cases, establishing new performance standards for open-source embedding models. In an era where retrieval speed and accuracy are paramount for competitive advantage, the Granite R2 models deliver a compelling combination of cutting-edge performance, enterprise-ready licensing, and transparent data provenance that organizations require for mission-critical deployments. All models are publicly available under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, enabling unrestricted research and commercial use.[4] TrInk: Ink Generation with Transformer Network
Zezhong Jin,Shubhang Desai,Xu Chen,Biyi Fang,Zhuoyi Huang,Zhe Li,Chong-Xin Gan,Xiao Tu,Man-Wai Mak,Yan Lu,Shujie Liu
Main category: cs.CL
TL;DR: 本文提出TrInk,一种基于Transformer的手写生成模型,通过改进对齐机制显著提升生成质量。
Details
Motivation: 改进现有的手写生成模型,以更好地捕捉全局依赖关系并提升生成文本的可读性和风格一致性。 Method: 引入基于Transformer的模型TrInk,结合缩放位置嵌入和高斯记忆掩码以优化文本与笔画的对齐。 Result: 在IAM-OnDB数据集上,字符错误率降低了35.56%,词错误率降低了29.66%。 Conclusion: TrInk实现了更好的手写文本生成质量,显著降低了字符和词错误率。 Abstract: In this paper, we propose TrInk, a Transformer-based model for ink generation, which effectively captures global dependencies. To better facilitate the alignment between the input text and generated stroke points, we introduce scaled positional embeddings and a Gaussian memory mask in the cross-attention module. Additionally, we design both subjective and objective evaluation pipelines to comprehensively assess the legibility and style consistency of the generated handwriting. Experiments demonstrate that our Transformer-based model achieves a 35.56\% reduction in character error rate (CER) and an 29.66% reduction in word error rate (WER) on the IAM-OnDB dataset compared to previous methods. We provide an demo page with handwriting samples from TrInk and baseline models at: https://akahello-a11y.github.io/trink-demo/[5] How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations
Yoshiki Takenami,Yin Jou Huang,Yugo Murawaki,Chenhui Chu
Main category: cs.CL
TL;DR: 这篇论文研究了LLMs在价格谈判中表现出的锚定效应,发现推理模型较不易受此效应影响,但未发现个性特征与此易感性的显著关联。
Details
Motivation: 论文的动机是认知偏见在人类中已经被广泛研究,但它们在影响LLMs在现实世界应用的可靠性中的作用尚未完全理解。 Method: 该论文通过指示销售商LLM代理应用锚定效应,并使用客观和主观两种指标评估谈判结果,从而研究LLMs驱动的价格谈判中的锚定效应。 Result: 实验结果表明,像人类一样,LLMs也受到锚定效应的影响。此外,研究还显示推理模型对这种效应的敏感度较低,表明长链思维可以减轻锚定效应。然而,没有发现个性特征与对锚定效应的易感性之间的显著相关性。 Conclusion: 该论文得出结论,像人类一样,大型语言模型(LLMs)也会受到锚定效应的影响,但在推理模型中这种效应较弱,表明长链思维可以减轻这种效应。此外,没有发现个性特征与对锚定效应的易感性之间存在显著相关性。 Abstract: Cognitive biases, well-studied in humans, can also be observed in LLMs, affecting their reliability in real-world applications. This paper investigates the anchoring effect in LLM-driven price negotiations. To this end, we instructed seller LLM agents to apply the anchoring effect and evaluated negotiations using not only an objective metric but also a subjective metric. Experimental results show that LLMs are influenced by the anchoring effect like humans. Additionally, we investigated the relationship between the anchoring effect and factors such as reasoning and personality. It was shown that reasoning models are less prone to the anchoring effect, suggesting that the long chain of thought mitigates the effect. However, we found no significant correlation between personality traits and susceptibility to the anchoring effect. These findings contribute to a deeper understanding of cognitive biases in LLMs and to the realization of safe and responsible application of LLMs in society.[6] Can Multimodal LLMs Solve the Basic Perception Problems of Percept-V?
Samrajnee Ghosh,Naman Agarwal,Hemanshu Garg,Chinmay Mittal,Mausam,Parag Singla
Main category: cs.CL
TL;DR: The study introduces Percept-V, a dataset to evaluate basic visual perception skills of MLLMs, revealing a drop in performance as task complexity increases.
Details
Motivation: MLLMs have shown strong performance in complex tasks like coding and mathematics, but there is limited research on their performance in basic visual perception tasks. Method: A dataset (Percept-V) of 7200 generated images across 30 categories was created to test visual perception abilities of MLLMs and LRMs. Result: Experiments revealed a consistent decline in model performance as task complexity increased, with some cognitive skills proving more challenging than others. Conclusion: MLLMs show a significant drop in performance with increasing problem complexity on basic visual perception tasks, despite their success in more complex domains. Abstract: The reasoning abilities of Multimodal Large Language Models (MLLMs) have garnered a lot of attention in recent times, with advances made in frontiers like coding, mathematics, and science. However, very limited experiments have been done to assess their performance in simple perception tasks performed over uncontaminated, generated images containing basic shapes and structures. To address this issue, the paper introduces a dataset, Percept-V, containing a total of 7200 program-generated images equally divided into 30 categories, each testing a combination of visual perception skills. Unlike previously proposed datasets, Percept-V comprises very basic tasks of varying complexity that test the perception abilities of MLLMs. This dataset is then tested on state-of-the-art MLLMs like GPT-4o, Gemini, and Claude as well as Large Reasoning Models (LRMs) like OpenAI o4-mini and DeepSeek R1 to gauge their performance. Contrary to the evidence that MLLMs excel in many complex tasks, our experiments show a significant drop in the models' performance with increasing problem complexity across all categories. An analysis of the performances also reveals that the tested MLLMs exhibit a similar trend in accuracy across categories, testing a particular cognitive skill and find some skills to be more difficult than others.[7] A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
Ming Hu,Chenglong Ma,Wei Li,Wanghan Xu,Jiamin Wu,Jucheng Hu,Tianbin Li,Guohang Zhuang,Jiaqi Liu,Yingzhou Lu,Ying Chen,Chaoyang Zhang,Cheng Tan,Jie Ying,Guocheng Wu,Shujian Gao,Pengcheng Chen,Jiashi Lin,Haitao Wu,Lulu Chen,Fengxiang Wang,Yuanyuan Zhang,Xiangyu Zhao,Feilong Tang,Encheng Su,Junzhi Ning,Xinyao Liu,Ye Du,Changkai Ji,Cheng Tang,Huihui Xu,Ziyang Chen,Ziyan Huang,Jiyao Liu,Pengfei Jiang,Yizhou Wang,Chen Tang,Jianyu Wu,Yuchen Ren,Siyuan Yan,Zhonghua Wang,Zhongxing Xu,Shiyan Su,Shangquan Sun,Runkai Zhao,Zhisheng Zhang,Yu Liu,Fudi Wang,Yuanfeng Ji,Yanzhou Su,Hongming Shan,Chunmei Feng,Jiahao Xu,Jiangtao Yan,Wenhao Tang,Diping Song,Lihao Liu,Yanyan Huang,Lequan Yu,Bin Fu,Shujun Wang,Xiaomeng Li,Xiaowei Hu,Yun Gu,Ben Fei,Zhongying Deng,Benyou Wang,Yuewen Cao,Minjie Shen,Haodong Duan,Jie Xu,Yirong Chen,Fang Yan,Hongxia Hao,Jielan Li,Jiajun Du,Yanbo Wang,Imran Razzak,Chi Zhang,Lijun Wu,Conghui He,Zhaohui Lu,Jinhai Huang,Yihao Liu,Fenghua Ling,Yuqiang Li,Aoran Wang,Qihao Zheng,Nanqing Dong,Tianfan Fu,Dongzhan Zhou,Yan Lu,Wenlong Zhang,Jin Ye,Jianfei Cai,Wanli Ouyang,Yu Qiao,Zongyuan Ge,Shixiang Tang,Junjun He,Chunfeng Song,Lei Bai,Bowen Zhou
Main category: cs.CL
TL;DR: 这篇论文探讨了科学大语言模型的发展,强调其与科学数据的共同进化,并提出了构建可信人工智能系统以加速科学发现的路线图。
Details
Motivation: 科学数据的复杂性影响了科学大语言模型的发展,需要解决多模态、跨尺度和领域特定的挑战。 Method: 论文采用了全面的数据中心视角,对科学大语言模型(Sci-LLMs)的发展进行了综合分析,包括统一的科学数据分类法和分层的知识模型。 Result: 论文系统回顾了近期的Sci-LLMs,分析了超过270个预训练和后训练数据集,以及190个基准数据集,展示了Sci-LLMs的独特需求和评估方法的演变。 Conclusion: 该论文提出构建可信赖、持续演化的科学人工智能系统,作为加速科学发现的真正合作伙伴。 Abstract: Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.[8] Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations
Muskan Saraf,Sajjad Rezvani Boroujeni,Justin Beaudry,Hossein Abedi,Tom Bush
Main category: cs.CL
TL;DR: This study finds that model identity perception heavily biases large language model evaluations, with false labels significantly altering preference rankings and quality ratings, highlighting the need for fairer evaluation protocols.
Details
Motivation: The motivation is to understand how bias in self- and cross-model evaluations by large language models affects the reliability of their judgments, particularly when labels are manipulated. Method: The study evaluated blog posts authored by ChatGPT, Gemini, and Claude using both overall preference voting and quality ratings for Coherence, Informativeness, and Conciseness under four labeling conditions: no labels, true labels, and two false-label scenarios. Result: Results show significant asymmetries in scoring: the 'Claude' label boosted scores, while the 'Gemini' label depressed them. False labels altered rankings significantly, with shifts up to 50 percentage points in preference votes. Gemini's self-scores dropped under true labels, while Claude's self-preference increased. Conclusion: The study concludes that perceived model identity significantly biases the evaluation of large language model outputs, emphasizing the necessity for blind or multi-model evaluation protocols to ensure fairness. Abstract: Large language models (LLMs) are increasingly used to evaluate outputs, yet their judgments may be influenced. This study examines bias in self- and cross-model evaluations by ChatGPT, Gemini, and Claude under four conditions: no labels, true labels, and two false-label scenarios. Blog posts authored by each model were evaluated by all three using both overall preference voting and quality ratings for Coherence, Informativeness, and Conciseness, with all scores expressed as percentages for direct comparison. Results reveal striking asymmetries: the "Claude" label consistently boosts scores, while the "Gemini" label consistently depresses them, regardless of actual content. False labels frequently reversed rankings, producing shifts of up to 50 percentage points in preference votes and up to 12 percentage points in converted quality ratings. Gemini's self-scores collapsed under true labels, while Claude's self-preference intensified. These findings show that perceived model identity can heavily distort high-level judgments and subtly influence detailed quality ratings, underscoring the need for blind or multimodel evaluation protocols to ensure fairness in LLM benchmarking.[9] BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design
Deepro Choudhury,Sinead Williamson,Adam Goliński,Ning Miao,Freddie Bickford Smith,Michael Kirchhof,Yizhe Zhang,Tom Rainforth
Main category: cs.CL
TL;DR: 本文提出BED-LLM,一种基于贝叶斯实验设计的新型方法,使大语言模型能够更智能、自适应地与用户或环境交互,通过最大化信息增益显著提升性能。
Details
Motivation: 提升大语言模型(LLM)在与用户或外部环境交互时智能、自适应地收集信息的能力,使其成为有效的多轮对话代理。 Method: BED-LLM基于顺序贝叶斯实验设计框架,通过迭代选择最大化预期信息增益(EIG)的问题或查询,利用从LLM信念分布推导出的概率模型进行EIG计算,并引入了多种创新技术,如精心设计的EIG估计器和候选查询的针对性策略。 Result: BED-LLM在多个测试任务中表现出色,包括20个问题游戏和用户偏好推断任务,相较于传统方法实现了显著的性能提升。 Conclusion: BED-LLM实现了在多种测试中性能的显著提升,相较于直接提示LLM和其他自适应设计策略,其在20个问题游戏和用户偏好主动推断任务中表现更优。 Abstract: We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian Experimental Design with Large Language Models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) about the task of interest given the responses gathered previously. We show how this EIG can be formulated in a principled way using a probabilistic model derived from the LLM's belief distribution and provide detailed insights into key decisions in its construction. Further key to the success of BED-LLM are a number of specific innovations, such as a carefully designed estimator for the EIG, not solely relying on in-context updates for conditioning on previous responses, and a targeted strategy for proposing candidate queries. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20-questions game and using the LLM to actively infer user preferences, compared to direct prompting of the LLM and other adaptive design strategies.[10] Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization
Arash Ahmadi,Sarah Sharif,Yaser Banad
Main category: cs.CL
TL;DR: 本文提出了一种基于强化学习的自动化HFACS分类框架,优化了语言模型在航空安全分析中的性能,提高了准确率并降低了计算资源需求。
Details
Motivation: 传统使用HFACS分析航空事故的方法存在可扩展性和一致性问题,需要自动化解决方案。 Method: 使用基于强化学习(GRPO)的方法微调Llama-3.1 8B语言模型,结合多成分奖励系统和合成数据生成解决事故数据集中的类别不平衡问题。 Result: 模型在精确匹配准确率上提升了350%(从0.0400到0.1800),部分匹配准确率为0.8800,并在多个指标上优于GPT-5-mini和Gemini-2.5-flash等先进模型。 Conclusion: 研究验证了优化的小型模型在航空安全分析中的有效性,能够提供高效的解决方案,并可在资源受限的边缘设备上部署。 Abstract: Analyzing the human factors behind aviation accidents is crucial for preventing future incidents, yet traditional methods using the Human Factors Analysis and Classification System (HFACS) are limited by scalability and consistency. To address this, we introduce an automated HFACS classification framework for aviation safety analysis that utilizes Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. Our approach incorporates a multi-component reward system tailored for aviation safety analysis and integrates synthetic data generation to overcome class imbalance in accident datasets. The resulting GRPO-optimized model achieved noticeable performance gains, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and an improved partial match accuracy of 0.8800. Significantly, our specialized model outperforms state-of-the-art LLMs (Large Language Models), including GPT-5-mini and Gemini-2.5-fiash, on key metrics. This research also proposes exact match accuracy in multi-label HFACS classification problem as a new benchmarking methodology to evaluate the advanced reasoning capabilities of language models. Ultimately, our work validates that smaller, domain-optimized models can provide a computationally efficient and better solution for critical safety analysis. This approach makes powerful, low-latency deployment on resource-constrained edge devices feasible.[11] Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach
Han Yang,Jian Lan,Yihong Liu,Hinrich Schütze,Thomas Seidl
Main category: cs.CL
TL;DR: This paper introduces a pixel-based language model to improve robustness against orthographic attacks and enhance multilingual support, showing promising results on multiple benchmarks.
Details
Motivation: Autoregressive language models are vulnerable to orthographic attacks due to subword tokenizers' limitations, which inspired the development of a more robust pixel-based approach. Method: The method involves replacing text-based embeddings with pixel-based representations by rendering words as images, addressing out-of-vocabulary issues caused by subword tokenizers. Result: The proposed model was evaluated on the multilingual LAMBADA dataset, WMT24 dataset, and SST-2 benchmark, showing resilience to orthographic noise and effectiveness in multilingual settings. Conclusion: A pixel-based generative language model is proposed to enhance robustness against orthographic attacks and support multilingual text processing effectively. Abstract: Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.[12] Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition?
Yurie Koga,Shunsuke Kando,Yusuke Miyao
Main category: cs.CL
TL;DR: This paper explores whether Critical Period effects in human language acquisition are present in self-supervised speech models, finding no clear evidence of such effects, but noting that delayed L2 exposure onset tends to perform better on L2 and delayed L1 exposure leads to L1 forgetting.
Details
Motivation: To explore whether Critical Period effects, commonly studied in textual language models, are present in self-supervised speech models, given the central role of spoken language in human language acquisition. Method: Train S3Ms with varying L2 training onsets and L1 training offsets on child-directed speech and evaluate their phone discrimination performance. Result: The study found no clear evidence of CP effects in S3Ms in terms of phonological acquisition. Conclusion: S3Ms do not exhibit clear evidence of CP effects in phonological acquisition, but delayed L2 exposure onset tends to perform better on L2, and delayed L1 exposure offset leads to L1 forgetting. Abstract: This paper investigates whether the Critical Period (CP) effects in human language acquisition are observed in self-supervised speech models (S3Ms). CP effects refer to greater difficulty in acquiring a second language (L2) with delayed L2 exposure onset, and greater retention of their first language (L1) with delayed L1 exposure offset. While previous work has studied these effects using textual language models, their presence in speech models remains underexplored despite the central role of spoken language in human language acquisition. We train S3Ms with varying L2 training onsets and L1 training offsets on child-directed speech and evaluate their phone discrimination performance. We find that S3Ms do not exhibit clear evidence of either CP effects in terms of phonological acquisition. Notably, models with delayed L2 exposure onset tend to perform better on L2 and delayed L1 exposure offset leads to L1 forgetting.[13] Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection
Weizhi Gao,Xiaorui Liu,Feiyi Wang,Dan Lu,Junqi Yin
Main category: cs.CL
TL;DR: 该论文提出了一种名为Decoding Memory Pipeline (DMP)的新方法,用于加速大型语言模型中的生成过程,同时保持性能,解决了现有方法中存在的计算成本高的问题。
Details
Motivation: 现有的幻觉检测方法在句子级别的生成中表现不佳,或者过度依赖特定领域的知识,而自洽方法虽然能缓解这些问题,但由于需要重复生成导致计算成本高昂。 Method: 通过识别自洽方法中生成的共享前缀令牌中的冗余,并基于非精确答案令牌对语义内容贡献较小的观察,提出了一种新的Decoding Memory Pipeline (DMP)方法。 Result: 实验表明,该方法在不牺牲AUROC性能的情况下实现了最高3倍的加速。 Conclusion: 提出的DMP方法有效地提高了多响应生成的效率,并且有望扩展到对齐和推理任务中。 Abstract: Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.[14] Efficient Code Embeddings from Code Generation Models
Daria Kryvosheieva,Saba Sturua,Michael Günther,Scott Martens,Han Xiao
Main category: cs.CL
TL;DR: This paper introduces jina-code-embeddings, a new code embedding model that excels at retrieving code from natural language queries, answering technical questions, and finding similar code snippets across programming languages, all while maintaining a relatively small model size thanks to its pre-training approach.
Details
Motivation: The motivation behind the paper is to develop a model that can effectively retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across different programming languages. Method: The method involves creating a code embedding model using an autoregressive backbone pre-trained on both text and code, with embeddings generated through last-token pooling. Result: The result is a novel code embedding model suite called jina-code-embeddings that demonstrates innovative use of pre-training on text and code and achieves superior performance in multiple code-related tasks. Conclusion: The paper concludes that the jina-code-embeddings model achieves state-of-the-art performance in code retrieval, technical question-answering, and identifying semantically similar code snippets across programming languages, despite its relatively small size. Abstract: jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.[15] BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning
João Guilherme Alves Santos,Giovana Kerche Bonás,Thales Sales Almeida
Main category: cs.CL
TL;DR: This paper updates the BLUEX dataset with new exams and AI-generated captions, significantly improving the evaluation of LLMs in multilingual and non-English contexts by enhancing visual context utilization.
Details
Motivation: The motivation stems from the growing capabilities of Large Language Models (LLMs) and the need for more robust evaluation methods, especially in multilingual and non-English settings. Method: The researchers expanded the BLUEX dataset by incorporating 2024-2025 exams and generating image captions using state-of-the-art models. They evaluated both commercial and open-source LLMs to assess their ability to utilize visual context. Result: The updated BLUEX dataset produced 1,422 usable questions, increasing accessibility to text-only models by more than 40%, and more than doubling the original number of questions. Conclusion: The study concludes that the updated BLUEX dataset significantly enhances the evaluation of LLMs, particularly in leveraging visual context through captions for improved performance in multilingual and non-English contexts. Abstract: With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.[16] Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models
Shubham Sharma,Sneha Tuli,Narendra Badam
Main category: cs.CL
TL;DR: 这篇论文研究了大型语言模型(LLMs)的16个关键挑战,并通过比较两种最先进的模型,OpenAI的闭源GPT-4o和DeepSeek-V3-0324,探讨了闭源模型和开源模型之间的权衡,以及它们在不同领域的应用。
Details
Motivation: 大型语言模型(LLMs)正在改变各行各业的人工智能,但其开发和部署仍然复杂。 Method: 通过比较两种最先进的模型——OpenAI的闭源GPT-4o和DeepSeek-V3-0324,一个大型开源专家混合模型,来展示闭源模型和开源模型之间的权衡。 Result: 论文展示了闭源模型(稳健的安全性,精细调整的可靠性)和开源模型(高效性,适应性)之间的权衡,并探讨了LLM在不同领域的应用,突出了每种使用案例最适合的模型属性。 Conclusion: 该论文旨在指导AI研究人员、开发者和决策者理解当前大型语言模型(LLM)的能力、局限性和最佳实践。 Abstract: Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI's closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.[17] Normality and the Turing Test
Alexandre Kabbach
Main category: cs.CL
TL;DR: The paper reinterprets the Turing test through the lens of normality, suggesting that it evaluates average human intelligence and interrogators, and argues that current language models do not pass this test as they represent exceptional intelligence. It raises deeper questions about the normalist paradigm's ability to understand human cognition.
Details
Motivation: The motivation is to understand the Turing test in the context of normal/average human intelligence and interrogators, rather than focusing on exceptional intelligence. Method: The paper revisits the Turing test through the concept of normality and analyzes its implications on artificial intelligence and human cognition. Result: The paper argues that the Turing test assesses normal intelligence through the judgments of an average judge, and that large language models represent artificial smartness rather than artificial intelligence. It also highlights the broader question of whether the normalist paradigm can truly represent human cognition. Conclusion: The paper concludes that large language models like ChatGPT are unlikely to pass the Turing test as they target exceptional rather than normal/average human intelligence. It also raises the question of whether the Turing test can contribute to understanding human cognition by questioning if the human mind can be reduced to the normal/average mind. Abstract: This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the statistical interpretation of the normal--understood as the average both in the normative and mathematical sense of the term--proves useful for understanding the Turing test in at least two ways. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires building machines that "make mistakes" and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single "average" judge (understood as non-expert) but always by a full jury. As such, the notion of "average human interrogator" that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. In short, this paper argues that the Turing test is a test of normal intelligence as assessed by a normal judge characterizing the average judgment of a pool of human interrogators. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence per se. Second, it argues that the core question of whether the Turing test can contribute anything to the understanding of human cognition is that of whether the human mind is really reducible to the normal/average mind--a question which largely extends beyond the Turing test itself and questions the conceptual underpinnings of the normalist paradigm it belongs to.[18] AllSummedUp: un framework open-source pour comparer les metriques d'evaluation de resume
Tanguy Herserant,Vincent Guigue
Main category: cs.CL
TL;DR: 这篇论文研究了自动文本摘要评估中的可重复性问题,通过实验比较不同评估指标的性能,并提出了一个统一的开源框架用于公平比较,同时强调了基于LLM的评估方法的局限性及改进方向。
Details
Motivation: 该论文的动机是探讨自动文本摘要评估中面临的可重复性挑战,特别是由于不同评估指标的性能表现存在显著差异,以及基于LLM的方法可能带来的不稳定性与随机性问题。 Method: 论文通过在六个具有代表性的评估指标(包括ROUGE、G-Eval、SEval-Ex等)上进行实验,比较文献中报告的性能与实验环境中观察到的性能之间的差异,并引入了一个统一的开源框架,用于支持对SummEval数据集的公平和透明比较。 Result: 实验结果显示,与人工评估最一致的指标往往计算成本更高且在不同运行之间不够稳定。此外,研究揭示了基于LLM的评估方法在随机性、技术依赖性和可重复性方面的关键问题。 Conclusion: 该论文得出结论,自动文本摘要评估中存在显著的可重复性挑战,尤其是基于LLM的评估方法在随机性、技术依赖性和有限的可重复性方面存在问题。论文提倡更稳健的评估协议,包括详尽的文档记录和方法标准化,以提高摘要评估的可靠性。 Abstract: This paper investigates reproducibility challenges in automatic text summarization evaluation. Based on experiments conducted across six representative metrics ranging from classical approaches like ROUGE to recent LLM-based methods (G-Eval, SEval-Ex), we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting. We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics. Our results reveal a structural trade-off: metrics with the highest alignment with human judgments tend to be computationally intensive and less stable across runs. Beyond comparative analysis, this study highlights key concerns about relying on LLMs for evaluation, stressing their randomness, technical dependencies, and limited reproducibility. We advocate for more robust evaluation protocols including exhaustive documentation and methodological standardization to ensure greater reliability in automatic summarization assessment.[19] Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework
Nils Dycke,Iryna Gurevych
Main category: cs.CL
TL;DR: This study evaluates how well automatic review generators (ARGs) detect faulty research logic using a counterfactual framework, finding that logic flaws have little impact on their reviews.
Details
Motivation: Understanding the capabilities and limitations of automatic review generators (ARGs) in detecting faulty research logic is crucial for ensuring scientific integrity. Method: Developed a counterfactual evaluation framework to assess the ability of ARGs to detect faulty research logic. Result: Tests revealed that flaws in research logic do not significantly affect the output of ARGs. Conclusion: ARGs' performance is not significantly affected by flaws in research logic, and improvements require focused strategies. Abstract: Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.[20] Discovering Semantic Subdimensions through Disentangled Conceptual Representations
Yunhao Zhang,Shaonan Wang,Nan Lin,Xinyi Dong,Chong Li,Chengqing Zong
Main category: cs.CL
TL;DR: 本文提出了一种新的解缠连续语义表示模型(DCSRM),用于分解大型语言模型中的词嵌入,以发现细粒度的语义子维度,并通过体素编码模型评估这些子维度的神经合理性。
Details
Motivation: 现有的语义分析方法依赖于预定义的粗粒度语义维度,忽略了更精细的概念区分,因此需要一种能够揭示语义子维度的新框架。 Method: 提出了解缠连续语义表示模型(DCSRM),将词嵌入分解为多个子嵌入,并使用体素编码模型将这些子嵌入映射到大脑激活模式,以评估其神经合理性。 Result: 发现了可解释的细粒度语义子维度,语义维度的分解受到极性的影响,并且这些子维度具有认知和神经科学上的合理性。 Conclusion: 该研究提供了更细粒度的语义子维度,揭示了语义维度的结构化原则,并验证了其在神经科学中的可行性。 Abstract: Understanding the core dimensions of conceptual semantics is fundamental to uncovering how meaning is organized in language and the brain. Existing approaches often rely on predefined semantic dimensions that offer only broad representations, overlooking finer conceptual distinctions. This paper proposes a novel framework to investigate the subdimensions underlying coarse-grained semantic dimensions. Specifically, we introduce a Disentangled Continuous Semantic Representation Model (DCSRM) that decomposes word embeddings from large language models into multiple sub-embeddings, each encoding specific semantic information. Using these sub-embeddings, we identify a set of interpretable semantic subdimensions. To assess their neural plausibility, we apply voxel-wise encoding models to map these subdimensions to brain activation. Our work offers more fine-grained interpretable semantic subdimensions of conceptual meaning. Further analyses reveal that semantic dimensions are structured according to distinct principles, with polarity emerging as a key factor driving their decomposition into subdimensions. The neural correlates of the identified subdimensions support their cognitive and neuroscientific plausibility.[21] Beyond the Surface: Probing the Ideological Depth of Large Language Models
Shariar Kabir,Kevin Esterling,Yue Dong
Main category: cs.CL
TL;DR: This paper explores the concept of 'ideological depth' in Large Language Models (LLMs), finding that some models have more stable and complex political representations than others. Using prompting techniques and Sparse Autoencoders, the study shows that ideological depth is measurable and that models with deeper ideological structures respond differently to interventions compared to shallower ones.
Details
Motivation: The motivation behind this study is to understand the stability and depth of ideological leanings in LLMs, as surface-level responses can often be manipulated through prompt engineering, raising questions about the coherence of underlying ideologies. Method: The paper uses a dual approach: first, measuring the 'steerability' of LLMs using instruction prompting and activation steering; second, probing the internal mechanisms of these models using Sparse Autoencoders (SAEs). Result: The results show that some models can easily switch between liberal and conservative viewpoints, while others resist or refuse, indicating more entrenched ideologies. Additionally, models with lower steerability possess more distinct and abstract ideological features. Targeted ablation in 'deep' models leads to logical shifts in reasoning, while similar interventions in 'shallow' models increase refusal outputs. Conclusion: This paper concludes that ideological depth is a quantifiable property of Large Language Models (LLMs), and that steerability provides insights into their latent political architecture. Abstract: Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulated through simple prompt engineering, calling into question whether they reflect a coherent underlying ideology. This paper investigates the concept of "ideological depth" in LLMs, defined as the robustness and complexity of their internal political representations. We employ a dual approach: first, we measure the "steerability" of two well-known open-source LLMs using instruction prompting and activation steering. We find that while some models can easily switch between liberal and conservative viewpoints, others exhibit resistance or an increased rate of refusal, suggesting a more entrenched ideological structure. Second, we probe the internal mechanisms of these models using Sparse Autoencoders (SAEs). Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features. Our evaluations reveal that one model can contain 7.3x more political features than another model of similar size. This allows targeted ablation of a core political feature in an ideologically "deep" model, leading to consistent, logical shifts in its reasoning across related topics, whereas the same intervention in a "shallow" model results in an increase in refusal outputs. Our findings suggest that ideological depth is a quantifiable property of LLMs and that steerability serves as a valuable window into their latent political architecture.[22] Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards
Xiaolong Wei,Bo Lu,Xingyu Zhang,Zhejun Zhao,Dongdong Shen,Long Xia,Dawei Yin
Main category: cs.CL
TL;DR: This paper introduces two AI-driven reward strategies using a Reinforcement Learning from AI Feedback (RLAIF) framework to enhance creative writing in a 7B-parameter small language model, with the principle-guided LLM-as-a-Judge proving most effective.
Details
Motivation: The motivation is to address the limitations of current methods like Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), which struggle with novelty and are costly, respectively, in enhancing the creative writing capabilities of small language models (SLMs). Method: This paper explores two AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework: one uses a Reward Model (RM) trained on high-quality preference data curated by a multi-agent rejection sampling framework, while the other utilizes a principle-guided LLM-as-a-Judge with an adversarially optimized reward function and a reflection mechanism. Result: Both approaches significantly enhance creative output compared to baselines, but the principle-guided LLM-as-a-Judge yields superior generation quality. Additionally, it improves training efficiency and reduces dependency on human-annotated data. Conclusion: The paper concludes that the principle-guided LLM-as-a-Judge approach provides a more scalable and effective method for enhancing creative writing in small language models, offering advantages in training efficiency and reduced reliance on human-annotated data. Abstract: Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.[23] HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble
Sara B. Coutinho,Rafael M. O. Cruz,Francimaria R. S. Nascimento,George D. C. Cavalcanti
Main category: cs.CL
TL;DR: 该论文提出了一种新的集成学习方法,通过优先选择多样化的分类器来提高模型的准确性和泛化能力。
Details
Motivation: 集成方法在结合多个分类器以提高鲁棒性方面特别有效,但其性能高度依赖于组成分类器的多样性。选择真正多样化的模型仍然是一个关键挑战,尤其是在模型倾向于学习冗余模式的情况下。 Method: 该方法首先计算分类器之间的成对多样性,然后应用层次聚类将它们组织成不同粒度级别的组。层次选择(HierarchySelect)探索这些层次级别,选择每个级别的分类器池,每个池代表不同的池内多样性。然后识别最多样化的池,并选择其用于构建集成。 Result: 在六个不同应用领域的数据集上与Elbow启发式方法和最先进的基线方法相比,该方法在六个数据集中的两个上达到了最高的准确性。 Conclusion: 该论文提出了一种新的自动分类器选择方法,优先考虑多样性,并通过性能进行扩展,以构建更有效的集成学习模型。 Abstract: Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifiers's performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project's repository: https://github.com/SaraBCoutinho/HSFN .[24] L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models
Aishwarya Mirashi,Ananya Joshi,Raviraj Joshi
Main category: cs.CL
TL;DR: 本文提出了MahaSTS数据集和MahaSBERT-STS-v2模型,用于提升马拉地语句子相似性任务的性能。
Details
Motivation: 为了在低资源语言马拉地语中有效进行句子相似性任务,需要高质量的数据集和针对性的模型优化。 Method: 构建了一个包含16,860对马拉地语句子的数据集MahaSTS,并使用回归微调方法优化了MahaSBERT模型。 Result: MahaSBERT-STS-v2模型在MahaSTS数据集上表现优于其他替代模型,如MahaBERT、MuRIL、IndicBERT和IndicSBERT。 Conclusion: MahaSTS和MahaSBERT-STS-v2的结合提供了一个有效的解决方案,用于马拉地语的句子相似性任务,强调了人工注释、目标微调和结构化监督在低资源环境中的影响。 Abstract: We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 16,860 Marathi sentence pairs labeled with continuous similarity scores in the range of 0-5. To ensure balanced supervision, the dataset is uniformly distributed across six score-based buckets spanning the full 0-5 range, thus reducing label bias and enhancing model stability. We fine-tune the MahaSBERT model on this dataset and benchmark its performance against other alternatives like MahaBERT, MuRIL, IndicBERT, and IndicSBERT. Our experiments demonstrate that MahaSTS enables effective training for sentence similarity tasks in Marathi, highlighting the impact of human-curated annotations, targeted fine-tuning, and structured supervision in low-resource settings. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP[25] A Survey on Current Trends and Recent Advances in Text Anonymization
Tobias Deußer,Lorenz Sparrenberg,Armin Berger,Max Hahnbück,Christian Bauckhage,Rafet Sifa
Main category: cs.CL
TL;DR: This survey reviews recent advances in text anonymization techniques, emphasizing the role of Named Entity Recognition, Large Language Models, and formal privacy frameworks. It addresses domain-specific challenges and evaluates the balance between data privacy and usability, aiming to guide future research and practical applications in the field.
Details
Motivation: The motivation stems from the growing volume of textual data containing sensitive personal information across various domains. Effective anonymization is crucial to protect privacy, comply with regulations, and maintain data usability for downstream tasks. Method: The study uses a survey methodology, providing a comprehensive review of existing literature on text anonymization. It categorizes techniques based on foundational approaches like Named Entity Recognition, domain-specific applications, and advanced frameworks incorporating privacy models. Result: The paper presents an organized overview of current trends and challenges in text anonymization, highlighting the dual role of Large Language Models in both advancing anonymization techniques and posing new risks. It also reviews domain-specific solutions, evaluation metrics, and toolkits for practical deployment. Conclusion: The paper concludes that while text anonymization techniques have advanced significantly, especially with the integration of Large Language Models and formal privacy frameworks, challenges like the privacy-utility trade-off and quasi-identifier handling persist. The survey aims to guide future research and practical implementations in this evolving field. Abstract: The proliferation of textual data containing sensitive personal information across various domains requires robust anonymization techniques to protect privacy and comply with regulations, while preserving data usability for diverse and crucial downstream tasks. This survey provides a comprehensive overview of current trends and recent advances in text anonymization techniques. We begin by discussing foundational approaches, primarily centered on Named Entity Recognition, before examining the transformative impact of Large Language Models, detailing their dual role as sophisticated anonymizers and potent de-anonymization threats. The survey further explores domain-specific challenges and tailored solutions in critical sectors such as healthcare, law, finance, and education. We investigate advanced methodologies incorporating formal privacy models and risk-aware frameworks, and address the specialized subfield of authorship anonymization. Additionally, we review evaluation frameworks, comprehensive metrics, benchmarks, and practical toolkits for real-world deployment of anonymization solutions. This review consolidates current knowledge, identifies emerging trends and persistent challenges, including the evolving privacy-utility trade-off, the need to address quasi-identifiers, and the implications of LLM capabilities, and aims to guide future research directions for both academics and practitioners in this field.[26] Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning
Zinan Tang,Xin Gao,Qizhi Pei,Zhuoshi Pan,Mengzhang Cai,Jiang Wu,Conghui He,Lijun Wu
Main category: cs.CL
TL;DR: 本文提出了一种动态数据优化框架 Middo,通过模型感知的数据选择和语义保留的数据优化,持续提升大型语言模型的性能。
Details
Motivation: 传统静态数据集整理方法无法适应模型能力的变化,需要一种动态数据优化框架。 Method: 提出 Middo 框架,包含自我诊断模块和自适应优化引擎,通过闭环优化系统动态优化训练数据。 Result: 实验表明,该方法平均提升了 7.15% 的准确性,同时保持了数据集规模不变。 Conclusion: Middo 提供了一个可持续的大型语言模型训练新范式,通过数据和模型的动态人机共进化提升了模型性能。 Abstract: Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our \method consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon.[27] Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks
Sarfaroz Yunusov,Kaige Chen,Kazi Nishat Anwar,Ali Emami
Main category: cs.CL
TL;DR: This study finds that users' personality traits influence their preference for LLMs like GPT-4 or Claude 3.5, revealing differences that traditional evaluations overlook.
Details
Motivation: As Large Language Models (LLMs) integrate into everyday workflows, understanding whether users with different personality traits systematically prefer certain LLMs over others is critical for optimizing human-AI collaboration. Method: A study with 32 participants across four Keirsey personality types evaluated interactions with GPT-4 and Claude 3.5 on four collaborative tasks, including data analysis, creative writing, information retrieval, and writing assistance. Result: Results showed significant personality-driven preferences: Rationals preferred GPT-4, especially for goal-oriented tasks, while Idealists favored Claude 3.5, particularly for creative and analytical tasks. Other types exhibited task-dependent preferences, with sentiment analysis supporting these patterns. Conclusion: Personality-based analysis reveals LLM differences that traditional evaluations miss, with Rationals preferring GPT-4 for goal-oriented tasks and Idealists favoring Claude 3.5 for creative and analytical tasks. Abstract: As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conducted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retrieval, and writing assistance. Results revealed significant personality-driven preferences: Rationals strongly preferred GPT-4, particularly for goal-oriented tasks, while idealists favored Claude 3.5, especially for creative and analytical tasks. Other personality types showed task-dependent preferences. Sentiment analysis of qualitative feedback confirmed these patterns. Notably, aggregate helpfulness ratings were similar across models, showing how personality-based analysis reveals LLM differences that traditional evaluations miss.[28] QZhou-Embedding Technical Report
Peng Yu,En Xu,Bin Chen,Haibiao Chen,Yinfei Xu
Main category: cs.CL
TL;DR: QZhou-Embedding是一种通用的上下文文本嵌入模型,基于Qwen2.5-7B-Instruct基础模型,通过统一的多任务框架和数据合成流程,在MTEB和CMTEB基准测试中达到最先进水平。
Details
Motivation: 开发一种具有强大文本表示能力的通用上下文文本嵌入模型,以提高检索模型的性能。 Method: 设计了一种统一的多任务框架,包括专门的数据转换和训练策略,并开发了一个利用LLM API的数据合成流程,以提高训练集的语义丰富性和样本难度。 Result: QZhou-Embedding在MTEB和CMTEB基准测试中均排名第一,并在重排序、聚类等任务中表现出色。 Conclusion: 高质量和多样化的数据对于提升检索模型性能至关重要,利用LLM的生成能力可以进一步优化嵌入模型的数据质量和性能。 Abstract: We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.[29] Is this chart lying to me? Automating the detection of misleading visualizations
Jonathan Tonglet,Jan Zimny,Tinne Tuytelaars,Iryna Gurevych
Main category: cs.CL
TL;DR: 本文介绍了Misviz和Misviz-synth数据集,用于检测误导性可视化内容,并评估了现有方法的效果,结果显示任务仍面临巨大挑战。
Details
Motivation: 误导性可视化是社交媒体和网络上错误信息传播的重要驱动力,需要通过自动检测和识别违规设计规则来减少其影响。 Method: 引入Misviz和Misviz-synth两个数据集,并使用最先进的MLLMs、基于规则的系统和微调分类器对它们进行综合评估。 Result: 任务仍然极具挑战性,尽管有新的数据集支持模型训练和评估。 Conclusion: Misviz和Misviz-synth数据集的发布为误导性可视化内容的检测提供了有效的资源,但任务仍然具有挑战性。 Abstract: Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.[30] Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Yao Wang,Di Liang,Minlong Peng
Main category: cs.CL
TL;DR: This paper introduces CPI-FT, a novel fine-tuning framework that isolates core model parameters for individual tasks and uses parameter fusion to reduce interference and forgetting during multi-task adaptation of large language models.
Details
Motivation: The motivation stems from the challenge of the 'seesaw phenomenon' in traditional supervised fine-tuning, where improving performance on certain tasks leads to degradation in others due to indiscriminate parameter updates. Method: The CPI-FT framework identifies core parameter regions for each task through independent fine-tuning, groups tasks based on parameter region overlap, and applies parameter fusion using SLERP for non-core parameters while freezing core regions during a pipelined SFT training phase. Result: Experiments show that the CPI-FT framework consistently outperforms standard multi-task and multi-stage fine-tuning baselines by mitigating task interference and catastrophic forgetting. Conclusion: The proposed CPI-FT framework effectively addresses the 'seesaw phenomenon' in supervised fine-tuning of large language models, significantly reducing task interference and forgetting compared to conventional methods. Abstract: Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the ``seesaw phenomenon'', where indiscriminate parameter updates yield progress on certain tasks at the expense of others. To address this challenge, we propose a novel \emph{Core Parameter Isolation Fine-Tuning} (CPI-FT) framework. Specifically, we first independently fine-tune the LLM on each task to identify its core parameter regions by quantifying parameter update magnitudes. Tasks with similar core regions are then grouped based on region overlap, forming clusters for joint modeling. We further introduce a parameter fusion technique: for each task, core parameters from its individually fine-tuned model are directly transplanted into a unified backbone, while non-core parameters from different tasks are smoothly integrated via Spherical Linear Interpolation (SLERP), mitigating destructive interference. A lightweight, pipelined SFT training phase using mixed-task data is subsequently employed, while freezing core regions from prior tasks to prevent catastrophic forgetting. Extensive experiments on multiple public benchmarks demonstrate that our approach significantly alleviates task interference and forgetting, consistently outperforming vanilla multi-task and multi-stage fine-tuning baselines.[31] Reasoning-Intensive Regression
Diane Tchuindjo,Omar Khattab
Main category: cs.CL
TL;DR: This paper introduces MENTAT, a new method combining prompt optimization and ensemble learning, which significantly improves performance on reasoning-intensive regression tasks compared to existing approaches.
Details
Motivation: The paper addresses the challenge of applying large language models to reasoning-intensive regression tasks, which require deeper text analysis with limited training data and computation. Current methods like prompting frozen LLMs and gradient descent-based fine-tuning are hypothesized to perform poorly in these scenarios. Method: The study casts three realistic problems as reasoning-intensive regression (RiR) tasks to test the hypothesis that prompting frozen LLMs and fine-tuning Transformer encoders via gradient descent struggle in RiR. The proposed solution, MENTAT, combines batch-reflective prompt optimization with neural ensemble learning. Result: MENTAT achieves up to a 65% improvement over the baselines in reasoning-intensive regression tasks. Conclusion: MENTAT is an effective method for reasoning-intensive regression tasks, showing significant improvement over baselines, but there is still room for future advances in this area. Abstract: AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e. deducing subtle numerical properties from text. Unlike standard language regression tasks, e.g. for sentiment or similarity, RiR often appears instead in ad-hoc problems like rubric-based scoring or domain-specific retrieval, where much deeper analysis of text is required while only limited task-specific training data and computation are available. We cast three realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.[32] PiCSAR: Probabilistic Confidence Selection And Ranking
Joshua Ong Jun Leang,Zheng Zhao,Aryo Pradipta Gema,Sohee Yang,Wai-Chung Kwan,Xuanli He,Wenda Li,Pasquale Minervini,Eleonora Giunchiglia,Shay B. Cohen
Main category: cs.CL
TL;DR: PiCSAR是一种无需训练的方法,通过计算推理链和最终答案的联合对数似然来提高大语言模型和大推理模型的准确性。
Details
Motivation: 在没有真实答案的情况下,设计一种能够识别正确推理链的评分函数是关键的挑战。 Method: PiCSAR使用推理和最终答案的联合对数似然来评分每个候选生成,这种方法自然分解为推理置信度和答案置信度。 Result: PiCSAR在各种基准测试中取得了显著的增益,并且在20次比较中,有16次使用至少少2倍的样本超过了基线。 Conclusion: 分析表明,正确的推理链显示出更高的推理和答案置信度,证明了PiCSAR的有效性。 Abstract: Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.[33] Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval
Inés Altemir Marinas,Anastasiia Kucherenko,Andrei Kucharavy
Main category: cs.CL
TL;DR: 本研究开发了一个基于ElasticSearch的高效框架,用于实时分析大型语言模型的训练数据,旨在提升AI系统的安全性与责任性。
Details
Motivation: 由于大型语言模型依赖于网络规模的数据集(如Common Crawl),而网络爬取的内容存在质量问题和伦理风险,因此需要对训练数据进行高效、实时的分析以确保数据质量与安全性。 Method: 使用基于ElasticSearch的流水线构建了一个索引和分析LLM训练数据集的框架,并应用于SwissAI的FineWeb-2语料库。 Result: 该框架在SwissAI的FineWeb-2语料库上实现了快速查询性能,大多数搜索在毫秒级别完成,最长不超过2秒。 Conclusion: 该研究展示了一种用于分析大型语言模型训练数据集的实时分析框架,为构建更安全、更负责任的人工智能系统提供了实用工具。 Abstract: Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.[34] Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
Meidan Ding,Jipeng Zhang,Wenxuan Wang,Cheng-Yi Li,Wei-Chieh Fang,Hsin-Yu Wu,Haiqin Zhong,Wenting Chen,Linlin Shen
Main category: cs.CL
TL;DR: Med-RewardBench is introduced as the first benchmark for evaluating medical reward models and judges, addressing the gap in assessing multimodal large language models for clinical accuracy and relevance.
Details
Motivation: The lack of dedicated benchmarks for evaluating medical reward models and judges in clinical contexts necessitates the development of Med-RewardBench. Method: Constructed Med-RewardBench with a multimodal dataset across 13 organ systems and 8 clinical departments, using a three-step process to ensure quality, and evaluated 32 MLLMs and baseline models. Result: Evaluation of 32 state-of-the-art MLLMs showed significant challenges in aligning outputs with expert judgment, while baseline models demonstrated improved performance through fine-tuning. Conclusion: Med-RewardBench successfully provides a comprehensive and high-quality benchmark for evaluating medical reward models and judges, highlighting the need for further improvements in aligning MLLM outputs with expert judgment. Abstract: Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.cs.CV [Back]
[35] 2COOOL: 2nd Workshop on the Challenge Of Out-Of-Label Hazards in Autonomous Driving
Ali K. AlShami,Ryan Rabinowitz,Maged Shoman,Jianwu Fang,Lukas Picek,Shao-Yuan Lo,Steve Cruz,Khang Nhut Lam,Nachiket Kamod,Lei-Lei Li,Jugal Kalita,Terrance E. Boult
Main category: cs.CV
TL;DR: The 2COOOL workshop focuses on addressing novel and out-of-distribution scenarios in autonomous driving to improve safety, featuring discussions on hazard detection, modeling, and methodologies.
Details
Motivation: Despite advances in autonomous driving algorithms, entirely safe self-driving cars have not yet been achieved due to challenges in handling novel and out-of-distribution scenarios. Method: Providing a forum for researchers and industry experts to discuss novelty handling, out-of-distribution hazard detection, vision-language models for hazard understanding, new benchmarking methodologies, and safe autonomous driving practices. Result: The workshop will be held at ICCV 2025, building on the success of its inaugural edition at WACV 2025, with the goal of pushing the state of the art in novelty handling for autonomous driving. Conclusion: The 2COOOL workshop aims to inspire the development of new algorithms and systems for hazard avoidance in autonomous driving, drawing on various techniques and fields, and will feature a mix of academic and industry participation. Abstract: As the computer vision community advances autonomous driving algorithms, integrating vision-based insights with sensor data remains essential for improving perception, decision making, planning, prediction, simulation, and control. Yet we must ask: Why don't we have entirely safe self-driving cars yet? A key part of the answer lies in addressing novel scenarios, one of the most critical barriers to real-world deployment. Our 2COOOL workshop provides a dedicated forum for researchers and industry experts to push the state of the art in novelty handling, including out-of-distribution hazard detection, vision-language models for hazard understanding, new benchmarking and methodologies, and safe autonomous driving practices. The 2nd Workshop on the Challenge of Out-of-Label Hazards in Autonomous Driving (2COOOL) will be held at the International Conference on Computer Vision (ICCV) 2025 in Honolulu, Hawaii, on October 19, 2025. We aim to inspire the development of new algorithms and systems for hazard avoidance, drawing on ideas from anomaly detection, open-set recognition, open-vocabulary modeling, domain adaptation, and related fields. Building on the success of its inaugural edition at the Winter Conference on Applications of Computer Vision (WACV) 2025, the workshop will feature a mix of academic and industry participation.[36] Advanced Deep Learning Techniques for Classifying Dental Conditions Using Panoramic X-Ray Images
Alireza Golkarieh,Kiana Kiashemshaki,Sajjad Rezvani Boroujeni
Main category: cs.CV
TL;DR: 本研究测试了多种深度学习模型在牙科疾病自动诊断中的表现,发现混合模型(CNN+传统分类器)效果最佳,为自动化牙科诊断提供了实用方法。
Details
Motivation: 研究旨在探索深度学习方法在自动化分类全景X光图像牙科疾病中的应用,以提高诊断效率和可靠性。 Method: 评估了三种方法:自定义卷积神经网络(CNN)、结合CNN特征提取与传统分类器的混合模型以及微调预训练架构,并采用5折交叉验证进行实验。 Result: 混合CNN随机森林模型表现最佳,准确率达到85.4%,超过了自定义CNN基线的74.3%。 Conclusion: 研究得出结合CNN特征提取与传统分类器的混合模型在全景X光图像的牙科疾病分类中表现出最佳性能,为自动化牙科诊断提供了可行路径。 Abstract: This study investigates deep learning methods for automated classification of dental conditions in panoramic X-ray images. A dataset of 1,512 radiographs with 11,137 expert-verified annotations across four conditions fillings, cavities, implants, and impacted teeth was used. After preprocessing and class balancing, three approaches were evaluated: a custom convolutional neural network (CNN), hybrid models combining CNN feature extraction with traditional classifiers, and fine-tuned pre-trained architectures. Experiments employed 5 fold cross validation with accuracy, precision, recall, and F1 score as evaluation metrics. The hybrid CNN Random Forest model achieved the highest performance with 85.4% accuracy, surpassing the custom CNN baseline of 74.3%. Among pre-trained models, VGG16 performed best at 82.3% accuracy, followed by Xception and ResNet50. Results show that hybrid models improve discrimination of morphologically similar conditions and provide efficient, reliable performance. These findings suggest that combining CNN-based feature extraction with ensemble classifiers offers a practical path toward automated dental diagnostic support, while also highlighting the need for larger datasets and further clinical validation.[37] Q-Align: Alleviating Attention Leakage in Zero-Shot Appearance Transfer via Query-Query Alignment
Namu Kim,Wonbin Kweon,Minsoo Kim,Hwanjo Yu
Main category: cs.CV
TL;DR: Q-Align addresses Attention Leakage in zero-shot appearance transfer through Query-Query alignment, Key-Value rearrangement, and Attention refinement, improving performance over existing methods.
Details
Motivation: The motivation is to solve the Attention Leakage problem in zero-shot appearance transfer with large-scale image generation models, which occurs due to semantic mapping captured by Query-Key alignment. Method: Q-Align introduces three core contributions: Query-Query alignment, Key-Value rearrangement, and Attention refinement to mitigate attention leakage and improve semantic alignment. Result: Extensive experiments validate that Q-Align outperforms existing state-of-the-art methods in terms of appearance fidelity while preserving structure competitively. Conclusion: Q-Align effectively addresses the Attention Leakage issue in zero-shot appearance transfer, outperforming state-of-the-art methods in appearance fidelity while maintaining structure preservation. Abstract: We observe that zero-shot appearance transfer with large-scale image generation models faces a significant challenge: Attention Leakage. This challenge arises when the semantic mapping between two images is captured by the Query-Key alignment. To tackle this issue, we introduce Q-Align, utilizing Query-Query alignment to mitigate attention leakage and improve the semantic alignment in zero-shot appearance transfer. Q-Align incorporates three core contributions: (1) Query-Query alignment, facilitating the sophisticated spatial semantic mapping between two images; (2) Key-Value rearrangement, enhancing feature correspondence through realignment; and (3) Attention refinement using rearranged keys and values to maintain semantic consistency. We validate the effectiveness of Q-Align through extensive experiments and analysis, and Q-Align outperforms state-of-the-art methods in appearance fidelity while maintaining competitive structure preservation.[38] ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion
Xurui Peng,Hong Liu,Chenqian Yan,Rui Ma,Fangmin Chen,Xing Wang,Zhihua Wu,Songwei Liu,Mingbao Lin
Main category: cs.CV
TL;DR: ERTACache是一个针对扩散模型推理加速的缓存框架,通过解决缓存引起的误差问题,实现了高效的推理过程,同时保持了生成质量。
Details
Motivation: 扩散模型由于其迭代推理过程,计算开销较大。特征缓存虽然提供了一种加速策略,但简单的缓存重用通常会导致明显的质量下降。 Method: ERTACache通过离线残差分析、动态调整积分间隔和解析近似缓存误差的闭合形式模型,来联合修正缓存引起的两种主要误差类型。 Result: 实验表明,ERTACache在多个标准图像和视频生成基准测试中实现了高达2倍的推理加速,同时保持或提升了视觉质量。在Wan2.1视频扩散模型上,其加速效果显著,VBench性能损失极小。 Conclusion: ERTACache是一种新的缓存框架,有效解决了扩散模型推理过程中缓存引起的质量下降问题,同时显著提高了推理效率。 Abstract: Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.[39] Video-LLMs with Temporal Visual Screening
Zheyu Fan,Jiateng Liu,Yuji Zhang,Zihan Wang,Yi R.,Fung,Manling Li,Heng Ji
Main category: cs.CV
TL;DR: Temporal Visual Screening (TVS) is proposed to improve the performance of Video Large Language Models (Video-LLMs) in understanding video-language by focusing on salient temporal segments and simplifying queries, resulting in significant performance gains during training and inference.
Details
Motivation: Current Video Large Language Models (Video-LLMs) struggle with capturing fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision. Humans naturally perform temporal screening by focusing on salient temporal segments, which inspired the development of TVS. Method: Temporal Visual Screening (TVS) is introduced as a modular front-end adapter task. It preprocesses video question answering and instruction tuning data by focusing on salient temporal segments, reconstructing queries to their most direct form, and maintaining answer consistency. A benchmark called ReSimplifyIt was curated to evaluate TVS. Result: Experiments showed that incorporating TVS resulted in relative performance gains of 7.33% during training and 34.6% during inference. ReSimplifyIt, the baseline for TVS, outperformed prior approaches by 0.47 in F-1 score on video trimming and demonstrated competitive query rewriting performance. Conclusion: Incorporating Temporal Visual Screening (TVS) into Video Instruction Tuning and Video Question Answering pipelines enhances the understanding of video-language by optimizing the distribution of reasoning burden and cognitive load, as evidenced by significant relative gains in performance. Abstract: Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.[40] ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments
Zhe Han,Charlie Budd,Gongyu Zhang,Huanyu Tian,Christos Bergeles,Tom Vercauteren
Main category: cs.CV
TL;DR: 本文提出了ROBUST-MIPS数据集和姿态注释方法,用于改进手术工具定位的深度学习模型训练,并发布了相关工具和基准测试结果。
Details
Motivation: 传统的基于深度学习的手术工具定位方法依赖于多样化的注释数据,而数据注释的丰富性和易用性之间需要取得平衡。作者提出骨骼姿态注释是一种更高效的注释方法,以加速注释数据的增长。 Method: 作者基于现有的ROBUST-MIS数据集构建了ROBUST-MIPS数据集,并设置了一个简单的基准测试,使用流行的姿态估计方法来展示姿态注释在手术工具定位中的有效性。 Result: 作者通过基准测试观察到高质量的结果,证明了姿态注释在手术工具定位中的有效性。同时,他们还发布了数据集、基准模型和自定义工具姿态注释软件。 Conclusion: 作者认为骨骼姿态注释是一种更高效的手术工具注释方法,并通过提出的ROBUST-MIPS数据集和相关工具促进了这一注释风格的应用。 Abstract: Localisation of surgical tools constitutes a foundational building block for computer-assisted interventional technologies. Works in this field typically focus on training deep learning models to perform segmentation tasks. Performance of learning-based approaches is limited by the availability of diverse annotated data. We argue that skeletal pose annotations are a more efficient annotation approach for surgical tools, striking a balance between richness of semantic information and ease of annotation, thus allowing for accelerated growth of available annotated data. To encourage adoption of this annotation style, we present, ROBUST-MIPS, a combined tool pose and tool instance segmentation dataset derived from the existing ROBUST-MIS dataset. Our enriched dataset facilitates the joint study of these two annotation styles and allow head-to-head comparison on various downstream tasks. To demonstrate the adequacy of pose annotations for surgical tool localisation, we set up a simple benchmark using popular pose estimation methods and observe high-quality results. To ease adoption, together with the dataset, we release our benchmark models and custom tool pose annotation software.[41] Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models
Xiangtao Meng,Yingkai Dong,Ning Yu,Li Wang,Zheng Li,Shanqing Guo
Main category: cs.CV
TL;DR: Safe-Control是一种创新的即插即用安全补丁,通过数据驱动策略和安全意识条件注入安全控制信号,有效减轻文本到图像生成模型中的不安全内容生成。
Details
Motivation: 现有的安全机制要么在分布转移下仍然容易被规避,要么需要大量的模型特定调整。 Method: 通过数据驱动策略和安全意识条件,将安全控制信号注入锁定的T2I模型,以类似补丁的方式进行更新。 Result: 经验结果表明,Safe-Control在减少六种不同且公开的T2I模型上的不安全内容生成方面是有效的,同时成功地保持了良性图像的质量和文本对齐。 Conclusion: Safe-Control是一个创新的即插即用的安全补丁,旨在减轻T2I模型中的不安全内容生成。 Abstract: Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.[42] GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions
Kei Katsumata,Yui Iioka,Naoki Hosomi,Teruhisa Misu,Kentaro Yamada,Komei Sugiura
Main category: cs.CV
TL;DR: GENNAV is a new method for identifying target regions from natural language instructions and images, particularly for stuff-type regions with ambiguous boundaries. It outperforms existing methods and demonstrates robustness across various real-world environments.
Details
Motivation: Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. There is a need for a more robust method for identifying target regions from natural language instructions and images. Method: GENNAV predicts target existence and generates segmentation masks for multiple stuff-type target regions. A novel benchmark called GRiN-Drive was also constructed for evaluation. Result: GENNAV achieved superior performance over baseline methods on standard evaluation metrics and demonstrated robustness across diverse real-world environments. Conclusion: GENNAV is an effective method for identifying target regions from natural language instructions and front camera images, particularly for stuff-type regions with ambiguous boundaries. It demonstrates robustness across various real-world environments. Abstract: We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility. This task is challenging because it requires both existence prediction and segmentation, particularly for stuff-type target regions with ambiguous boundaries. Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions. To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target. GENNAV achieved superior performance over baseline methods on standard evaluation metrics. Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance. In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments. The project page is available at https://gennav.vercel.app/.[43] R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Jie Jiang,Qi Yang,Bolin Ni,Shiming Xiang,Han Hu,Houwen Peng
Main category: cs.CV
TL;DR: R-4B是一种自适应思考的多模态大语言模型,能够根据问题复杂性决定是否进行思考,从而在多个任务上达到最先进的性能。
Details
Motivation: 为了解决多模态大语言模型在简单问题上的思考过程冗余的问题。 Method: R-4B的核心思想是使用双模式退火赋予模型思考和非思考能力,并应用双模式策略优化(BPO)来提高模型在确定是否激活思考过程方面的准确性。 Result: 实验结果表明,R-4B在多个任务上表现出色,达到了最先进的性能,并且在计算成本较低的情况下实现了与较大模型相当的性能。 Conclusion: R-4B实现了在25个具有挑战性的基准测试中达到最先进的性能,优于Qwen2.5-VL-7B并在计算成本较低的情况下在推理密集型基准测试中实现了与较大模型如Kimi-VL-A3B-Thinking-2506(16B)相当的性能。 Abstract: Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.[44] HiddenObject: Modality-Agnostic Fusion for Multimodal Hidden Object Detection
Harris Song,Tuan-Anh Vu,Sanjith Menon,Sriram Narasimhan,M. Khalid Jawed
Main category: cs.CV
TL;DR: This paper introduces HiddenObject, a Mamba-based fusion framework that integrates RGB, thermal, and depth data to enhance detection of concealed or camouflaged objects. It achieves strong performance across benchmark datasets, demonstrating the potential of modality-agnostic approaches in challenging environments.
Details
Motivation: Detecting hidden or partially concealed objects in multimodal environments is challenging due to factors like occlusion, camouflage, and lighting variations. Traditional RGB-based detection methods perform poorly under such conditions, necessitating more robust, modality-agnostic approaches. Method: The paper proposes HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. This method identifies modality-specific features and combines them into a unified representation to improve detection performance. Result: HiddenObject was validated across multiple benchmark datasets, achieving state-of-the-art or competitive performance compared to existing methods. The results demonstrate the effectiveness of the proposed fusion design in challenging scenarios involving obscured or camouflaged targets. Conclusion: The study concludes that the HiddenObject fusion framework, leveraging a Mamba-based mechanism, significantly enhances detection performance for hidden or camouflaged objects in multimodal environments. It demonstrates the efficacy of modality-agnostic approaches and highlights limitations in unimodal and naive fusion strategies. Abstract: Detecting hidden or partially concealed objects remains a fundamental challenge in multimodal environments, where factors like occlusion, camouflage, and lighting variations significantly hinder performance. Traditional RGB-based detection methods often fail under such adverse conditions, motivating the need for more robust, modality-agnostic approaches. In this work, we present HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. Our method captures complementary signals across modalities, enabling enhanced detection of obscured or camouflaged targets. Specifically, the proposed approach identifies modality-specific features and fuses them in a unified representation that generalizes well across challenging scenarios. We validate HiddenObject across multiple benchmark datasets, demonstrating state-of-the-art or competitive performance compared to existing methods. These results highlight the efficacy of our fusion design and expose key limitations in current unimodal and na\"ive fusion strategies. More broadly, our findings suggest that Mamba-based fusion architectures can significantly advance the field of multimodal object detection, especially under visually degraded or complex conditions.[45] RadGS-Reg: Registering Spine CT with Biplanar X-rays via Joint 3D Radiative Gaussians Reconstruction and 3D/3D Registration
Ao Shen,Xueming Fu,Junfeng Jiang,Qiang Zeng,Ye Tang,Zhengming Chen,Luming Nong,Feng Wang,S. Kevin Zhou
Main category: cs.CV
TL;DR: RadGS-Reg是一种新型的CT/X-ray配准框架,通过结合3D Radiative Gaussians重建和3D/3D配准技术,显著提高了脊椎级别的图像配准精度和性能。
Details
Motivation: CT/X-ray registration is challenging due to high accuracy and real-time performance requirements, spatial information loss, and domain gaps in traditional methods. Method: RadGS-Reg introduces a joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration framework, using a Counterfactual Attention Learning (CAL) mechanism and a patient-specific pre-training strategy. Result: Experiments show that RadGS-Reg achieves superior performance on both RadGS reconstruction and 3D/3D registration tasks compared to existing methods. Conclusion: The proposed RadGS-Reg framework demonstrates state-of-the-art performance for vertebral-level CT/X-ray registration and addresses limitations in existing methods. Abstract: Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional "render and compare" methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.[46] SYNBUILD-3D: A large, multi-modal, and semantically rich synthetic dataset of 3D building models at Level of Detail 4
Kevin Mayer,Alex Vesel,Xinyi Zhao,Martin Fischer
Main category: cs.CV
TL;DR: SYNBUILD-3D是一个大规模、多模态的合成3D建筑数据集,旨在促进生成式AI算法的发展,以自动化创建具有语义丰富性和几何一致性的3D建筑模型。
Details
Motivation: 由于公共领域缺乏大规模注释数据集,自动生成准确且语义丰富的3D建筑模型仍然是一个重大挑战。为了解决这个问题,作者引入了SYNBUILD-3D数据集。 Method: 受计算机视觉中合成数据成功的启发,作者创建了一个包含超过620万个多模态合成3D住宅建筑的数据集,并在Level of Detail (LoD) 4上进行注释。每个建筑通过三种不同的模态表示:语义增强的3D线框图(Modality I)、对应的平面图图像(Modality II)和类似LiDAR的屋顶点云(Modality III) Result: SYNBUILD-3D数据集的发布为开发新的生成式AI算法提供了基础,这些算法可以在预定义平面布局和屋顶几何条件下,自动化创建LoD 4级别的3D建筑模型,并保证语义与几何的一致性。 Conclusion: SYNBUILD-3D为3D建筑建模领域提供了重要的数据资源,有助于推动自动化生成高质量、语义丰富的3D建筑模型的研究进展。 Abstract: 3D building models are critical for applications in architecture, energy simulation, and navigation. Yet, generating accurate and semantically rich 3D buildings automatically remains a major challenge due to the lack of large-scale annotated datasets in the public domain. Inspired by the success of synthetic data in computer vision, we introduce SYNBUILD-3D, a large, diverse, and multi-modal dataset of over 6.2 million synthetic 3D residential buildings at Level of Detail (LoD) 4. In the dataset, each building is represented through three distinct modalities: a semantically enriched 3D wireframe graph at LoD 4 (Modality I), the corresponding floor plan images (Modality II), and a LiDAR-like roof point cloud (Modality III). The semantic annotations for each building wireframe are derived from the corresponding floor plan images and include information on rooms, doors, and windows. Through its tri-modal nature, future work can use SYNBUILD-3D to develop novel generative AI algorithms that automate the creation of 3D building models at LoD 4, subject to predefined floor plan layouts and roof geometries, while enforcing semantic-geometric consistency. Dataset and code samples are publicly available at https://github.com/kdmayer/SYNBUILD-3D.[47] Radially Distorted Homographies, Revisited
Mårten Wadenbäck,Marcus Valtonen Örnhag,Johan Edstedt
Main category: cs.CV
TL;DR: 本文提出了一种统一的方法,用于解决三种径向失真配置问题,并构建了新的更快且准确的最小解算器,用于处理径向失真的单应性问题。
Details
Motivation: 在处理真实图像时,由于镜头几何畸变的影响,有必要同时确定单应性和镜头畸变,尤其是径向畸变,以获得有用的估计。 Method: 研究者们通过构建新的快速、稳定且准确的最小解算器来解决三种径向失真配置问题,并在标准测试集上验证其性能。 Result: 所提出的解算器在所有三种情况下都比现有的最先进的解算器更快,同时保持类似的准确性,并且已经在标准基准测试中进行了测试,包括使用鱼眼相机拍摄的图像。 Conclusion: 本文提出了一种新颖且统一的方法,用于解决三种概念上不同的径向失真配置问题,并展示了如何构建新的快速、稳定且准确的最小解算器用于径向失真的单应性。 Abstract: Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. The source code for our solvers will be made available in the event our paper is accepted for publication.[48] GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability
Zhenghao He,Sanchit Sinha,Guangzhi Xiong,Aidong Zhang
Main category: cs.CV
TL;DR: GCAV是一种新的框架,通过统一不同层的概念激活向量,提高深度神经网络解释的稳定性和可靠性。
Details
Motivation: 独立计算的CAVs在不同层之间常常不一致,导致跨层比较不可靠,因此需要一种统一的框架解决这个问题。 Method: GCAV利用对比学习对齐不同层的概念表示,并使用基于注意力的融合机制构建全局集成CAV。 Result: GCAV显著降低了TCAV分数的方差,增强了概念定位能力,并提高了对抗扰动的鲁棒性。 Conclusion: GCAV通过统一不同层的CAVs,提供了一种更稳定和可靠的方法来解释深度神经网络对人类定义概念的敏感性。 Abstract: Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at https://github.com/Zhenghao-He/GCAV.[49] Generalizable Object Re-Identification via Visual In-Context Prompting
Zhizhong Huang,Xiaoming Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为Visual In-Context Prompting (VICP)的新方法,结合了视觉基础模型和大语言模型,通过少量示例即可推广到新类别,无需重新训练。
Details
Motivation: 当前的对象重识别方法训练特定领域的模型,缺乏泛化能力,并且对新类别需要大量标记数据。虽然自监督学习减少了标注需求,但难以捕捉对重识别至关重要的身份敏感特征。 Method: 提出了Visual In-Context Prompting (VICP)框架,通过任务特定的提示,利用大语言模型从少量正负样本对中推断语义身份规则,并通过动态视觉提示指导视觉模型提取身份区分特征。 Result: 在ShopID10K和多样化的ReID基准上的实验表明,VICP在未见类别上明显优于基线方法。 Conclusion: VICP是一个新的框架,通过结合视觉基础模型和大语言模型,使模型能够在没有参数适应的情况下推广到新的类别,从而消除了对特定数据集的再训练需求。 Abstract: Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.[50] Lightweight MRI-Based Automated Segmentation of Pancreatic Cancer with Auto3DSeg
Keshav Jha,William Sharp,Dominic LaBella
Main category: cs.CV
TL;DR: 本研究探讨了基于MRI的胰腺肿瘤自动分割的挑战,特别是在小数据集上使用SegResNet模型和Auto3DSeg架构的效果。
Details
Motivation: 准确的胰腺肿瘤分割对于诊断、治疗计划和结果评估至关重要,但由于解剖结构的变异性及数据集有限,自动分割仍然具有挑战性。 Method: 该研究使用了SegResNet模型,作为Auto3DSeg架构的一部分,并采用了5折交叉验证和STAPLE集成方法,专注于解剖相关的感兴趣区域。 Result: 对于任务1,算法实现了0.56的DSC,0.73的5 mm DSC,41.1毫米的HD95,26.0毫米的MASD和5164毫米的RMSE。任务2的性能下降,DSC为0.33,5 mm DSC为0.50,HD95为20.1毫米,MASD为7.2毫米,RMSE为17203毫米。 Conclusion: 研究得出MRI为基础的胰腺肿瘤分割在小数据集上具有挑战性,不同MRI序列引入了变异性。尽管性能适度,结果展示了自动分割的潜力,并强调需要更大、标准化的MRI数据集以提高模型的鲁棒性和临床实用性。 Abstract: Accurate delineation of pancreatic tumors is critical for diagnosis, treatment planning, and outcome assessment, yet automated segmentation remains challenging due to anatomical variability and limited dataset availability. In this study, SegResNet models, as part of the Auto3DSeg architecture, were trained and evaluated on two MRI-based pancreatic tumor segmentation tasks as part of the 2025 PANTHER Challenge. Algorithm methodology included 5-fold cross-validation with STAPLE ensembling after focusing on an anatomically relevant region-of-interest. The Pancreatic Tumor Segmentation on Diagnostic MRI task 1 training set included 91 T1-weighted arterial contrast-enhanced MRI with expert annotated pancreas and tumor labels. The Pancreatic Tumor Segmentation on MR-Linac task 2 training set used 50 T2-weighted MR-Linac cases with expert annotated pancreas and tumor labels. Algorithm-automated segmentation performance of pancreatic tumor was assessed using Dice Similarity Coefficient (DSC), 5 mm DSC, 95th percentile Hausdorff Distance (HD95), Mean Average Surface Distance (MASD), and Root Mean Square Error (RMSE). For Task 1, the algorithm achieved a DSC of 0.56, 5 mm DSC of 0.73, HD95 of 41.1 mm, MASD of 26.0 mm, and RMSE of 5164 mm. For Task 2, performance decreased, with a DSC of 0.33, 5 mm DSC of 0.50, HD95 of 20.1 mm, MASD of 7.2 mm, and RMSE of 17,203 mm. These findings illustrate the challenges of MRI-based pancreatic tumor segmentation with small datasets, highlighting variability introduced by different MRI sequences. Despite modest performance, the results demonstrate potential for automated delineation and emphasize the need for larger, standardized MRI datasets to improve model robustness and clinical utility.[51] Reverse Imaging for Wide-spectrum Generalization of Cardiac MRI Segmentation
Yidong Zhao,Peter Kellman,Hui Xue,Tongyun Yang,Yi Zhang,Yuchi Han,Orlando Simonetti,Qian Tao
Main category: cs.CV
TL;DR: Reverse Imaging 是一种基于物理的新型心脏MRI数据增强和域适应方法,通过解决逆问题并利用自旋先验分布,实现跨不同成像协议和对比度的高精度分割。
Details
Motivation: 预训练的心脏MRI分割模型由于成像协议变化导致的图像对比度差异而难以泛化,本文旨在从根本上解决这一泛化问题。 Method: 通过求解正则化的非线性逆问题,从心脏MRI图像中反向推断出基本的自旋属性,并利用从mSASHA数据集学习的生成扩散模型作为自旋先验。 Result: Reverse Imaging 能够从任意MR图像中获得近似但有意义的自旋属性估计,从而实现灵活的新图像合成和跨不同对比度的高精度分割。 Conclusion: Reverse Imaging 方法实现了心脏MRI分割在广泛成像协议和对比度上的泛化能力,解决了现有模型的局限性。 Abstract: Pretrained segmentation models for cardiac magnetic resonance imaging (MRI) struggle to generalize across different imaging sequences due to significant variations in image contrast. These variations arise from changes in imaging protocols, yet the same fundamental spin properties, including proton density, T1, and T2 values, govern all acquired images. With this core principle, we introduce Reverse Imaging, a novel physics-driven method for cardiac MRI data augmentation and domain adaptation to fundamentally solve the generalization problem. Our method reversely infers the underlying spin properties from observed cardiac MRI images, by solving ill-posed nonlinear inverse problems regularized by the prior distribution of spin properties. We acquire this "spin prior" by learning a generative diffusion model from the multiparametric SAturation-recovery single-SHot acquisition sequence (mSASHA) dataset, which offers joint cardiac T1 and T2 maps. Our method enables approximate but meaningful spin-property estimates from MR images, which provide an interpretable "latent variable" that lead to highly flexible image synthesis of arbitrary novel sequences. We show that Reverse Imaging enables highly accurate segmentation across vastly different image contrasts and imaging protocols, realizing wide-spectrum generalization of cardiac MRI segmentation.[52] PHD: Personalized 3D Human Body Fitting with Point Diffusion
Hsuan-I Ho,Chen Guo,Po-Chen Wu,Ivan Shugurov,Chengcheng Tang,Abhay Mittal,Sizhe An,Manuel Kaufmann,Linguang Zhang
Main category: cs.CV
TL;DR: PHD是一个利用用户特定形状信息来提高从视频中估计3D人体姿态准确性的新方法。
Details
Motivation: 传统HMR方法设计为用户无关的,优化泛化。这些方法通常使用从2D图像导出的约束来改进对齐,但此过程由于未能共同考虑个人体型和3D姿态的合理性而损害了3D准确性。 Method: PHD方法首先校准用户的体型,然后使用基于该体型的个性化姿态拟合过程。这种方法通过一个点扩散变压器实现的体型条件3D姿态先验,迭代地通过点蒸馏采样损失来指导姿态拟合。 Result: PHD不仅提高了骨盆对齐姿态的准确性,还提高了绝对姿态的准确性。此外,该方法只需要合成数据进行训练,并且可以作为通用的插件模块,无缝集成到现有的3D姿态估计器中。 Conclusion: PHD是一个改进的个性化3D人体网格恢复和身体拟合方法,它利用用户特定的形状信息来提高从视频中估计姿态的准确性。 Abstract: We introduce PHD, a novel approach for personalized 3D human mesh recovery (HMR) and body fitting that leverages user-specific shape information to improve pose estimation accuracy from videos. Traditional HMR methods are designed to be user-agnostic and optimized for generalization. While these methods often refine poses using constraints derived from the 2D image to improve alignment, this process compromises 3D accuracy by failing to jointly account for person-specific body shapes and the plausibility of 3D poses. In contrast, our pipeline decouples this process by first calibrating the user's body shape and then employing a personalized pose fitting process conditioned on that shape. To achieve this, we develop a body shape-conditioned 3D pose prior, implemented as a Point Diffusion Transformer, which iteratively guides the pose fitting via a Point Distillation Sampling loss. This learned 3D pose prior effectively mitigates errors arising from an over-reliance on 2D constraints. Consequently, our approach improves not only pelvis-aligned pose accuracy but also absolute pose accuracy -- an important metric often overlooked by prior work. Furthermore, our method is highly data-efficient, requiring only synthetic data for training, and serves as a versatile plug-and-play module that can be seamlessly integrated with existing 3D pose estimators to enhance their performance. Project page: https://phd-pose.github.io/[53] Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning
Yuquan Bi,Hongsong Wang,Xinli Shi,Zhipeng Gui,Jie Gui,Yuan Yan Tang
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的高效3D人体姿态估计框架,通过层次化的时间剪枝策略,显著降低了计算成本,同时保持了状态-of-the-art的性能。
Details
Motivation: 扩散模型在生成高质量3D人体姿态方面表现出色,但由于其迭代特性和多假设需求,计算成本较高。因此,需要一种高效的方法来降低计算成本。 Method: 提出了一种层次化的时间剪枝(HTP)策略,包括三个阶段:(1) 时间相关性增强剪枝(TCEP),通过自适应时间图构建分析帧间运动相关性,识别关键帧;(2) 稀疏聚焦时间MHSA(SFT MHSA),利用帧级稀疏性减少注意力计算,专注于运动相关令牌;(3) 掩码引导姿态令牌剪枝器(MGPTP),通过聚类进行细粒度语义剪枝,保留最具信息量的姿态令牌。 Result: 在Human3.6M和MPI-INF-3DHP数据集上的实验表明,HTP相比之前的扩散模型方法,训练MACs减少了38.5%,推理MACs减少了56.8%,推理速度平均提高了81.1%,同时达到了最先进的性能。 Conclusion: HTP是一种高效的扩散模型方法,能够在保持高质量输出的同时显著降低计算成本,适用于3D人体姿态估计任务。 Abstract: Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5\%, inference MACs by 56.8\%, and improves inference speed by an average of 81.1\% compared to prior diffusion-based methods, while achieving state-of-the-art performance.[54] Print2Volume: Generating Synthetic OCT-based 3D Fingerprint Volume from 2D Fingerprint Image
Qingran Miao,Haixia Wang,Haohao Sun,Yilong Zhang
Main category: cs.CV
TL;DR: 本文提出Print2Volume,一个生成合成OCT 3D指纹数据的框架,通过三个阶段(2D风格转换、3D结构扩展、OCT真实性优化)生成高质量数据,解决了数据稀缺问题,并显著提升了识别性能。
Details
Motivation: OCT能够获取高分辨率的3D指纹数据,但由于采集成本高且耗时,导致大规模公共数据集稀缺,阻碍了深度学习模型的发展。 Method: Print2Volume框架包括三个阶段:(1) 2D风格转换模块将二值指纹转换为模仿OCT扫描风格的灰度图像;(2) 3D结构扩展网络将2D图像外推为合理的3D解剖体积;(3) 基于3D GAN的OCT真实性优化器生成具有真实纹理和噪声的3D指纹数据。 Result: 使用Print2Volume生成了420,000个样本的合成数据集。定量实验表明,合成数据质量高,并显著提升了识别性能。在ZJUT-EIFD基准上,EER从15.62%降低到2.50%。 Conclusion: Print2Volume有效缓解了数据匮乏问题,通过预训练和微调,使识别模型在真实数据上的性能显著提升,证明了该方法的有效性。 Abstract: Optical Coherence Tomography (OCT) enables the acquisition of high-resolution, three-dimensional fingerprint data, capturing rich subsurface structures for robust biometric recognition. However, the high cost and time-consuming nature of OCT data acquisition have led to a scarcity of large-scale public datasets, significantly hindering the development of advanced algorithms, particularly data-hungry deep learning models. To address this critical bottleneck, this paper introduces Print2Volume, a novel framework for generating realistic, synthetic OCT-based 3D fingerprints from 2D fingerprint image. Our framework operates in three sequential stages: (1) a 2D style transfer module that converts a binary fingerprint into a grayscale images mimicking the style of a Z-direction mean-projected OCT scan; (2) a 3D Structure Expansion Network that extrapolates the 2D im-age into a plausible 3D anatomical volume; and (3) an OCT Realism Refiner, based on a 3D GAN, that renders the structural volume with authentic textures, speckle noise, and other imaging characteristics. Using Print2Volume, we generated a large-scale synthetic dataset of 420,000 samples. Quantitative experiments demonstrate the high quality of our synthetic data and its significant impact on recognition performance. By pre-training a recognition model on our synthetic data and fine-tuning it on a small real-world dataset, we achieved a remarkable reduction in the Equal Error Rate (EER) from 15.62% to 2.50% on the ZJUT-EIFD benchmark, proving the effectiveness of our approach in overcoming data scarcity.[55] Identifying Surgical Instruments in Laparoscopy Using Deep Learning Instance Segmentation
Sabrina Kletz,Klaus Schoeffmann,Jenny Benois-Pineau,Heinrich Husslein
Main category: cs.CV
TL;DR: 本文研究了腹腔镜妇科手术视频中手术器械的分割与识别问题,发现器械定位和分割效果较好,但器械类型识别仍面临挑战。
Details
Motivation: 手术录制视频已成为医学内窥镜领域的重要信息来源,但由于视频内容特殊,自动内容索引仍是重大挑战。因此,需要研究手术器械的分割与识别方法,以实现基于内容的医学视频检索。 Method: 本文采用基于区域的全卷积网络进行实例感知的手术器械分割与识别,包括(1)用于二值分割的器械实例分割和(2)用于多类识别的器械类型识别。 Result: 实验结果表明,即使训练样本数量较少,也能以较高精度定位和分割器械区域,但确定具体器械类型的准确性仍然较低。 Conclusion: 虽然使用适度数量的训练样本可以实现高精度的器械定位和分割,但由于手术器械本身的高度相似性,确定具体器械类型仍然具有挑战性。 Abstract: Recorded videos from surgeries have become an increasingly important information source for the field of medical endoscopy, since the recorded footage shows every single detail of the surgery. However, while video recording is straightforward these days, automatic content indexing - the basis for content-based search in a medical video archive - is still a great challenge due to the very special video content. In this work, we investigate segmentation and recognition of surgical instruments in videos recorded from laparoscopic gynecology. More precisely, we evaluate the achievable performance of segmenting surgical instruments from their background by using a region-based fully convolutional network for instance-aware (1) instrument segmentation as well as (2) instrument recognition. While the first part addresses only binary segmentation of instances (i.e., distinguishing between instrument or background) we also investigate multi-class instrument recognition (i.e., identifying the type of instrument). Our evaluation results show that even with a moderately low number of training examples, we are able to localize and segment instrument regions with a pretty high accuracy. However, the results also reveal that determining the particular instrument is still very challenging, due to the inherently high similarity of surgical instruments.[56] SatDINO: A Deep Dive into Self-Supervised Pretraining for Remote Sensing
Jakub Straka,Ivan Gruber
Main category: cs.CV
TL;DR: SatDINO是一种用于遥感图像分析的新型自我监督学习模型,它通过对比学习方法超越了现有的掩码自动编码器方法,并在多个基准测试中表现良好。
Details
Motivation: 自我监督学习已成为遥感领域的一个强大工具,因为该领域有大量的未标记数据可用。 Method: 研究使用DINO(一种对比自监督方法)对遥感图像进行预训练,并提出了SatDINO模型。 Result: 通过在多个数据集上的广泛实验,SatDINO展示了其超越其他方法的性能,并通过严格的消融研究评估了其各个组件。 Conclusion: SatDINO是为卫星图像表示学习而定制的模型,它超越了基于掩码自动编码器(MAE)的其他最先进的方法,并在多个基准测试中取得了具有竞争力的结果。 Abstract: Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. We introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks. We also provide a rigorous ablation study evaluating SatDINO's individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: https://github.com/strakaj/SatDINO.[57] Standardized Multi-Layer Tissue Maps for Enhanced Artificial Intelligence Integration and Search in Large-Scale Whole Slide Image Archives
Gernot Fiala,Markus Plass,Robert Harb,Peter Regitnig,Kristijan Skok,Wael Al Zoughbi,Carmen Zerner,Paul Torke,Michaela Kargl,Heimo Müller,Tomas Brazdil,Matej Gallo,Jaroslav Kubín,Roman Stoklasa,Rudolf Nenutil,Norman Zerbe,Andreas Holzinger,Petr Holub
Main category: cs.CV
TL;DR: This paper proposes a standardized framework for generating detailed tissue maps of Whole Slide Images (WSIs), enhancing their usability in AI development by providing structured, interoperable metadata.
Details
Motivation: The lack of metadata standardization for Whole Slide Images (WSIs) necessitates manual inspection, which is impractical for large-scale AI training and validation. This work aims to establish a standardized and automated approach for describing WSI content. Method: A general framework was developed to generate a 2D index map for WSIs, along with a profiling mechanism tailored for specific application domains. The framework was demonstrated in clinical pathology using standardized syntax and semantics for interoperability. Result: A three-layer tissue map (source, tissue type, and pathological alterations) was successfully developed to provide detailed, fine-grained WSI content information. The proposed standard demonstrated advantages in WSI cataloging, machine learning, and graph-based representations. Conclusion: The proposed framework enhances WSI collections with detailed tissue maps, offering fine-grained content information organized into three layers: source, tissue type, and pathological alterations. This facilitates interoperability and efficient utilization of WSIs in clinical pathology and other domains. Abstract: A Whole Slide Image (WSI) is a high-resolution digital image created by scanning an entire glass slide containing a biological specimen, such as tissue sections or cell samples, at multiple magnifications. These images can be viewed, analyzed, shared digitally, and are used today for Artificial Intelligence (AI) algorithm development. WSIs are used in a variety of fields, including pathology for diagnosing diseases and oncology for cancer research. They are also utilized in neurology, veterinary medicine, hematology, microbiology, dermatology, pharmacology, toxicology, immunology, and forensic science. When assembling cohorts for the training or validation of an AI algorithm, it is essential to know what is present on such a WSI. However, there is currently no standard for this metadata, so such selection has mainly been done through manual inspection, which is not suitable for large collections with several million objects. We propose a general framework to generate a 2D index map for WSI and a profiling mechanism for specific application domains. We demonstrate this approach in the field of clinical pathology, using common syntax and semantics to achieve interoperability between different catalogs. Our approach augments each WSI collection with a detailed tissue map that provides fine-grained information about the WSI content. The tissue map is organized into three layers: source, tissue type, and pathological alterations, with each layer assigning segments of the WSI to specific classes. We illustrate the advantages and applicability of the proposed standard through specific examples in WSI catalogs, Machine Learning (ML), and graph-based WSI representations.[58] Unsupervised Incremental Learning Using Confidence-Based Pseudo-Labels
Lucas Rakotoarivony
Main category: cs.CV
TL;DR: 本文提出了一种基于置信度伪标签的无监督增量学习方法(ICPL),解决了现有CIL方法对完全标注数据的不现实假设问题。
Details
Motivation: 现实场景中,训练过程中未见过的新类别经常出现,而现有的CIL方法假设增量数据集完全标注,这在实践中是不现实的。 Method: 我们提出了一种基于置信度伪标签的无监督增量学习方法(ICPL),将伪标签集成到各种CIL方法中,并通过基于置信度的选择进行评估。 Result: ICPL在CIFAR100和ImageNet100上评估了性能下降,并在与流行的class-iNCD方法比较中展示了其实际性和竞争力。 Conclusion: ICPL实现了与监督方法相当的结果,并且在最终准确率上比最先进的class-iNCD方法高出超过5%。 Abstract: Deep learning models have achieved state-of-the-art performance in many computer vision tasks. However, in real-world scenarios, novel classes that were unseen during training often emerge, requiring models to acquire new knowledge incrementally. Class-Incremental Learning (CIL) methods enable a model to learn novel classes while retaining knowledge of previous classes. However, these methods make the strong assumption that the incremental dataset is fully labeled, which is unrealistic in practice. In this work, we propose an unsupervised Incremental Learning method using Confidence-based Pseudo-labels (ICPL), which replaces human annotations with pseudo-labels, enabling incremental learning from unlabeled datasets. We integrate these pseudo-labels into various CIL methods with confidence-based selection and evaluate performance degradation on CIFAR100 and ImageNet100. Then, we compare our approach to popular Class Incremental Novel Category Discovery (class-iNCD) methods addressing similar challenges. Additionally, we apply our method to fine-grained datasets to demonstrate its real-world practicality and measure its computational complexity to validate its suitability for resource-constrained environments. ICPL achieves competitive results compared to supervised methods and outperforms state-of-the-art class-iNCD methods by more than 5% in final accuracy.[59] MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation
Francisco Caetano,Christiaan Viviers,Peter H. H. de With,Fons van der Sommen
Main category: cs.CV
TL;DR: 本文提出 MedShift,一种用于合成与真实 X 射线图像跨域转换的高效生成模型,解决了医学图像中的领域适应问题。
Details
Motivation: 合成医学数据在训练模型方面具有潜力,但其在真实临床环境中的泛化能力受到领域差异的限制,尤其是在 X 射线图像的衰减行为、噪声特征和软组织表示方面存在显著差异。 Method: 提出 MedShift,一种基于 Flow Matching 和 Schrodinger Bridges 的统一类条件生成模型,学习共享的领域无关潜在空间,实现任意训练领域之间的无缝转换。 Result: 尽管模型规模小于基于扩散的方法,MedShift 在实验中表现出色,推理时具有灵活性,可以选择优先提高感知保真度或结构一致性。 Conclusion: MedShift 是一种统一的、基于类条件生成模型的方法,能够实现高质量、无配对的跨域图像转换,适用于医学图像中的领域适应问题。 Abstract: Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html[60] Trees as Gaussians: Large-Scale Individual Tree Mapping
Dimitri Gominski,Martin Brandt,Xiaoye Tong,Siyu Liu,Maurice Mugabowindekwe,Sizhuo Li,Florian Reiner,Andrew Davies,Rasmus Fensholt
Main category: cs.CV
TL;DR: 本研究开发了一种利用深度学习和PlanetScope影像在全球范围内检测单个大树的方法,结合激光雷达数据训练,实现了高精度的树覆盖监测。
Details
Motivation: 现有的全球树木监测产品主要集中在二值树覆盖或树冠高度,无法精确识别个体树木,因此需要一种大规模监测个体树木的方法。 Method: 使用3米分辨率的PlanetScope影像,通过高斯核模拟树冠,提取树冠中心并生成二值树覆盖图;利用从机载激光雷达数据中自动提取的数十亿点进行训练。 Result: 该方法在与现有树覆盖图和机载激光雷达的对比中表现出色(分数覆盖R²=0.81),在不同生物群系中检测指标均衡,并且通过手动标签微调可进一步提高检测性能。 Conclusion: 该研究提出了一种基于深度学习的方法,用于在全球范围内检测单个大树,为未来的高分辨率树木监测和卫星任务提供了可扩展的框架。 Abstract: Trees are key components of the terrestrial biosphere, playing vital roles in ecosystem function, climate regulation, and the bioeconomy. However, large-scale monitoring of individual trees remains limited by inadequate modelling. Available global products have focused on binary tree cover or canopy height, which do not explicitely identify trees at individual level. In this study, we present a deep learning approach for detecting large individual trees in 3-m resolution PlanetScope imagery at a global scale. We simulate tree crowns with Gaussian kernels of scalable size, allowing the extraction of crown centers and the generation of binary tree cover maps. Training is based on billions of points automatically extracted from airborne lidar data, enabling the model to successfully identify trees both inside and outside forests. We compare against existing tree cover maps and airborne lidar with state-of-the-art performance (fractional cover R$^2 = 0.81$ against aerial lidar), report balanced detection metrics across biomes, and demonstrate how detection can be further improved through fine-tuning with manual labels. Our method offers a scalable framework for global, high-resolution tree monitoring, and is adaptable to future satellite missions offering improved imagery.[61] Scale-GS: Efficient Scalable Gaussian Splatting via Redundancy-filtering Training on Streaming Content
Jiayu Yang,Weijian Su,Songqian Zhang,Yuqi Han,Jinli Suo,Qiang Zhang
Main category: cs.CV
TL;DR: M3Gaussian是一种高效的3D高斯泼溅框架,通过分层结构、混合变形和自适应掩码机制,实现流媒体任务的快速训练和高质量渲染。
Details
Motivation: 3D高斯泼溅(3DGS)在动态场景中的应用受限于密集高斯数据量大和每帧训练时间长的问题,因此需要一种可扩展的高效训练框架。 Method: M3Gaussian通过基于锚点的结构分层组织高斯球体,结合混合变形和生成策略,以及双向自适应掩码机制来提高训练效率。 Result: 实验表明,M3Gaussian在保证高视觉质量的同时,显著减少了训练时间,优于现有方法。 Conclusion: M3Gaussian实现了比现有最先进方法更优的视觉质量,同时显著减少了训练时间,适用于流媒体任务中的高效训练。 Abstract: 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, a key requirement for immersive applications. However, the extension of 3DGS to dynamic scenes remains limitations on the substantial data volume of dense Gaussians and the prolonged training time required for each frame. This paper presents \M, a scalable Gaussian Splatting framework designed for efficient training in streaming tasks. Specifically, Gaussian spheres are hierarchically organized by scale within an anchor-based structure. Coarser-level Gaussians represent the low-resolution structure of the scene, while finer-level Gaussians, responsible for detailed high-fidelity rendering, are selectively activated by the coarser-level Gaussians. To further reduce computational overhead, we introduce a hybrid deformation and spawning strategy that models motion of inter-frame through Gaussian deformation and triggers Gaussian spawning to characterize wide-range motion. Additionally, a bidirectional adaptive masking mechanism enhances training efficiency by removing static regions and prioritizing informative viewpoints. Extensive experiments demonstrate that \M~ achieves superior visual quality while significantly reducing training time compared to state-of-the-art methods.[62] One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist
Junha Song,Yongsik Jo,So Yeon Min,Quanting Xie,Taehwan Kim,Yonatan Bisk,Jaegul Choo
Main category: cs.CV
TL;DR: This paper introduces a lightweight image captioning model and proposes the Sharp-Eyed Refinement framework to improve visual grounding and caption quality for on-device applications.
Details
Motivation: Deploying multimodal large language models (MLLMs) on local devices is challenging due to their high computational demands, and existing models suffer from visual blindness, causing semantic captioning errors. Method: The study implements a lightweight captioning model based on a 125M-parameter language model and develops a novel framework called Sharp-Eyed Refinement, which uses DeepLens to extract detailed visual representations. Result: The lightweight model achieves performance comparable to large multimodal generalists, and the Sharp-Eyed Refinement framework effectively addresses visual blindness, outperforming prior small captioning models and large generalists. Conclusion: The Sharp-Eyed Refinement framework significantly improves caption quality by enhancing visual grounding through detailed visual representations. Abstract: Image captioning is fundamental for applications like video instruction systems and exploration robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal large language models (MLLMs). To address this, we first explore lightweight captioning by implementing a specialist based on a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluating its performance on both single-sentence and detailed captioning tasks. Surprisingly, we find that our model can achieve performance comparable to large multimodal generalists, suggesting its potential to serve as a strong visual specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from visual blindness, occasionally resulting in semantic captioning errors. We carry out toy experiments and investigate the underlying causes, where we observe that the problems arise from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality through improved visual grounding. At its core, our DeepLens extracts detailed visual representations by concentrating on informative regions identified during the initial glance. Our experiments confirm both the advantages of our specialist over prior small captioning models and large generalists and the effectiveness of our framework.[63] Federated Fine-tuning of SAM-Med3D for MRI-based Dementia Classification
Kaouther Mouheb,Marawan Elbatel,Janne Papma,Geert Jan Biessels,Jurgen Claassen,Huub Middelkoop,Barbara van Munster,Wiesje van der Flier,Inez Ramakers,Stefan Klein,Esther E. Bron
Main category: cs.CV
TL;DR: 本文研究了联邦学习系统中基础模型微调的关键设计选择,发现分类头架构、微调策略和聚合方法对性能和效率有显著影响,并为在去中心化临床环境中部署基础模型提供了见解。
Details
Motivation: 尽管基础模型在基于AI的痴呆诊断中具有巨大潜力,但它们在联邦学习系统中的集成仍未得到充分探索。 Method: 通过使用大脑MRI数据,系统评估了分类头架构、微调策略和聚合方法对联邦基础模型微调的性能和效率的影响。 Result: 研究发现,分类头的架构对性能有显著影响,冻结基础模型编码器的效果与完全微调相当,而先进的聚合方法优于标准的联邦平均。 Conclusion: 该研究得出了一些关键的设计选择对于联邦学习系统中基础模型的微调具有重要影响,并为在去中心化临床环境中部署基础模型提供了实用见解。 Abstract: While foundation models (FMs) offer strong potential for AI-based dementia diagnosis, their integration into federated learning (FL) systems remains underexplored. In this benchmarking study, we systematically evaluate the impact of key design choices: classification head architecture, fine-tuning strategy, and aggregation method, on the performance and efficiency of federated FM tuning using brain MRI data. Using a large multi-cohort dataset, we find that the architecture of the classification head substantially influences performance, freezing the FM encoder achieves comparable results to full fine-tuning, and advanced aggregation methods outperform standard federated averaging. Our results offer practical insights for deploying FMs in decentralized clinical settings and highlight trade-offs that should guide future method development.[64] Multi-Method Ensemble for Out-of-Distribution Detection
Lucas Rakotoarivony
Main category: cs.CV
TL;DR: 本文提出了一种新的分布外检测方法MME,通过结合特征截断和评分函数,在多个基准测试中表现出色。
Details
Motivation: 现有的方法通常只关注单一技术或特定类型的分布外数据集,忽略了多种现有解决方案结合的潜力。 Method: 提出了多方法集成(MME)评分函数,统一了最先进的分布外检测器。 Result: MME在所有基准测试中显著优于最新的最先进方法,使用BiT模型在ImageNet-1K基准测试中平均FPR95达到27.57%,比现有最佳基线性能提升了6%。 Conclusion: 结合最先进的特征截断和评分函数可以有效提升分布外检测的性能,且集成多个评分函数能增强对各种分布外样本的鲁棒性。 Abstract: Detecting out-of-distribution (OOD) samples is essential for neural networks operating in open-world settings, particularly in safety-critical applications. Existing methods have improved OOD detection by leveraging two main techniques: feature truncation, which increases the separation between in-distribution (ID) and OOD samples, and scoring functions, which assign scores to distinguish between ID and OOD data. However, most approaches either focus on a single family of techniques or evaluate their effectiveness on a specific type of OOD dataset, overlooking the potential of combining multiple existing solutions. Motivated by this observation, we theoretically and empirically demonstrate that state-of-the-art feature truncation and scoring functions can be effectively combined. Moreover, we show that aggregating multiple scoring functions enhances robustness against various types of OOD samples. Based on these insights, we propose the Multi-Method Ensemble (MME) score, which unifies state-of-the-art OOD detectors into a single, more effective scoring function. Extensive experiments on both large-scale and small-scale benchmarks, covering near-OOD and far-OOD scenarios, show that MME significantly outperforms recent state-of-the-art methods across all benchmarks. Notably, using the BiT model, our method achieves an average FPR95 of 27.57% on the challenging ImageNet-1K benchmark, improving performance by 6% over the best existing baseline.[65] Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR
Shashank Vempati,Nishit Anand,Gaurav Talebailkar,Arpan Garai,Chetan Arora
Main category: cs.CV
TL;DR: This paper proposes a line-level OCR approach to bypass word segmentation errors, improving accuracy by 5.4% and efficiency by 4 times compared to traditional word-level methods.
Details
Motivation: Modern OCR techniques rely on word segmentation, which introduces a new bottleneck in accuracy. This paper aims to explore a progression to line-level OCR to improve context and accuracy. Method: The authors propose a shift from word-level to line-level OCR to bypass errors in word detection, using a dataset of 251 English page images with line-level annotations. They compare the performance of their approach to word-based OCR pipelines. Result: The proposed line-level OCR technique showed a 5.4% improvement in end-to-end accuracy and a 4 times improvement in efficiency compared to word-based pipelines. Conclusion: Transitioning from word-level to line-level OCR improves accuracy and efficiency, especially for document images, while allowing better exploitation of large language models. Abstract: Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website[66] Adversarial Patch Attack for Ship Detection via Localized Augmentation
Chun Liu,Panpan Ding,Zheng Zheng,Hailong Wang,Bingqian Zhu,Tao Xu,Zhigang Han,Jiayao Wang
Main category: cs.CV
TL;DR: 本文提出一种针对遥感图像中船舶检测的对抗补丁攻击方法,通过局部增强策略减少背景干扰,从而提高攻击成功率和迁移性。
Details
Motivation: 现有的基于数据变换的对抗样本生成方法虽然能够提高迁移性,但过度增强背景或非目标区域可能引入不必要的干扰,导致检测模型误检。 Method: 提出了一种局部增强方法,仅对目标区域进行增强,避免对非目标区域造成干扰。 Result: 实验表明,该方法在HRSC2016数据集上显著提高了对抗补丁攻击的成功率和迁移性。 Conclusion: 局部增强方法能够有效提升对抗补丁攻击的成功率和迁移性,实验结果验证了该方法在HRSC2016数据集上的有效性。 Abstract: Current ship detection techniques based on remote sensing imagery primarily rely on the object detection capabilities of deep neural networks (DNNs). However, DNNs are vulnerable to adversarial patch attacks, which can lead to misclassification by the detection model or complete evasion of the targets. Numerous studies have demonstrated that data transformation-based methods can improve the transferability of adversarial examples. However, excessive augmentation of image backgrounds or irrelevant regions may introduce unnecessary interference, resulting in false detections of the object detection model. These errors are not caused by the adversarial patches themselves but rather by the over-augmentation of background and non-target areas. This paper proposes a localized augmentation method that applies augmentation only to the target regions, avoiding any influence on non-target areas. By reducing background interference, this approach enables the loss function to focus more directly on the impact of the adversarial patch on the detection model, thereby improving the attack success rate. Experiments conducted on the HRSC2016 dataset demonstrate that the proposed method effectively increases the success rate of adversarial patch attacks and enhances their transferability.[67] ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
Hao Lu,Jiahao Wang,Yaolun Zhang,Ruohui Wang,Xuanyu Zheng,Yepeng Tang,Dahua Lin,Lewei Lu
Main category: cs.CV
TL;DR: 研究提出ELV-Halluc基准,系统研究长视频中的语义聚合幻觉(SAH),并通过位置编码和DPO策略有效减轻幻觉,SAH比例减少27.7%。
Details
Motivation: 视频多模态大语言模型在视频理解方面取得了显著进展,但在长视频中仍存在语义聚合幻觉(SAH)问题,需要系统研究和解决。 Method: 通过构建包含8K对抗数据对的数据集,采用位置编码策略和DPO策略来减轻SAH,并在ELV-Halluc和Video-MME上进行实验验证。 Result: 实验确认了SAH的存在,并发现其随语义复杂度增加而增加;位置编码策略和DPO策略能有效减轻SAH,在ELV-Halluc和Video-MME上SAH比例减少了27.7%。 Conclusion: ELV-Halluc基准的引入以及对语义聚合幻觉(SAH)的系统研究,为长视频理解中的幻觉问题提供了新的视角和解决方案。 Abstract: Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.[68] Maybe you don't need a U-Net: convolutional feature upsampling for materials micrograph segmentation
Ronan Docherty,Antonis Vamvakeros,Samuel J. Cooper
Main category: cs.CV
TL;DR: 本研究提出了一种卷积神经网络上采样方法,用于增强基础模型的显微图像分割能力,具有高效性和高质量结果。
Details
Motivation: 基础模型的特征描述通常基于补丁,难以表示显微图像中的精细特征,同时在处理材料和生物图像分析中的大图像尺寸时也面临困难。因此需要一种方法来提升特征分辨率和分割效果。 Method: 训练了一个卷积神经网络来上采样低分辨率的特征基础模型,使用显微图像数据集进行验证,并采用交互式分割方法评估结果质量。 Result: 提出的方法能够高效地对显微图像进行特征化和分割,能分离难以分割的相,例如发丝裂纹。 Conclusion: 训练了一个卷积神经网络来根据输入图像放大低分辨率的基础模型特征。这种方法能够高效地对各种显微图像进行特征化和分割,并且交互式分割可以产生高质量的结果,比传统的卷积网络需要更少的标签和时间。 Abstract: Feature foundation models - usually vision transformers - offer rich semantic descriptors of images, useful for downstream tasks such as (interactive) segmentation and object detection. For computational efficiency these descriptors are often patch-based, and so struggle to represent the fine features often present in micrographs; they also struggle with the large image sizes present in materials and biological image analysis. In this work, we train a convolutional neural network to upsample low-resolution (i.e, large patch size) foundation model features with reference to the input image. We apply this upsampler network (without any further training) to efficiently featurise and then segment a variety of microscopy images, including plant cells, a lithium-ion battery cathode and organic crystals. The richness of these upsampled features admits separation of hard to segment phases, like hairline cracks. We demonstrate that interactive segmentation with these deep features produces high-quality segmentations far faster and with far fewer labels than training or finetuning a more traditional convolutional network.[69] HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones
Hao Ruan,Jinliang Lin,Yingxin Lai,Zhiming Luo,Shaozi Li
Main category: cs.CV
TL;DR: The paper proposes HCCM, a novel framework for vision-language understanding in Natural Language-Guided Drones, which improves hierarchical semantic alignment and robustness, achieving superior performance in retrieval tasks and zero-shot generalization.
Details
Motivation: Mainstream Vision-Language Models (VLMs) focus on global alignment but lack fine-grained semantics, while existing hierarchical methods struggle with precise entity partitioning and strict containment in dynamic drone environments. This limits their effectiveness in complex vision-language tasks. Method: The HCCM framework introduces two components: (1) RG-ITC for hierarchical local-to-global semantic alignment without precise scene partitioning, and (2) RG-ITM for evaluating local semantic consistency without rigid constraints. Additionally, a Momentum Contrast and Distillation (MCD) mechanism is used to improve robustness against incomplete or ambiguous text descriptions. Result: Experiments on GeoText-1652 achieved state-of-the-art Recall@1 scores of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM showed strong zero-shot generalization with a mean recall (mR) of 39.93%, outperforming fine-tuned baselines. Conclusion: The proposed HCCM framework enhances vision-language understanding for Natural Language-Guided Drones by addressing challenges in fine-grained semantics and dynamic environments, achieving state-of-the-art performance and strong zero-shot generalization. Abstract: Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.[70] Complete Gaussian Splats from a Single Image with Denoising Diffusion Models
Ziwei Liao,Mohamed Sayed,Steven L. Waslander,Sara Vicente,Daniyar Turmukhambetov,Michael Firman
Main category: cs.CV
TL;DR: This paper proposes a generative method using a diffusion model to reconstruct complete 3D scenes, including occluded parts, from a single image.
Details
Motivation: Gaussian splatting struggles with reconstructing occluded and unobserved areas, and conventional regression-based methods lead to implausible or blurry results. Method: A latent diffusion model combined with a Variational AutoReconstructor is used to learn a distribution of 3D Gaussian splats from 2D images in a self-supervised manner. Result: The method generates faithful and diverse 3D reconstructions, capable of completing occluded surfaces from a single input image. Conclusion: The proposed generative formulation effectively reconstructs complete 3D scenes, including occluded areas, and produces high-quality 360-degree renderings. Abstract: Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference. Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single "mode" for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views. In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360-degree renderings.[71] EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting
Yujin Park,Haejun Chung,Ikbeom Jang
Main category: cs.CV
TL;DR: EZ-Sort是一种高效的成对排序方法,它通过使用CLIP模型的零样本预排序和不确定性引导的人在环中归并排序,显著减少了人工标注需求。
Details
Motivation: 由于成对比较在主观或困难的标注任务中比绝对评分或序数分类更可靠,但穷尽比较需要大量的标注(O(n^2)),因此需要提高标注效率。 Method: 提出了一种名为EZ-Sort的方法,该方法首先基于CLIP模型进行零样本预排序,然后初始化桶感知的Elo评分,最后运行不确定性引导的人在环中的归并排序算法。 Result: 与穷尽成对比较相比,EZ-Sort将人工标注成本降低了90.5%;与先前工作相比,在n=100时降低了19.8%,同时保持或提高了评分者间的一致性。 Conclusion: EZ-Sort通过结合基于CLIP的先验知识和不确定性感知采样,提供了一种高效且可扩展的成对排序解决方案,显著降低了人工标注成本,同时保持或提高了评分者间的一致性。 Abstract: Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical image chronology (DHCI), and retinal image quality assessment (EyePACS). It showed that EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (when n = 100), while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking.[72] ECHO: Ego-Centric modeling of Human-Object interactions
Ilya A. Petrov,Vladimir Guzov,Riccardo Marin,Emre Aksan,Xu Chen,Daniel Cremers,Thabo Beeler,Gerard Pons-Moll
Main category: cs.CV
TL;DR: ECHO是一种新的人-物交互建模方法,能从头部和手腕追踪中恢复交互信息。
Details
Motivation: 由于可穿戴设备的日益普及,从自我中心视角建模人-物交互是一个重要但尚未充分探索的问题。 Method: ECHO使用扩散变压器架构和独特的三变量扩散过程,联合建模人体运动、物体轨迹和接触序列。 Result: ECHO在处理任意长度的序列时展现出强大的性能,并在以自我中心的人-物交互重建方面设定了最先进的水平。 Conclusion: ECHO通过使用扩散变压器架构和独特的三变量扩散过程,在以自我为中心的人-物交互建模方面取得了先进成果。 Abstract: Modeling human-object interactions (HOI) from an egocentric perspective is a largely unexplored yet important problem due to the increasing adoption of wearable devices, such as smart glasses and watches. We investigate how much information about interaction can be recovered from only head and wrists tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object interactions), which, for the first time, proposes a unified framework to recover three modalities: human pose, object motion, and contact from such minimal observation. ECHO employs a Diffusion Transformer architecture and a unique three-variate diffusion process, which jointly models human motion, object trajectory, and contact sequence, allowing for flexible input configurations. Our method operates in a head-centric canonical space, enhancing robustness to global orientation. We propose a conveyor-based inference, which progressively increases the diffusion timestamp with the frame position, allowing us to process sequences of any length. Through extensive evaluation, we demonstrate that ECHO outperforms existing methods that do not offer the same flexibility, setting a state-of-the-art in egocentric HOI reconstruction.[73] How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images
Juneyoung Ro,Namwoo Kim,Yoonjin Yoon
Main category: cs.CV
TL;DR: 本文研究了视觉语言模型在城市场景空间推理中的表现,并提出通过合成链式思维监督数据集进行微调,以提升模型在复杂问题上的性能。
Details
Motivation: 城市场景的理解需要对物体、布局和深度线索进行细粒度的空间推理。然而,目前尚不清楚当前在通用场景上预训练的视觉语言模型(VLMs)在多大程度上能够将这些能力迁移到城市领域。本研究旨在填补这一空白。 Method: 研究者对三种现有的视觉语言模型(BLIP-2、InstructBLIP 和 LLaVA-1.5)进行了比较研究,评估了它们在零样本设置下的性能以及使用特定于城市场景的合成 VQA 数据集进行微调的效果。合成数据集基于街景图像的分割、深度和物体检测预测构建,并配以大语言模型生成的链式思维(CoT)答案进行逐步推理监督。 Result: 结果表明,尽管视觉语言模型在零样本设置下表现尚可,但使用合成的链式思维监督数据集进行微调后,性能显著提升,尤其是在处理否定和反事实等复杂问题类型时。 Conclusion: 该研究得出结论,使用合成的、具有链式思维监督的数据集对视觉语言模型(VLMs)进行微调,可以显著提高其在城市场景中的空间推理能力,特别是在处理否定和反事实等复杂问题时。 Abstract: Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.[74] Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging
Nico Albert Disch,Yannick Kirchhoff,Robin Peretzke,Maximilian Rokuss,Saikat Roy,Constantin Ulrich,David Zimmerer,Klaus Maier-Hein
Main category: cs.CV
TL;DR: Temporal Flow Matching (TFM) is a new method for medical imaging that improves the modeling of temporal dynamics, outperforming current approaches in predicting 4D medical images.
Details
Motivation: The motivation behind TFM is to overcome the limitations of existing deep learning methods that either consider only single temporal contexts or focus on classification or regression tasks, which limits their ability to make fine-grained spatial predictions. TFM aims to model disease progression, treatment planning, and anatomical development tracking more effectively. Method: Temporal Flow Matching (TFM) is introduced as a method that learns the underlying temporal distribution and can predict the last context image (LCI) when necessary. It supports 3D volumes, multiple prior scans, and irregular sampling. Result: TFM consistently surpasses spatio-temporal methods from natural imaging in extensive benchmarks on three public longitudinal datasets, showing its effectiveness in 4D medical image prediction. Conclusion: Temporal Flow Matching (TFM) is a unified generative trajectory method that addresses the limitations of current deep learning approaches in handling temporal dynamics for medical imaging. It establishes a new state-of-the-art and robust baseline for 4D medical image prediction. Abstract: Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3D$ volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for $4D$ medical image prediction.[75] Integrating Pathology and CT Imaging for Personalized Recurrence Risk Prediction in Renal Cancer
Daniël Boeke,Cedrik Blommestijn,Rebecca N. Wray,Kalina Chupetlovska,Shangqi Gao,Zeyu Gao,Regina G. H. Beets-Tan,Mireia Crispin-Ortuzar,James O. Jones,Wilson Silva,Ines P. Machado
Main category: cs.CV
TL;DR: This study integrates preoperative CT and postoperative histopathology data using a deep learning framework to improve personalized recurrence risk prediction in clear cell renal cell carcinoma, showing that multimodal approaches can approach the performance of established clinical scores.
Details
Motivation: The motivation is to improve patient-level recurrence risk estimation in ccRCC by integrating imaging and pathology data, overcoming the limitations of the Leibovich score which excludes imaging information and offers limited resolution. Method: A modular deep learning framework with pretrained encoders and Cox-based survival modeling was employed, testing unimodal, late fusion, and intermediate fusion setups for integrating preoperative CT scans and postoperative histopathology WSIs. Result: WSI-based models outperformed CT-only models; intermediate fusion improved performance, with the best model (TITAN-CONCH with ResNet-18) approaching the adjusted Leibovich score. Fusion techniques enhanced the predictive value of radiology data. Conclusion: The study demonstrates the feasibility of using a multimodal deep learning framework for personalized recurrence risk prediction in ccRCC, with pathology data being particularly prognostic and imaging data adding value through fusion. Abstract: Recurrence risk estimation in clear cell renal cell carcinoma (ccRCC) is essential for guiding postoperative surveillance and treatment. The Leibovich score remains widely used for stratifying distant recurrence risk but offers limited patient-level resolution and excludes imaging information. This study evaluates multimodal recurrence prediction by integrating preoperative computed tomography (CT) and postoperative histopathology whole-slide images (WSIs). A modular deep learning framework with pretrained encoders and Cox-based survival modeling was tested across unimodal, late fusion, and intermediate fusion setups. In a real-world ccRCC cohort, WSI-based models consistently outperformed CT-only models, underscoring the prognostic strength of pathology. Intermediate fusion further improved performance, with the best model (TITAN-CONCH with ResNet-18) approaching the adjusted Leibovich score. Random tie-breaking narrowed the gap between the clinical baseline and learned models, suggesting discretization may overstate individualized performance. Using simple embedding concatenation, radiology added value primarily through fusion. These findings demonstrate the feasibility of foundation model-based multimodal integration for personalized ccRCC risk prediction. Future work should explore more expressive fusion strategies, larger multimodal datasets, and general-purpose CT encoders to better match pathology modeling capacity.[76] Unfolding Framework with Complex-Valued Deformable Attention for High-Quality Computer-Generated Hologram Generation
Haomiao Zhang,Zhangyuan Li,Yanling Piao,Zhi Li,Xiaodong Wang,Miao Cao,Xiongfei Su,Qiang Song,Xin Yuan
Main category: cs.CV
TL;DR: 本文提出了一种新的计算机生成全息算法,通过深度展开网络提高重建的灵活性和性能,实验结果显示其在多个数据集上的优越性。
Details
Motivation: 由于传统端到端网络将重建模型视为黑盒,卷积神经网络全息算法的感受野有限,以及角谱法模型受限于有限的近场区域,因此需要一种更具灵活性和性能的算法。 Method: 提出了一种深度展开网络(DUN),该网络将梯度下降分解为自适应带宽保持模型(ABPM)和相位域复值去噪器(PCD)两个模块。 Result: 实验结果表明,所提出的算法在模拟和真实数据上均取得了最先进的结果,PSNR超过35 dB。 Conclusion: 本文提出了一种基于深度展开网络的计算机生成全息算法,通过分解梯度下降过程为两个模块,实现了更灵活的模型设计,并在模拟和真实数据实验中取得了最先进的结果。 Abstract: Computer-generated holography (CGH) has gained wide attention with deep learning-based algorithms. However, due to its nonlinear and ill-posed nature, challenges remain in achieving accurate and stable reconstruction. Specifically, ($i$) the widely used end-to-end networks treat the reconstruction model as a black box, ignoring underlying physical relationships, which reduces interpretability and flexibility. ($ii$) CNN-based CGH algorithms have limited receptive fields, hindering their ability to capture long-range dependencies and global context. ($iii$) Angular spectrum method (ASM)-based models are constrained to finite near-fields.In this paper, we propose a Deep Unfolding Network (DUN) that decomposes gradient descent into two modules: an adaptive bandwidth-preserving model (ABPM) and a phase-domain complex-valued denoiser (PCD), providing more flexibility. ABPM allows for wider working distances compared to ASM-based methods. At the same time, PCD leverages its complex-valued deformable self-attention module to capture global features and enhance performance, achieving a PSNR over 35 dB. Experiments on simulated and real data show state-of-the-art results.[77] Towards Interactive Lesion Segmentation in Whole-Body PET/CT with Promptable Models
Maximilian Rokuss,Yannick Kirchhoff,Fabian Isensee,Klaus H. Maier-Hein
Main category: cs.CV
TL;DR: 本文提出了一种基于autoPET III nnU-Net框架的交互式分割方法,通过用户提示提高全身PET/CT图像的病变分割准确性,并展示了欧几里得距离变换(EDT)编码优于高斯核的效果。
Details
Motivation: 本文的动机是准确的病变分割在全身PET/CT中仍然具有挑战性,尽管全自动方法取得了显著进展,但临床实践中仍需要保持人工参与以高效优化预测掩膜。 Method: 本文的方法是基于获奖的autoPET III nnU-Net框架,通过将用户提供的前景和背景点击编码为额外输入通道来扩展框架的可提示功能,并通过欧几里得距离变换(EDT)编码和高斯核对比实验进行评估。 Result: 本文的结果是基于EDT的模型集成在交叉验证中表现最佳,相比基线模型减少了假阳性和假阴性。 Conclusion: 本文的结论是基于模拟用户提示的交互式分割任务在多示踪剂、多中心PET/CT中具有高效、用户引导的分割工作流程的潜力。 Abstract: Whole-body PET/CT is a cornerstone of oncological imaging, yet accurate lesion segmentation remains challenging due to tracer heterogeneity, physiological uptake, and multi-center variability. While fully automated methods have advanced substantially, clinical practice benefits from approaches that keep humans in the loop to efficiently refine predicted masks. The autoPET/CT IV challenge addresses this need by introducing interactive segmentation tasks based on simulated user prompts. In this work, we present our submission to Task 1. Building on the winning autoPET III nnU-Net pipeline, we extend the framework with promptable capabilities by encoding user-provided foreground and background clicks as additional input channels. We systematically investigate representations for spatial prompts and demonstrate that Euclidean Distance Transform (EDT) encodings consistently outperform Gaussian kernels. Furthermore, we propose online simulation of user interactions and a custom point sampling strategy to improve robustness under realistic prompting conditions. Our ensemble of EDT-based models, trained with and without external data, achieves the strongest cross-validation performance, reducing both false positives and false negatives compared to baseline models. These results highlight the potential of promptable models to enable efficient, user-guided segmentation workflows in multi-tracer, multi-center PET/CT. Code is publicly available at https://github.com/MIC-DKFZ/autoPET-interactive[78] Mapping like a Skeptic: Probabilistic BEV Projection for Online HD Mapping
Fatih Erdoğan,Merve Rabia Barın,Fatma Güney
Main category: cs.CV
TL;DR: This paper proposes a probabilistic projection mechanism with confidence scores to improve the accuracy and generalization of high-definition (HD) map generation from camera images, outperforming existing methods on nuScenes and Argoverse2 datasets.
Details
Motivation: Existing HD mapping approaches struggle with accuracy due to generalization problems and hallucination of non-existent road elements, necessitating a more reliable and accurate method for mapping road elements from image space to BEV space. Method: A novel probabilistic projection mechanism with confidence scores is proposed to refine mapping alignment with the scene and filter irrelevant elements. Temporal processing is improved by selectively accumulating reliable information over time. Result: Experiments on new splits of nuScenes and Argoverse2 datasets show improved performance over state-of-the-art approaches, particularly in terms of generalization and long perception range. Conclusion: The proposed probabilistic projection mechanism with confidence scores improves HD map generation by refining mapping alignment and filtering irrelevant elements, leading to better generalization and performance on nuScenes and Argoverse2 datasets. Abstract: Constructing high-definition (HD) maps from sensory input requires accurately mapping the road elements in image space to the Bird's Eye View (BEV) space. The precision of this mapping directly impacts the quality of the final vectorized HD map. Existing HD mapping approaches outsource the projection to standard mapping techniques, such as attention-based ones. However, these methods struggle with accuracy due to generalization problems, often hallucinating non-existent road elements. Our key idea is to start with a geometric mapping based on camera parameters and adapt it to the scene to extract relevant map information from camera images. To implement this, we propose a novel probabilistic projection mechanism with confidence scores to (i) refine the mapping to better align with the scene and (ii) filter out irrelevant elements that should not influence HD map generation. In addition, we improve temporal processing by using confidence scores to selectively accumulate reliable information over time. Experiments on new splits of the nuScenes and Argoverse2 datasets demonstrate improved performance over state-of-the-art approaches, indicating better generalization. The improvements are particularly pronounced on nuScenes and in the challenging long perception range. Our code and model checkpoints are available at https://github.com/Fatih-Erdogan/mapping-like-skeptic .[79] FLORA: Efficient Synthetic Data Generation for Object Detection in Low-Data Regimes via finetuning Flux LoRA
Alvaro Patricio,Atabak Dehban,Rodrigo Ventura
Main category: cs.CV
TL;DR: FLORA is a lightweight synthetic data generation method that achieves superior object detection performance with significantly reduced computational costs and data requirements.
Details
Motivation: Recent diffusion models for synthetic data generation require extensive resources, limiting their accessibility. Method: FLORA uses the Flux 1.1 Dev diffusion model fine-tuned with Low-Rank Adaptation (LoRA) to reduce computational requirements. Result: FLORA outperforms the ODGEN baseline by achieving up to 21.3% improvement in mAP@.50:.95 while using only 10% of the data and a consumer-grade GPU. Conclusion: FLORA provides a more efficient and practical solution for synthetic data generation compared to existing methods. Abstract: Recent advances in diffusion-based generative models have demonstrated significant potential in augmenting scarce datasets for object detection tasks. Nevertheless, most recent models rely on resource-intensive full fine-tuning of large-scale diffusion models, requiring enterprise-grade GPUs (e.g., NVIDIA V100) and thousands of synthetic images. To address these limitations, we propose Flux LoRA Augmentation (FLORA), a lightweight synthetic data generation pipeline. Our approach uses the Flux 1.1 Dev diffusion model, fine-tuned exclusively through Low-Rank Adaptation (LoRA). This dramatically reduces computational requirements, enabling synthetic dataset generation with a consumer-grade GPU (e.g., NVIDIA RTX 4090). We empirically evaluate our approach on seven diverse object detection datasets. Our results demonstrate that training object detectors with just 500 synthetic images generated by our approach yields superior detection performance compared to models trained on 5000 synthetic images from the ODGEN baseline, achieving improvements of up to 21.3% in mAP@.50:.95. This work demonstrates that it is possible to surpass state-of-the-art performance with far greater efficiency, as FLORA achieves superior results using only 10% of the data and a fraction of the computational cost. This work demonstrates that a quality and efficiency-focused approach is more effective than brute-force generation, making advanced synthetic data creation more practical and accessible for real-world scenarios.[80] Entropy-Based Non-Invasive Reliability Monitoring of Convolutional Neural Networks
Amirhossein Nazeri,Wael Hafez
Main category: cs.CV
TL;DR: 提出无需模型修改的对抗输入检测方法,通过监控CNN激活熵实现实时检测。
Details
Motivation: 现有对抗输入检测方法需昂贵重训练、修改网络结构或影响干净输入性能。 Method: 使用并行熵监控在VGG-16上检测对抗扰动引起的激活熵变化。 Result: 对抗输入在早期卷积层激活熵偏移7%,检测准确率达90%,错误率低于20%。 Conclusion: CNN可靠性可通过激活熵单独评估,无需修改模型即可实时检测对抗输入。 Abstract: Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision, achieving unprecedented accuracy across diverse image recognition tasks. While these networks excel on in-distribution data, they remain vulnerable to adversarial perturbations imperceptible input modifications that cause misclassification with high confidence. However, existing detection methods either require expensive retraining, modify network architecture, or degrade performance on clean inputs. Here we show that adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance.[81] CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models
João Valente,Atabak Dehban,Rodrigo Ventura
Main category: cs.CV
TL;DR: 本文提出了一种新的合成数据生成工具CAD2DMD-SET和验证集DMDBench,显著提升了大型视觉-语言模型在复杂现实条件下读取数字测量设备的能力。
Details
Motivation: 当前的LVLMs在复杂现实条件(如杂乱、遮挡、极端视角和运动模糊)下难以准确读取数字测量设备(DMDs),尤其是在头戴式摄像头和增强现实(AR)应用中。 Method: 利用3D CAD模型、高级渲染和高保真图像合成生成带有VQA标签的合成DMD数据集,并构建了包含1000张现实世界图像的验证集DMDBench。 Result: 在使用平均归一化Levenshtein相似度(ANLS)进行基准测试并进一步微调LoRA后,InternVL的得分提高了200%,且未影响其他任务的表现。 Conclusion: CAD2DMD-SET通过生成多样化的合成数据集,显著提高了LVLMs在复杂现实条件下读取数字测量设备的能力,并计划作为开源工具发布以促进社区发展。 Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA's of these models with CAD2DMD-SET's generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.[82] Learning from Silence and Noise for Visual Sound Source Localization
Xavier Juanola,Giovana Morais,Magdalena Fuentes,Gloria Haro
Main category: cs.CV
TL;DR: This paper introduces SSL-SaN, a self-supervised model that improves visual sound source localization, especially in challenging audio scenarios like silence and noise.
Details
Motivation: Current methods perform poorly in scenarios with low audio-visual semantic correspondence (e.g., silence, noise, offscreen sounds) and are primarily evaluated on positive cases with a single visible sound source. Method: The authors introduce a new training strategy that incorporates silence and noise, a new metric for quantifying the trade-off between alignment and separability of auditory and visual features, and an extended dataset named IS3+ for better evaluation. Result: SSL-SaN outperforms existing self-supervised models in sound localization and cross-modal retrieval tasks, especially in the presence of negative audio. The proposed metric and IS3+ dataset provide improved evaluation capabilities. Conclusion: The paper proposes SSL-SaN, a self-supervised model that effectively localizes sound sources in videos, particularly robust in cases with negative audio. The model achieves state-of-the-art performance in sound localization and cross-modal retrieval. Abstract: Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the https://xavijuanola.github.io/SSL-SaN/.[83] UItron: Foundational GUI Agent with Advanced Perception and Planning
Zhixiong Zeng,Jing Huang,Liming Zheng,Wenkang Han,Yufeng Zhong,Lei Chen,Longrong Yang,Yingjie Chu,Yuzhi He,Lin Ma
Main category: cs.CV
TL;DR: UItron is an open-source foundational model for GUI agents that advances GUI perception, grounding, and planning. It achieves strong performance, especially in Chinese mobile app scenarios, through data engineering, interactive infrastructure, and a reinforcement learning framework.
Details
Motivation: Building effective GUI agents is challenging due to limited operation trajectories, lack of interactive infrastructure, and limitations in initial foundation models. UItron addresses these issues to advance the development of GUI agents. Method: UItron employs supervised fine-tuning on perception and planning tasks across various GUI scenarios, followed by a curriculum reinforcement learning framework to enhance complex reasoning and exploration in online environments. It also includes systemic data engineering and an interactive infrastructure connecting mobile and PC devices. Result: UItron achieves superior performance in benchmarks related to GUI perception, grounding, and planning. It demonstrates significant progress in handling Chinese app scenarios, where even state-of-the-art solutions previously lacked proficiency. Conclusion: UItron represents a significant advancement in GUI agent development, particularly in handling Chinese mobile apps, and moves the field closer to practical real-world applications. Abstract: GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.[84] Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations
Ha Min Son,Zhe Zhao,Shahbaz Rezaei,Xin Liu
Main category: cs.CV
TL;DR: 本论文提出了一种新的方法 CLIP-DCA,通过增强基础模型的领域意识来改善其在未知领域数据上的泛化能力。
Details
Motivation: 标准的领域不变损失试图使表示领域不变,这可能迫使基础模型丢弃对泛化有益的领域感知表示。CLIP-DCA 假设增强领域意识是实现基础模型有效领域不变分类的前提。 Method: 通过引入一个独立的领域头部和合成生成多样化领域数据的方法,CLIP-DCA 在基础模型的编码器中识别并增强了领域意识。同时,它通过与领域特征解耦来促进领域不变的分类。 Result: CLIP-DCA 在处理更具挑战性的未知领域数据场景时,相较于现有方法表现出显著的性能提升,尤其是在那些更加未知领域的数据集上。 Conclusion: CLIP-DCA 提出了一种新的方法来提升基础模型在领域泛化中的表现,特别是在处理更接近真实世界场景的未知领域数据时,相较于现有方法有显著改进。 Abstract: Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP's performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP's encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.[85] What Can We Learn from Harry Potter? An Exploratory Study of Visual Representation Learning from Atypical Videos
Qiyue Sun,Qiming Huang,Yang Yang,Hongjun Wang,Jianbo Jiao
Main category: cs.CV
TL;DR: This paper introduces a new dataset of atypical videos and shows that incorporating them into training improves open-world learning tasks like OOD detection, novel category discovery, and zero-shot recognition, highlighting the importance of semantic diversity.
Details
Motivation: Humans excel at understanding uncommon concepts in the open world, but most existing studies focus on typical data from closed sets. This paper explores how atypical videos can improve open-world learning and addresses the lack of exploration in this area. Method: The authors collected a new video dataset containing various atypical data (e.g., sci-fi, animation) and integrated them into the model training process for representation learning. They evaluated the impact of these atypical videos on three key open-world learning tasks: OOD detection, NCD, and ZSAR. Result: Using atypical data consistently improved performance across multiple open-world learning tasks. Increasing categorical diversity in atypical samples boosted OOD detection, and semantically diverse smaller datasets outperformed larger typical datasets in NCD. In ZSAR, atypical videos improved generalization to unseen actions. Conclusion: The paper concludes that incorporating atypical videos into the training process enhances open-world learning performance in tasks like OOD detection, NCD, and ZSAR. It emphasizes the importance of semantic diversity in atypical samples and proposes a new dataset to encourage further research in this area. Abstract: Humans usually show exceptional generalisation and discovery ability in the open world, when being shown uncommon new concepts. Whereas most existing studies in the literature focus on common typical data from closed sets, open-world novel discovery is under-explored in videos. In this paper, we are interested in asking: \textit{What if atypical unusual videos are exposed in the learning process?} To this end, we collect a new video dataset consisting of various types of unusual atypical data (\eg sci-fi, animation, \etc). To study how such atypical data may benefit open-world learning, we feed them into the model training process for representation learning. Focusing on three key tasks in open-world learning: out-of-distribution (OOD) detection, novel category discovery (NCD), and zero-shot action recognition (ZSAR), we found that even straightforward learning approaches with atypical data consistently improve performance across various settings. Furthermore, we found that increasing the categorical diversity of the atypical samples further boosts OOD detection performance. Additionally, in the NCD task, using a smaller yet more semantically diverse set of atypical samples leads to better performance compared to using a larger but more typical dataset. In the ZSAR setting, the semantic diversity of atypical videos helps the model generalise better to unseen action classes. These observations in our extensive experimental evaluations reveal the benefits of atypical videos for visual representation learning in the open world, together with the newly proposed dataset, encouraging further studies in this direction.[86] Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering
Nattapong Kurpukdee,Adrian G. Bors
Main category: cs.CV
TL;DR: This paper proposes a non-parametric learning solution for unsupervised video continual learning, using Kernel Density Estimation and unsupervised video transformer networks, enhancing model performance when learning multiple tasks without labels or task boundaries.
Details
Motivation: To address the gap in unsupervised video continual learning where neither task boundaries nor labels are provided, making the use of labeled data costly and impractical. Method: A non-parametric learning solution using Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks is proposed. A novelty detection criterion and transfer learning are also utilized. Result: The study found that the proposed methodology substantially enhances model performance in successive task learning, evaluated on three standard video action recognition datasets. Conclusion: The proposed methodology enhances the performance of the model when successively learning many tasks in unsupervised video continual learning. Abstract: We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.[87] A Multi-Stage Fine-Tuning and Ensembling Strategy for Pancreatic Tumor Segmentation in Diagnostic and Therapeutic MRI
Omer Faruk Durugol,Maximilian Rokuss,Yannick Kirchhoff,Klaus H. Maier-Hein
Main category: cs.CV
TL;DR: 本论文致力于通过改进的nnU-Net框架实现胰腺导管腺癌(PDAC)的MRI图像自动分割,采用多阶段级联预训练策略和定制异构集成方法,在数据有限的情况下取得了优异的分割效果。
Details
Motivation: MRI图像中肿瘤与组织对比度差以及标注数据稀缺是PDAC自动分割的主要挑战,本文旨在解决这些问题,以改进临床工作流程。 Method: 基于nnU-Net框架,采用深度多阶段级联预训练策略,从通用解剖基础模型开始,逐步在CT胰腺病变数据集和目标MRI模态上进行微调,并通过五折交叉验证评估数据增强方案和训练计划。 Result: 研究发现,激进的数据增强策略提高了体积准确性,而默认增强策略则在边界精度上表现更优(Task 1的MASD为5.46 mm,HD95为17.33 mm);最终通过异构集成方法,Task 1的肿瘤Dice评分为0.661,Task 2为0.523。 Conclusion: 本研究提出了一种鲁棒的方法,通过多阶段预训练和集成策略,在有限数据和复杂医学影像任务中实现了高性能模型开发。 Abstract: Automated segmentation of Pancreatic Ductal Adenocarcinoma (PDAC) from MRI is critical for clinical workflows but is hindered by poor tumor-tissue contrast and a scarcity of annotated data. This paper details our submission to the PANTHER challenge, addressing both diagnostic T1-weighted (Task 1) and therapeutic T2-weighted (Task 2) segmentation. Our approach is built upon the nnU-Net framework and leverages a deep, multi-stage cascaded pre-training strategy, starting from a general anatomical foundation model and sequentially fine-tuning on CT pancreatic lesion datasets and the target MRI modalities. Through extensive five-fold cross-validation, we systematically evaluated data augmentation schemes and training schedules. Our analysis revealed a critical trade-off, where aggressive data augmentation produced the highest volumetric accuracy, while default augmentations yielded superior boundary precision (achieving a state-of-the-art MASD of 5.46 mm and HD95 of 17.33 mm for Task 1). For our final submission, we exploited this finding by constructing custom, heterogeneous ensembles of specialist models, essentially creating a mix of experts. This metric-aware ensembling strategy proved highly effective, achieving a top cross-validation Tumor Dice score of 0.661 for Task 1 and 0.523 for Task 2. Our work presents a robust methodology for developing specialized, high-performance models in the context of limited data and complex medical imaging tasks (Team MIC-DKFZ).[88] Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight
Ugur Dinc,Jibak Sarkar,Philipp Schubert,Sabine Semrau,Thomas Weissmann,Andre Karius,Johann Brand,Bernd-Niklas Axer,Ahmed Gomaa,Pluvio Stephan,Ishita Sheth,Sogand Beirami,Annette Schwarz,Udo Gaipl,Benjamin Frey,Christoph Bert,Stefanie Corradini,Rainer Fietkau,Florian Putz
Main category: cs.CV
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' \k{appa}. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.[89] TMUAD: Enhancing Logical Capabilities in Unified Anomaly Detection Models with a Text Memory Bank
Jiawei Liu,Jiahe Hou,Wei Wang,Jinsong Du,Yang Cong,Huijie Fan
Main category: cs.CV
TL;DR: 本文提出TMUAD框架,通过三重记忆库进行统一的结构和逻辑异常检测,提高了检测性能并开源代码。
Details
Motivation: 由于现有方法依赖于精心设计的图像特征提取器和记忆库,而正常数据有限,因此需要一种新的方法来提高逻辑异常检测的效果。 Method: 提出了一种名为TMUAD的三重记忆框架,包含类级文本记忆库、对象级图像记忆库和块级图像记忆库,用于多层次的异常检测。 Result: TMUAD在多个数据集上实现了最先进的性能,能够有效检测结构和逻辑异常。 Conclusion: TMUAD通过结合结构和逻辑异常检测,在工业和医学领域的七个公开数据集中实现了最先进的性能,并且模型和代码已公开。 Abstract: Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.[90] VoCap: Video Object Captioning and Segmentation from Any Prompt
Jasper Uijlings,Xingyi Zhou,Xiuye Gu,Arsha Nagrani,Anurag Arnab,Alireza Fathi,David Ross,Cordelia Schmid
Main category: cs.CV
TL;DR: VoCap模型通过处理多模态提示生成视频对象的时空掩码和描述,实现了在指代表达视频对象分割方面的先进性能,并建立了视频对象描述的新基准。
Details
Motivation: 视频理解中需要以细粒度定位掩码和详细语义属性来理解对象,但获取相关数据既繁琐又昂贵。 Method: VoCap模型利用多模态提示(文本、框或掩码)生成时空掩码和对象中心描述。通过预处理视频,使用现有数据集的真值掩码突出感兴趣的对象,并将其输入大型视觉语言模型(VLM)生成伪描述,构建了SAV-Caption数据集。 Result: 提出了VoCap模型,它同时处理可提示的视频对象分割、指代表达分割和对象描述任务;构建了SAV-Caption数据集,用于无偏评估。 Conclusion: VoCap模型在指代表达视频对象分割方面取得了最先进的结果,在半监督视频对象分割方面具有竞争力,并为视频对象描述任务建立了基准。 Abstract: Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.[91] The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning
Yiming Lin,Yuchen Niu,Shang Wang,Kaizhu Huang,Qiufeng Wang,Xiao-Bo Jin
Main category: cs.CV
TL;DR: 本文重新定义了语义角色识别中的动词分类问题为单正多标签学习问题,并提出了相应的模型和评估基准,实验结果显示了显著的性能提升。
Details
Motivation: 现有的语义角色识别方法将动词分类视为单标签问题,这未能解决视觉事件识别中的固有歧义,因为多个动词类别可能合理地描述同一张图像。 Method: 作者通过实证分析揭示了动词分类本质上是一个多标签问题,并提出了单正多标签学习(SPMLL)的新视角。他们设计了一个综合的多标签评估基准,并开发了结合图神经网络和对抗训练的GE-VerbMLP模型来解决SPMLL问题。 Result: 提出的GE-VerbMLP模型在真实世界数据集上实现了超过3%的MAP提升,同时在传统的top-1和top-5准确率指标上保持竞争力。 Conclusion: 该论文提出了一种新的语义角色识别方法,通过将动词分类重新定义为单正多标签学习问题,并设计了相应的基准测试和模型(GE-VerbMLP),在真实世界数据集上的实验表明该方法在MAP指标上取得了显著提升。 Abstract: Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.[92] DriveQA: Passing the Driving Knowledge Test
Maolin Wei,Wanzhou Liu,Eshed Ohn-Bar
Main category: cs.CV
TL;DR: DriveQA is introduced as a benchmark to evaluate LLMs' understanding of driving knowledge, showing that while current models have limitations in complex driving scenarios, fine-tuning and pretraining significantly improve their performance and generalization abilities.