Table of Contents
cs.CL [Back]
[1] CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples
Kyohoon Jin,Juhwan Choi,Jungmin Yun,Junho Lee,Soojin Jang,Youngbin Kim
Main category: cs.CL
TL;DR: CoBA tackles multiple biases and enhances out-of-distribution robustness by generating counterbias data that mitigates spurious patterns.
Details
Motivation: Deep learning models often learn and exploit spurious correlations in training data, leading to performance degradation and poor generalization on unseen data. Method: CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level. Result: CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience. Conclusion: CoBA offers a versatile and robust solution to the challenges posed by spurious correlations. Abstract: Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.[2] Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting
Jan Fillies,Michael Peter Hoffmann,Rebecca Reichel,Roman Salzwedel,Sven Bodemer,Adrian Paschke
Main category: cs.CL
TL;DR: This paper presents the first large-scale German dataset of online comments annotated for toxicity and enriched with age data to understand age-based differences in toxic speech patterns.
Details
Motivation: The lack of demographic context in existing toxic speech datasets limits understanding of how different age groups communicate online. Method: Collaboration with funk to collect 3,024 human-annotated and 30,024 LLM-annotated comments from Instagram, TikTok, and YouTube. Comments were selected using toxic keywords and annotated for toxicity and categorized into insults, disinformation, and criticism of broadcasting fees. Result: 16.7% of comments were labeled as problematic. The study found that younger users favor expressive language while older users engage more in disinformation and devaluation. Conclusion: The dataset provides new opportunities to study linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems. Abstract: A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7\% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.[3] Granite Embedding R2 Models
Parul Awasthy,Aashka Trivedi,Yulong Li,Meet Doshi,Riyaz Bhat,Vignesh P,Vishwajeet Kumar,Yushu Yang,Bhavani Iyer,Abraham Daniels,Rudra Murthy,Ken Barker,Martin Franz,Madison Lee,Todd Ward,Salim Roukos,David Cox,Luis Lastras,Jaydeep Sen,Radu Florian
Main category: cs.CL
TL;DR: The Granite Embedding R2 models are high-performance English encoder-based embedding models designed for enterprise-scale dense retrieval applications, offering improved context length, performance, and speed advantages over competitors.
Details
Motivation: To provide high-performance English encoder-based embedding models for enterprise-scale dense retrieval applications with improved context length, performance across diverse retrieval domains, and speed advantages over competitors. Method: The models are trained exclusively on enterprise-appropriate data with comprehensive governance oversight, encompassing both bi-encoder and cross-encoder architectures. Result: The Granite Embedding R2 models deliver substantial improvements including 16x expanded context length, state-of-the-art performance across diverse retrieval domains, and speed advantages of 19-44% over leading competitors while maintaining superior accuracy. Conclusion: The Granite Embedding R2 models are publicly available under the Apache 2.0 license, offering unrestricted research and commercial use while setting new performance standards for open-source embedding models. Abstract: We introduce the Granite Embedding R2 models, a comprehensive family of high-performance English encoder-based embedding models engineered for enterprise-scale dense retrieval applications. Building upon our first-generation release, these models deliver substantial improvements, including 16x expanded context length (8,192 tokens), state-of-the-art performance across diverse retrieval domains - text, code, long-document search, multi-turn conversational, and tabular data - and measurable speed advantages of 19-44\% over leading competitors while maintaining superior accuracy. Our release encompasses both bi-encoder and cross-encoder architectures, featuring a highly effective 22-layer retriever model and its efficient 12-layer counterpart, alongside a high-quality reranker model, all trained exclusively on enterprise-appropriate data with comprehensive governance oversight. The models demonstrate exceptional versatility across standard benchmarks, IBM-developed evaluation suites, and real-world enterprise use cases, establishing new performance standards for open-source embedding models. In an era where retrieval speed and accuracy are paramount for competitive advantage, the Granite R2 models deliver a compelling combination of cutting-edge performance, enterprise-ready licensing, and transparent data provenance that organizations require for mission-critical deployments. All models are publicly available under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, enabling unrestricted research and commercial use.[4] TrInk: Ink Generation with Transformer Network
Zezhong Jin,Shubhang Desai,Xu Chen,Biyi Fang,Zhuoyi Huang,Zhe Li,Chong-Xin Gan,Xiao Tu,Man-Wai Mak,Yan Lu,Shujie Liu
Main category: cs.CL
TL;DR: TrInk是一种基于Transformer的手写生成模型,通过改进注意力机制和位置嵌入策略,显著提升了生成质量与风格一致性。
Details
Motivation: 现有方法在生成手写文本时未能有效捕捉全局依赖关系,导致生成结果在连贯性和风格一致性上不足。 Method: TrInk采用基于Transformer的模型架构,引入了缩放位置嵌入和高斯记忆掩码来优化文本与笔画点之间的对齐。 Result: 在IAM-OnDB数据集上,TrInk相较于之前的方法,字符错误率(CER)降低了35.56%,词错误率(WER)降低了29.66%。 Conclusion: TrInk在手写生成任务中显著优于现有方法,通过主观和客观评估验证了其在连贯性和风格一致性方面的有效性。 Abstract: In this paper, we propose TrInk, a Transformer-based model for ink generation, which effectively captures global dependencies. To better facilitate the alignment between the input text and generated stroke points, we introduce scaled positional embeddings and a Gaussian memory mask in the cross-attention module. Additionally, we design both subjective and objective evaluation pipelines to comprehensively assess the legibility and style consistency of the generated handwriting. Experiments demonstrate that our Transformer-based model achieves a 35.56\% reduction in character error rate (CER) and an 29.66% reduction in word error rate (WER) on the IAM-OnDB dataset compared to previous methods. We provide an demo page with handwriting samples from TrInk and baseline models at: https://akahello-a11y.github.io/trink-demo/[5] How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations
Yoshiki Takenami,Yin Jou Huang,Yugo Murawaki,Chenhui Chu
Main category: cs.CL
TL;DR: This paper explores how LLMs exhibit human-like cognitive biases, specifically the anchoring effect in price negotiations, and finds that reasoning models are less susceptible, while personality traits show no significant correlation.
Details
Motivation: Cognitive biases like the anchoring effect can affect the reliability of LLMs in real-world applications. Understanding these biases is essential for ensuring the safe and responsible use of LLMs in society. Method: The study involved instructing seller LLM agents to apply the anchoring effect during price negotiations. Evaluations were conducted using both objective and subjective metrics to assess the impact of the anchoring effect on LLM-driven negotiations. Result: Experimental results showed that LLMs are indeed influenced by the anchoring effect. Reasoning models were found to be less susceptible, likely due to their longer chain of thought, while no significant correlation was found between personality traits and the anchoring effect. Conclusion: This paper concludes that LLMs are susceptible to cognitive biases such as the anchoring effect, similar to humans. However, models with better reasoning capabilities are less prone to these biases, and there is no significant link between personality traits and susceptibility to the anchoring effect. Abstract: Cognitive biases, well-studied in humans, can also be observed in LLMs, affecting their reliability in real-world applications. This paper investigates the anchoring effect in LLM-driven price negotiations. To this end, we instructed seller LLM agents to apply the anchoring effect and evaluated negotiations using not only an objective metric but also a subjective metric. Experimental results show that LLMs are influenced by the anchoring effect like humans. Additionally, we investigated the relationship between the anchoring effect and factors such as reasoning and personality. It was shown that reasoning models are less prone to the anchoring effect, suggesting that the long chain of thought mitigates the effect. However, we found no significant correlation between personality traits and susceptibility to the anchoring effect. These findings contribute to a deeper understanding of cognitive biases in LLMs and to the realization of safe and responsible application of LLMs in society.[6] Can Multimodal LLMs Solve the Basic Perception Problems of Percept-V?
Samrajnee Ghosh,Naman Agarwal,Hemanshu Garg,Chinmay Mittal,Mausam,Parag Singla
Main category: cs.CL
TL;DR: 本文提出了Percept-V数据集,用于评估多模态大语言模型(MLLMs)和大推理模型(LRMs)在简单视觉感知任务中的表现。结果表明,随着问题复杂度的增加,模型的表现显著下降,某些认知技能更具挑战性。
Details
Motivation: 尽管多模态大语言模型(MLLMs)在编码、数学和科学等领域展现出强大的推理能力,但关于它们在简单感知任务上的表现研究非常有限。为解决这一问题,论文提出了Percept-V数据集,用于评估MLLMs在基本形状和结构生成图像上的感知能力。 Method: 论文提出了一种新的数据集Percept-V,该数据集包含7200张程序生成的图像,平均分为30个类别,每种类别测试不同的视觉感知技能。使用最先进的MLLMs(如GPT-4o、Gemini和Claude)以及大推理模型(LRMs)进行实验,以评估它们在该数据集上的表现。 Result: 实验结果显示,多模态大语言模型(MLLMs)和大推理模型(LRMs)在Percept-V数据集上的表现随着问题复杂度的增加而显著下降。同时,不同模型在测试的各类认知技能上表现出相似的准确率趋势,并且某些技能比其他技能更难掌握。 Conclusion: 实验结果表明,随着问题复杂度的增加,多模态大语言模型(MLLMs)的表现显著下降。此外,不同模型在测试的各类认知技能上表现出相似的准确率趋势,并且某些技能对模型来说更具挑战性。 Abstract: The reasoning abilities of Multimodal Large Language Models (MLLMs) have garnered a lot of attention in recent times, with advances made in frontiers like coding, mathematics, and science. However, very limited experiments have been done to assess their performance in simple perception tasks performed over uncontaminated, generated images containing basic shapes and structures. To address this issue, the paper introduces a dataset, Percept-V, containing a total of 7200 program-generated images equally divided into 30 categories, each testing a combination of visual perception skills. Unlike previously proposed datasets, Percept-V comprises very basic tasks of varying complexity that test the perception abilities of MLLMs. This dataset is then tested on state-of-the-art MLLMs like GPT-4o, Gemini, and Claude as well as Large Reasoning Models (LRMs) like OpenAI o4-mini and DeepSeek R1 to gauge their performance. Contrary to the evidence that MLLMs excel in many complex tasks, our experiments show a significant drop in the models' performance with increasing problem complexity across all categories. An analysis of the performances also reveals that the tested MLLMs exhibit a similar trend in accuracy across categories, testing a particular cognitive skill and find some skills to be more difficult than others.[7] A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
Ming Hu,Chenglong Ma,Wei Li,Wanghan Xu,Jiamin Wu,Jucheng Hu,Tianbin Li,Guohang Zhuang,Jiaqi Liu,Yingzhou Lu,Ying Chen,Chaoyang Zhang,Cheng Tan,Jie Ying,Guocheng Wu,Shujian Gao,Pengcheng Chen,Jiashi Lin,Haitao Wu,Lulu Chen,Fengxiang Wang,Yuanyuan Zhang,Xiangyu Zhao,Feilong Tang,Encheng Su,Junzhi Ning,Xinyao Liu,Ye Du,Changkai Ji,Cheng Tang,Huihui Xu,Ziyang Chen,Ziyan Huang,Jiyao Liu,Pengfei Jiang,Yizhou Wang,Chen Tang,Jianyu Wu,Yuchen Ren,Siyuan Yan,Zhonghua Wang,Zhongxing Xu,Shiyan Su,Shangquan Sun,Runkai Zhao,Zhisheng Zhang,Yu Liu,Fudi Wang,Yuanfeng Ji,Yanzhou Su,Hongming Shan,Chunmei Feng,Jiahao Xu,Jiangtao Yan,Wenhao Tang,Diping Song,Lihao Liu,Yanyan Huang,Lequan Yu,Bin Fu,Shujun Wang,Xiaomeng Li,Xiaowei Hu,Yun Gu,Ben Fei,Zhongying Deng,Benyou Wang,Yuewen Cao,Minjie Shen,Haodong Duan,Jie Xu,Yirong Chen,Fang Yan,Hongxia Hao,Jielan Li,Jiajun Du,Yanbo Wang,Imran Razzak,Chi Zhang,Lijun Wu,Conghui He,Zhaohui Lu,Jinhai Huang,Yihao Liu,Fenghua Ling,Yuqiang Li,Aoran Wang,Qihao Zheng,Nanqing Dong,Tianfan Fu,Dongzhan Zhou,Yan Lu,Wenlong Zhang,Jin Ye,Jianfei Cai,Wanli Ouyang,Yu Qiao,Zongyuan Ge,Shixiang Tang,Junjun He,Chunfeng Song,Lei Bai,Bowen Zhou
Main category: cs.CL
TL;DR: 本文综述了科学大型语言模型(Sci-LLMs)的发展,提出了其未来向闭环系统演化的方向,并强调了数据驱动方法在解决科学数据挑战中的作用。
Details
Motivation: Sci-LLMs正在改变科学知识的表示、整合和应用方式,但其发展受到科学数据复杂性的制约。因此,本文旨在探讨Sci-LLMs的发展路径,并解决其面临的挑战。 Method: 本文采用数据驱动的方法对Sci-LLMs的发展进行综合分析,包括构建统一的科学数据分类法和科学知识层次模型,系统回顾了大量Sci-LLMs模型和相关数据集,并分析了评估基准的变化趋势。 Result: 作者分析了超过270个预训练/后训练数据集和190个基准测试数据集,展示了Sci-LLMs在异构、多尺度、不确定数据中的独特需求,并指出从静态评估向过程导向和发现导向评估的转变趋势。 Conclusion: 本文总结了科学大型语言模型(Sci-LLMs)的发展现状,提出其未来发展方向是闭环系统,这些系统能够主动实验、验证并为不断演化的知识库做贡献,从而使AI成为加速科学发现的真正合作伙伴。 Abstract: Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.[8] Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations
Muskan Saraf,Sajjad Rezvani Boroujeni,Justin Beaudry,Hossein Abedi,Tom Bush
Main category: cs.CL
TL;DR: 研究发现大型语言模型的评估结果受模型身份标签影响显著,'Claude'标签提升评分,'Gemini'标签降低评分,错误标签可逆转排名。为确保公平,建议采用盲评或多模型评估。
Details
Motivation: 大型语言模型越来越多地用于评估输出质量,但其判断可能受到模型身份的影响。研究旨在揭示这些偏见并探讨其对评估公平性的影响。 Method: 研究通过ChatGPT、Gemini和Claude在四种条件(无标签、真实标签和两种错误标签情景)下进行自我和交叉模型评估。模型生成的博客文章通过总体偏好投票和质量评分(包括连贯性、信息性和简洁性)进行评估,所有评分以百分比形式呈现以便直接比较。 Result: 结果揭示了显著的不对称性:'Claude'标签始终提升评分,而'Gemini'标签始终降低评分,与实际内容无关。错误标签常常逆转排名,偏好投票的百分比变化高达50个百分点,质量评分变化最高达12个百分点。在真实标签下,Gemini的自我评分大幅下降,而Claude的自我偏好增强。 Conclusion: 研究得出,大型语言模型(LLMs)的评估结果可能受到模型身份标签的显著影响,这种偏见可能导致评估结果的严重失真。为确保评估的公平性,应采用盲评或多模型评估协议。 Abstract: Large language models (LLMs) are increasingly used to evaluate outputs, yet their judgments may be influenced. This study examines bias in self- and cross-model evaluations by ChatGPT, Gemini, and Claude under four conditions: no labels, true labels, and two false-label scenarios. Blog posts authored by each model were evaluated by all three using both overall preference voting and quality ratings for Coherence, Informativeness, and Conciseness, with all scores expressed as percentages for direct comparison. Results reveal striking asymmetries: the "Claude" label consistently boosts scores, while the "Gemini" label consistently depresses them, regardless of actual content. False labels frequently reversed rankings, producing shifts of up to 50 percentage points in preference votes and up to 12 percentage points in converted quality ratings. Gemini's self-scores collapsed under true labels, while Claude's self-preference intensified. These findings show that perceived model identity can heavily distort high-level judgments and subtly influence detailed quality ratings, underscoring the need for blind or multimodel evaluation protocols to ensure fairness in LLM benchmarking.[9] BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design
Deepro Choudhury,Sinead Williamson,Adam Goliński,Ning Miao,Freddie Bickford Smith,Michael Kirchhof,Yizhe Zhang,Tom Rainforth
Main category: cs.CL
TL;DR: BED-LLM is a framework that enhances the ability of Large Language Models to adaptively gather information by selecting queries that maximize expected information gain, resulting in significant performance improvements.
Details
Motivation: The motivation is to improve the adaptive information-gathering ability of LLMs so they can act as effective multi-turn conversational agents interacting with external environments. Method: The method involves using BED-LLM, which iteratively selects queries maximizing expected information gain (EIG) about the task, using a probabilistic model derived from the LLM's belief distribution. Result: BED-LLM demonstrated substantial performance improvements in tasks like the 20-questions game and active inference of user preferences. Conclusion: BED-LLM proves to be successful in enhancing the ability of LLMs to gather information adaptively, showing significant improvements over direct prompting and other adaptive design strategies. Abstract: We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian Experimental Design with Large Language Models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) about the task of interest given the responses gathered previously. We show how this EIG can be formulated in a principled way using a probabilistic model derived from the LLM's belief distribution and provide detailed insights into key decisions in its construction. Further key to the success of BED-LLM are a number of specific innovations, such as a carefully designed estimator for the EIG, not solely relying on in-context updates for conditioning on previous responses, and a targeted strategy for proposing candidate queries. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20-questions game and using the LLM to actively infer user preferences, compared to direct prompting of the LLM and other adaptive design strategies.[10] Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization
Arash Ahmadi,Sarah Sharif,Yaser Banad
Main category: cs.CL
TL;DR: This paper presents an automated HFACS classification framework using GRPO to fine-tune a Llama-3.1 8B model, achieving significant performance improvements and outperforming existing state-of-the-art LLMs in aviation safety analysis.
Details
Motivation: Traditional methods of analyzing aviation accidents using HFACS are limited by scalability and consistency. This research aims to overcome these limitations through automation and advanced machine learning techniques. Method: The study introduces an automated HFACS classification framework using Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. It employs a multi-component reward system and synthetic data generation to address class imbalance. Result: The GRPO-optimized model showed significant performance improvements, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and a partial match accuracy of 0.8800. It outperformed state-of-the-art LLMs like GPT-5-mini and Gemini-2.5-flash on key metrics. Conclusion: The study concludes that domain-optimized, smaller language models can offer computationally efficient and effective solutions for critical safety analysis, with potential for deployment on edge devices. Abstract: Analyzing the human factors behind aviation accidents is crucial for preventing future incidents, yet traditional methods using the Human Factors Analysis and Classification System (HFACS) are limited by scalability and consistency. To address this, we introduce an automated HFACS classification framework for aviation safety analysis that utilizes Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. Our approach incorporates a multi-component reward system tailored for aviation safety analysis and integrates synthetic data generation to overcome class imbalance in accident datasets. The resulting GRPO-optimized model achieved noticeable performance gains, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and an improved partial match accuracy of 0.8800. Significantly, our specialized model outperforms state-of-the-art LLMs (Large Language Models), including GPT-5-mini and Gemini-2.5-fiash, on key metrics. This research also proposes exact match accuracy in multi-label HFACS classification problem as a new benchmarking methodology to evaluate the advanced reasoning capabilities of language models. Ultimately, our work validates that smaller, domain-optimized models can provide a computationally efficient and better solution for critical safety analysis. This approach makes powerful, low-latency deployment on resource-constrained edge devices feasible.[11] Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach
Han Yang,Jian Lan,Yihong Liu,Hinrich Schütze,Thomas Seidl
Main category: cs.CL
TL;DR: A new pixel-based language model improves robustness against orthographic attacks and supports multiple languages effectively.
Details
Motivation: Autoregressive language models are vulnerable to orthographic attacks due to subword tokenizer limitations. This study aims to enhance model robustness and multilingual compatibility. Method: A pixel-based generative language model was developed, replacing traditional text-based embeddings with pixel-based representations by rendering words as images. Result: The proposed model demonstrated resilience to orthographic noise and effectiveness in multilingual settings, evaluated on the multilingual LAMBADA dataset, WMT24 dataset, and SST-2 benchmark. Conclusion: The proposed pixel-based generative language model offers improved robustness against orthographic attacks and enhanced multilingual capabilities compared to traditional autoregressive models. Abstract: Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.[12] Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition?
Yurie Koga,Shunsuke Kando,Yusuke Miyao
Main category: cs.CL
TL;DR: 本文研究自监督语音模型(S3Ms)中是否存在语言习得的关键期效应,发现S3Ms在语音学习中不表现出明显的关键期效应。
Details
Motivation: 尽管关键期效应在文本语言模型中有所研究,但其在语音模型中的表现尚不清楚,而语音在人类语言习得中具有核心作用。 Method: 通过改变第二语言(L2)训练开始时间和第一语言(L1)训练结束时间,使用儿童导向语音训练S3Ms,并评估其语音辨别性能。 Result: S3Ms在语音获取方面没有表现出明显的关键期效应,延迟L2暴露开始时间反而提高了L2表现,而延迟L1暴露结束时间导致L1遗忘。 Conclusion: S3Ms在语音学习中不表现出明显的关键期效应,这与文本语言模型的结果有所不同。 Abstract: This paper investigates whether the Critical Period (CP) effects in human language acquisition are observed in self-supervised speech models (S3Ms). CP effects refer to greater difficulty in acquiring a second language (L2) with delayed L2 exposure onset, and greater retention of their first language (L1) with delayed L1 exposure offset. While previous work has studied these effects using textual language models, their presence in speech models remains underexplored despite the central role of spoken language in human language acquisition. We train S3Ms with varying L2 training onsets and L1 training offsets on child-directed speech and evaluate their phone discrimination performance. We find that S3Ms do not exhibit clear evidence of either CP effects in terms of phonological acquisition. Notably, models with delayed L2 exposure onset tend to perform better on L2 and delayed L1 exposure offset leads to L1 forgetting.[13] Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection
Weizhi Gao,Xiaorui Liu,Feiyi Wang,Dan Lu,Junqi Yin
Main category: cs.CL
TL;DR: 本文提出了一种名为Decoding Memory Pipeline (DMP)的新方法,通过选择性推理和退火解码来加速生成,解决了大型语言模型在句子级别生成中的幻觉问题和现有方法的高计算成本问题。
Details
Motivation: 大型语言模型在研究和实际应用中表现出色,但仍然存在幻觉问题。现有的幻觉检测方法在句子级别生成中表现不佳或严重依赖领域特定知识。自洽方法虽然有助于解决这些问题,但由于重复生成导致计算成本高昂。 Method: 本文提出了一种名为Decoding Memory Pipeline (DMP)的新方法,通过选择性推理和退火解码来加速生成。 Result: 实验表明,该方法在不牺牲AUROC性能的情况下实现了最高3倍的速度提升。 Conclusion: DMP方法在多响应生成的效率上始终有所提高,并有望扩展到对齐和推理任务。 Abstract: Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.[14] Efficient Code Embeddings from Code Generation Models
Daria Kryvosheieva,Saba Sturua,Michael Günther,Scott Martens,Han Xiao
Main category: cs.CL
TL;DR: jina-code-embeddings is a compact yet powerful code embedding model that excels at retrieving code from natural language queries and identifying similar code snippets across languages using a pre-trained autoregressive backbone.
Details
Motivation: The motivation is to develop a model that can efficiently retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. Method: The model uses an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. Result: The model demonstrates state-of-the-art performance, validating the approach to code embedding model construction. Conclusion: jina-code-embeddings is an effective and innovative code embedding model suite that achieves state-of-the-art performance despite its relatively small size. Abstract: jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.[15] BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning
João Guilherme Alves Santos,Giovana Kerche Bonás,Thales Sales Almeida
Main category: cs.CL
TL;DR: The updated BLUEX dataset improves LLM evaluation by incorporating visual context through automatically generated image captions, significantly increasing accessibility and usability for multilingual and non-English contexts.
Details
Motivation: The motivation stems from the growing capabilities of Large Language Models (LLMs) and the increasing need for robust evaluation methods, especially in multilingual and non-English contexts. Method: The study updated the BLUEX dataset with 2024-2025 exams and used state-of-the-art models to generate image captions. Commercial and open-source LLMs were evaluated based on their ability to utilize visual context provided by these captions. Result: The captioning strategies increased accessibility to text-only models by more than 40%, producing 1,422 usable questions—more than doubling the original BLUEX dataset. The updated dataset was used to evaluate LLMs' ability to leverage visual context through captions. Conclusion: The updated BLUEX dataset significantly enhances the evaluation of LLMs, particularly in leveraging visual context through captions, demonstrating improved accessibility and usability for multilingual and non-English contexts. Abstract: With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.[16] Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models
Shubham Sharma,Sneha Tuli,Narendra Badam
Main category: cs.CL
TL;DR: 本文综述了构建和使用大语言模型的主要挑战,并通过比较GPT-4o和DeepSeek-V3-0324探讨了开源与闭源模型的优缺点,为AI研究者和决策者提供参考。
Details
Motivation: 大语言模型正在改变人工智能,但其开发和部署仍然复杂,因此需要了解不同模型的优劣及其适用场景。 Method: 本文采用调查方法,回顾了LLMs的挑战,并通过对比两个最先进模型(GPT-4o和DeepSeek-V3-0324)的解决方式进行分析。 Result: 揭示了闭源模型在安全性与可靠性方面的优势,以及开源模型在效率和适应性方面的优点,并指出了不同模型在聊天机器人、编程工具、医疗和教育等领域的最佳应用场景。 Conclusion: 本文总结了构建和使用大语言模型(LLMs)的16个关键挑战,并通过比较GPT-4o和DeepSeek-V3-0324展示了开源与闭源模型之间的权衡,旨在指导研究人员、开发者和决策者理解LLMs的能力、限制和最佳实践。 Abstract: Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI's closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.[17] Normality and the Turing Test
Alexandre Kabbach
Main category: cs.CL
TL;DR: This paper reinterprets the Turing test through the lens of 'normality,' arguing that the test measures average human intelligence and raises questions about whether artificial intelligence should be measured against this standard or if human cognition extends beyond it.
Details
Motivation: The motivation is to reinterpret the Turing test by focusing on the concept of 'normality' or 'average human intelligence', exploring how this perspective influences our understanding of artificial intelligence and human cognition. Method: The paper revisits the Turing test through the concept of normality, interpreting it statistically and normatively to understand its implications on human and artificial intelligence. Result: The paper argues that the Turing test assesses 'normal intelligence' through the judgments of an average judge, derived from a pool of human interrogators. It also suggests that current large language models represent artificial smartness, not artificial intelligence, as they focus on exceptional intelligence. Conclusion: The paper concludes that large language models like ChatGPT are unlikely to pass the Turing test as they target exceptional human intelligence rather than normal/average intelligence. It also raises the question of whether human cognition can be reduced to the normal/average mind, extending beyond the Turing test itself. Abstract: This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the statistical interpretation of the normal--understood as the average both in the normative and mathematical sense of the term--proves useful for understanding the Turing test in at least two ways. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires building machines that "make mistakes" and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single "average" judge (understood as non-expert) but always by a full jury. As such, the notion of "average human interrogator" that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. In short, this paper argues that the Turing test is a test of normal intelligence as assessed by a normal judge characterizing the average judgment of a pool of human interrogators. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence per se. Second, it argues that the core question of whether the Turing test can contribute anything to the understanding of human cognition is that of whether the human mind is really reducible to the normal/average mind--a question which largely extends beyond the Turing test itself and questions the conceptual underpinnings of the normalist paradigm it belongs to.[18] AllSummedUp: un framework open-source pour comparer les metriques d'evaluation de resume
Tanguy Herserant,Vincent Guigue
Main category: cs.CL
TL;DR: 本文研究了自动文本摘要评估中的可重复性挑战,引入了一个统一的开源框架用于公平比较评估指标,并提出了对更稳健评估协议的需求。
Details
Motivation: 研究的动机是发现文献中报告的性能与实验环境中观察到的性能之间存在显著差异,尤其是在自动文本摘要评估领域。 Method: 通过在六个代表性指标上进行实验,包括传统的ROUGE和最新的LLM-based方法(G-Eval, SEval-Ex),并引入了一个统一的、开源框架用于公平和透明的指标比较。 Result: 研究结果显示,与人类判断最一致的指标往往计算密集且在运行间不够稳定,同时突出了对依赖LLM进行评估的一些关键担忧,包括随机性、技术依赖性和有限的可重复性。 Conclusion: 研究得出,在自动文本摘要评估中存在可重复性挑战,建议采用更稳健的评估协议,包括详尽的文档记录和方法标准化,以提高可靠性。 Abstract: This paper investigates reproducibility challenges in automatic text summarization evaluation. Based on experiments conducted across six representative metrics ranging from classical approaches like ROUGE to recent LLM-based methods (G-Eval, SEval-Ex), we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting. We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics. Our results reveal a structural trade-off: metrics with the highest alignment with human judgments tend to be computationally intensive and less stable across runs. Beyond comparative analysis, this study highlights key concerns about relying on LLMs for evaluation, stressing their randomness, technical dependencies, and limited reproducibility. We advocate for more robust evaluation protocols including exhaustive documentation and methodological standardization to ensure greater reliability in automatic summarization assessment.[19] Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework
Nils Dycke,Iryna Gurevych
Main category: cs.CL
TL;DR: Automated review systems fail to reliably detect faulty research logic, a critical skill in peer review, suggesting the need for improvement in this area.
Details
Motivation: Understanding the capabilities and limitations of automated review generators is essential to ensure scientific integrity as their use increases. Method: Developed a counterfactual evaluation framework to test the ability of ARGs to detect faulty research logic under controlled conditions. Result: Flaws in research logic did not significantly affect the output of tested ARGs, highlighting limitations in their current capabilities. Conclusion: ARGs do not significantly respond to flaws in research logic, indicating a gap in their ability to emulate a core peer review skill. Abstract: Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper's results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.[20] Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models
Meidan Ding,Jipeng Zhang,Wenxuan Wang,Cheng-Yi Li,Wei-Chieh Fang,Hsin-Yu Wu,Haiqin Zhong,Wenting Chen,Linlin Shen
Main category: cs.CL
TL;DR: Med-RewardBench是首个专门用于评估医学奖励模型和判断的基准,它提供了一个覆盖13个器官系统和8个临床科室的多模态数据集,并揭示了与专家判断对齐的重大挑战。
Details
Motivation: 尽管医学奖励模型和判断在疾病诊断和临床决策中至关重要,但目前尚无专门针对临床需求的基准。 Method: Med-RewardBench包括一个覆盖13个器官系统和8个临床科室的多模态数据集,采用严格的三步过程确保评估数据的质量,并评估了32个最先进的多模态大语言模型。 Result: Med-RewardBench评估了32个最先进的多模态大语言模型,包括开源、专有和医学专用模型,揭示了在与专家判断对齐方面的重大挑战,并通过开发基线模型展示了通过微调实现的显著性能改进。 Conclusion: Med-RewardBench是一个专门用于评估医学奖励模型和判断的基准,它揭示了与专家判断对齐的重大挑战,并通过微调展示了显著的性能改进。 Abstract: Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.[21] Discovering Semantic Subdimensions through Disentangled Conceptual Representations
Yunhao Zhang,Shaonan Wang,Nan Lin,Xinyi Dong,Chong Li,Chengqing Zong
Main category: cs.CL
TL;DR: This paper introduces a novel model to uncover fine-grained, interpretable semantic subdimensions from word embeddings, showing that polarity is a key factor in their organization and demonstrating their neural plausibility through brain activation mapping.
Details
Motivation: Existing approaches rely on predefined, coarse semantic dimensions that overlook finer conceptual distinctions; this study aims to uncover more detailed and interpretable semantic subdimensions to better understand how meaning is organized in language and the brain. Method: The researchers introduced a Disentangled Continuous Semantic Representation Model (DCSRM) to decompose word embeddings into sub-embeddings, identified interpretable semantic subdimensions, and used voxel-wise encoding models to assess neural plausibility by mapping these subdimensions to brain activation. Result: The DCSRM successfully identified interpretable semantic subdimensions, and further analyses revealed that these subdimensions are structured according to distinct principles, with polarity playing a key role. Neural mapping confirmed the cognitive and neuroscientific plausibility of these subdimensions. Conclusion: The study concludes that semantic dimensions can be decomposed into finer, interpretable subdimensions, with polarity being a key factor in this decomposition, offering new insights into how meaning is organized in language and the brain. Abstract: Understanding the core dimensions of conceptual semantics is fundamental to uncovering how meaning is organized in language and the brain. Existing approaches often rely on predefined semantic dimensions that offer only broad representations, overlooking finer conceptual distinctions. This paper proposes a novel framework to investigate the subdimensions underlying coarse-grained semantic dimensions. Specifically, we introduce a Disentangled Continuous Semantic Representation Model (DCSRM) that decomposes word embeddings from large language models into multiple sub-embeddings, each encoding specific semantic information. Using these sub-embeddings, we identify a set of interpretable semantic subdimensions. To assess their neural plausibility, we apply voxel-wise encoding models to map these subdimensions to brain activation. Our work offers more fine-grained interpretable semantic subdimensions of conceptual meaning. Further analyses reveal that semantic dimensions are structured according to distinct principles, with polarity emerging as a key factor driving their decomposition into subdimensions. The neural correlates of the identified subdimensions support their cognitive and neuroscientific plausibility.[22] Beyond the Surface: Probing the Ideological Depth of Large Language Models
Shariar Kabir,Kevin Esterling,Yue Dong
Main category: cs.CL
TL;DR: 该研究探讨了大型语言模型中的“意识形态深度”,即其内部政治表征的稳健性和复杂性,并发现其可操控性可以反映潜在的政治架构。
Details
Motivation: 大型语言模型展现出明显的意识形态倾向,但这些立场的稳定性和深度仍知之甚少。表面层次的响应可以通过简单的提示工程进行操控,这引发了它们是否反映了一种连贯的潜在意识形态的疑问。 Method: 采用双重方法:首先,通过指令提示和激活转向衡量两个知名开源LLMs的“可操控性”;其次,使用稀疏自编码器(SAEs)探测这些模型的内部机制。 Result: 发现某些模型可以在自由主义和保守主义观点之间轻松切换,而其他模型则表现出抵抗或拒绝率增加,表明其具有更根深蒂固的意识形态结构。使用稀疏自编码器分析显示,低可操控性的模型拥有更独特和抽象的意识形态特征。评估揭示了一个模型可能包含比另一个相似大小的模型多7.3倍的政治特征。 Conclusion: 意识形态深度是大型语言模型(LLMs)的一个可量化属性,其可操控性为研究其潜在政治架构提供了有价值的视角。 Abstract: Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulated through simple prompt engineering, calling into question whether they reflect a coherent underlying ideology. This paper investigates the concept of "ideological depth" in LLMs, defined as the robustness and complexity of their internal political representations. We employ a dual approach: first, we measure the "steerability" of two well-known open-source LLMs using instruction prompting and activation steering. We find that while some models can easily switch between liberal and conservative viewpoints, others exhibit resistance or an increased rate of refusal, suggesting a more entrenched ideological structure. Second, we probe the internal mechanisms of these models using Sparse Autoencoders (SAEs). Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features. Our evaluations reveal that one model can contain 7.3x more political features than another model of similar size. This allows targeted ablation of a core political feature in an ideologically "deep" model, leading to consistent, logical shifts in its reasoning across related topics, whereas the same intervention in a "shallow" model results in an increase in refusal outputs. Our findings suggest that ideological depth is a quantifiable property of LLMs and that steerability serves as a valuable window into their latent political architecture.[23] Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards
Xiaolong Wei,Bo Lu,Xingyu Zhang,Zhejun Zhao,Dongdong Shen,Long Xia,Dawei Yin
Main category: cs.CL
TL;DR: 本文研究了两种基于AI反馈的强化学习策略,以提升小型语言模型在创意写作中的表现,特别是中文祝福语生成。
Details
Motivation: 大型语言模型计算需求高,而传统方法如监督微调和人类反馈强化学习在新颖性和成本上存在问题。 Method: 采用两种AI驱动的奖励策略:一种基于多代理拒绝采样的RM训练框架,另一种是通过对抗训练和反思机制优化的LLM-as-a-Judge方法。 Result: 两种方法均显著提升创意输出,但LLM-as-a-Judge在生成质量和训练效率方面表现更优,同时减少了对人工标注数据的依赖。 Conclusion: 基于AI反馈的强化学习策略为小型语言模型的创意写作提供了可扩展且高效的方法。 Abstract: Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.[24] HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble
Sara B. Coutinho,Rafael M. O. Cruz,Francimaria R. S. Nascimento,George D. C. Cavalcanti
Main category: cs.CL
TL;DR: This paper proposes HierarchySelect, a new method for selecting diverse and effective classifiers for ensemble models in fake news detection, demonstrating improved performance on two datasets.
Details
Motivation: The motivation is to enhance the performance of ensemble-based fact-checking systems by addressing the challenge of selecting genuinely diverse classifiers that do not learn redundant patterns. Method: The method computes pairwise diversity between classifiers, uses hierarchical clustering to group them, and selects the most diverse pool of classifiers while incorporating performance evaluation. Result: Experiments on six datasets with 40 classifiers showed that the proposed method achieved the highest accuracy on two datasets compared to baselines. Conclusion: The proposed HierarchySelect method effectively selects diverse and high-performing classifiers for ensemble construction, showing improved accuracy on two out of six datasets compared to existing approaches. Abstract: Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifiers's performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project's repository: https://github.com/SaraBCoutinho/HSFN .[25] L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models
Aishwarya Mirashi,Ananya Joshi,Raviraj Joshi
Main category: cs.CL
TL;DR: 本文介绍了MahaSTS,这是一个为马拉地语设计的句子文本相似度数据集,以及MahaSBERT-STS-v2模型,该模型经过优化用于基于回归的相似度评分。
Details
Motivation: 为了提升马拉地语在句子相似性任务中的模型表现,减少标签偏差,增强模型稳定性,作者创建了MahaSTS数据集。 Method: 构建了一个包含16,860对马拉地语句子的数据集,并对其进行连续相似度评分标注;同时对MahaSBERT模型进行了微调以优化相似度评分。 Result: 实验表明,使用MahaSTS数据集和MahaSBERT模型在马拉地语的句子相似性任务中表现优异,尤其是在低资源环境下。 Conclusion: MahaSTS数据集和MahaSBERT模型的结合不仅提升了马拉地语的句子相似性任务的效果,还强调了人工标注数据和针对性微调的重要性。 Abstract: We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 16,860 Marathi sentence pairs labeled with continuous similarity scores in the range of 0-5. To ensure balanced supervision, the dataset is uniformly distributed across six score-based buckets spanning the full 0-5 range, thus reducing label bias and enhancing model stability. We fine-tune the MahaSBERT model on this dataset and benchmark its performance against other alternatives like MahaBERT, MuRIL, IndicBERT, and IndicSBERT. Our experiments demonstrate that MahaSTS enables effective training for sentence similarity tasks in Marathi, highlighting the impact of human-curated annotations, targeted fine-tuning, and structured supervision in low-resource settings. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP[26] A Survey on Current Trends and Recent Advances in Text Anonymization
Tobias Deußer,Lorenz Sparrenberg,Armin Berger,Max Hahnbück,Christian Bauckhage,Rafet Sifa
Main category: cs.CL
TL;DR: This survey paper explores current trends and recent advances in text anonymization techniques, emphasizing the role of Named Entity Recognition, Large Language Models, and formal privacy models. It addresses domain-specific challenges in sectors like healthcare, law, finance, and education, and reviews evaluation frameworks and toolkits for real-world deployment. The paper aims to guide future research by identifying emerging trends and persistent challenges in the field of text anonymization.
Details
Motivation: The motivation for the paper is the increasing volume of textual data containing sensitive personal information across various domains, which necessitates robust anonymization techniques to protect privacy, comply with regulations, and preserve data usability for downstream tasks. Method: The paper uses a survey methodology to provide a comprehensive overview of current trends and recent advances in text anonymization techniques. It discusses foundational approaches, examines the impact of Large Language Models, explores domain-specific challenges, investigates advanced methodologies, and reviews evaluation frameworks and practical toolkits. Result: The result is a detailed survey that consolidates current knowledge in text anonymization, highlights emerging trends, identifies persistent challenges such as the evolving privacy-utility trade-off and the need to address quasi-identifiers, and explores the implications of Large Language Model capabilities in both anonymization and de-anonymization. Conclusion: The paper concludes that while text anonymization techniques have advanced significantly, especially with the integration of Large Language Models and formal privacy models, there remain persistent challenges such as managing the privacy-utility trade-off, addressing quasi-identifiers, and adapting to the evolving capabilities of LLMs. The survey aims to guide future research directions for academics and practitioners in this field. Abstract: The proliferation of textual data containing sensitive personal information across various domains requires robust anonymization techniques to protect privacy and comply with regulations, while preserving data usability for diverse and crucial downstream tasks. This survey provides a comprehensive overview of current trends and recent advances in text anonymization techniques. We begin by discussing foundational approaches, primarily centered on Named Entity Recognition, before examining the transformative impact of Large Language Models, detailing their dual role as sophisticated anonymizers and potent de-anonymization threats. The survey further explores domain-specific challenges and tailored solutions in critical sectors such as healthcare, law, finance, and education. We investigate advanced methodologies incorporating formal privacy models and risk-aware frameworks, and address the specialized subfield of authorship anonymization. Additionally, we review evaluation frameworks, comprehensive metrics, benchmarks, and practical toolkits for real-world deployment of anonymization solutions. This review consolidates current knowledge, identifies emerging trends and persistent challenges, including the evolving privacy-utility trade-off, the need to address quasi-identifiers, and the implications of LLM capabilities, and aims to guide future research directions for both academics and practitioners in this field.[27] Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning
Zinan Tang,Xin Gao,Qizhi Pei,Zhuoshi Pan,Mengzhang Cai,Jiang Wu,Conghui He,Lijun Wu
Main category: cs.CL
TL;DR: Middo是一种动态数据优化框架,通过闭环系统持续优化训练数据,显著提高大型语言模型的性能。
Details
Motivation: 现有的静态数据集构建方法难以适应模型能力的演变,而Middo通过动态数据优化解决这一问题,从而提高模型性能。 Method: Middo采用闭环优化系统,包括自我诊断模块和自适应优化引擎,通过损失模式、嵌入聚类动态和自我对齐评分识别低效样本,并将其转化为有价值的训练点。 Result: 实验结果显示,Middo平均提升了7.15%的准确性,同时保持了原始数据集规模。 Conclusion: Middo框架通过动态优化数据和模型协同进化,为大型语言模型的可持续训练提供了新范式。 Abstract: Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our \method consistently enhances the quality of seed data and boosts LLM's performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon.[28] Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks
Sarfaroz Yunusov,Kaige Chen,Kazi Nishat Anwar,Ali Emami
Main category: cs.CL
TL;DR: The study finds that personality traits influence preferences for LLMs like GPT-4 and Claude 3.5, with distinct choices emerging based on task types, which traditional evaluations may overlook.
Details
Motivation: To understand if users with different personality traits systematically prefer certain Large Language Models (LLMs) over others in multi-turn collaboration scenarios. Method: A study involving 32 participants across four Keirsey personality types evaluated interactions with GPT-4 and Claude 3.5 on four collaborative tasks, combined with sentiment analysis of qualitative feedback. Result: Rationals preferred GPT-4, especially for goal-oriented tasks, while Idealists favored Claude 3.5 for creative and analytical tasks. Other types showed task-dependent preferences despite similar overall helpfulness ratings. Conclusion: Personality-based analysis reveals differences between LLMs that traditional evaluations miss, with distinct preferences observed based on personality traits and task requirements. Abstract: As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conducted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retrieval, and writing assistance. Results revealed significant personality-driven preferences: Rationals strongly preferred GPT-4, particularly for goal-oriented tasks, while idealists favored Claude 3.5, especially for creative and analytical tasks. Other personality types showed task-dependent preferences. Sentiment analysis of qualitative feedback confirmed these patterns. Notably, aggregate helpfulness ratings were similar across models, showing how personality-based analysis reveals LLM differences that traditional evaluations miss.[29] QZhou-Embedding Technical Report
Peng Yu,En Xu,Bin Chen,Haibiao Chen,Yinfei Xu
Main category: cs.CL
TL;DR: QZhou-Embedding是一个基于Qwen2.5-7B-Instruct的文本嵌入模型,通过多任务框架和高质量数据提升检索性能,在多个基准测试中达到最先进的结果。
Details
Motivation: 为了提升文本嵌入模型的表示能力和效率,通过多样化数据集和改进数据质量来优化模型性能。 Method: 构建了一个基于Qwen2.5-7B-Instruct基础模型的统一多任务框架,包括专门的数据转换和训练策略,并采用了一个两阶段的训练策略。 Result: QZhou-Embedding在MTEB和CMTEB排行榜上均排名第一,并在重排序、聚类等任务中表现出色。 Conclusion: QZhou-Embedding在多个基准测试中达到了最先进的性能,证明了高质量和多样化数据对检索模型性能的重要性。 Abstract: We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.[30] Is this chart lying to me? Automating the detection of misleading visualizations
Jonathan Tonglet,Jan Zimny,Tinne Tuytelaars,Iryna Gurevych
Main category: cs.CL
TL;DR: 本研究发布了Misviz和Misviz-synth数据集,用于自动检测误导性可视化,但任务仍然具有挑战性。
Details
Motivation: 误导性可视化是社交媒体和网络上错误信息的强大驱动因素,通过违反图表设计原则来误导读者得出错误结论。 Method: 介绍Misviz和Misviz-synth数据集,并使用最先进的MLLMs、基于规则的系统和微调分类器对它们进行全面评估。 Result: 任务仍然具有高度挑战性,尽管有新的数据集支持模型训练和评估。 Conclusion: Misviz和Misviz-synth数据集的发布为自动检测误导性可视化和减少错误信息传播提供了新的资源,但任务依然具有挑战性。 Abstract: Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.[31] Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Yao Wang,Di Liang,Minlong Peng
Main category: cs.CL
TL;DR: This paper introduces CPI-FT, a novel fine-tuning framework that isolates and preserves core parameters for different tasks, thereby reducing interference and forgetting in large language models during multi-task adaptation.
Details
Motivation: The motivation stems from the 'seesaw phenomenon' in supervised fine-tuning of large language models, where parameter updates benefit some tasks while harming others. This work aims to develop a method that enables more balanced and effective multi-task adaptation. Method: The Core Parameter Isolation Fine-Tuning (CPI-FT) framework identifies core parameter regions for each task through independent fine-tuning, groups tasks with similar core regions, and employs a parameter fusion technique using SLERP for non-core parameters. A pipelined SFT training phase with mixed-task data and freezing of prior task core regions is also introduced. Result: Extensive experiments show that CPI-FT consistently outperforms vanilla multi-task and multi-stage fine-tuning baselines, demonstrating its effectiveness in reducing task interference and catastrophic forgetting. Conclusion: The proposed CPI-FT framework effectively addresses the seesaw phenomenon in supervised fine-tuning of large language models, significantly alleviating task interference and forgetting, and outperforming traditional multi-task and multi-stage fine-tuning baselines. Abstract: Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the ``seesaw phenomenon'', where indiscriminate parameter updates yield progress on certain tasks at the expense of others. To address this challenge, we propose a novel \emph{Core Parameter Isolation Fine-Tuning} (CPI-FT) framework. Specifically, we first independently fine-tune the LLM on each task to identify its core parameter regions by quantifying parameter update magnitudes. Tasks with similar core regions are then grouped based on region overlap, forming clusters for joint modeling. We further introduce a parameter fusion technique: for each task, core parameters from its individually fine-tuned model are directly transplanted into a unified backbone, while non-core parameters from different tasks are smoothly integrated via Spherical Linear Interpolation (SLERP), mitigating destructive interference. A lightweight, pipelined SFT training phase using mixed-task data is subsequently employed, while freezing core regions from prior tasks to prevent catastrophic forgetting. Extensive experiments on multiple public benchmarks demonstrate that our approach significantly alleviates task interference and forgetting, consistently outperforming vanilla multi-task and multi-stage fine-tuning baselines.[32] Reasoning-Intensive Regression
Diane Tchuindjo,Omar Khattab
Main category: cs.CL
TL;DR: This paper introduces MENTAT, a new method combining batch-reflective prompt optimization and neural ensemble learning, which significantly improves performance on reasoning-intensive regression tasks compared to existing approaches like prompting frozen LLMs or finetuning Transformers.
Details
Motivation: The authors observe that reasoning-intensive regression (RiR) tasks, which require deducing subtle numerical properties from text, are increasingly being tackled with large language models (LLMs). However, these tasks are distinct from standard language regression tasks due to their need for deeper text analysis and limited availability of task-specific training data and computation. This leads to challenges for existing approaches like prompting frozen LLMs or finetuning via gradient descent. Method: The authors propose MENTAT, which combines batch-reflective prompt optimization with neural ensemble learning. They evaluate this approach on three realistic reasoning-intensive regression tasks and compare it against baselines such as prompting frozen LLMs and finetuning Transformer encoders via gradient descent. Result: MENTAT achieves up to a 65% improvement over the baselines tested, showing its effectiveness in handling reasoning-intensive regression tasks. Conclusion: MENTAT provides a simple and lightweight method that significantly improves performance on reasoning-intensive regression tasks, although there is still room for future advancements in this area. Abstract: AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e. deducing subtle numerical properties from text. Unlike standard language regression tasks, e.g. for sentiment or similarity, RiR often appears instead in ad-hoc problems like rubric-based scoring or domain-specific retrieval, where much deeper analysis of text is required while only limited task-specific training data and computation are available. We cast three realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.[33] PiCSAR: Probabilistic Confidence Selection And Ranking
Joshua Ong Jun Leang,Zheng Zhao,Aryo Pradipta Gema,Sohee Yang,Wai-Chung Kwan,Xuanli He,Wenda Li,Pasquale Minervini,Eleonora Giunchiglia,Shay B. Cohen
Main category: cs.CL
TL;DR: PiCSAR是一个有效的训练免费方法,用于通过推理和最终答案的联合对数似然评分候选生成,在各种基准测试中实现了显著的增益。
Details
Motivation: 在推理任务中,设计一个能够在没有真实答案访问的情况下识别正确推理链的评分函数是一个关键挑战。 Method: 提出了Probabilistic Confidence Selection And Ranking (PiCSAR),使用推理和最终答案的联合对数似率来评分每个候选生成。 Result: PiCSAR在各种基准测试中实现了显著的增益(例如+10.18在MATH500,+9.81在AIME2025),并且分析揭示了正确的推理链展示出显著更高的推理和答案信心。 Conclusion: PiCSAR是一种有效的训练免费方法,通过利用推理和最终答案的联合对数似然来评分候选生成,从而在多样化的基准测试中实现显著增益,并且在20次比较中的16次中使用至少少2倍的样本就超过了基线。 Abstract: Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.[34] Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval
Inés Altemir Marinas,Anastasiia Kucherenko,Andrei Kucharavy
Main category: cs.CL
TL;DR: 该研究开发了一个基于ElasticSearch的框架,能够实时分析大型语言模型的训练数据集,解决数据质量、安全和伦理问题,适用于大规模多语言语料库。
Details
Motivation: 网络爬取的无差别性质对数据质量、安全性和伦理提出了挑战,而训练数据质量对大型语言模型至关重要。 Method: 使用基于ElasticSearch的流水线对该研究中的LLM训练数据集进行索引和分析,并应用于SwissAI的FineWeb-2语料库。 Result: 实现了实时数据集分析,大多数搜索在毫秒级完成,所有搜索均在2秒内完成,且分析了1.5TB的四语种语料库。 Conclusion: 该研究展示了一种用于索引和分析大型语言模型训练数据集的框架,为构建更安全、更负责任的人工智能系统提供了实用工具。 Abstract: Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80\% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI's FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance--most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.cs.CV [Back]
[35] 2COOOL: 2nd Workshop on the Challenge Of Out-Of-Label Hazards in Autonomous Driving
Ali K. AlShami,Ryan Rabinowitz,Maged Shoman,Jianwu Fang,Lukas Picek,Shao-Yuan Lo,Steve Cruz,Khang Nhut Lam,Nachiket Kamod,Lei-Lei Li,Jugal Kalita,Terrance E. Boult
Main category: cs.CV
TL;DR: The 2COOOL workshop at ICCV 2025 focuses on improving autonomous driving safety by addressing novel scenarios through new algorithms and systems inspired by anomaly detection, open-set recognition, and domain adaptation.
Details
Motivation: The lack of entirely safe self-driving cars is attributed to the challenge of addressing novel scenarios, which is a critical barrier to real-world deployment. Method: The workshop brings together academic and industry experts to explore novelty handling techniques such as anomaly detection, open-set recognition, and domain adaptation. Result: The workshop will feature discussions on out-of-distribution hazard detection, vision-language models, benchmarking methodologies, and safe autonomous driving practices. Conclusion: The 2COOOL workshop aims to advance algorithms and systems for hazard avoidance in autonomous driving by addressing novelty handling and incorporating insights from various fields. Abstract: As the computer vision community advances autonomous driving algorithms, integrating vision-based insights with sensor data remains essential for improving perception, decision making, planning, prediction, simulation, and control. Yet we must ask: Why don't we have entirely safe self-driving cars yet? A key part of the answer lies in addressing novel scenarios, one of the most critical barriers to real-world deployment. Our 2COOOL workshop provides a dedicated forum for researchers and industry experts to push the state of the art in novelty handling, including out-of-distribution hazard detection, vision-language models for hazard understanding, new benchmarking and methodologies, and safe autonomous driving practices. The 2nd Workshop on the Challenge of Out-of-Label Hazards in Autonomous Driving (2COOOL) will be held at the International Conference on Computer Vision (ICCV) 2025 in Honolulu, Hawaii, on October 19, 2025. We aim to inspire the development of new algorithms and systems for hazard avoidance, drawing on ideas from anomaly detection, open-set recognition, open-vocabulary modeling, domain adaptation, and related fields. Building on the success of its inaugural edition at the Winter Conference on Applications of Computer Vision (WACV) 2025, the workshop will feature a mix of academic and industry participation.[36] Advanced Deep Learning Techniques for Classifying Dental Conditions Using Panoramic X-Ray Images
Alireza Golkarieh,Kiana Kiashemshaki,Sajjad Rezvani Boroujeni
Main category: cs.CV
TL;DR: 该研究分析了多种深度学习模型在牙科X光图像分类中的应用,发现混合模型(CNN+随机森林)表现最优,准确率高达85.4%,为自动化牙科诊断提供了实用方案。
Details
Motivation: 为了提高牙科疾病诊断的自动化水平,研究旨在评估不同深度学习方法在牙科X光图像分类中的性能。 Method: 使用了1,512张全景X光图像数据集,包含11,137个专家验证的注释,采用5折交叉验证方法评估三种方法:自定义CNN、CNN与传统分类器的混合模型以及微调预训练模型。 Result: 混合CNN随机森林模型表现最佳,准确率达到85.4%,优于自定义CNN的74.3%;预训练模型中VGG16表现最好,准确率为82.3%。 Conclusion: 研究得出结论,结合CNN特征提取与随机森林分类器的混合模型在牙科疾病自动分类中表现出最佳性能,为自动化牙科诊断提供了可行路径。 Abstract: This study investigates deep learning methods for automated classification of dental conditions in panoramic X-ray images. A dataset of 1,512 radiographs with 11,137 expert-verified annotations across four conditions fillings, cavities, implants, and impacted teeth was used. After preprocessing and class balancing, three approaches were evaluated: a custom convolutional neural network (CNN), hybrid models combining CNN feature extraction with traditional classifiers, and fine-tuned pre-trained architectures. Experiments employed 5 fold cross validation with accuracy, precision, recall, and F1 score as evaluation metrics. The hybrid CNN Random Forest model achieved the highest performance with 85.4% accuracy, surpassing the custom CNN baseline of 74.3%. Among pre-trained models, VGG16 performed best at 82.3% accuracy, followed by Xception and ResNet50. Results show that hybrid models improve discrimination of morphologically similar conditions and provide efficient, reliable performance. These findings suggest that combining CNN-based feature extraction with ensemble classifiers offers a practical path toward automated dental diagnostic support, while also highlighting the need for larger datasets and further clinical validation.[37] Q-Align: Alleviating Attention Leakage in Zero-Shot Appearance Transfer via Query-Query Alignment
Namu Kim,Wonbin Kweon,Minsoo Kim,Hwanjo Yu
Main category: cs.CV
TL;DR: 本文提出Q-Align方法,通过Query-Query对齐和Key-Value重排解决零样本外观迁移中的注意力泄漏问题,提高了语义一致性和外观保真度。
Details
Motivation: 零样本外观迁移在大规模图像生成模型中面临注意力泄漏的显著挑战,需要解决语义映射问题。 Method: 提出了Q-Align方法,包括Query-Query对齐、Key-Value重排和注意力优化三个核心贡献。 Result: 通过大量实验验证了Q-Align的有效性,并证明其在外观保真度方面优于现有技术。 Conclusion: Q-Align有效地解决了零样本外观迁移中的注意力泄漏问题,并在外观保真度方面优于现有方法,同时保持了结构保持性。 Abstract: We observe that zero-shot appearance transfer with large-scale image generation models faces a significant challenge: Attention Leakage. This challenge arises when the semantic mapping between two images is captured by the Query-Key alignment. To tackle this issue, we introduce Q-Align, utilizing Query-Query alignment to mitigate attention leakage and improve the semantic alignment in zero-shot appearance transfer. Q-Align incorporates three core contributions: (1) Query-Query alignment, facilitating the sophisticated spatial semantic mapping between two images; (2) Key-Value rearrangement, enhancing feature correspondence through realignment; and (3) Attention refinement using rearranged keys and values to maintain semantic consistency. We validate the effectiveness of Q-Align through extensive experiments and analysis, and Q-Align outperforms state-of-the-art methods in appearance fidelity while maintaining competitive structure preservation.[38] ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion
Xurui Peng,Hong Liu,Chenqian Yan,Rui Ma,Fangmin Chen,Xing Wang,Zhihua Wu,Songwei Liu,Mingbao Lin
Main category: cs.CV
TL;DR: ERTACache accelerates diffusion model inference by addressing caching errors, achieving faster performance with minimal impact on quality.
Details
Motivation: Diffusion models are computationally expensive due to their iterative inference process, and while feature caching can accelerate this, it often leads to quality degradation, which this work aims to solve. Method: The authors propose ERTACache, which uses an offline residual profiling stage, dynamic adjustment of integration intervals, and a closed-form residual linearization model to address caching errors. Result: ERTACache achieves up to 2x inference speedup on image and video generation benchmarks while maintaining or improving visual quality. Conclusion: ERTACache is an effective caching framework that improves the efficiency of diffusion models without compromising visual quality, making it a promising solution for accelerating iterative inference processes. Abstract: Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.[39] Video-LLMs with Temporal Visual Screening
Zheyu Fan,Jiateng Liu,Yuji Zhang,Zihan Wang,Yi R.,Fung,Manling Li,Heng Ji
Main category: cs.CV
TL;DR: Temporal Visual Screening (TVS) is introduced to enhance video-language understanding by focusing on critical video segments and simplifying queries, leading to improved performance in both training and inference.
Details
Motivation: Current Video-LLMs struggle with capturing fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision. Humans naturally perform temporal screening, which inspired the development of TVS. Method: TVS retains focus-critical video segments, reconstructs queries to their most direct form while preserving answer consistency, and maintains answer invariance and consistency. It is designed as a modular front-end adapter task integrated into Video Instruction Tuning and Video Question Answering pipelines. Result: TVS achieved relative gains of 7.33% during training and 34.6% during inference. ReSimplifyIt, a baseline for TVS, outperformed prior approaches by 0.47 in F-1 score on video trimming and showed competitive query rewriting performance. Conclusion: The introduction of TVS demonstrates the effectiveness of temporal information screening in improving video-language understanding, providing a universal preprocessing approach for video QA and instruction tuning. Abstract: Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.[40] ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments
Zhe Han,Charlie Budd,Gongyu Zhang,Huanyu Tian,Christos Bergeles,Tom Vercauteren
Main category: cs.CV
TL;DR: 本文提出了一种新的手术工具定位方法,使用骨骼姿态注释并构建了新的数据集和基准测试,证明了其在性能和注释效率上的优势。
Details
Motivation: 现有的基于学习的手术工具定位方法受限于多样化注释数据的缺乏,而骨骼姿态注释能够在语义信息丰富度和注释难易程度之间取得平衡,加速注释数据的增长。 Method: 提出了一种结合工具姿态和实例分割的新数据集ROBUST-MIPS,并基于流行的姿态估计方法建立了基准测试。 Result: 通过基准测试观察到了高质量的结果,证明了姿态注释在手术工具定位中的有效性。此外,作者还发布了数据集、基准模型和自定义工具姿态注释软件。 Conclusion: 作者认为骨骼姿态注释是一种更高效的手术工具注释方法,并通过提出的ROBUST-MIPS数据集和基准测试验证了姿态注释在手术工具定位中的有效性。 Abstract: Localisation of surgical tools constitutes a foundational building block for computer-assisted interventional technologies. Works in this field typically focus on training deep learning models to perform segmentation tasks. Performance of learning-based approaches is limited by the availability of diverse annotated data. We argue that skeletal pose annotations are a more efficient annotation approach for surgical tools, striking a balance between richness of semantic information and ease of annotation, thus allowing for accelerated growth of available annotated data. To encourage adoption of this annotation style, we present, ROBUST-MIPS, a combined tool pose and tool instance segmentation dataset derived from the existing ROBUST-MIS dataset. Our enriched dataset facilitates the joint study of these two annotation styles and allow head-to-head comparison on various downstream tasks. To demonstrate the adequacy of pose annotations for surgical tool localisation, we set up a simple benchmark using popular pose estimation methods and observe high-quality results. To ease adoption, together with the dataset, we release our benchmark models and custom tool pose annotation software.[41] Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models
Xiangtao Meng,Yingkai Dong,Ning Yu,Li Wang,Zheng Li,Shanqing Guo
Main category: cs.CV
TL;DR: Safe-Control是一种即插即用的安全补丁,可以有效减少文本到图像生成模型中的不安全内容生成。
Details
Motivation: 现有的T2I模型安全机制要么在分布转移下仍然容易被规避,要么需要大量的模型特定调整,因此引入了Safe-Control来解决这些限制。 Method: 通过数据驱动策略和安全意识条件,将安全控制信号注入锁定的T2I模型中,并可构建各种安全补丁以满足不断变化的安全需求。 Result: 在六个不同且公开的T2I模型上进行的广泛评估表明,Safe-Control可以有效地减少不安全内容的生成,并且与其他具有相似去噪架构的T2I模型兼容。 Conclusion: Safe-Control是一个创新的即插即用的安全补丁,可以减轻T2I模型中的不安全内容生成。 Abstract: Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.[42] GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions
Kei Katsumata,Yui Iioka,Naoki Hosomi,Teruhisa Misu,Kentaro Yamada,Komei Sugiura
Main category: cs.CV
TL;DR: 本文提出GENNAV,通过预测目标存在并生成分割掩码,有效处理具有模糊边界的目标区域,并在新基准和真实世界实验中表现优异。
Details
Motivation: 现有方法在处理具有模糊边界的stuff-type目标区域、不存在或多个目标时表现不佳。 Method: 提出GENNAV,预测目标存在并为多个stuff-type目标区域生成分割掩码。构建了包含无目标、单目标和多目标样本的新基准GRiN-Drive。 Result: 在标准评估指标上,GENNAV表现优于基线方法。真实世界实验验证了其零样本迁移性能。 Conclusion: GENNAV展现出比基线方法更优越的性能,证明了其在不同真实世界环境中的鲁棒性。 Abstract: We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility. This task is challenging because it requires both existence prediction and segmentation, particularly for stuff-type target regions with ambiguous boundaries. Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions. To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target. GENNAV achieved superior performance over baseline methods on standard evaluation metrics. Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance. In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments. The project page is available at https://gennav.vercel.app/.[43] R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Jie Jiang,Qi Yang,Bolin Ni,Shiming Xiang,Han Hu,Houwen Peng
Main category: cs.CV
TL;DR: 本文提出了一种能够根据问题复杂度自适应决定是否进行复杂推理的多模态大语言模型R-4B,通过双模式退火和双模式策略优化,在保证性能的同时降低了计算成本。
Details
Motivation: 为了提高多模态大语言模型在处理简单问题时的效率,避免不必要的复杂推理过程带来的冗余。 Method: 提出了一种名为R-4B的自思考多模态大语言模型,该模型通过双模式退火赋予模型思考和非思考两种能力,并通过改进的GRPO框架下的双模式策略优化(BPO)来提升模型判断是否激活思考过程的准确性。 Result: R-4B在大多数任务中优于Qwen2.5-VL-7B,并且在推理密集型基准测试中达到了与更大的模型相当的性能水平,同时计算成本更低。 Conclusion: R-4B实现了在25个具有挑战性的基准测试中达到最先进的性能,其在计算成本较低的情况下,表现优于Qwen2.5-VL-7B,并与更大的模型如Kimi-VL-A3B-Thinking-2506(16B)在推理密集型基准测试中的表现相当。 Abstract: Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.[44] HiddenObject: Modality-Agnostic Fusion for Multimodal Hidden Object Detection
Harris Song,Tuan-Anh Vu,Sanjith Menon,Sriram Narasimhan,M. Khalid Jawed
Main category: cs.CV
TL;DR: This paper introduces HiddenObject, a Mamba-based fusion framework that improves object detection in challenging multimodal environments by integrating RGB, thermal, and depth data to detect hidden or camouflaged objects more effectively than existing methods.
Details
Motivation: Traditional RGB-based object detection methods struggle in adverse conditions such as occlusion, camouflage, and varying lighting. This limitation necessitates a more robust, modality-agnostic solution for detecting hidden or partially concealed objects in multimodal environments. Method: The authors introduced HiddenObject, a fusion framework integrating RGB, thermal, and depth data through a Mamba-based fusion mechanism. This approach captures complementary signals from different modalities and fuses modality-specific features into a unified representation for enhanced detection of obscured or camouflaged objects. Result: HiddenObject was evaluated on multiple benchmark datasets and demonstrated state-of-the-art or competitive performance compared to existing methods, validating the effectiveness of the Mamba-based fusion approach in detecting obscured or camouflaged targets. Conclusion: The study concludes that the proposed HiddenObject framework, which uses a Mamba-based fusion mechanism, significantly improves the detection of hidden or partially concealed objects in multimodal environments, surpassing or competing with existing methods. It highlights the shortcomings of unimodal and naive fusion strategies and suggests that Mamba-based fusion can advance multimodal object detection, especially under degraded visual conditions. Abstract: Detecting hidden or partially concealed objects remains a fundamental challenge in multimodal environments, where factors like occlusion, camouflage, and lighting variations significantly hinder performance. Traditional RGB-based detection methods often fail under such adverse conditions, motivating the need for more robust, modality-agnostic approaches. In this work, we present HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. Our method captures complementary signals across modalities, enabling enhanced detection of obscured or camouflaged targets. Specifically, the proposed approach identifies modality-specific features and fuses them in a unified representation that generalizes well across challenging scenarios. We validate HiddenObject across multiple benchmark datasets, demonstrating state-of-the-art or competitive performance compared to existing methods. These results highlight the efficacy of our fusion design and expose key limitations in current unimodal and na\"ive fusion strategies. More broadly, our findings suggest that Mamba-based fusion architectures can significantly advance the field of multimodal object detection, especially under visually degraded or complex conditions.[45] RadGS-Reg: Registering Spine CT with Biplanar X-rays via Joint 3D Radiative Gaussians Reconstruction and 3D/3D Registration
Ao Shen,Xueming Fu,Junfeng Jiang,Qiang Zeng,Ye Tang,Zhengming Chen,Luming Nong,Feng Wang,S. Kevin Zhou
Main category: cs.CV
TL;DR: This paper introduces RadGS-Reg, an improved framework for vertebral-level CT/X-ray registration that addresses the limitations of traditional methods and achieves superior performance.
Details
Motivation: The motivation is to overcome the limitations of traditional 'render and compare' methods and 3D reconstruction from biplanar X-rays, which suffer from spatial information loss, domain gaps, dense-view requirements, and difficulties with noisy X-rays. Method: The paper proposes RadGS-Reg, a framework for CT/X-ray registration using joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration, including a Counterfactual Attention Learning (CAL) mechanism and a patient-specific pre-training strategy. Result: Experiments on in-house datasets show that RadGS-Reg achieves state-of-the-art performance for CT/X-ray registration tasks, outperforming existing methods. Conclusion: The paper concludes that RadGS-Reg achieves state-of-the-art performance in vertebral-level CT/X-ray registration and is effective in handling noisy X-rays. Abstract: Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional "render and compare" methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: https://github.com/shenao1995/RadGS_Reg.[46] SYNBUILD-3D: A large, multi-modal, and semantically rich synthetic dataset of 3D building models at Level of Detail 4
Kevin Mayer,Alex Vesel,Xinyi Zhao,Martin Fischer
Main category: cs.CV
TL;DR: SYNBUILD-3D 是一个包含超过 620 万座合成 3D 住宅建筑的大规模、多样化和多模态数据集,具有语义注释,可用于开发新的生成式人工智能算法,以自动化创建符合预定义平面布局和屋顶几何结构的 3D 建筑模型。
Details
Motivation: 由于公共领域缺乏大规模注释数据集,自动生成准确且语义丰富的 3D 建筑仍然是一个重大挑战。受计算机视觉中合成数据成功的启发,作者提出了 SYNBUILD-3D。 Method: 通过合成数据生成的方法,创建了一个包含超过 620 万座合成 3D 住宅建筑的数据集,每个建筑通过三种不同的模态进行表示:语义增强的 LoD 4 3D 线框图、对应的平面图图像和类似 LiDAR 的屋顶点云。 Result: SYNBUILD-3D 是一个大规模、多样化和多模态的数据集,包含超过 620 万座合成 3D 住宅建筑,每个建筑都有三种模态表示和语义注释,适用于开发新的生成式人工智能算法。 Conclusion: SYNBUILD-3D 提供了一个大规模、多样化和多模态的数据集,未来可用于开发新的生成式人工智能算法,以自动化创建符合预定义平面布局和屋顶几何结构的 3D 建筑模型,并确保语义-几何一致性。 Abstract: 3D building models are critical for applications in architecture, energy simulation, and navigation. Yet, generating accurate and semantically rich 3D buildings automatically remains a major challenge due to the lack of large-scale annotated datasets in the public domain. Inspired by the success of synthetic data in computer vision, we introduce SYNBUILD-3D, a large, diverse, and multi-modal dataset of over 6.2 million synthetic 3D residential buildings at Level of Detail (LoD) 4. In the dataset, each building is represented through three distinct modalities: a semantically enriched 3D wireframe graph at LoD 4 (Modality I), the corresponding floor plan images (Modality II), and a LiDAR-like roof point cloud (Modality III). The semantic annotations for each building wireframe are derived from the corresponding floor plan images and include information on rooms, doors, and windows. Through its tri-modal nature, future work can use SYNBUILD-3D to develop novel generative AI algorithms that automate the creation of 3D building models at LoD 4, subject to predefined floor plan layouts and roof geometries, while enforcing semantic-geometric consistency. Dataset and code samples are publicly available at https://github.com/kdmayer/SYNBUILD-3D.[47] Radially Distorted Homographies, Revisited
Mårten Wadenbäck,Marcus Valtonen Örnhag,Johan Edstedt
Main category: cs.CV
TL;DR: 本文提出了一种统一的方法来解决在存在径向畸变的情况下不同配置的单应性估计问题,并构建了比现有最先进求解器更快的新求解器。
Details
Motivation: 在处理存在镜头几何畸变的真实图像时,为了获得有效的估计,通常需要同时确定单应性和镜头畸变,尤其是径向畸变。 Method: 提出了一种新颖且统一的方法,用于解决两个图像之间三种不同的径向畸变配置下的单应性估计问题,并构建了新的最小求解器。 Result: 在所有三种情况下,所提出的求解器都比现有最先进的求解器更快,同时保持相似的准确性。 Conclusion: 本文提供了一种统一的解决方案,用于处理径向畸变下的单应性估计问题,并展示了其在速度和准确性方面的优势。 Abstract: Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. The source code for our solvers will be made available in the event our paper is accepted for publication.[48] GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability
Zhenghao He,Sanchit Sinha,Guangzhi Xiong,Aidong Zhang
Main category: cs.CV
TL;DR: GCAV通过跨层融合提高深度神经网络概念解释的稳定性和一致性。
Details
Motivation: CAVs在不同层之间存在不一致性,影响了跨层比较的可靠性。 Method: 使用对比学习对齐各层的概念表示,并通过注意力融合机制构建全局CAV。 Result: GCAV减少了TCAV评分的方差,增强了概念定位和对对抗扰动的鲁棒性。 Conclusion: GCAV 方法通过统一各层的CAVs,显著提高了深度神经网络概念解释的一致性和可靠性。 Abstract: Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at https://github.com/Zhenghao-He/GCAV.[49] Generalizable Object Re-Identification via Visual In-Context Prompting
Zhizhong Huang,Xiaoming Liu
Main category: cs.CV
TL;DR: This paper proposes VICP, a framework combining LLMs and vision models to generalize object re-identification to unseen categories using in-context examples, without parameter adaptation.
Details
Motivation: Current object re-identification (ReID) methods are domain-specific, lack generalization, and require costly labeled data for new categories. Self-supervised learning struggles to capture identity-sensitive features, necessitating a solution that reduces annotation needs while improving generalization. Method: The paper introduces Visual In-Context Prompting (VICP), which synergizes large language models (LLMs) and vision foundation models (VFM) by using in-context examples as prompts. LLMs infer semantic identity rules from few-shot pairs, guiding a VFM like DINO to extract ID-discriminative features via dynamic visual prompts. Result: VICP outperforms baselines by a clear margin on unseen categories in object re-identification tasks, as demonstrated on the introduced ShopID10K dataset and other ReID benchmarks. Conclusion: The proposed VICP framework successfully enables generalization to unseen categories in object re-identification tasks without requiring dataset-specific retraining, outperforming baselines on ShopID10K and other ReID benchmarks. Abstract: Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}. By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at https://github.com/Hzzone/VICP.[50] Lightweight MRI-Based Automated Segmentation of Pancreatic Cancer with Auto3DSeg
Keshav Jha,William Sharp,Dominic LaBella
Main category: cs.CV
TL;DR: 本研究通过SegResNet模型对胰腺肿瘤进行MRI图像分割,结果显示尽管性能有限,但自动化分割仍具潜力,同时强调了大规模标准化数据集的重要性。
Details
Motivation: 胰腺肿瘤的自动分割对于诊断、治疗计划和结果评估至关重要,但由于解剖结构的变异性和数据集的有限性,仍然具有挑战性。 Method: 使用SegResNet模型和STAPLE集成方法,进行了5折交叉验证,并在关注解剖相关的感兴趣区域后评估了两个基于MRI的胰腺肿瘤分割任务。 Result: 在任务1中,DSC为0.56,5 mm DSC为0.73,HD95为41.1 mm,MASD为26.0 mm,RMSE为5164 mm;在任务2中,性能下降,DSC为0.33,5 mm DSC为0.50,HD95为20.1 mm,MASD为7.2 mm,RMSE为17,203 mm。 Conclusion: 尽管性能 modest,但结果展示了自动勾画的潜力,并强调需要更大、标准化的MRI数据集以提高模型的鲁棒性和临床实用性。 Abstract: Accurate delineation of pancreatic tumors is critical for diagnosis, treatment planning, and outcome assessment, yet automated segmentation remains challenging due to anatomical variability and limited dataset availability. In this study, SegResNet models, as part of the Auto3DSeg architecture, were trained and evaluated on two MRI-based pancreatic tumor segmentation tasks as part of the 2025 PANTHER Challenge. Algorithm methodology included 5-fold cross-validation with STAPLE ensembling after focusing on an anatomically relevant region-of-interest. The Pancreatic Tumor Segmentation on Diagnostic MRI task 1 training set included 91 T1-weighted arterial contrast-enhanced MRI with expert annotated pancreas and tumor labels. The Pancreatic Tumor Segmentation on MR-Linac task 2 training set used 50 T2-weighted MR-Linac cases with expert annotated pancreas and tumor labels. Algorithm-automated segmentation performance of pancreatic tumor was assessed using Dice Similarity Coefficient (DSC), 5 mm DSC, 95th percentile Hausdorff Distance (HD95), Mean Average Surface Distance (MASD), and Root Mean Square Error (RMSE). For Task 1, the algorithm achieved a DSC of 0.56, 5 mm DSC of 0.73, HD95 of 41.1 mm, MASD of 26.0 mm, and RMSE of 5164 mm. For Task 2, performance decreased, with a DSC of 0.33, 5 mm DSC of 0.50, HD95 of 20.1 mm, MASD of 7.2 mm, and RMSE of 17,203 mm. These findings illustrate the challenges of MRI-based pancreatic tumor segmentation with small datasets, highlighting variability introduced by different MRI sequences. Despite modest performance, the results demonstrate potential for automated delineation and emphasize the need for larger, standardized MRI datasets to improve model robustness and clinical utility.[51] Reverse Imaging for Wide-spectrum Generalization of Cardiac MRI Segmentation
Yidong Zhao,Peter Kellman,Hui Xue,Tongyun Yang,Yi Zhang,Yuchi Han,Orlando Simonetti,Qian Tao
Main category: cs.CV
TL;DR: Reverse Imaging improves cardiac MRI segmentation generalization by leveraging spin properties through a physics-driven approach.
Details
Motivation: Pretrained cardiac MRI segmentation models struggle with generalization due to variations in image contrast from differing imaging protocols, despite being governed by consistent spin properties. Method: Reverse Imaging solves ill-posed nonlinear inverse problems regularized by a spin property prior learned from mSASHA dataset using a generative diffusion model. Result: The method allows meaningful spin-property estimation and flexible image synthesis for novel sequences, achieving highly accurate segmentation across different image contrasts. Conclusion: Reverse Imaging offers a physics-driven solution for domain adaptation and data augmentation in cardiac MRI, enabling accurate segmentation across diverse imaging protocols. Abstract: Pretrained segmentation models for cardiac magnetic resonance imaging (MRI) struggle to generalize across different imaging sequences due to significant variations in image contrast. These variations arise from changes in imaging protocols, yet the same fundamental spin properties, including proton density, T1, and T2 values, govern all acquired images. With this core principle, we introduce Reverse Imaging, a novel physics-driven method for cardiac MRI data augmentation and domain adaptation to fundamentally solve the generalization problem. Our method reversely infers the underlying spin properties from observed cardiac MRI images, by solving ill-posed nonlinear inverse problems regularized by the prior distribution of spin properties. We acquire this "spin prior" by learning a generative diffusion model from the multiparametric SAturation-recovery single-SHot acquisition sequence (mSASHA) dataset, which offers joint cardiac T1 and T2 maps. Our method enables approximate but meaningful spin-property estimates from MR images, which provide an interpretable "latent variable" that lead to highly flexible image synthesis of arbitrary novel sequences. We show that Reverse Imaging enables highly accurate segmentation across vastly different image contrasts and imaging protocols, realizing wide-spectrum generalization of cardiac MRI segmentation.[52] PHD: Personalized 3D Human Body Fitting with Point Diffusion
Hsuan-I Ho,Chen Guo,Po-Chen Wu,Ivan Shugurov,Chengcheng Tang,Abhay Mittal,Sizhe An,Manuel Kaufmann,Linguang Zhang
Main category: cs.CV
TL;DR: PHD是一种新的个性化3D人体网格恢复和身体拟合方法,通过利用用户特定的形状信息提高从视频中估计姿态的准确性。
Details
Motivation: 传统HMR方法设计为用户无关的,优化泛化。虽然这些方法通常使用从2D图像中得出的约束来改进对齐,但这个过程由于未能同时考虑人特定的身体形状和3D姿态的合理性而损害了3D准确性。 Method: 该方法首先校准用户的体形,然后采用基于该形状的个性化姿态拟合过程。此外,开发了一个体形条件化的3D姿态先验,通过点蒸馏采样损失迭代地指导姿态拟合。 Result: 该方法不仅提高了骨盆对齐姿态的准确性,还提高了绝对姿态的准确性。此外,该方法具有高度的数据效率,只需要合成数据进行训练,并可作为通用的插件模块与现有的3D姿态估计器无缝集成。 Conclusion: PHD是一个创新的个性化3D人体网格恢复和身体拟合的方法,它通过利用用户特定的形状信息,提高了从视频中估计姿态的准确性。 Abstract: We introduce PHD, a novel approach for personalized 3D human mesh recovery (HMR) and body fitting that leverages user-specific shape information to improve pose estimation accuracy from videos. Traditional HMR methods are designed to be user-agnostic and optimized for generalization. While these methods often refine poses using constraints derived from the 2D image to improve alignment, this process compromises 3D accuracy by failing to jointly account for person-specific body shapes and the plausibility of 3D poses. In contrast, our pipeline decouples this process by first calibrating the user's body shape and then employing a personalized pose fitting process conditioned on that shape. To achieve this, we develop a body shape-conditioned 3D pose prior, implemented as a Point Diffusion Transformer, which iteratively guides the pose fitting via a Point Distillation Sampling loss. This learned 3D pose prior effectively mitigates errors arising from an over-reliance on 2D constraints. Consequently, our approach improves not only pelvis-aligned pose accuracy but also absolute pose accuracy -- an important metric often overlooked by prior work. Furthermore, our method is highly data-efficient, requiring only synthetic data for training, and serves as a versatile plug-and-play module that can be seamlessly integrated with existing 3D pose estimators to enhance their performance. Project page: https://phd-pose.github.io/[53] Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning
Yuquan Bi,Hongsong Wang,Xinli Shi,Zhipeng Gui,Jie Gui,Yuan Yan Tang
Main category: cs.CV
TL;DR: This paper proposes a Hierarchical Temporal Pruning (HTP) strategy to reduce computational costs in diffusion-based 3D human pose estimation while maintaining high performance.
Details
Motivation: Diffusion models have strong capabilities in generating high-fidelity 3D human poses, but their iterative nature and multi-hypothesis requirements incur substantial computational cost. Method: Proposed an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy. Result: Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5%, inference MACs by 56.8%, and improves inference speed by an average of 81.1% compared to prior diffusion-based methods, while achieving state-of-the-art performance. Conclusion: Hierarchical Temporal Pruning (HTP) strategy can effectively reduce computational costs while maintaining high performance in 3D human pose estimation. Abstract: Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5\%, inference MACs by 56.8\%, and improves inference speed by an average of 81.1\% compared to prior diffusion-based methods, while achieving state-of-the-art performance.[54] Print2Volume: Generating Synthetic OCT-based 3D Fingerprint Volume from 2D Fingerprint Image
Qingran Miao,Haixia Wang,Haohao Sun,Yilong Zhang
Main category: cs.CV
TL;DR: Print2Volume addresses the lack of OCT-based 3D fingerprint data by generating realistic synthetic samples, significantly improving biometric recognition performance.
Details
Motivation: The scarcity of large-scale public OCT datasets due to high costs and time-consuming data acquisition hinders the development of advanced algorithms, especially deep learning models. Method: Print2Volume operates in three stages: 2D style transfer, 3D structure expansion, and OCT realism refinement using a 3D GAN. Result: The framework generated a dataset of 420,000 synthetic samples, achieving a reduction in Equal Error Rate (EER) from 15.62% to 2.50% on the ZJUT-EIFD benchmark. Conclusion: Print2Volume is an effective solution for generating high-quality synthetic OCT-based 3D fingerprints, significantly improving recognition performance while overcoming the limitations of data scarcity. Abstract: Optical Coherence Tomography (OCT) enables the acquisition of high-resolution, three-dimensional fingerprint data, capturing rich subsurface structures for robust biometric recognition. However, the high cost and time-consuming nature of OCT data acquisition have led to a scarcity of large-scale public datasets, significantly hindering the development of advanced algorithms, particularly data-hungry deep learning models. To address this critical bottleneck, this paper introduces Print2Volume, a novel framework for generating realistic, synthetic OCT-based 3D fingerprints from 2D fingerprint image. Our framework operates in three sequential stages: (1) a 2D style transfer module that converts a binary fingerprint into a grayscale images mimicking the style of a Z-direction mean-projected OCT scan; (2) a 3D Structure Expansion Network that extrapolates the 2D im-age into a plausible 3D anatomical volume; and (3) an OCT Realism Refiner, based on a 3D GAN, that renders the structural volume with authentic textures, speckle noise, and other imaging characteristics. Using Print2Volume, we generated a large-scale synthetic dataset of 420,000 samples. Quantitative experiments demonstrate the high quality of our synthetic data and its significant impact on recognition performance. By pre-training a recognition model on our synthetic data and fine-tuning it on a small real-world dataset, we achieved a remarkable reduction in the Equal Error Rate (EER) from 15.62% to 2.50% on the ZJUT-EIFD benchmark, proving the effectiveness of our approach in overcoming data scarcity.[55] GLENDA: Gynecologic Laparoscopy Endometriosis Dataset
Andreas Leibetseder,Sabrina Kletz,Klaus Schoeffmann,Simon Keckstein,Jörg Keckstein
Main category: cs.CV
TL;DR: This paper introduces GLENDA, the first dataset with region-based annotations for endometriosis in gynecologic laparoscopy, aimed at improving machine learning applications in this medical domain.
Details
Motivation: The manual analysis of surgical recordings in gynecologic laparoscopy is time-consuming and current machine learning approaches are limited by sparse data availability. This work aims to address these challenges by introducing a specialized dataset. Method: The authors collaborated with leading medical experts to create the Gynecologic Laparoscopy ENdometriosis DAtaset (GLENDA), which includes region-based annotations for endometriosis. Result: The GLENDA dataset was developed, containing region-based annotations for endometriosis, marking the first dataset of its kind in this medical field. Conclusion: The publication of the GLENDA dataset aims to enhance the development of computer vision and machine learning approaches in gynecologic laparoscopy by providing the first region-based annotated dataset for endometriosis. Abstract: Gynecologic laparoscopy as a type of minimally invasive surgery (MIS) is performed via a live feed of a patient's abdomen surveying the insertion and handling of various instruments for conducting treatment. Adopting this kind of surgical intervention not only facilitates a great variety of treatments, the possibility of recording said video streams is as well essential for numerous post-surgical activities, such as treatment planning, case documentation and education. Nonetheless, the process of manually analyzing surgical recordings, as it is carried out in current practice, usually proves tediously time-consuming. In order to improve upon this situation, more sophisticated computer vision as well as machine learning approaches are actively developed. Since most of such approaches heavily rely on sample data, which especially in the medical field is only sparsely available, with this work we publish the Gynecologic Laparoscopy ENdometriosis DAtaset (GLENDA) - an image dataset containing region-based annotations of a common medical condition named endometriosis, i.e. the dislocation of uterine-like tissue. The dataset is the first of its kind and it has been created in collaboration with leading medical experts in the field.[56] Identifying Surgical Instruments in Laparoscopy Using Deep Learning Instance Segmentation
Sabrina Kletz,Klaus Schoeffmann,Jenny Benois-Pineau,Heinrich Husslein
Main category: cs.CV
TL;DR: 本文研究了基于区域的全卷积网络在手术器械分割和识别中的应用,结果显示分割效果较好,但识别因器械相似性仍面临挑战。
Details
Motivation: 手术视频的自动内容索引对于医学视频存档和检索至关重要,但特殊的视频内容使其具有挑战性。 Method: 采用基于区域的全卷积网络进行实例感知的手术器械分割和识别,并评估其性能。 Result: 在训练样本数量适中的情况下,实现了较高的器械区域定位和分割精度,但器械类型识别仍存在困难。 Conclusion: 虽然利用区域卷积网络可以高效地进行手术器械的实例分割,但器械识别仍然具有挑战性,因为手术器械本身具有高度相似性。 Abstract: Recorded videos from surgeries have become an increasingly important information source for the field of medical endoscopy, since the recorded footage shows every single detail of the surgery. However, while video recording is straightforward these days, automatic content indexing - the basis for content-based search in a medical video archive - is still a great challenge due to the very special video content. In this work, we investigate segmentation and recognition of surgical instruments in videos recorded from laparoscopic gynecology. More precisely, we evaluate the achievable performance of segmenting surgical instruments from their background by using a region-based fully convolutional network for instance-aware (1) instrument segmentation as well as (2) instrument recognition. While the first part addresses only binary segmentation of instances (i.e., distinguishing between instrument or background) we also investigate multi-class instrument recognition (i.e., identifying the type of instrument). Our evaluation results show that even with a moderately low number of training examples, we are able to localize and segment instrument regions with a pretty high accuracy. However, the results also reveal that determining the particular instrument is still very challenging, due to the inherently high similarity of surgical instruments.[57] SatDINO: A Deep Dive into Self-Supervised Pretraining for Remote Sensing
Jakub Straka,Ivan Gruber
Main category: cs.CV
TL;DR: SatDINO是为卫星图像表示学习量身定制的对比自监督模型,在多个基准测试中表现优异,并引入了新的增强功能,如GSD编码和自适应视图采样。
Details
Motivation: 遥感领域存在大量未标记的数据,需要利用强大的工具进行预训练,从而提高模型性能。 Method: 引入SatDINO模型,基于DINO对比自监督方法,并提出了GSD编码和自适应视图采样等增强功能。 Result: SatDINO在多个数据集上超越了基于掩码自编码器(MAE)的现有最先进方法,并在多个基准测试中取得了具有竞争力的结果。 Conclusion: SatDINO是一种有效的遥感图像表示学习方法,具备可独立使用的增强功能,且代码和模型已公开。 Abstract: Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. We introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks. We also provide a rigorous ablation study evaluating SatDINO's individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: https://github.com/strakaj/SatDINO.[58] Standardized Multi-Layer Tissue Maps for Enhanced Artificial Intelligence Integration and Search in Large-Scale Whole Slide Image Archives
Gernot Fiala,Markus Plass,Robert Harb,Peter Regitnig,Kristijan Skok,Wael Al Zoughbi,Carmen Zerner,Paul Torke,Michaela Kargl,Heimo Müller,Tomas Brazdil,Matej Gallo,Jaroslav Kubín,Roman Stoklasa,Rudolf Nenutil,Norman Zerbe,Andreas Holzinger,Petr Holub
Main category: cs.CV
TL;DR: 论文提出了一种用于全切片图像(WSI)内容分析和索引的通用框架,通过生成组织图和索引图提高大规模WSI数据的可操作性和互操作性。
Details
Motivation: 在人工智能算法的训练和验证中,需要对WSI内容进行准确的元数据描述,但目前尚无标准,主要依赖人工检查,不适合大规模数据集。 Method: 提出了一种生成2D索引图和组织图的框架,组织图分为三个层次:来源、组织类型和病理变化,为WSI内容提供细粒度信息。 Result: 开发了一个通用框架,可以生成WSI的2D索引图和组织图,并展示了其在WSI目录、机器学习和基于图的WSI表示中的优势。 Conclusion: 该论文提出了一种用于生成全切片图像(WSI)索引图和特定领域分析的通用框架,并通过在临床病理学中的应用展示了其互操作性和实用性。 Abstract: A Whole Slide Image (WSI) is a high-resolution digital image created by scanning an entire glass slide containing a biological specimen, such as tissue sections or cell samples, at multiple magnifications. These images can be viewed, analyzed, shared digitally, and are used today for Artificial Intelligence (AI) algorithm development. WSIs are used in a variety of fields, including pathology for diagnosing diseases and oncology for cancer research. They are also utilized in neurology, veterinary medicine, hematology, microbiology, dermatology, pharmacology, toxicology, immunology, and forensic science. When assembling cohorts for the training or validation of an AI algorithm, it is essential to know what is present on such a WSI. However, there is currently no standard for this metadata, so such selection has mainly been done through manual inspection, which is not suitable for large collections with several million objects. We propose a general framework to generate a 2D index map for WSI and a profiling mechanism for specific application domains. We demonstrate this approach in the field of clinical pathology, using common syntax and semantics to achieve interoperability between different catalogs. Our approach augments each WSI collection with a detailed tissue map that provides fine-grained information about the WSI content. The tissue map is organized into three layers: source, tissue type, and pathological alterations, with each layer assigning segments of the WSI to specific classes. We illustrate the advantages and applicability of the proposed standard through specific examples in WSI catalogs, Machine Learning (ML), and graph-based WSI representations.[59] Unsupervised Incremental Learning Using Confidence-Based Pseudo-Labels
Lucas Rakotoarivony
Main category: cs.CV
TL;DR: ICPL proposes an unsupervised incremental learning approach using pseudo-labels, effectively reducing reliance on labeled data and outperforming existing methods in accuracy while being computationally efficient.
Details
Motivation: Existing Class-Incremental Learning (CIL) methods rely on fully labeled datasets, which is unrealistic in real-world scenarios. This work aims to enable incremental learning from unlabeled datasets by replacing human annotations with pseudo-labels. Method: ICPL (unsupervised Incremental Learning using Confidence-based Pseudo-labels) generates pseudo-labels to replace human annotations, integrates them into CIL methods with confidence-based selection, and evaluates performance on datasets like CIFAR100, ImageNet100, and fine-grained datasets while measuring computational complexity. Result: ICPL achieves competitive results compared to supervised CIL methods and outperforms class-iNCD methods by more than 5% in final accuracy, while also demonstrating efficiency in computational complexity and real-world applicability on fine-grained datasets. Conclusion: ICPL is a practical and efficient unsupervised incremental learning method that outperforms state-of-the-art class-iNCD methods by more than 5% in final accuracy, making it suitable for real-world and resource-constrained environments. Abstract: Deep learning models have achieved state-of-the-art performance in many computer vision tasks. However, in real-world scenarios, novel classes that were unseen during training often emerge, requiring models to acquire new knowledge incrementally. Class-Incremental Learning (CIL) methods enable a model to learn novel classes while retaining knowledge of previous classes. However, these methods make the strong assumption that the incremental dataset is fully labeled, which is unrealistic in practice. In this work, we propose an unsupervised Incremental Learning method using Confidence-based Pseudo-labels (ICPL), which replaces human annotations with pseudo-labels, enabling incremental learning from unlabeled datasets. We integrate these pseudo-labels into various CIL methods with confidence-based selection and evaluate performance degradation on CIFAR100 and ImageNet100. Then, we compare our approach to popular Class Incremental Novel Category Discovery (class-iNCD) methods addressing similar challenges. Additionally, we apply our method to fine-grained datasets to demonstrate its real-world practicality and measure its computational complexity to validate its suitability for resource-constrained environments. ICPL achieves competitive results compared to supervised methods and outperforms state-of-the-art class-iNCD methods by more than 5% in final accuracy.[60] MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation
Francisco Caetano,Christiaan Viviers,Peter H. H. de With,Fons van der Sommen
Main category: cs.CV
TL;DR: 本文提出了一种名为MedShift的统一类别条件生成模型,用于在合成和真实X光图像之间进行高保真的图像转换,解决了医学图像领域适应的问题。
Details
Motivation: 合成医学数据在训练鲁棒模型方面具有潜力,但其在真实临床环境中的泛化能力受限于显著的领域差距。本文旨在解决合成和真实X光图像之间的跨领域转换问题。 Method: 提出了一种基于流匹配和薛定谔桥的统一类别条件生成模型MedShift,并引入了一个新的数据集X-DigiSkull用于基准测试。 Result: 实验结果表明,尽管模型规模较小,MedShift在图像转换任务中表现出色,且在推理时具有灵活性,能够在感知保真度和结构一致性之间进行调节。 Conclusion: MedShift提供了一种可扩展且可泛化的医学图像领域适应解决方案。 Abstract: Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html[61] Trees as Gaussians: Large-Scale Individual Tree Mapping
Dimitri Gominski,Martin Brandt,Xiaoye Tong,Siyu Liu,Maurice Mugabowindekwe,Sizhuo Li,Florian Reiner,Andrew Davies,Rasmus Fensholt
Main category: cs.CV
TL;DR: 该研究开发了一种深度学习方法,利用高分辨率卫星图像在全球范围内检测单个树木,具有高精度和广泛适应性,可用于未来卫星任务的树木监测。
Details
Motivation: 当前的全球树木监测产品主要关注二值树覆盖或树冠高度,无法明确识别个体树木,因此需要一种新的方法来提高大规模个体树木的监测能力。 Method: 研究使用了深度学习方法,通过PlanetScope的3米分辨率影像进行大型单个树木的检测,并利用机载激光雷达数据自动提取数十亿个点进行模型训练。通过模拟树冠的高斯核函数,提取树冠中心并生成二值树覆盖图。 Result: 该方法在与现有树覆盖图和机载激光雷达数据的对比中表现出色(与航空激光雷达相比,覆盖分数R²=0.81),在不同生物群落中表现出均衡的检测性能,并展示了通过手动标签微调进一步提升检测效果的潜力。 Conclusion: 该研究提出了一种基于深度学习的方法,用于在全球范围内从高分辨率卫星图像中检测单个大型树木,为全球树木监测提供了一个可扩展的框架,并展示了其在不同生物群落中的高性能表现。 Abstract: Trees are key components of the terrestrial biosphere, playing vital roles in ecosystem function, climate regulation, and the bioeconomy. However, large-scale monitoring of individual trees remains limited by inadequate modelling. Available global products have focused on binary tree cover or canopy height, which do not explicitely identify trees at individual level. In this study, we present a deep learning approach for detecting large individual trees in 3-m resolution PlanetScope imagery at a global scale. We simulate tree crowns with Gaussian kernels of scalable size, allowing the extraction of crown centers and the generation of binary tree cover maps. Training is based on billions of points automatically extracted from airborne lidar data, enabling the model to successfully identify trees both inside and outside forests. We compare against existing tree cover maps and airborne lidar with state-of-the-art performance (fractional cover R$^2 = 0.81$ against aerial lidar), report balanced detection metrics across biomes, and demonstrate how detection can be further improved through fine-tuning with manual labels. Our method offers a scalable framework for global, high-resolution tree monitoring, and is adaptable to future satellite missions offering improved imagery.[62] Scale-GS: Efficient Scalable Gaussian Splatting via Redundancy-filtering Training on Streaming Content
Jiayu Yang,Weijian Su,Songqian Zhang,Yuqi Han,Jinli Suo,Qiang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种高效的3D高斯点阵框架M,用于流媒体任务,通过分层组织高斯球体、混合变形和生成策略以及自适应掩码机制,提升了训练效率和视觉质量。
Details
Motivation: 3D高斯点阵在动态场景中的应用受限于密集高斯数据量大和每帧训练时间长的问题,因此需要一种高效的训练框架。 Method: 通过基于锚点的结构按比例分层组织高斯球体,采用混合变形和生成策略建模帧间运动,并引入双向自适应掩码机制提升训练效率。 Result: 实验表明,与现有最先进方法相比,该方法在显著减少训练时间的同时实现了更优的视觉质量。 Conclusion: 该论文提出了一种可扩展的高斯点阵框架M,用于流媒体任务中的高效训练,解决了现有3D高斯点阵在动态场景中的局限性。 Abstract: 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, a key requirement for immersive applications. However, the extension of 3DGS to dynamic scenes remains limitations on the substantial data volume of dense Gaussians and the prolonged training time required for each frame. This paper presents \M, a scalable Gaussian Splatting framework designed for efficient training in streaming tasks. Specifically, Gaussian spheres are hierarchically organized by scale within an anchor-based structure. Coarser-level Gaussians represent the low-resolution structure of the scene, while finer-level Gaussians, responsible for detailed high-fidelity rendering, are selectively activated by the coarser-level Gaussians. To further reduce computational overhead, we introduce a hybrid deformation and spawning strategy that models motion of inter-frame through Gaussian deformation and triggers Gaussian spawning to characterize wide-range motion. Additionally, a bidirectional adaptive masking mechanism enhances training efficiency by removing static regions and prioritizing informative viewpoints. Extensive experiments demonstrate that \M~ achieves superior visual quality while significantly reducing training time compared to state-of-the-art methods.[63] One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist
Junha Song,Yongsik Jo,So Yeon Min,Quanting Xie,Taehwan Kim,Yonatan Bisk,Jaegul Choo
Main category: cs.CV
TL;DR: A lightweight image captioning model with the Sharp-Eyed Refinement framework achieves competitive performance by improving visual grounding and reducing errors caused by ineffective attention and limited visual representation.
Details
Motivation: Deploying multimodal large language models (MLLMs) on local devices is challenging due to their high computational demands, prompting the need for lightweight and efficient alternatives. Method: The authors implemented a lightweight 125M-parameter language model and introduced the Sharp-Eyed Refinement framework, which utilizes DeepLens to extract detailed visual representations from informative regions. Result: The lightweight model achieved performance comparable to large MLLMs but suffered from visual blindness; the proposed Sharp-Eyed Refinement framework significantly improved captioning accuracy by enhancing visual grounding. Conclusion: The Sharp-Eyed Refinement framework effectively improves caption quality through enhanced visual grounding, making the smaller model outperform both prior small models and larger generalists. Abstract: Image captioning is fundamental for applications like video instruction systems and exploration robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal large language models (MLLMs). To address this, we first explore lightweight captioning by implementing a specialist based on a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluating its performance on both single-sentence and detailed captioning tasks. Surprisingly, we find that our model can achieve performance comparable to large multimodal generalists, suggesting its potential to serve as a strong visual specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from visual blindness, occasionally resulting in semantic captioning errors. We carry out toy experiments and investigate the underlying causes, where we observe that the problems arise from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality through improved visual grounding. At its core, our DeepLens extracts detailed visual representations by concentrating on informative regions identified during the initial glance. Our experiments confirm both the advantages of our specialist over prior small captioning models and large generalists and the effectiveness of our framework.[64] Federated Fine-tuning of SAM-Med3D for MRI-based Dementia Classification
Kaouther Mouheb,Marawan Elbatel,Janne Papma,Geert Jan Biessels,Jurgen Claassen,Huub Middelkoop,Barbara van Munster,Wiesje van der Flier,Inez Ramakers,Stefan Klein,Esther E. Bron
Main category: cs.CV
TL;DR: 该研究评估了联邦学习中基础模型微调的关键设计选择,揭示了分类头架构、冻结编码器策略和高级聚合方法对性能和效率的影响。
Details
Motivation: 尽管基础模型在基于人工智能的痴呆症诊断中具有巨大潜力,但其在联邦学习系统中的集成仍未得到充分探索。 Method: 研究系统地评估了分类头架构、微调策略和聚合方法对使用脑部MRI数据的联邦基础模型微调的性能和效率的影响。 Result: 研究发现分类头的架构对性能有显著影响,冻结基础模型编码器的效果与完全微调相当,而高级聚合方法优于标准联邦平均方法。 Conclusion: 该研究得出了一些关键的设计选择对于联邦学习系统中基础模型的微调具有重要影响,并为在去中心化临床环境中部署基础模型提供了实用见解。 Abstract: While foundation models (FMs) offer strong potential for AI-based dementia diagnosis, their integration into federated learning (FL) systems remains underexplored. In this benchmarking study, we systematically evaluate the impact of key design choices: classification head architecture, fine-tuning strategy, and aggregation method, on the performance and efficiency of federated FM tuning using brain MRI data. Using a large multi-cohort dataset, we find that the architecture of the classification head substantially influences performance, freezing the FM encoder achieves comparable results to full fine-tuning, and advanced aggregation methods outperform standard federated averaging. Our results offer practical insights for deploying FMs in decentralized clinical settings and highlight trade-offs that should guide future method development.[65] Multi-Method Ensemble for Out-of-Distribution Detection
Lucas Rakotoarivony
Main category: cs.CV
TL;DR: This paper proposes a Multi-Method Ensemble (MME) score that combines feature truncation and multiple scoring functions for improved out-of-distribution (OOD) detection. Experiments show MME significantly outperforms existing methods across various benchmarks.
Details
Motivation: Existing OOD detection methods typically focus on either feature truncation or scoring functions and are evaluated on limited OOD datasets. This work aims to combine these techniques to improve robustness and effectiveness across different OOD scenarios. Method: The method involves combining state-of-the-art feature truncation techniques with multiple scoring functions into a unified scoring framework called Multi-Method Ensemble (MME). Result: Extensive experiments show that MME significantly outperforms recent state-of-the-art methods on both large-scale and small-scale benchmarks, including near-OOD and far-OOD scenarios. Using the BiT model, MME achieves an average FPR95 of 27.57% on ImageNet-1K, improving performance by 6% over the best existing baseline. Conclusion: The proposed Multi-Method Ensemble (MME) score combines feature truncation and scoring functions to improve OOD detection, demonstrating robustness and superior performance across multiple benchmarks. Abstract: Detecting out-of-distribution (OOD) samples is essential for neural networks operating in open-world settings, particularly in safety-critical applications. Existing methods have improved OOD detection by leveraging two main techniques: feature truncation, which increases the separation between in-distribution (ID) and OOD samples, and scoring functions, which assign scores to distinguish between ID and OOD data. However, most approaches either focus on a single family of techniques or evaluate their effectiveness on a specific type of OOD dataset, overlooking the potential of combining multiple existing solutions. Motivated by this observation, we theoretically and empirically demonstrate that state-of-the-art feature truncation and scoring functions can be effectively combined. Moreover, we show that aggregating multiple scoring functions enhances robustness against various types of OOD samples. Based on these insights, we propose the Multi-Method Ensemble (MME) score, which unifies state-of-the-art OOD detectors into a single, more effective scoring function. Extensive experiments on both large-scale and small-scale benchmarks, covering near-OOD and far-OOD scenarios, show that MME significantly outperforms recent state-of-the-art methods across all benchmarks. Notably, using the BiT model, our method achieves an average FPR95 of 27.57% on the challenging ImageNet-1K benchmark, improving performance by 6% over the best existing baseline.[66] Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR
Shashank Vempati,Nishit Anand,Gaurav Talebailkar,Arpan Garai,Chetan Arora
Main category: cs.CV
TL;DR: 本文提出了一种行级OCR方法,通过绕过容易出错的单词检测步骤,并提供更大的句子上下文以更好地利用语言模型,从而提高OCR的准确性和效率。
Details
Motivation: 作者观察到,从字符分割到单词分割的转变已经将准确性瓶颈转移到了单词分割上。因此,他们提出了从单词级OCR到行级OCR的演进方法。 Method: 该方法通过绕过容易出错的单词检测步骤,并提供更大的句子上下文以更好地利用语言模型,从而实现行级OCR。 Result: 实验结果显示,与单词级OCR相比,该方法在端到端的准确性上提高了5.4%,效率上提高了4倍。此外,作者还贡献了一个包含251张英文页面图像并带有行级注释的数据集。 Conclusion: 本文提出了一种从单词级OCR到行级OCR的自然且逻辑清晰的演进方法,该方法不仅提高了OCR的准确性,还提高了效率。此外,该方法为利用大型语言模型的进步提供了潜力。 Abstract: Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: https://nishitanand.github.io/line-level-ocr-website[67] Adversarial Patch Attack for Ship Detection via Localized Augmentation
Chun Liu,Panpan Ding,Zheng Zheng,Hailong Wang,Bingqian Zhu,Tao Xu,Zhigang Han,Jiayao Wang
Main category: cs.CV
TL;DR: 论文提出了一种局部增强方法,通过仅增强目标区域,减少背景干扰,从而提高对抗补丁攻击的成功率和可迁移性。
Details
Motivation: 论文的动机是为了解决基于深度神经网络的舰船检测技术容易受到对抗补丁攻击的问题,同时避免基于数据变换的方法因过度增强图像背景或无关区域而引入不必要的干扰。 Method: 论文的方法是通过仅对目标区域进行增强,避免对非目标区域的影响,从而减少背景干扰,使损失函数更直接地关注对抗补丁对检测模型的影响。 Result: 论文的实验结果表明,该方法可以有效提高对抗补丁攻击的成功率和可迁移性。 Conclusion: 论文的结论是,局部增强方法可以有效提高对抗补丁攻击的成功率和可迁移性。 Abstract: Current ship detection techniques based on remote sensing imagery primarily rely on the object detection capabilities of deep neural networks (DNNs). However, DNNs are vulnerable to adversarial patch attacks, which can lead to misclassification by the detection model or complete evasion of the targets. Numerous studies have demonstrated that data transformation-based methods can improve the transferability of adversarial examples. However, excessive augmentation of image backgrounds or irrelevant regions may introduce unnecessary interference, resulting in false detections of the object detection model. These errors are not caused by the adversarial patches themselves but rather by the over-augmentation of background and non-target areas. This paper proposes a localized augmentation method that applies augmentation only to the target regions, avoiding any influence on non-target areas. By reducing background interference, this approach enables the loss function to focus more directly on the impact of the adversarial patch on the detection model, thereby improving the attack success rate. Experiments conducted on the HRSC2016 dataset demonstrate that the proposed method effectively increases the success rate of adversarial patch attacks and enhances their transferability.[68] ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
Hao Lu,Jiahao Wang,Yaolun Zhang,Ruohui Wang,Xuanyu Zheng,Yepeng Tang,Dahua Lin,Lewei Lu
Main category: cs.CV
TL;DR: 论文提出ELV-Halluc基准与对抗数据,研究长视频中的语义聚合幻觉(SAH),并通过位置编码和DPO策略有效缓解SAH。
Details
Motivation: 当前视频幻觉基准主要关注短视频,而长视频由于多事件语义复杂性,存在特定的语义聚合幻觉(SAH),需要专门研究。 Method: 论文提出了ELV-Halluc基准,用于系统研究长视频中的SAH,并通过8K对抗数据对模型进行优化,结合DPO策略增强模型区分语义的能力。 Result: 实验验证了SAH的存在,并发现其随着语义复杂性增加而加剧,同时证明位置编码和DPO策略可显著降低SAH比例,ELV-Halluc基准和对抗数据对模型性能提升有效。 Conclusion: 论文得出结论,语义聚合幻觉(SAH)在长视频中尤为显著,增加语义复杂性会加剧这种幻觉,而位置编码策略和DPO策略能够有效缓解SAH。 Abstract: Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.[69] Maybe you don't need a U-Net: convolutional feature upsampling for materials micrograph segmentation
Ronan Docherty,Antonis Vamvakeros,Samuel J. Cooper
Main category: cs.CV
TL;DR: 本文提出了一种利用卷积神经网络上采样低分辨率特征的方法,显著提升了微观图像分割的效果和效率。
Details
Motivation: 现有的基于图像块的特征描述方法在微观图像分析中难以有效表现细节特征,同时在处理大尺寸图像时面临计算挑战。 Method: 训练一个卷积神经网络来对低分辨率的基础模型特征进行上采样,并利用输入图像作为参考优化特征表现。 Result: 该方法成功地实现了对多种微观图像(如植物细胞、锂离子电池正极和有机晶体)的高效特征化和分割,尤其在分离细微结构(如发丝裂纹)方面表现出色。 Conclusion: 该研究提出了一种卷积神经网络的上采样方法,以提升低分辨率特征的细节表现能力,从而更好地应用于微观图像分析和分割任务。 Abstract: Feature foundation models - usually vision transformers - offer rich semantic descriptors of images, useful for downstream tasks such as (interactive) segmentation and object detection. For computational efficiency these descriptors are often patch-based, and so struggle to represent the fine features often present in micrographs; they also struggle with the large image sizes present in materials and biological image analysis. In this work, we train a convolutional neural network to upsample low-resolution (i.e, large patch size) foundation model features with reference to the input image. We apply this upsampler network (without any further training) to efficiently featurise and then segment a variety of microscopy images, including plant cells, a lithium-ion battery cathode and organic crystals. The richness of these upsampled features admits separation of hard to segment phases, like hairline cracks. We demonstrate that interactive segmentation with these deep features produces high-quality segmentations far faster and with far fewer labels than training or finetuning a more traditional convolutional network.[70] HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones
Hao Ruan,Jinliang Lin,Yingxin Lai,Zhiming Luo,Shaozi Li
Main category: cs.CV
TL;DR: The paper proposes the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework to improve vision-language understanding for Natural Language-Guided Drones (NLGD), achieving state-of-the-art results and strong generalization in dynamic environments.
Details
Motivation: The limitations of mainstream Vision-Language Models (VLMs) and existing hierarchical methods in capturing fine-grained semantics and compositional reasoning under dynamic drone scenarios motivated the development of the HCCM framework. Method: The HCCM framework includes two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which captures hierarchical local-to-global semantics without precise scene partitioning, and (2) Region-Global Image-Text Matching (RG-ITM), which evaluates local semantic consistency within global cross-modal representations. Additionally, the Momentum Contrast and Distillation (MCD) mechanism improves robustness against incomplete or ambiguous drone text descriptions. Result: Experiments on GeoText-1652 showed HCCM achieves state-of-the-art performance with Recall@1 of 28.8% for image retrieval and 14.7% for text retrieval. On the unseen ERA dataset, it demonstrated strong zero-shot generalization with a mean recall of 39.93%, outperforming fine-tuned baselines. Conclusion: HCCM addresses the limitations of mainstream VLMs and existing hierarchical methods by introducing RG-ITC and RG-ITM, which enhance fine-grained semantics and compositional reasoning for better vision-language understanding in dynamic drone environments. Abstract: Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.[71] Complete Gaussian Splats from a Single Image with Denoising Diffusion Models
Ziwei Liao,Mohamed Sayed,Steven L. Waslander,Sara Vicente,Daniyar Turmukhambetov,Michael Firman
Main category: cs.CV
TL;DR: This paper proposes a latent diffusion model for reconstructing full 3D scenes, including occluded regions, from a single image, outperforming regression-based approaches in quality and diversity.
Details
Motivation: To overcome the limitations of Gaussian splatting in reconstructing occluded and unobserved areas and to provide more plausible, diverse, and high-quality 3D reconstructions. Method: A generative formulation using a latent diffusion model conditioned on a single input image, trained using a Variational AutoReconstructor over a self-supervised latent space. Result: Faithful and diverse 3D reconstructions with the ability to complete occluded surfaces for high-quality 360-degree renderings. Conclusion: The proposed method successfully reconstructs complete 3D scenes, including occluded areas, from a single image by using a generative latent diffusion model. Abstract: Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference. Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single "mode" for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views. In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360-degree renderings.[72] EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting
Yujin Park,Haejun Chung,Ikbeom Jang
Main category: cs.CV
TL;DR: EZ-Sort reduces annotation costs and improves efficiency in pairwise comparison tasks by using CLIP-based pre-ordering and uncertainty-aware sampling.
Details
Motivation: The motivation is to reduce the annotation burden in pairwise comparison tasks, which typically require a large number of annotations (O(n^2)). Method: The method involves three steps: zero-shot pre-ordering based on CLIP, initializing bucket-aware Elo scores, and conducting uncertainty-guided human-in-the-loop MergeSort. Result: EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work when n = 100, while improving or maintaining inter-rater reliability. Conclusion: The study concludes that EZ-Sort is an efficient and scalable solution for pairwise ranking by combining CLIP-based priors with uncertainty-aware sampling, significantly reducing human annotation costs while maintaining inter-rater reliability. Abstract: Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical image chronology (DHCI), and retinal image quality assessment (EyePACS). It showed that EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (when n = 100), while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking.[73] ECHO: Ego-Centric modeling of Human-Object interactions
Ilya A. Petrov,Vladimir Guzov,Riccardo Marin,Emre Aksan,Xu Chen,Daniel Cremers,Thabo Beeler,Gerard Pons-Moll
Main category: cs.CV
TL;DR: ECHO是一种新的统一框架,可以从头部和手腕追踪中恢复人体姿势、物体运动和接触信息,具有良好的灵活性和性能。
Details
Motivation: 由于智能眼镜和手表等可穿戴设备的日益普及,从自我中心的角度建模人与物体的相互作用是一个重要但尚未充分研究的问题。 Method: ECHO采用了一种扩散变压器架构和独特的三变量扩散过程,并在头部中心的规范空间中进行操作,提出了一种基于传送带的推理方式,允许处理任意长度的序列。 Result: 通过广泛的评估,证明了ECHO优于现有不提供相同灵活性的方法,在自我中心的HOI重建中达到了最先进的水平。 Conclusion: ECHO是一个新的统一框架,可以从头部和手腕追踪中恢复三种模态:人体姿势、物体运动和接触,从而在自我中心的HOI重建中达到最先进的性能。 Abstract: Modeling human-object interactions (HOI) from an egocentric perspective is a largely unexplored yet important problem due to the increasing adoption of wearable devices, such as smart glasses and watches. We investigate how much information about interaction can be recovered from only head and wrists tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object interactions), which, for the first time, proposes a unified framework to recover three modalities: human pose, object motion, and contact from such minimal observation. ECHO employs a Diffusion Transformer architecture and a unique three-variate diffusion process, which jointly models human motion, object trajectory, and contact sequence, allowing for flexible input configurations. Our method operates in a head-centric canonical space, enhancing robustness to global orientation. We propose a conveyor-based inference, which progressively increases the diffusion timestamp with the frame position, allowing us to process sequences of any length. Through extensive evaluation, we demonstrate that ECHO outperforms existing methods that do not offer the same flexibility, setting a state-of-the-art in egocentric HOI reconstruction.[74] How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images
Juneyoung Ro,Namwoo Kim,Yoonjin Yoon
Main category: cs.CV
TL;DR: 该研究探讨了现有视觉-语言模型在城市空间推理任务中的表现,并提出通过合成数据集微调模型以提升其在特定领域中的性能。
Details
Motivation: 当前视觉-语言模型在城市领域中对物体、布局和深度线索的细粒度空间推理能力尚不明确,因此需要研究其迁移能力。 Method: 通过比较三种现有的VLMs(BLIP-2、InstructBLIP和LLaVA-1.5)在零样本设置和使用特定于城市场景的合成VQA数据集微调的效果,进行对比研究。合成数据集通过街景图像的分割、深度和物体检测预测构建,并为每个问题配对LLM生成的思维链(CoT)答案。 Result: 研究结果表明,尽管VLMs在零样本设置中表现合理,但通过合成的CoT监督数据集进行微调显著提升了性能,尤其是对于否定和反事实等复杂问题类型。 Conclusion: 该研究得出城市空间推理是VLMs的新挑战,并展示了合成数据集构建作为适应特定领域模型的实际路径。 Abstract: Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.[75] Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging
Nico Albert Disch,Yannick Kirchhoff,Robin Peretzke,Maximilian Rokuss,Saikat Roy,Constantin Ulrich,David Zimmerer,Klaus Maier-Hein
Main category: cs.CV
TL;DR: Temporal Flow Matching (TFM) 是一种新的统一生成轨迹方法,用于4D医学图像预测,优于现有时空自然图像方法。
Details
Motivation: 为了解决现有深度学习方法在医学图像时间动态建模中的局限性,如仅考虑单一时间上下文、专注于分类或回归任务以及对细粒度空间预测能力有限等问题。 Method: 引入Temporal Flow Matching (TFM),一种统一的生成轨迹方法,旨在学习潜在的时间分布,可以回退到最近图像预测器,并支持3D体积、多个先前扫描和不规则采样。 Result: 在三个公开的纵向数据集上进行了广泛的基准测试,结果显示TFM始终优于来自自然图像的时空方法,建立了新的4D医学图像预测的最先进状态和稳健基线。 Conclusion: TFM提供了一种有效的解决方案,用于医学图像的时间动态建模,具有广泛的应用潜力。 Abstract: Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3D$ volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for $4D$ medical image prediction.[76] Integrating Pathology and CT Imaging for Personalized Recurrence Risk Prediction in Renal Cancer
Daniël Boeke,Cedrik Blommestijn,Rebecca N. Wray,Kalina Chupetlovska,Shangqi Gao,Zeyu Gao,Regina G. H. Beets-Tan,Mireia Crispin-Ortuzar,James O. Jones,Wilson Silva,Ines P. Machado
Main category: cs.CV
TL;DR: 本研究通过结合CT和病理切片图像,利用深度学习框架预测ccRCC患者的复发风险,结果显示多模态模型优于传统评分系统,尤其是病理学信息在预测中具有重要价值。
Details
Motivation: 研究的动机是提高透明细胞肾细胞癌(ccRCC)术后复发风险评估的准确性,以指导术后监测和治疗。现有的Leibovich评分在患者个体层面的分辨率有限,且未包含影像信息。 Method: 研究采用模块化深度学习框架,结合了预训练编码器和基于Cox的生存建模,测试了单模态、晚期融合和中间融合设置下的复发预测性能。 Result: 研究结果显示,基于病理切片图像(WSI)的模型在预测复发方面始终优于仅使用CT图像的模型;中间融合进一步提高了性能,最佳模型(TITAN-CONCH与ResNet-18)接近调整后的Leibovich评分。 Conclusion: 该研究得出结论,基于基础模型的多模态整合在个性化ccRCC风险预测中是可行的,并强调了病理学在预测复发方面的强大预后能力。 Abstract: Recurrence risk estimation in clear cell renal cell carcinoma (ccRCC) is essential for guiding postoperative surveillance and treatment. The Leibovich score remains widely used for stratifying distant recurrence risk but offers limited patient-level resolution and excludes imaging information. This study evaluates multimodal recurrence prediction by integrating preoperative computed tomography (CT) and postoperative histopathology whole-slide images (WSIs). A modular deep learning framework with pretrained encoders and Cox-based survival modeling was tested across unimodal, late fusion, and intermediate fusion setups. In a real-world ccRCC cohort, WSI-based models consistently outperformed CT-only models, underscoring the prognostic strength of pathology. Intermediate fusion further improved performance, with the best model (TITAN-CONCH with ResNet-18) approaching the adjusted Leibovich score. Random tie-breaking narrowed the gap between the clinical baseline and learned models, suggesting discretization may overstate individualized performance. Using simple embedding concatenation, radiology added value primarily through fusion. These findings demonstrate the feasibility of foundation model-based multimodal integration for personalized ccRCC risk prediction. Future work should explore more expressive fusion strategies, larger multimodal datasets, and general-purpose CT encoders to better match pathology modeling capacity.[77] Unfolding Framework with Complex-Valued Deformable Attention for High-Quality Computer-Generated Hologram Generation
Haomiao Zhang,Zhangyuan Li,Yanling Piao,Zhi Li,Xiaodong Wang,Miao Cao,Xiongfei Su,Qiang Song,Xin Yuan
Main category: cs.CV
TL;DR: This paper introduces a Deep Unfolding Network (DUN) for computer-generated holography that improves reconstruction accuracy and flexibility by combining an adaptive bandwidth-preserving model and a complex-valued denoiser with self-attention.
Details
Motivation: Current deep learning-based CGH algorithms face challenges such as limited interpretability, restricted receptive fields, and constraints on working distances. This work aims to address these issues by introducing a more flexible and physically meaningful reconstruction framework. Method: The authors proposed a Deep Unfolding Network (DUN) that decomposes gradient descent into two modules: an adaptive bandwidth-preserving model (ABPM) and a phase-domain complex-valued denoiser (PCD). ABPM enables wider working distances, while PCD uses a complex-valued deformable self-attention module to capture global features. Result: The proposed method achieves a PSNR over 35 dB and demonstrates superior performance on both simulated and real data, showing state-of-the-art results in computer-generated holography. Conclusion: The proposed DUN method outperforms existing approaches by offering greater flexibility and better performance in CGH, achieving state-of-the-art results on simulated and real data. Abstract: Computer-generated holography (CGH) has gained wide attention with deep learning-based algorithms. However, due to its nonlinear and ill-posed nature, challenges remain in achieving accurate and stable reconstruction. Specifically, ($i$) the widely used end-to-end networks treat the reconstruction model as a black box, ignoring underlying physical relationships, which reduces interpretability and flexibility. ($ii$) CNN-based CGH algorithms have limited receptive fields, hindering their ability to capture long-range dependencies and global context. ($iii$) Angular spectrum method (ASM)-based models are constrained to finite near-fields.In this paper, we propose a Deep Unfolding Network (DUN) that decomposes gradient descent into two modules: an adaptive bandwidth-preserving model (ABPM) and a phase-domain complex-valued denoiser (PCD), providing more flexibility. ABPM allows for wider working distances compared to ASM-based methods. At the same time, PCD leverages its complex-valued deformable self-attention module to capture global features and enhance performance, achieving a PSNR over 35 dB. Experiments on simulated and real data show state-of-the-art results.[78] Towards Interactive Lesion Segmentation in Whole-Body PET/CT with Promptable Models
Maximilian Rokuss,Yannick Kirchhoff,Fabian Isensee,Klaus H. Maier-Hein
Main category: cs.CV
TL;DR: This paper improves interactive lesion segmentation in PET/CT by extending the nnU-Net with promptable capabilities using Euclidean Distance Transform (EDT) encodings, achieving better performance and robustness in multi-tracer, multi-center settings.
Details
Motivation: Accurate lesion segmentation in whole-body PET/CT is challenging due to tracer heterogeneity, physiological uptake, and multi-center variability. Fully automated methods, while advanced, benefit from human-in-the-loop approaches for refinement. This work aims to improve interactive segmentation performance. Method: Building on the nnU-Net pipeline, the authors extended it with promptable capabilities by encoding user input (foreground/background clicks) using Euclidean Distance Transform (EDT) and Gaussian kernels. They also introduced online simulation of user interactions and a custom point sampling strategy to enhance robustness. Result: An ensemble of EDT-based models achieved the best cross-validation performance, reducing both false positives and false negatives compared to baseline models. EDT encodings outperformed Gaussian kernels consistently. Conclusion: The study concludes that promptable models, particularly those using EDT encodings, can significantly improve lesion segmentation accuracy and robustness in PET/CT imaging, enabling more efficient and user-guided workflows. Abstract: Whole-body PET/CT is a cornerstone of oncological imaging, yet accurate lesion segmentation remains challenging due to tracer heterogeneity, physiological uptake, and multi-center variability. While fully automated methods have advanced substantially, clinical practice benefits from approaches that keep humans in the loop to efficiently refine predicted masks. The autoPET/CT IV challenge addresses this need by introducing interactive segmentation tasks based on simulated user prompts. In this work, we present our submission to Task 1. Building on the winning autoPET III nnU-Net pipeline, we extend the framework with promptable capabilities by encoding user-provided foreground and background clicks as additional input channels. We systematically investigate representations for spatial prompts and demonstrate that Euclidean Distance Transform (EDT) encodings consistently outperform Gaussian kernels. Furthermore, we propose online simulation of user interactions and a custom point sampling strategy to improve robustness under realistic prompting conditions. Our ensemble of EDT-based models, trained with and without external data, achieves the strongest cross-validation performance, reducing both false positives and false negatives compared to baseline models. These results highlight the potential of promptable models to enable efficient, user-guided segmentation workflows in multi-tracer, multi-center PET/CT. Code is publicly available at https://github.com/MIC-DKFZ/autoPET-interactive[79] Mapping like a Skeptic: Probabilistic BEV Projection for Online HD Mapping
Fatih Erdoğan,Merve Rabia Barın,Fatma Güney
Main category: cs.CV
TL;DR: This paper proposes a probabilistic projection mechanism with confidence scores to improve high-definition map generation by refining the mapping of road elements and filtering irrelevant ones, resulting in better generalization and accuracy.
Details
Motivation: Existing HD mapping approaches struggle with accuracy due to generalization problems, often hallucinating non-existent road elements. This work aims to improve accuracy by introducing a geometric mapping approach adapted to the scene. Method: A novel probabilistic projection mechanism with confidence scores is proposed to refine the mapping of road elements from image space to BEV space and filter out irrelevant elements. Temporal processing is improved by selectively accumulating reliable information over time. Result: Experiments on new splits of the nuScenes and Argoverse2 datasets demonstrate improved performance over state-of-the-art approaches, particularly on nuScenes and in long perception range scenarios. Conclusion: The proposed probabilistic projection mechanism with confidence scores improves HD map generation by refining the mapping and filtering irrelevant elements, leading to better generalization. Abstract: Constructing high-definition (HD) maps from sensory input requires accurately mapping the road elements in image space to the Bird's Eye View (BEV) space. The precision of this mapping directly impacts the quality of the final vectorized HD map. Existing HD mapping approaches outsource the projection to standard mapping techniques, such as attention-based ones. However, these methods struggle with accuracy due to generalization problems, often hallucinating non-existent road elements. Our key idea is to start with a geometric mapping based on camera parameters and adapt it to the scene to extract relevant map information from camera images. To implement this, we propose a novel probabilistic projection mechanism with confidence scores to (i) refine the mapping to better align with the scene and (ii) filter out irrelevant elements that should not influence HD map generation. In addition, we improve temporal processing by using confidence scores to selectively accumulate reliable information over time. Experiments on new splits of the nuScenes and Argoverse2 datasets demonstrate improved performance over state-of-the-art approaches, indicating better generalization. The improvements are particularly pronounced on nuScenes and in the challenging long perception range. Our code and model checkpoints are available at https://github.com/Fatih-Erdogan/mapping-like-skeptic .[80] FLORA: Efficient Synthetic Data Generation for Object Detection in Low-Data Regimes via finetuning Flux LoRA
Alvaro Patricio,Atabak Dehban,Rodrigo Ventura
Main category: cs.CV
TL;DR: 本文提出 FLORA,一种轻量级合成数据生成方法,能够在消费级 GPU 上高效训练目标检测模型,并在少量数据下超越现有方法。
Details
Motivation: 现有的扩散模型生成合成数据需要大量计算资源,限制了其在实际场景中的应用。 Method: 使用 Flux 1.1 Dev 扩散模型,并仅通过低秩自适应(LoRA)进行微调,以构建轻量级合成数据生成流程。 Result: 在七个多样的目标检测数据集上进行评估,使用 FLORA 生成的 500 张合成图像训练的目标检测器性能优于使用 5000 张 ODGEN 基线图像训练的模型,在 mAP@.50:.95 上提升了高达 21.3%。 Conclusion: FLORA 能够以更少的数据和更低的计算成本超越现有方法的性能,证明了高质量和高效率的方法比盲目生成更有效。 Abstract: Recent advances in diffusion-based generative models have demonstrated significant potential in augmenting scarce datasets for object detection tasks. Nevertheless, most recent models rely on resource-intensive full fine-tuning of large-scale diffusion models, requiring enterprise-grade GPUs (e.g., NVIDIA V100) and thousands of synthetic images. To address these limitations, we propose Flux LoRA Augmentation (FLORA), a lightweight synthetic data generation pipeline. Our approach uses the Flux 1.1 Dev diffusion model, fine-tuned exclusively through Low-Rank Adaptation (LoRA). This dramatically reduces computational requirements, enabling synthetic dataset generation with a consumer-grade GPU (e.g., NVIDIA RTX 4090). We empirically evaluate our approach on seven diverse object detection datasets. Our results demonstrate that training object detectors with just 500 synthetic images generated by our approach yields superior detection performance compared to models trained on 5000 synthetic images from the ODGEN baseline, achieving improvements of up to 21.3% in mAP@.50:.95. This work demonstrates that it is possible to surpass state-of-the-art performance with far greater efficiency, as FLORA achieves superior results using only 10% of the data and a fraction of the computational cost. This work demonstrates that a quality and efficiency-focused approach is more effective than brute-force generation, making advanced synthetic data creation more practical and accessible for real-world scenarios.[81] Entropy-Based Non-Invasive Reliability Monitoring of Convolutional Neural Networks
Amirhossein Nazeri,Wael Hafez
Main category: cs.CV
TL;DR: This paper proposes a method to detect adversarial inputs in CNNs by monitoring activation entropy, achieving high detection accuracy without model modification.
Details
Motivation: Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision but remain vulnerable to adversarial perturbations. Existing detection methods require expensive retraining, modify network architecture, or degrade performance on clean inputs. Method: Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. Result: Adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. Conclusion: This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance. Abstract: Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision, achieving unprecedented accuracy across diverse image recognition tasks. While these networks excel on in-distribution data, they remain vulnerable to adversarial perturbations imperceptible input modifications that cause misclassification with high confidence. However, existing detection methods either require expensive retraining, modify network architecture, or degrade performance on clean inputs. Here we show that adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance.[82] CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models
João Valente,Atabak Dehban,Rodrigo Ventura
Main category: cs.CV
TL;DR: 该论文提出CAD2DMD-SET,一个用于生成带有视觉问答标签的合成数字测量设备数据集的工具,以及DMDBench,一个用于评估模型性能的验证集。
Details
Motivation: LVLMs在阅读数字测量设备(DMDs)时存在困难,特别是在现实世界的复杂条件下,如杂乱、遮挡、极端视角和运动模糊。 Method: 通过利用3D CAD模型、高级渲染和高保真图像合成,生成多样化的、带有VQA标签的合成DMD数据集,并提出了DMDBench验证集。 Result: 使用CAD2DMD-SET生成的数据集对LoRA进行微调后,InternVL得分提高了200%,且未降低其他任务的性能。 Conclusion: CAD2DMD-SET显著提高了LVLMs在具有挑战性的条件下的鲁棒性和性能,并预计作为开源工具发布,允许社区添加不同的测量设备并生成自己的数据集。 Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA's of these models with CAD2DMD-SET's generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.[83] Learning from Silence and Noise for Visual Sound Source Localization
Xavier Juanola,Giovana Morais,Magdalena Fuentes,Gloria Haro
Main category: cs.CV
TL;DR: This paper proposes SSL-SaN, a self-supervised model for visual sound source localization that addresses poor performance in negative audio scenarios, introduces a new evaluation metric, and presents an improved dataset named IS3+.
Details
Motivation: The motivation is to address the limitations of current visual sound source localization methods, particularly their poor performance with negative audio and the lack of comprehensive evaluation metrics and datasets. Method: The method involves a new training strategy incorporating silence and noise, a new metric for auditory-visual feature alignment and separability, and an extended dataset named IS3+. Result: The proposed SSL-SaN model achieves state-of-the-art performance in sound localization and cross-modal retrieval, and the new metric and dataset enhance the evaluation of models under varied audio conditions. Conclusion: The paper introduces SSL-SaN, a self-supervised model for visual sound source localization that performs well in both positive and negative audio scenarios. Abstract: Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the https://xavijuanola.github.io/SSL-SaN/.[84] UItron: Foundational GUI Agent with Advanced Perception and Planning
Zhixiong Zeng,Jing Huang,Liming Zheng,Wenkang Han,Yufeng Zhong,Lei Chen,Longrong Yang,Yingjie Chu,Yuzhi He,Lin Ma
Main category: cs.CV
TL;DR: UItron is an open-source foundational model for automatic GUI agents that advances GUI perception, grounding, and planning capabilities through data engineering and interactive infrastructure, achieving significant results in Chinese app scenarios.
Details
Motivation: The motivation behind UItron is to advance the development of GUI agents for automated operations on Mobile/PC devices by addressing challenges such as scarcity of operation trajectories, lack of interactive infrastructure, and limitations in foundation models. Method: UItron uses supervised finetuning over perception and planning tasks across various GUI scenarios, followed by a curriculum reinforcement learning framework to enable complex reasoning and exploration in online environments. Result: UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning, and demonstrates significant progress in handling Chinese app scenarios. Conclusion: UItron represents significant progress in the development of GUI agents, particularly in Chinese app scenarios, bringing automated operations on Mobile/PC devices closer to real-world application. Abstract: GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.[85] Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations
Ha Min Son,Zhe Zhao,Shahbaz Rezaei,Xin Liu
Main category: cs.CV
TL;DR: CLIP-DCA improves domain generalization for foundational models like CLIP by enhancing domain awareness and disentangling domain-invariant classification, showing strong performance on challenging unseen data scenarios.
Details
Motivation: Standard domain invariance losses may harm foundational models by discarding domain-aware representations beneficial for generalization. This work explores enhancing domain awareness as a prerequisite for effective domain-invariant classification. Method: CLIP-DCA introduces a separate domain head to identify and enhance domain awareness using synthetically generated diverse domain data, while encouraging domain-invariant classification through disentanglement from domain features. Result: CLIP-DCA shows significant improvements over existing methods in challenging DG evaluations, especially for datasets that are more OOD. Conclusion: CLIP-DCA enhances domain awareness and improves domain-invariant classification in foundational models like CLIP, particularly for datasets that are more out-of-distribution (OOD). Abstract: Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget' some domains as an approximation. We observe that CLIP's performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP's encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.[86] What Can We Learn from Harry Potter? An Exploratory Study of Visual Representation Learning from Atypical Videos
Qiyue Sun,Qiming Huang,Yang Yang,Hongjun Wang,Jianbo Jiao
Main category: cs.CV
TL;DR: 本论文研究了非典型视频数据在开放世界学习中的作用,发现其对视觉表示学习具有显著益处,并提出了一个新数据集以促进相关研究。
Details
Motivation: 人类在面对不常见的新概念时展现出卓越的泛化和发现能力,而现有大多数研究关注封闭集中的典型数据,开放世界的发现能力研究较少。 Method: 作者收集了一个包含各种非典型数据的视频数据集,并通过将这些数据用于模型训练进行表示学习,研究其对开放世界学习任务的影响。 Result: 实验发现,使用非典型数据可以持续提升开放世界学习任务的性能,包括分布外检测、新类别发现和零样本动作识别。此外,非典型样本的语义多样性有助于模型更好地泛化到未见过的动作类别。 Conclusion: 论文得出结论,非典型视频数据对开放世界的视觉表示学习具有显著益处,并提出了一个新数据集以促进进一步研究。 Abstract: Humans usually show exceptional generalisation and discovery ability in the open world, when being shown uncommon new concepts. Whereas most existing studies in the literature focus on common typical data from closed sets, open-world novel discovery is under-explored in videos. In this paper, we are interested in asking: \textit{What if atypical unusual videos are exposed in the learning process?} To this end, we collect a new video dataset consisting of various types of unusual atypical data (\eg sci-fi, animation, \etc). To study how such atypical data may benefit open-world learning, we feed them into the model training process for representation learning. Focusing on three key tasks in open-world learning: out-of-distribution (OOD) detection, novel category discovery (NCD), and zero-shot action recognition (ZSAR), we found that even straightforward learning approaches with atypical data consistently improve performance across various settings. Furthermore, we found that increasing the categorical diversity of the atypical samples further boosts OOD detection performance. Additionally, in the NCD task, using a smaller yet more semantically diverse set of atypical samples leads to better performance compared to using a larger but more typical dataset. In the ZSAR setting, the semantic diversity of atypical videos helps the model generalise better to unseen action classes. These observations in our extensive experimental evaluations reveal the benefits of atypical videos for visual representation learning in the open world, together with the newly proposed dataset, encouraging further studies in this direction.[87] Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering
Nattapong Kurpukdee,Adrian G. Bors
Main category: cs.CV
TL;DR: This paper introduces an unsupervised video continual learning framework using non-parametric methods and transfer learning, significantly improving performance on standard datasets without labels or task boundaries.
Details
Motivation: To address the lack of exploration in unsupervised video continual learning, where neither labels nor task boundaries are available, making the learning process more practical and efficient. Method: A non-parametric approach using Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks was proposed. The method includes a novelty detection criterion for dynamically expanding memory clusters and leverages transfer learning from previous tasks. Result: The proposed methodology substantially improves the model's performance in unsupervised continual learning scenarios, as evaluated on three standard video action recognition datasets: UCF101, HMDB51, and Something-to-Something V2. Conclusion: The proposed unsupervised video continual learning methodology enhances model performance when learning multiple tasks without labels or class boundaries, as demonstrated through evaluations on standard datasets. Abstract: We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.[88] A Multi-Stage Fine-Tuning and Ensembling Strategy for Pancreatic Tumor Segmentation in Diagnostic and Therapeutic MRI
Omer Faruk Durugol,Maximilian Rokuss,Yannick Kirchhoff,Klaus H. Maier-Hein
Main category: cs.CV
TL;DR: 本文提出了一种深度多阶段级联预训练策略,结合数据增强和模型集成,有效解决了MRI中胰腺导管腺癌自动分割的问题。
Details
Motivation: MRI中胰腺导管腺癌的自动分割对临床流程至关重要,但由于肿瘤-组织对比度差和标注数据稀缺而受到阻碍。 Method: 作者利用nnU-Net框架,从通用解剖基础模型开始,按顺序在CT胰腺病变数据集和目标MRI模态上进行微调。通过广泛的五折交叉验证,系统评估了数据增强方案和训练计划,并构建了专家模型的异构集成。 Result: 研究发现,激进的数据增强产生了最高的体积准确性,而默认增强则在边界精度上表现更优(在任务1中实现了最先进的MASD 5.46 mm和HD95 17.33 mm)。最终提交利用这一发现,通过构建专家模型的异构集成,在任务1和任务2中分别实现了0.661和0.523的交叉验证肿瘤Dice分数。 Conclusion: 本文提出了一个基于nnU-Net框架的深度多阶段级联预训练策略,用于解决MRI中胰腺导管腺癌的自动分割问题,并通过构建专家模型的异构集成实现了高性能模型的开发。 Abstract: Automated segmentation of Pancreatic Ductal Adenocarcinoma (PDAC) from MRI is critical for clinical workflows but is hindered by poor tumor-tissue contrast and a scarcity of annotated data. This paper details our submission to the PANTHER challenge, addressing both diagnostic T1-weighted (Task 1) and therapeutic T2-weighted (Task 2) segmentation. Our approach is built upon the nnU-Net framework and leverages a deep, multi-stage cascaded pre-training strategy, starting from a general anatomical foundation model and sequentially fine-tuning on CT pancreatic lesion datasets and the target MRI modalities. Through extensive five-fold cross-validation, we systematically evaluated data augmentation schemes and training schedules. Our analysis revealed a critical trade-off, where aggressive data augmentation produced the highest volumetric accuracy, while default augmentations yielded superior boundary precision (achieving a state-of-the-art MASD of 5.46 mm and HD95 of 17.33 mm for Task 1). For our final submission, we exploited this finding by constructing custom, heterogeneous ensembles of specialist models, essentially creating a mix of experts. This metric-aware ensembling strategy proved highly effective, achieving a top cross-validation Tumor Dice score of 0.661 for Task 1 and 0.523 for Task 2. Our work presents a robust methodology for developing specialized, high-performance models in the context of limited data and complex medical imaging tasks (Team MIC-DKFZ).[89] Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight
Ugur Dinc,Jibak Sarkar,Philipp Schubert,Sabine Semrau,Thomas Weissmann,Andre Karius,Johann Brand,Bernd-Niklas Axer,Ahmed Gomaa,Pluvio Stephan,Ishita Sheth,Sogand Beirami,Annette Schwarz,Udo Gaipl,Benjamin Frey,Christoph Bert,Stefanie Corradini,Rainer Fietkau,Florian Putz
Main category: cs.CV
TL;DR: GPT-5, a large language model tailored for oncology, outperforms previous versions in radiation oncology assessments but still requires expert oversight due to errors in complex cases.
Details
Motivation: To evaluate the performance of GPT-5, a new large language model designed for oncology applications, in clinical decision-making tasks compared to earlier models. Method: GPT-5 was evaluated using the ACR Radiation Oncology In-Training Examination (TXIT) with 300 multiple-choice questions and a curated set of 60 radiation oncology vignettes. Its performance was compared to GPT-4 and GPT-3.5. The vignette evaluation involved generating treatment plans rated by four board-certified radiation oncologists for correctness, comprehensiveness, and hallucinations. Result: GPT-5 achieved a mean accuracy of 92.8% on TXIT, surpassing GPT-4 (78.8%) and GPT-3.5 (62.1%). In the vignette evaluation, GPT-5 received high scores for correctness (3.24/4) and comprehensiveness (3.59/4). Hallucinations were rare, but errors occurred in complex cases requiring precise clinical adaptation or trial knowledge. Inter-rater reliability was low (Fleiss' kappa 0.083). Conclusion: GPT-5 significantly outperforms earlier models in radiation oncology benchmarks, showing promise for clinical decision support. However, its errors in complex scenarios and the variability in clinical judgment highlight the need for rigorous expert review before clinical use. Abstract: Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' \k{appa}. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.[90] TMUAD: Enhancing Logical Capabilities in Unified Anomaly Detection Models with a Text Memory Bank
Jiawei Liu,Jiahe Hou,Wei Wang,Jinsong Du,Yang Cong,Huijie Fan
Main category: cs.CV
TL;DR: 本文提出了一种新的异常检测框架TMUAD,结合了结构和逻辑信息,通过三个记忆库提升了检测效果,并在多个数据集上取得了最佳性能。
Details
Motivation: 由于现有方法依赖于精心设计的图像特征提取器和记忆库来捕捉对象之间的逻辑关系,而正常数据有限,因此需要一种新的方法来提高逻辑异常检测的效果。 Method: 提出了一个三记忆框架(TMUAD),包括类级文本记忆库、对象级图像记忆库和块级图像特征记忆库,用于统一的结构和逻辑异常检测。 Result: TMUAD在多个领域实现了最先进的异常检测性能,并提供了可解释的异常分数。 Conclusion: TMUAD通过结合结构和逻辑异常检测,在工业和医学领域的七个公开数据集中实现了最先进的性能。 Abstract: Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at https://github.com/SIA-IDE/TMUAD.[91] VoCap: Video Object Captioning and Segmentation from Any Prompt
Jasper Uijlings,Xingyi Zhou,Xiuye Gu,Arsha Nagrani,Anurag Arnab,Alireza Fathi,David Ross,Cordelia Schmid
Main category: cs.CV
TL;DR: VoCap是一个创新的视频模型,通过处理视频和多模态提示,能够同时完成多项视频对象理解任务,并在多个任务上表现出色。
Details
Motivation: 为了在视频理解中以细粒度定位和详细语义属性理解对象,提出了一种新的视频模型VoCap。 Method: VoCap模型通过处理视频和多种模态的提示(文本、框或掩码),生成具有相应对象中心描述的时空masklet。 Result: VoCap在指代表达视频对象分割任务上取得了最先进的结果,在半监督视频对象分割任务上表现具有竞争力,并为视频对象描述任务建立了基准。 Conclusion: VoCap是一个灵活的视频模型,能够同时完成可提示的视频对象分割、指代表达分割和对象描述任务,并且在多个任务上达到了最先进的结果。 Abstract: Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.[92] The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning
Yiming Lin,Yuchen Niu,Shang Wang,Kaizhu Huang,Qiufeng Wang,Xiao-Bo Jin
Main category: cs.CV
TL;DR: This paper addresses the multi-label nature of verb classification in semantic role recognition and proposes a novel method (GE-VerbMLP) that improves model performance on real-world datasets.
Details
Motivation: Existing methods treat verb classification as a single-label problem, which fails to address the inherent ambiguity and semantic overlap in visual event recognition. This motivates the need for a multi-label learning approach. Method: The authors propose the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), combining graph neural networks and adversarial training to tackle the single positive multi-label learning (SPMLL) problem in semantic role recognition. Result: The proposed GE-VerbMLP achieves more than a 3% improvement in mean average precision (MAP) while maintaining competitiveness on traditional top-1 and top-5 accuracy metrics. Conclusion: The paper concludes that verb classification in context recognition is inherently multi-label, and the proposed GE-VerbMLP model effectively addresses this challenge, achieving improved performance on real-world datasets. Abstract: Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3\% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.[93] DriveQA: Passing the Driving Knowledge Test
Maolin Wei,Wanzhou Liu,Eshed Ohn-Bar
Main category: cs.CV
TL;DR: DriveQA is an open-source benchmark for evaluating LLMs and MLLMs on comprehensive traffic regulations and scenarios, revealing their strengths and weaknesses in driving knowledge. Fine-tuning and pretraining on DriveQA enhance performance on real-world driving datasets and improve generalization across QA tasks.