Table of Contents
cs.CL [Back]
[1] Towards Probabilistic Question Answering Over Tabular Data
Chen Shen,Sajjadur Rahman,Estevam Hruschka
Main category: cs.CL
TL;DR: This paper introduces LUCARIO, a new benchmark and framework for probabilistic question answering over tabular data by combining Bayesian Networks with large language models.
Details
Motivation: Current QA systems like NL2SQL struggle with probabilistic questions requiring reasoning under uncertainty, which this paper aims to address. Method: The method involves inducing Bayesian Networks from tabular data, translating natural language queries into probabilistic queries, and using large language models (LLMs) to produce answers. Result: Empirical results show significant improvements over baseline approaches, demonstrating the effectiveness of the proposed hybrid symbolic-neural reasoning framework. Conclusion: The study concludes that integrating symbolic reasoning (Bayesian Networks) with neural methods (LLMs) enhances probabilistic QA performance over tabular data. Abstract: Current approaches for question answering (QA) over tabular data, such as NL2SQL systems, perform well for factual questions where answers are directly retrieved from tables. However, they fall short on probabilistic questions requiring reasoning under uncertainty. In this paper, we introduce a new benchmark LUCARIO and a framework for probabilistic QA over large tabular data. Our method induces Bayesian Networks from tables, translates natural language queries into probabilistic queries, and uses large language models (LLMs) to generate final answers. Empirical results demonstrate significant improvements over baselines, highlighting the benefits of hybrid symbolic-neural reasoning.[2] Multi-lingual Functional Evaluation for Large Language Models
Victor Ojewale,Inioluwa Deborah Raji,Suresh Venkatasubramanian
Main category: cs.CL
TL;DR: This paper introduces new cross-lingual benchmarks to better evaluate the functional performance and robustness of multilingual language models, revealing significant variations across languages and benchmarks.
Details
Motivation: Existing static data benchmarks often fail to provide an adequate understanding of practical performance and robustness of models in multi-lingual settings. Method: The authors created new multi-lingual functional benchmarks by translating existing templates into five additional languages: French, Spanish, Hindi, Arabic, and Yoruba. Result: Some static benchmarks closely reflect functional performance, with notable performance drops observed between certain benchmarks and languages. Model robustness also varies significantly across languages, with some languages consistently performing better. Conclusion: Static multi-lingual benchmarks vary in their ability to capture functional performance, and model robustness differs significantly across languages. Abstract: Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.[3] The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
Chenglei Si,Tatsunori Hashimoto,Diyi Yang
Main category: cs.CL
TL;DR: 尽管大型语言模型(LLM)生成的研究想法在初步评估中表现出较高的新颖性,但实际执行后其评价显著下降,显示出当前LLM在有效研究创意生成方面的局限性。
Details
Motivation: 测试AI生成的研究想法是否能够产生比人类专家更好的研究成果。 Method: 通过招募43位专家研究人员随机执行专家撰写或LLM生成的想法,并对结果进行盲审评估。 Result: LLM生成的想法在执行后的所有评价指标上评分显著下降,而专家撰写的想法保持稳定,甚至在多个指标上超越LLM想法。 Conclusion: AI生成的研究想法在执行后评价下降,说明当前LLM在产生真正有效的研究想法方面存在局限性。 Abstract: Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.[4] MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering
Chinmay Gondhalekar,Urjitkumar Patel,Fang-Chun Yeh
Main category: cs.CL
TL;DR: MultiFinRAG is an efficient, purpose-built framework for financial document QA that improves accuracy by enabling cross-modal reasoning across text, tables, and images.
Details
Motivation: Financial documents combine diverse modalities like narrative text, tables, and figures, making question answering difficult for traditional LLMs and RAG pipelines due to token limits, layout loss, and fragmented context. Method: MultiFinRAG performs multimodal extraction by processing batches of tables and figures through a lightweight, quantized open-source multimodal LLM to generate structured JSON outputs and textual summaries. These, along with narrative text, are embedded and indexed with modality-aware thresholds. A tiered fallback strategy dynamically escalates from text-only to multi-context when necessary. Result: Despite running on commodity hardware, MultiFinRAG outperforms ChatGPT-4o (free-tier) by 19 percentage points in accuracy on complex financial QA tasks involving multiple modalities. Conclusion: MultiFinRAG is a retrieval-augmented generation framework designed for financial QA that enables cross-modal reasoning while reducing irrelevant context, and achieves higher accuracy compared to ChatGPT-4o on complex financial tasks. Abstract: Financial documents--such as 10-Ks, 10-Qs, and investor presentations--span hundreds of pages and combine diverse modalities, including dense narrative text, structured tables, and complex figures. Answering questions over such content often requires joint reasoning across modalities, which strains traditional large language models (LLMs) and retrieval-augmented generation (RAG) pipelines due to token limitations, layout loss, and fragmented cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation framework purpose-built for financial QA. MultiFinRAG first performs multimodal extraction by grouping table and figure images into batches and sending them to a lightweight, quantized open-source multimodal LLM, which produces both structured JSON outputs and concise textual summaries. These outputs, along with narrative text, are embedded and indexed with modality-aware similarity thresholds for precise retrieval. A tiered fallback strategy then dynamically escalates from text-only to text+table+image contexts when necessary, enabling cross-modal reasoning while reducing irrelevant context. Despite running on commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning.[5] Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes
Quintin Myers,Yanjun Gao
Main category: cs.CL
TL;DR: This study finds that large language models often handle morally ambiguous violent scenarios inconsistently and show demographic biases, raising concerns about their use in moderating online content.
Details
Motivation: As LLMs are increasingly used to detect and respond to online violence, it's important to assess their reasoning in morally complex, real-world situations and potential biases in their responses. Method: The study evaluates six LLMs using the Violent Behavior Vignette Questionnaire (VBVQ), a tool from social science, along with persona-based prompting to test bias related to race, age, and geography. Result: Two main findings emerged: (1) LLMs' generated text often differs from their internal preference for violent responses, and (2) their tendencies toward violent reactions vary across demographic factors, contradicting established research in related fields. Conclusion: Large language models (LLMs) show discrepancies in handling violent content, particularly in morally ambiguous scenarios, and their responses vary across demographics. Abstract: Large language models (LLMs) are increasingly proposed for detecting and responding to violent content online, yet their ability to reason about morally ambiguous, real-world scenarios remains underexamined. We present the first study to evaluate LLMs using a validated social science instrument designed to measure human response to everyday conflict, namely the Violent Behavior Vignette Questionnaire (VBVQ). To assess potential bias, we introduce persona-based prompting that varies race, age, and geographic identity within the United States. Six LLMs developed across different geopolitical and organizational contexts are evaluated under a unified zero-shot setting. Our study reveals two key findings: (1) LLMs surface-level text generation often diverges from their internal preference for violent responses; (2) their violent tendencies vary across demographics, frequently contradicting established findings in criminology, social science, and psychology.[6] Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine
Sebastian Joseph,Lily Chen,Barry Wei,Michael Mackert,Iain J. Marshall,Paul Pu Liang,Ramez Kouzy,Byron C. Wallace,Junyi Jessy Li
Main category: cs.CL
TL;DR: 研究探讨了医学领域中端到端事实核查系统的应用障碍,强调应将其视为交互式通信问题而非端到端过程。
Details
Motivation: 由于医疗决策的高风险性以及评估大量多样化医学文献的挑战,人们对将自动事实核查系统应用于公共卫生和医学的兴趣日益增长。然而,这些系统仍未得到广泛使用。 Method: 通过首次研究临床专家如何综合医学证据验证社交媒体中的真实主张,探索医学事实核查的上限。 Result: 揭示了在医学中应用端到端事实核查的根本挑战:难以将现实中的主张与临床试验形式的科学证据联系起来;模糊的未明确规定主张混合不匹配的意图;以及本质上主观的真实性标签。 Conclusion: 事实核查应被视为一种交互式通信问题来处理和评估,而非端到端过程。 Abstract: Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripens the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. To understand this, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached and evaluated as an interactive communication problem, rather than an end-to-end process.[7] Optimising Language Models for Downstream Tasks: A Post-Training Perspective
Zhengyan Shi
Main category: cs.CL
TL;DR: 这篇论文致力于解决语言模型在适应特定任务时面临的挑战,提出了一系列新方法,包括继续预训练技术、参数高效的微调方法以及改进的监督微调方法,以提高LMs的适应性、效率和性能。
Details
Motivation: 语言模型在NLP中表现出色,但高效且稳健地适应特定任务仍然具有挑战性。随着其规模和复杂性的增长,使用标记数据进行微调往往无法充分利用可用的未标记数据,导致在小型任务特定集上过拟合,并带来显著的计算成本。 Method: 首先引入了一种新的继续预训练技术,其次提出了参数高效的微调方法,并开发了改进的监督微调方法和新的评估方法及基准测试,如多跳空间推理任务。 Result: 通过广泛的实证研究,结果表明这些方法显著提高了LMs的鲁棒性、效率和泛化能力,使它们更能适应广泛的应用。 Conclusion: 该论文提出了一系列方法来更好地将语言模型(LMs)适应到下游应用中,包括从无标签数据中提取任务相关知识的新预训练技术、参数高效的微调方法以及改进的监督微调方法。这些方法提高了LMs的鲁棒性、效率和泛化能力,使其更适用于广泛的应用场景。 Abstract: Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often underutilizes available unlabelled data, leads to overfitting on small task-specific sets, and imposes significant computational costs. These limitations hamper their application to the open-ended landscape of real-world language tasks. This thesis proposes a series of methods to better adapt LMs to downstream applications. First, we explore strategies for extracting task-relevant knowledge from unlabelled data, introducing a novel continued pre-training technique that outperforms state-of-the-art semi-supervised approaches. Next, we present a parameter-efficient fine-tuning method that substantially reduces memory and compute costs while maintaining competitive performance. We also introduce improved supervised fine-tuning methods that enable LMs to better follow instructions, especially when labelled data is scarce, enhancing their performance across a range of NLP tasks, including open-ended generation. Finally, we develop new evaluation methods and benchmarks, such as multi-hop spatial reasoning tasks, to assess LM capabilities and adaptation more comprehensively. Through extensive empirical studies across diverse NLP tasks, our results demonstrate that these approaches substantially improve LM robustness, efficiency, and generalization, making them more adaptable to a broad range of applications. These advances mark a significant step towards more robust and efficient LMs, bringing us closer to the goal of artificial general intelligence.[8] FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Guilherme Penedo,Hynek Kydlíček,Vinko Sabolčec,Bettina Messmer,Negar Foroutan,Amir Hossein Kargaran,Colin Raffel,Martin Jaggi,Leandro Von Werra,Thomas Wolf
Main category: cs.CL
TL;DR: 本文提出了FineWeb2,一个支持多语言的大规模预训练数据集,提升了非英语语言模型的表现。
Details
Motivation: 由于当前多语言大型语言模型的训练受限于高质量、多样化的非英语数据,需要开发一种能够适用于各种语言的数据筛选和去重流程。 Method: 基于FineWeb构建了一个可自动适应多种语言的预训练数据集生成流程,并提出了一种考虑重复数量和质量的数据再平衡方法。 Result: 通过在九种不同语言上的实验验证,新方法生成的非英语语料库能显著提升模型性能,并成功扩展到1000多种语言,构建了包含50亿文档的FineWeb2数据集。 Conclusion: FineWeb2的发布为多语言大模型预训练提供了更优质的数据支持,推动了非英语语言模型的发展。 Abstract: Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.[9] KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
Xinping Zhao,Xinshuo Hu,Zifei Shan,Shouzheng Huang,Yao Zhou,Zetian Sun,Zhenyu Liu,Dongfang Li,Xinyuan Wei,Qian Chen,Youcheng Pan,Yang Xiang,Meishan Zhang,Haofen Wang,Jun Yu,Baotian Hu,Min Zhang
Main category: cs.CL
TL;DR: This paper introduces KaLM-Embedding-V2, a compact and high-performing text embedding model that uses innovative training techniques and data strategies to achieve superior results.
Details
Motivation: To develop a compact embedding model that performs well in general-purpose text embedding tasks. Method: The method involves using a fully bidirectional transformer with mean-pooling, a multi-stage training pipeline, a focal-style reweighting mechanism, and online hard-negative mixing strategy. Result: KaLM-Embedding-V2 achieved impressive results on the MTEB benchmark, surpassing larger models in performance while having less than 1B parameters. Conclusion: KaLM-Embedding-V2 is a versatile and compact embedding model that significantly outperforms other models of comparable size and competes with much larger models. Abstract: In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.[10] Can Gradient Descent Simulate Prompting?
Eric Zhang,Leshem Choshen,Jacob Andreas
Main category: cs.CL
TL;DR: This paper presents a method for meta-training language models so that parameter updates via gradient descent can effectively emulate the performance of prompting, achieving comparable results on certain tasks without needing real labels.
Details
Motivation: The motivation is to explore whether fine-tuning can be made as effective as prompting in language models when incorporating new information, particularly since parameter updates through fine-tuning do not incur long-term storage costs. Method: The method involves meta-training language models using gradient-based meta-learning techniques, where the model's own prompted predictions are used as targets. This eliminates the need for ground-truth labels. Result: The results show that subsequent gradient descent training can recover some or all of the performance of prompted models. This includes improvements on tasks like the 'reversal curse' and answering questions about text passages after a single gradient update. Conclusion: The paper concludes that with appropriate initialization, gradient descent can emulate the effectiveness of prompting in language models, opening up new possibilities for long-context modeling and understanding the generalization capabilities of gradient-based learning. Abstract: There are two primary ways of incorporating new information into a language model (LM): changing its prompt or changing its parameters, e.g. via fine-tuning. Parameter updates incur no long-term storage cost for model changes. However, for many model updates, prompting is significantly more effective: prompted models can generalize robustly from single examples and draw logical inferences that do not occur under standard fine-tuning. Can models be modified so that fine-tuning does emulate prompting? This paper describes a method for meta-training LMs such that gradient updates emulate the effects of conditioning on new information. Our approach uses tools from gradient-based meta-learning but uses an LM's own prompted predictions as targets, eliminating the need for ground-truth labels. Subsequent gradient descent training recovers some (and occasionally all) of prompted model performance -- showing improvement on the ``reversal curse'' tasks, and answering questions about text passages after a single gradient update. These results suggest that, with appropriate initialization, gradient descent can be surprisingly expressive. Our results suggest new avenues for long-context modeling and offer insight into the generalization capabilities of gradient-based learning.[11] SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control
Adithya Chittem,Aishna Shrivastava,Sai Tarun Pendela,Jagat Sesh Challa,Dhruv Kumar
Main category: cs.CL
TL;DR: 本文提出了一种新的方法来建模和控制大型语言模型的个性特征,通过引入16PF模型和SAC框架,实现了更细致和可控的个性表达。
Details
Motivation: 期望大型语言模型在交互过程中展示出类似人类的个性,但现有的模型存在仅提供粗略个性维度和缺乏特征强度控制机制两个主要限制。 Method: 扩展了原本使用Big Five模型的MPI以包含16PF模型,并开发了一个称为特定属性控制(SAC)的结构化框架,利用基于形容词的语义锚定和行为问题来引导特征强度表达。 Result: 将特征强度建模为连续谱比二元特征切换能够产生更一致和可控的个性表达;目标特征强度的变化系统地影响心理上相关特征的方向。 Conclusion: 本论文通过引入16PF模型和开发SAC框架,为实现对大型语言模型个性特征的细致控制提供了新方法,从而促进了在医疗、教育和面试等领域的可控且细腻的人机交互。 Abstract: Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and \textit{Willingness}. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.[12] Large Language Models Acing Chartered Accountancy
Jatin Gupta,Akhil Sharma,Saransh Singhania,Mohammad Adnan,Sakshi Deo,Ali Imam Abidi,Keshav Gupta
Main category: cs.CL
TL;DR: 本文提出了一个用于评估大型语言模型(LLMs)在金融、法律和定量推理能力的基准测试工具CA-Ben,发现当前模型在某些领域的局限性,并提出可能的改进方向。
Details
Motivation: 填补了在印度金融背景下,LLMs如何有效捕捉和应用特定领域金融知识的研究空白。 Method: 本研究构建了一个专门用于评估LLMs在财务、法律和定量推理能力的基准测试工具CA-Ben,并采用标准化协议对六个主要的LLMs进行了评估。 Result: 实验结果显示不同LLM的表现存在差异,其中Claude 3.5 Sonnet和GPT-4o表现最佳,尤其是在概念和法律推理方面,但在数值计算和法律解释上仍面临挑战。 Conclusion: 该论文强调了当前LLMs在金融领域应用中的优缺点,并建议通过混合推理和检索增强生成方法来改进模型,特别是在定量分析和法律解释方面。 Abstract: Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.[13] A Semi-supervised Scalable Unified Framework for E-commerce Query Classification
Chunyuan Yuan,Chong Zhang,Zheng Fang,Ming Pang,Xue Jiang,Changping Peng,Zhangang Lin,Ching Law
Main category: cs.CL
TL;DR: 本文提出了一个高效的电商查询分类框架SSUF,通过整合多种增强模块解决查询信息不足和依赖后验标签等问题,取得了优于现有模型的效果。
Details
Motivation: 电商查询通常较短且缺乏上下文信息,传统方法依赖用户点击行为构建训练样本,导致马太效应循环;同时查询分类子任务缺乏统一框架,影响建模效率。 Method: 提出了一种新的半监督可扩展统一框架(SSUF),包含知识增强模块、标签增强模块和结构增强模块,以解决查询信息不足、依赖后验标签以及缺乏统一框架的问题。 Result: SSUF通过增强查询表示、减少对后验标签的依赖及利用复杂标签关系,提升了查询分类任务的效果,并在大量实验中表现优异。 Conclusion: SSUF在离线和在线A/B实验中显著优于现有模型,提供了一个半监督可扩展的统一框架用于电商查询分类。 Abstract: Query classification, including multiple subtasks such as intent and category prediction, is vital to e-commerce applications. E-commerce queries are usually short and lack context, and the information between labels cannot be used, resulting in insufficient prior information for modeling. Most existing industrial query classification methods rely on users' posterior click behavior to construct training samples, resulting in a Matthew vicious cycle. Furthermore, the subtasks of query classification lack a unified framework, leading to low efficiency for algorithm optimization. In this paper, we propose a novel Semi-supervised Scalable Unified Framework (SSUF), containing multiple enhanced modules to unify the query classification tasks. The knowledge-enhanced module uses world knowledge to enhance query representations and solve the problem of insufficient query information. The label-enhanced module uses label semantics and semi-supervised signals to reduce the dependence on posterior labels. The structure-enhanced module enhances the label representation based on the complex label relations. Each module is highly pluggable, and input features can be added or removed as needed according to each subtask. We conduct extensive offline and online A/B experiments, and the results show that SSUF significantly outperforms the state-of-the-art models.[14] MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection
Fuqiang Niu,Genan Dai,Yisha Lu,Jiayu Liao,Xiang Li,Hu Huang,Bowen Zhang
Main category: cs.CL
TL;DR: This paper introduces the MT2-CSD dataset and the LLM-CRAN model, which effectively addresses multi-target, multi-turn conversational stance detection challenges by leveraging large language models.
Details
Motivation: Traditional stance detection methods are limited in modeling multi-party social media discussions due to a lack of realistic datasets and the complexity of conversational dynamics. Method: A novel dataset (MT2-CSD) was introduced, along with the development of the LLM-CRAN model that leverages large language models to enhance conversational understanding for multi-target, multi-turn stance detection. Result: MT2-CSD is the largest and most conversationally complex dataset for stance detection, and experiments show that LLM-CRAN achieves superior performance compared to baseline models. Conclusion: The proposed LLM-CRAN model significantly outperforms existing baseline models in conversational stance detection on the MT2-CSD dataset. Abstract: In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.[15] DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning
Kang He,Yuzhe Ding. Haining Wang,Fei Li,Chong Teng,Donghong Ji
Main category: cs.CL
TL;DR: This paper introduces DALR, a novel method for multimodal sentence representation that addresses cross-modal misalignment and intra-modal divergence, achieving superior performance on key tasks.
Details
Motivation: Existing methods align images and text at a coarse level, leading to cross-modal misalignment bias and intra-modal semantic divergence that degrade representation quality. There is a need to address these issues for improved performance. Method: DALR employs a dual-level alignment learning approach, including consistency learning for fine-grained cross-modal alignment and ranking distillation combined with global intra-modal alignment learning to better capture intricate sentence relationships. Result: Experiments on semantic textual similarity (STS) and transfer (TR) tasks demonstrate the effectiveness and superiority of DALR over state-of-the-art baselines. Conclusion: The proposed DALR method overcomes cross-modal misalignment bias and intra-modal semantic divergence, significantly enhancing the quality of multimodal sentence representation. Abstract: Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.[16] ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry
Qinwen Chen,Wenbiao Tao,Zhiwei Zhu,Mingfan Xi,Liangzhong Guo,Yuan Wang,Wei Wang,Yunshi Lan
Main category: cs.CL
TL;DR: 本文介绍了一个用于实时工业社区问答的检索增强生成框架ComRAG,其通过整合静态知识和动态历史问答对,在多个工业CQA数据集上表现优异。
Details
Motivation: 现有的方法往往未能充分利用外部知识、无法整合动态的历史问答上下文,或缺乏适合工业部署的记忆机制。 Method: 提出了一种名为ComRAG的新框架,该框架利用基于质心的记忆机制将静态知识与动态历史问答对结合,并在多个工业CQA数据集上进行了评估。 Result: ComRAG在向量相似性方面最高提高了25.9%,延迟降低了8.7%至23.3%,并在迭代过程中将块增长从20.23%降低到2.06%。 Conclusion: ComRAG是一个用于实时工业社区问答的检索增强生成框架,通过基于质心的记忆机制集成静态知识和动态历史问答对,在三个工业CQA数据集上持续优于所有基线方法。 Abstract: Community Question Answering (CQA) platforms can be deemed as important knowledge bases in community, but effectively leveraging historical interactions and domain knowledge in real-time remains a challenge. Existing methods often underutilize external knowledge, fail to incorporate dynamic historical QA context, or lack memory mechanisms suited for industrial deployment. We propose ComRAG, a retrieval-augmented generation framework for real-time industrial CQA that integrates static knowledge with dynamic historical QA pairs via a centroid-based memory mechanism designed for retrieval, generation, and efficient storage. Evaluated on three industrial CQA datasets, ComRAG consistently outperforms all baselines--achieving up to 25.9% improvement in vector similarity, reducing latency by 8.7% to 23.3%, and lowering chunk growth from 20.23% to 2.06% over iterations.[17] Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models
Xiaoshuang Ji,Zhendong Zhao,Xiaojun Chen,Xin Zhao,Zeyao Liu
Main category: cs.CL
TL;DR: 本文提出了一种名为 Progtuning 的新微调框架,该框架通过根据 Transformer 块的贡献度逐步减少需要更新的块数,以优化资源分配,提高参数更新的效率,并展示出与现有参数高效微调方法的良好适配性及广泛适应场景下的优异表现。
Details
Motivation: 随着模型规模的增长,更新所有模型参数的成本越来越高,而传统的微调和大多数现有的参数高效微调方法未能考虑 Transformer 块之间的不同贡献度,导致计算资源分配效率低下。 Method: 提出了 Progtuning 框架,结合了基于贡献度的渐进式学习方法,用于 Transformer-based 语言模型的微调。 Result: Progtuning 在保持竞争力性能的同时减少了约 25% 的更新参数数量,并展示了其与参数高效微调方法的高度适应性和在多种适应场景中的优良表现。 Conclusion: Progtuning 是一种新颖的微调框架,通过逐步减少更新的 Transformer 块的数量来优化资源分配,从而在保持竞争力性能的同时减少约 25% 的更新参数数量,并且与参数高效微调方法具有高度适应性,在各种适应场景中表现出色。 Abstract: Fine-tuning is a promising technique for leveraging Transformer-based language models in downstream tasks. As model sizes continue to grow, updating all model parameters becomes increasingly costly. Parameter-efficient fine-tuning methods effectively address this issue by selectively updating a small subset of parameters. However, fine-tuning and most existing parameter-efficient fine-tuning methods require updating the same number of parameters as the initial size, ignoring the unequal contribution across Transformer blocks and leading to extremely inefficient allocation of computing resources. In this paper, we propose Progtuning, the novel fine-tuning framework combined with progressive learning for Transformer-based language models. Specifically, Progtuning progressively reduces the number of updated transformer blocks based on the contribution. Remarkably, Progtuning optimizes resource allocation and reduces the number of updated parameters by approximately 25\%, while still maintaining competitive performance. And it also exhibits high adaptability with parameter-efficient fine-tuning methods, demonstrating excellent performance across various adaptation scenarios.[18] Compressed and Smooth Latent Space for Text Diffusion Modeling
Viacheslav Meshchaninov,Egor Chimbulatov,Alexander Shabalin,Aleksandr Abramov,Dmitry Vetrov
Main category: cs.CL
TL;DR: Cosmos is a new text generation method leveraging a compressed latent space that enables efficient and effective diffusion-based generation, outperforming autoregressive and other diffusion models.
Details
Motivation: Autoregressive models are slow and struggle with global coherence, while diffusion models face challenges due to the high dimensionality of token-level representations. Method: Cosmos operates in a compressed latent space designed for diffusion, learned using an autoencoder aligned with frozen activations from a pretrained language encoder. Result: Cosmos achieves comparable or superior generation quality compared to existing models, compresses text representations by 8x, surpasses baselines with increased latent sequence length, and offers more than 2x faster inference. Conclusion: Cosmos provides a novel and efficient approach to text generation with advantages over traditional autoregressive and diffusion-based models. Abstract: Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference.[19] Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
Isaac Chung,Imene Kerboua,Marton Kardos,Roman Solomatin,Kenneth Enevoldsen
Main category: cs.CL
TL;DR: This paper details the engineering practices that support the reproducibility, extensibility, and usability of the Massive Text Embedding Benchmark (MTEB), highlighting continuous integration pipelines, dataset validation, test automation, and strategies for handling community contributions.
Details
Motivation: As MTEB becomes a standard evaluation platform for text embedding models, ensuring its reproducibility, extensibility, and usability is critical. This paper aims to address the engineering challenges faced in maintaining such a benchmark. Method: The paper discusses the engineering approaches used to enhance MTEB's reproducibility and usability, including continuous integration pipelines, dataset integrity validation, test execution automation, and strategies for incorporating community contributions and extending tasks and datasets. Result: The implementation of rigorous engineering practices has enabled MTEB to scale effectively while preserving its quality and relevance to the field. Conclusion: MTEB has successfully scaled in comprehensiveness while maintaining quality and relevance through robust engineering practices, offering valuable insights for maintainers of other machine learning evaluation frameworks. Abstract: The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB's continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results' generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb[20] Prompt-Guided Turn-Taking Prediction
Koji Inoue,Mikey Elmers,Yahui Fu,Zi Haur Pang,Divesh Lala,Keiko Ochi,Tatsuya Kawahara
Main category: cs.CL
TL;DR: 本文提出了一种可通过文本提示动态调整对话轮换预测的新型Transformer模型,并证明其在预测准确性和灵活性方面的有效性。
Details
Motivation: 为了实现通过文本提示动态控制对话轮换预测,提供直观且明确的控制,如指令“更快”或“更冷静”,以适应不同的对话伙伴和情境。 Method: 我们提出了一个基于Transformer的语音活动预测(VAP)模型,将文本提示嵌入整合到通道内和跨通道的Transformer中。 Result: 使用超过950小时的人类对话数据评估了该方法的可行性,并利用大语言模型(LLM)生成合成的文本提示句子。 Conclusion: 实验结果表明,所提出的模型提高了预测准确性,并能根据文本提示有效改变对话轮换时机行为。 Abstract: Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as "faster" or "calmer" adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.[21] Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
Yongchan Chun,Minhyuk Kim,Dongjun Kim,Chanjun Park,Heuiseok Lim
Main category: cs.CL
TL;DR: paper explores using syntactic retrieval instead of semantic similarity in few-shot Automatic Term Extraction with large language models, showing improved performance.
Details
Motivation: large language models have advanced NLP tasks, but their potential for Automatic Term Extraction has scarcely been examined. Method: propose a retrieval-based prompting strategy that selects demonstrations according to syntactic rather than semantic similarity. Result: evaluate the approach in both in-domain and cross-domain settings and show that syntactic retrieval improves F1-score. Conclusion: syntactic retrieval improves F1-score and highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks. Abstract: Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emph{syntactic} rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.[22] Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
Tianyi Men,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
Main category: cs.CL
TL;DR: 本文提出了Agent-RewardBench,用于评估多模态大语言模型中的奖励建模能力,结果显示当前模型仍需专门训练以提升性能。
Details
Motivation: 现有MLLMs在自我纠正和泛化方面存在不足,而外部反馈(如奖励模型)可以改善这一问题,但目前缺乏针对代理选择奖励模型的方法和基准。 Method: 提出了一个名为Agent-RewardBench的基准,包含多个维度、实时任务场景、步骤级奖励评估以及高质量的数据样本。 Result: 实验表明,即使是最先进的多模态模型表现也有限,突出了奖励建模训练的重要性。 Conclusion: Agent-RewardBench强调了为多模态语言模型设计专门的奖励建模训练的必要性,并提供了评估和改进奖励模型能力的基准。 Abstract: As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.[23] Cat and Mouse -- Can Fake Text Generation Outpace Detector Systems?
Andrea McGlinchey,Peter J Barclay
Main category: cs.CL
TL;DR: 该论文探讨了在大型语言模型不断升级的情况下,利用统计分类器检测AI生成假文本的可行性,并发现尽管模型规模扩大,检测依然可能保持有效。
Details
Motivation: 随着大型语言模型(LLMs)使用越来越多的参数、训练数据和能源,而相对简单的分类器仍能在有限资源下展示出良好的检测准确性,本文探讨了模型是否能够持续超越检测器的问题。 Method: 研究了统计分类器在识别经典侦探小说风格的“假文本”中的能力,并比较了Gemini和GPT在生成欺骗性文本方面的能力变化。 Result: Gemini在0.5版本升级后显示出更强的生成欺骗性文本的能力,而GPT则没有表现出这种提升。 Conclusion: 可靠检测假文本的方法可能仍然可行,即使模型规模不断扩大,但新模型架构可能会提高其欺骗性。 Abstract: Large language models can produce convincing "fake text" in domains such as academic writing, product reviews, and political news. Many approaches have been investigated for the detection of artificially generated text. While this may seem to presage an endless "arms race", we note that newer LLMs use ever more parameters, training data, and energy, while relatively simple classifiers demonstrate a good level of detection accuracy with modest resources. To approach the question of whether the models' ability to beat the detectors may therefore reach a plateau, we examine the ability of statistical classifiers to identify "fake text" in the style of classical detective fiction. Over a 0.5 version increase, we found that Gemini showed an increased ability to generate deceptive text, while GPT did not. This suggests that reliable detection of fake text may remain feasible even for ever-larger models, though new model architectures may improve their deceptiveness[24] Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
Xin Xu,Tianhao Chen,Fan Zhang,Wanlong Liu,Pengxiang Li,Ajay Kumar Jaiswal,Yuchen Yan,Jishan Hu,Yang Wang,Hao Chen,Shiwei Liu,Shizhe Diao,Can Yang,Lu Yin
Main category: cs.CL
TL;DR: 本文提出了一种名为Double-Checker的框架,通过显式自我批判和迭代优化提升慢思考大语言模型的推理能力。
Details
Motivation: 尽管慢思考的大语言模型(LLMs)表现出类似反思的推理能力,但它们生成有意义的批评和改进先前解决方案的能力仍然有限。 Method: 在1730个自我批判实例上进行微调,并在推理过程中迭代地批判并改进输出。 Result: 在AIME基准测试中,pass@1性能从4.4%提高到了18.2%。 Conclusion: Double-Checker是一个有原则的框架,旨在通过促进显式的自我批判和对先前解决方案的迭代改进来增强慢思考LLMs的推理能力。结果表明,这种迭代自我批判的方法显著提高了长链式思维LLMs的推理能力。 Abstract: While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.[25] Small Encoders Can Rival Large Decoders in Detecting Groundedness
Istabrak Abbes,Gabriele Prato,Quentin Fournier,Fernando Rodriguez,Alaa Boukhary,Adam Elwood,Sarath Chandar
Main category: cs.CL
TL;DR: 本研究提出了一种高效的接地性检测方法,使用轻量级模型如RoBERTa和NomicBERT,在保证准确率的同时大幅减少推理时间和资源消耗。
Details
Motivation: 当提供的上下文缺乏信息时,LLM往往会产生不基于事实的推测或依赖内部知识,因此需要一种机制来检测查询是否基于给定文档,以确保事实一致性和可信度。 Method: 使用RoBERTa和NomicBERT等轻量级任务特定编码器模型,并在精心策划的数据集上进行微调,以进行接地性检测。 Result: 轻量级任务特定编码器模型在接地性检测中的准确率与Llama3 8B和GPT4o等先进LLM相当,且推理延迟降低了几个数量级。 Conclusion: 轻量级任务特定编码器模型在接地性检测中可实现与最先进的LLM相当的准确性,同时显著降低推理延迟。 Abstract: Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : https://github.com/chandarlab/Hallucinate-less[26] Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
Bram Willemsen,Gabriel Skantze
Main category: cs.CL
TL;DR: This paper investigates the effectiveness of a text-only approach in detecting referring expressions in visually grounded dialogues, finding it useful but inherently limited without multimodal integration.
Details
Motivation: The motivation is to determine how much linguistic context alone can aid in identifying mentions with visual referents in dialogues, without relying on visual data. Method: A pretrained large language model (LLM) was adapted to annotate mention spans in conversations through next-token prediction, using a text-only, autoregressive language modeling approach. Result: The results show that even with moderate-sized LLMs, small datasets, and parameter-efficient fine-tuning, the text-only approach can be effective for this task. Conclusion: The paper concludes that while a text-only approach can be effective for extracting referring expressions, the task fundamentally requires multimodal consideration. Abstract: In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.[27] Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models
Fangzhou Dong,Yifan Zeng,Yingpeng Sang,Hong Shen
Main category: cs.CL
TL;DR: 这篇论文介绍了GLASS(Greimas符号方阵分析法),一种增强大型语言模型深度文学分析能力的结构化分析框架,并展示了其高效性和应用潜力。
Details
Motivation: 大型语言模型在理解和生成文本方面表现出色,但在对具有深刻思想和复杂叙述的作品进行专业文学批评时仍存在困难。 Method: 提出了一种名为GLASS的结构化分析框架,该框架基于Greimas符号方阵(GSS),并创建了首个用于GSS-based文学批评的数据集以及使用LLM-as-a-judge范式提出了定量指标。 Result: 通过与多个作品和LLMs的专家批评比较,本研究提出的框架显示出高性能;将GLASS应用于39部经典作品,产生了原创且高质量的分析结果,填补了现有研究空白。 Conclusion: 研究提供了一个基于人工智能的文学研究和教学工具,并对文学参与的认知机制提供了新的见解。 Abstract: Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs' ability to conduct in-depth literary analysis. GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Our framework's results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.[28] Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation
Guanting Dong,Xiaoxi Li,Yuyao Zhang,Mengjie Deng
Main category: cs.CL
TL;DR: This paper presents Omni-RAG, a new framework for improving retrieval-augmented generation systems' performance in real-world scenarios involving complex and noisy user inputs.
Details
Motivation: Real-world live RAG systems struggle with noisy, ambiguous user queries containing multiple intents, while current systems are typically trained or evaluated on cleaner data. Method: Omni-RAG employs LLM-assisted query understanding through three modules: Deep Query Understanding and Decomposition, Intent-Aware Knowledge Retrieval, and Reranking and Generation. Result: The paper introduces Omni-RAG, a novel framework designed to improve the robustness and effectiveness of RAG systems in live, open-domain settings. Conclusion: Omni-RAG aims to bridge the gap between current RAG capabilities and the demands of real-world applications by robustly handling complex and noisy queries. Abstract: Real-world live retrieval-augmented generation (RAG) systems face significant challenges when processing user queries that are often noisy, ambiguous, and contain multiple intents. While RAG enhances large language models (LLMs) with external knowledge, current systems typically struggle with such complex inputs, as they are often trained or evaluated on cleaner data. This paper introduces Omni-RAG, a novel framework designed to improve the robustness and effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs LLM-assisted query understanding to preprocess user inputs through three key modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs with tailored prompts to denoise queries (e.g., correcting spelling errors) and decompose multi-intent queries into structured sub-queries; (2) Intent-Aware Knowledge Retrieval, which performs retrieval for each sub-query from a corpus (i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking and Generation, where a reranker (i.e., BGE) refines document selection before a final response is generated by an LLM (i.e., Falcon-10B) using a chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG capabilities and the demands of real-world applications, such as those highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex and noisy queries.[29] Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection
Ali Şenol,Garima Agrawal,Huan Liu
Main category: cs.CL
TL;DR: 本文提出了一种结合领域知识和概念漂移检测的LLM框架,在欺诈对话检测中表现出色,解决了现有模型在动态平台上的局限性。
Details
Motivation: 由于语言模式的演变和概念漂移(CD),在动态平台上检测欺骗性对话变得越来越困难,而现有的大语言模型(LLMs)在敏感风险场景中存在上下文歧义和幻觉问题。 Method: 提出了一种基于预训练LLM和任务特定结构化见解的DK-Enhanced LLM框架,用于欺诈和概念漂移检测,并使用SEConvo多轮对话数据集进行验证。 Result: 该系统在检测虚假对话方面具有高准确性,并能有效分类漂移性质,基于LLaMA的实现达到了98%的分类准确率。 Conclusion: 整合领域知识和概念漂移检测可以显著提升高风险NLP应用的性能、可解释性和鲁棒性。 Abstract: Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.[30] Text2Cypher Across Languages: Evaluating Foundational Models Beyond English
Makbule Gulcin Ozsoy,William Tai
Main category: cs.CL
TL;DR: This paper investigates how foundational large language models perform on the Text2Cypher task across English, Spanish, and Turkish. It finds varying levels of performance and highlights the need for more inclusive multilingual evaluation and development.
Details
Motivation: Recent research on large language models (LLMs) has primarily focused on English, with limited evaluation in other languages. This paper aims to investigate the performance of LLMs on the Text2Cypher task across multiple languages to address this gap. Method: The authors created a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries. They evaluated multiple foundational models using standardized prompts and metrics and also explored the impact of translating task prompts into Spanish and Turkish. Result: The results show a consistent performance pattern: highest on English, then Spanish, and lowest on Turkish. Additionally, translating task prompts into Spanish and Turkish had little to no impact on evaluation metrics. Conclusion: The paper concludes that the performance of foundational LLMs in the Text2Cypher task varies across languages, with English showing the best results followed by Spanish and then Turkish. The findings emphasize the need for more inclusive evaluation and development in multilingual query generation. Abstract: Recent advances in large language models have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses solely on English, with limited evaluation in other languages. This paper investigates the performance of foundational LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. We evaluate multiple foundational models using standardized prompts and metrics. Our results show a consistent performance pattern: highest on English, then Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic characteristics. Additionally, we explore the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Our findings highlight the need for more inclusive evaluation and development in multilingual query generation. Future work includes schema localization and fine-tuning across diverse languages.[31] Aligning Spoken Dialogue Models from User Interactions
Anne Wu,Laurent Mazaré,Neil Zeghidour,Alexandre Défossez
Main category: cs.CL
TL;DR: This paper introduces a novel preference alignment framework for improving spoken dialogue models in real-time conversations by creating a large-scale dataset and leveraging offline alignment methods to fine-tune a speech-to-speech model.
Details
Motivation: Current preference learning methods primarily focus on text-based language models, which are not directly suited to the complexities of real-time speech interactions with richer dynamics (e.g., interruption, interjection) and no explicit segmentation between speaker turns. Method: The researchers created a large-scale dataset of over 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback. They leveraged offline alignment methods to fine-tune a full-duplex autoregressive speech-to-speech model and conducted extensive experiments and holistic human evaluations. Result: The proposed framework effectively improves spoken dialogue models in producing more factual, safer, and contextually aligned interactions, as demonstrated by extensive experiments and human evaluations. Conclusion: The study concludes that feedback on generic conversations can consistently improve spoken dialogue models, and a well-calibrated balance among various dynamics is crucial for natural real-time speech dialogue systems. Abstract: We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.[32] TopK Language Models
Ryosuke Takahashi,Tatsuro Inaba,Kentaro Inui,Benjamin Heinzerling
Main category: cs.CL
TL;DR: The paper introduces TopK LMs, which incorporate a TopK activation function into the transformer architecture to produce sparse representations that offer improved interpretability without sacrificing model performance.
Details
Motivation: Sparse autoencoders (SAEs) have shortcomings in utility and internal validity because they are trained post-hoc, leading to uncertainties about whether failure to discover a concept is due to the SAE or the underlying LM. Feature instability also complicates comparing SAE features across different checkpoints. Method: A modification to the transformer architecture was introduced by incorporating a TopK activation function at chosen layers, making the model's hidden states equivalent to the latent features of a TopK SAE. Result: TopK LMs maintain their original capabilities while providing robust interpretability benefits. The sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. Conclusion: TopK LMs provide a stable and reliable method for understanding how language models learn and represent concepts, offering a favorable trade-off between model size, computational efficiency, and interpretability. Abstract: Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE's side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model's hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.[33] Bridging Offline and Online Reinforcement Learning for LLMs
Jack Lanchantin,Angelica Chen,Janice Lan,Xian Li,Swarnadeep Saha,Tianlu Wang,Jing Xu,Ping Yu,Weizhe Yuan,Jason E Weston,Sainbayar Sukhbaatar,Ilia Kulikov
Main category: cs.CL
TL;DR: This paper shows that online and semi-online reinforcement learning methods, like Direct Preference Optimization and Group Reward Policy Optimization, are more effective than offline approaches for training large language models on both verifiable (e.g., math) and non-verifiable (e.g., instruction-following) tasks.
Details
Motivation: The motivation stems from the need to improve the adaptability of large language models in transitioning from static datasets (offline) to dynamic environments (online), especially when dealing with both verifiable and non-verifiable tasks. Method: Experiments were conducted using online, semi-online, and offline reinforcement learning techniques—specifically Direct Preference Optimization and Group Reward Policy Optimization—across both verifiable (e.g., math) and non-verifiable (e.g., instruction-following) tasks. Training dynamics and hyperparameter strategies were also analyzed. Result: Reinforcement learning methods showed strong performance in semi-online and fully online setups, surpassing offline approaches. Multi-tasking with verifiable and non-verifiable rewards improved performance across both task types. Conclusion: Online and semi-online reinforcement learning methods, including Direct Preference Optimization and Group Reward Policy Optimization, perform similarly and outperform offline methods in training large language models for verifiable and non-verifiable tasks. Abstract: We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.[34] Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments
Jiashuo Wang,Kaitao Song,Chunpu Xu,Changhe Song,Yang Xiao,Dongsheng Li,Lili Qiu,Wenjie Li
Main category: cs.CL
TL;DR: 本文提出了一种基于用户模拟器和i×MCTS算法的新方法,通过直接偏好优化(DPO)提升交互式大语言模型在社交对话中的用户参与度。
Details
Motivation: 尽管先前的研究已经优化模型以推理相关知识或规划对话行为流程,但用户参与度与知识或对话行为之间的关系微妙且不能保证在社交驱动的对话中实现高用户参与度。因此,本文旨在通过从对话未来发展的信号中学习来提升交互式LLM的用户参与度。 Method: 通过开发一个用户模拟器与目标交互式LLM进行互动,并利用i×MCTS算法探索用户和系统之间的交互,从而收集高质量和低质量体验的数据对。接着使用直接偏好优化(DPO)对交互式LLM进行对齐以提升用户参与度。 Result: 在两个社交驱动对话场景(情感支持对话和善意说服)上的实验表明,该方法能有效增强交互式大语言模型的用户参与度。 Conclusion: 实验结果表明,所提出的方法能够有效增强交互式大语言模型在社交对话中的用户参与度。 Abstract: Enhancing user engagement through interactions plays an essential role in socially-driven dialogues. While prior works have optimized models to reason over relevant knowledge or plan a dialogue act flow, the relationship between user engagement and knowledge or dialogue acts is subtle and does not guarantee user engagement in socially-driven dialogues. To this end, we enable interactive LLMs to learn user engagement by leveraging signals from the future development of conversations. Specifically, we adopt a more direct and relevant indicator of user engagement, i.e., the user's reaction related to dialogue intention after the interaction, as a reward to align interactive LLMs. To achieve this, we develop a user simulator to interact with target interactive LLMs and explore interactions between the user and the interactive LLM system via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree \textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset containing pairs of higher and lower-quality experiences using \textit{i$\times$MCTS}, and align interactive LLMs for high-level user engagement by direct preference optimization (DPO) accordingly. Experiments conducted on two socially-driven dialogue scenarios (emotional support conversations and persuasion for good) demonstrate that our method effectively enhances user engagement in interactive LLMs.[35] skLEP: A Slovak General Language Understanding Benchmark
Marek Šuppa,Andrej Ridzik,Daniel Hládek,Tomáš Javůrek,Viktória Ondrejová,Kristína Sásiková,Martin Tamajka,Marián Šimko
Main category: cs.CL
TL;DR: 该论文介绍了skLEP,这是一个为评估斯洛伐克自然语言理解模型而设计的综合基准测试。
Details
Motivation: 为了评估斯洛伐克自然语言理解模型的效果,需要一个专门的基准测试。 Method: 编译了包含九个不同任务的数据集,并创建了新的原创数据集和翻译了现有的英文NLU资源。 Result: 发布了完整的基准数据、开源工具包以及公共排行榜。 Conclusion: skLEP是一个全面的斯洛伐克语NLU基准测试,旨在促进可重复性和推动斯洛伐克语NLU的未来研究。 Abstract: In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.[36] Potemkin Understanding in Large Language Models
Marina Mancoridis,Bec Weeks,Keyon Vafa,Sendhil Mullainathan
Main category: cs.CL
TL;DR: This paper argues that current benchmarks may mislead us into thinking LLMs understand concepts when they only mimic understanding through incoherent reasoning, and it provides methods to detect such 'potemkin' responses.
Details
Motivation: The motivation stems from the concern that benchmark datasets used for evaluating LLMs might not be reliable indicators of their actual conceptual understanding since these models could produce correct answers without truly grasping the underlying concepts. Method: The authors present two procedures: one involves a specially designed benchmark across three domains, while the other is a general method to estimate the minimum prevalence of 'potemkins.' They analyze whether LLMs' errors align with human misunderstandings. Result: The study finds that 'potemkin understanding' is widespread across different models, tasks, and domains. Moreover, these errors indicate deeper internal inconsistencies in how LLMs represent concepts. Conclusion: The paper concludes that current benchmarks may not accurately assess LLMs' true understanding, as they can result in 'potemkin understanding,' where models give answers that seem correct but are based on incoherent concept representations. Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.[37] "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets
Akshay Paruchuri,Maryam Aziz,Rohit Vartak,Ayman Ali,Best Uchehara,Xin Liu,Ishan Chatterjee,Monica Agrawal
Main category: cs.CL
TL;DR: 这篇论文研究了用户通过聊天机器人向大型语言模型寻求医疗信息的行为,发现其中存在多种风险,并提出需要改进LLMs在医疗领域的应用能力。
Details
Motivation: 随着越来越多的人通过聊天机器人从大型语言模型中获取医疗信息,研究这些对话的本质及其潜在风险变得尤为重要。 Method: 通过过滤大规模的对话AI数据集,构建了一个名为HealthChat-11K的数据集,包含11,000个真实世界的对话,涵盖21个不同的健康专业领域。研究者利用这一数据集以及由临床医生制定的分类法来分析用户如何与LLMs交互以获取医疗信息。 Result: 研究揭示了用户寻求医疗信息的方式和原因,包括常见的互动模式、上下文不完整的情况、情感行为以及可能引发盲目迎合的问题类型。 Conclusion: 该论文强调了在医疗健康信息查询中使用大型语言模型(LLMs)作为聊天机器人所面临的挑战和风险,并提出了改进LLMs在医疗支持能力的必要性。 Abstract: People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: https://github.com/yahskapar/HealthChat[38] Data Efficacy for Language Model Training
Yalun Dai,Yangyu Huang,Xin Zhang,Wenshan Wu,Chong Li,Wenhui Lu,Shijie Cao,Li Dong,Scarlett Li
Main category: cs.CL
TL;DR: This paper introduces DELT, a new approach to language model training that focuses on data efficacy by optimizing how training data is organized, scored, selected, and ordered, resulting in improved model performance.
Details
Motivation: While data efficiency has been widely researched in language model training, data efficacy - optimizing the organization of training data - remains relatively underexplored. Method: The study introduces DELT, a paradigm for considering data efficacy in LM training. It includes Data Scoring (specifically Learnability-Quality Scoring), Data Selection, and Data Ordering (specifically Folding Ordering). Result: Experiments showed that instances of DELT improved LM performance without increasing data scale or model size. The combination of LQS for data scoring and Folding for data ordering achieved the most significant improvement. Conclusion: Data efficacy is a promising foundational area in LM training that can be achieved alongside data efficiency through techniques like data scoring and ordering. Abstract: Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.cs.CV [Back]
[39] OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport
Qin Ren,Yifan Wang,Ruogu Fang,Haibin Ling,Chenyu You
Main category: cs.CV
TL;DR: The paper introduces OTSurv, a novel multiple instance learning framework from an optimal transport perspective, which explicitly captures pathological heterogeneity within whole slide images for survival prediction in digital pathology.
Details
Motivation: However, existing MIL methods often fail to explicitly capture pathological heterogeneity within WSIs, both globally -- through long-tailed morphological distributions, and locally through -- tile-level prediction uncertainty. Optimal transport (OT) provides a principled way of modeling such heterogeneity by incorporating marginal distribution constraints. Method: Building on this insight, we propose OTSurv, a novel MIL framework from an optimal transport perspective. Specifically, OTSurv formulates survival predictions as a heterogeneity-aware OT problem with two constraints: (1) global long-tail constraint that models prior morphological distributions to avert both mode collapse and excessive uniformity by regulating transport mass allocation, and (2) local uncertainty-aware constraint that prioritizes high-confidence patches while suppressing noise by progressively raising the total transport mass. Result: Empirically, OTSurv sets new state-of-the-art results across six popular benchmarks, achieving an absolute 3.6% improvement in average C-index. Conclusion: OTSurv achieves statistical significance in log-rank tests and offers high interpretability, making it a powerful tool for survival prediction in digital pathology. Abstract: Survival prediction using whole slide images (WSIs) can be formulated as a multiple instance learning (MIL) problem. However, existing MIL methods often fail to explicitly capture pathological heterogeneity within WSIs, both globally -- through long-tailed morphological distributions, and locally through -- tile-level prediction uncertainty. Optimal transport (OT) provides a principled way of modeling such heterogeneity by incorporating marginal distribution constraints. Building on this insight, we propose OTSurv, a novel MIL framework from an optimal transport perspective. Specifically, OTSurv formulates survival predictions as a heterogeneity-aware OT problem with two constraints: (1) global long-tail constraint that models prior morphological distributions to avert both mode collapse and excessive uniformity by regulating transport mass allocation, and (2) local uncertainty-aware constraint that prioritizes high-confidence patches while suppressing noise by progressively raising the total transport mass. We then recast the initial OT problem, augmented by these constraints, into an unbalanced OT formulation that can be solved with an efficient, hardware-friendly matrix scaling algorithm. Empirically, OTSurv sets new state-of-the-art results across six popular benchmarks, achieving an absolute 3.6% improvement in average C-index. In addition, OTSurv achieves statistical significance in log-rank tests and offers high interpretability, making it a powerful tool for survival prediction in digital pathology. Our codes are available at https://github.com/Y-Research-SBU/OTSurv.[40] StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation
Haodong Li,Chen Wang,Jiahui Lei,Kostas Daniilidis,Lingjie Liu
Main category: cs.CV
TL;DR: StereoDiff是一个两阶段的视频深度估计器,结合了立体匹配和视频深度扩散方法的优势,以提高视频深度估计的一致性和准确性。
Details
Motivation: 视频深度估计与图像深度估计存在根本差异,尤其在时间一致性要求方面。静态区域通过立体匹配可以获得更强的全局3D线索,而动态区域仍需从大规模视频深度数据中学习以确保平滑过渡。 Method: 提出StereoDiff,一种将立体匹配(用于静态区域)与视频深度扩散(用于动态区域)相结合的两阶段视频深度估计方法,并通过频域分析展示两者互补优势。 Result: 在零样本、真实世界、动态视频深度基准测试中表现出色,展示了其在室内和室外场景中的优越一致性和准确性。 Conclusion: StereoDiff成功结合了立体匹配和视频深度扩散的优点,在视频深度估计任务中实现了最先进的性能。 Abstract: Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff's SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.[41] ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations
Zhiyuan Wu,Yongqiang Zhao,Shan Luo
Main category: cs.CV
TL;DR: ConViTac improves robot perception by effectively fusing visual and tactile data using contrastive representation learning, achieving better performance than existing methods.
Details
Motivation: The motivation is to overcome the limitations of existing approaches that use simple feature combination techniques like addition or concatenation for visual-tactile fusion, which often lead to poor integration of features. Method: The paper proposes ConViTac, a network that uses a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are then fused using cross-modal attention for improved performance on downstream tasks. Result: Experiments show that ConViTac achieves superior performance compared to current state-of-the-art methods in real-world scenarios, with up to a 12.0% improvement in accuracy for material classification and grasping prediction tasks. Conclusion: The paper concludes that ConViTac, with its Contrastive Embedding Conditioning mechanism, significantly enhances visual-tactile feature fusion and outperforms state-of-the-art methods in material classification and grasping prediction tasks. Abstract: Vision and touch are two fundamental sensory modalities for robots, offering complementary information that enhances perception and manipulation tasks. Previous research has attempted to jointly learn visual-tactile representations to extract more meaningful information. However, these approaches often rely on direct combination, such as feature addition and concatenation, for modality fusion, which tend to result in poor feature integration. In this paper, we propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0% in material classification and grasping prediction tasks.[42] AI-Driven MRI-based Brain Tumour Segmentation Benchmarking
Connor Ludwig,Khashayar Namdar,Farzad Khalvati
Main category: cs.CV
TL;DR: 本文比较了多种可提示模型在医学图像分割中的表现,发现尽管SAM系列模型在高质量提示下表现优异,但nnU-Net因无需精确提示仍为最佳选择,微调显著提升了点提示效果。
Details
Motivation: 近年来出现了许多通用可提示的模型及医学图像分割变体,但缺乏在统一医学数据集上对不同提示质量下的模型性能评估与比较。 Method: 本研究在BraTS 2023成人胶质瘤和儿科数据集上,对多种提示条件下(包括点和边界框)的Segment Anything Model (SAM)、Segment Anything Model 2 (SAM 2)、MedSAM、SAM-Med-3D以及nnU-Net进行零样本推理评估,并进一步通过在儿科数据集上微调这些模型以比较其性能提升情况。 Result: 多个模型在精确边界框提示下表现出良好的Dice分数,其中SAM和SAM 2分别达到0.894和0.893,超过了nnU-Net的表现。然而,在不依赖高精度提示的实际应用场景中,nnU-Net仍然是最主流的医学图像分割网络。微调后点提示性能有明显提升,展现出未来研究的潜力。 Conclusion: 虽然SAM及其变体在使用精确提示时能取得与nnU-Net相当甚至更优的分割效果,但nnU-Net由于在实际应用中不需要高度精确的提示而仍然占据主导地位。微调显著提升了点提示的效果,但其仍无法超越边界框或nnU-Net的分割性能。 Abstract: Medical image segmentation has greatly aided medical diagnosis, with U-Net based architectures and nnU-Net providing state-of-the-art performance. There have been numerous general promptable models and medical variations introduced in recent years, but there is currently a lack of evaluation and comparison of these models across a variety of prompt qualities on a common medical dataset. This research uses Segment Anything Model (SAM), Segment Anything Model 2 (SAM 2), MedSAM, SAM-Med-3D, and nnU-Net to obtain zero-shot inference on the BraTS 2023 adult glioma and pediatrics dataset across multiple prompt qualities for both points and bounding boxes. Several of these models exhibit promising Dice scores, particularly SAM and SAM 2 achieving scores of up to 0.894 and 0.893, respectively when given extremely accurate bounding box prompts which exceeds nnU-Net's segmentation performance. However, nnU-Net remains the dominant medical image segmentation network due to the impracticality of providing highly accurate prompts to the models. The model and prompt evaluation, as well as the comparison, are extended through fine-tuning SAM, SAM 2, MedSAM, and SAM-Med-3D on the pediatrics dataset. The improvements in point prompt performance after fine-tuning are substantial and show promise for future investigation, but are unable to achieve better segmentation than bounding boxes or nnU-Net.[43] How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?
Stephanie Käs,Anton Burenko,Louis Markert,Onur Alp Culha,Dennis Mack,Timm Linder,Bastian Leibe
Main category: cs.CV
TL;DR: This paper explores using Vision Foundation Models (like V-JEPA) and Vision Language Models (like Gemini Flash 2.0) for dynamic, full-body gesture recognition, comparing them with the top-performing skeleton-based method HD-GCN. While HD-GCN performs best, V-JEPA shows promise for reducing system complexity. Gemini struggles in zero-shot settings, highlighting the need for better input representations.
Details
Motivation: Traditional deep learning-based gesture recognition systems rely on task-specific architectures and modalities such as images, videos, or skeletal pose estimates. Vision Foundation Models and Vision Language Models have shown strong generalization abilities, potentially reducing system complexity by replacing dedicated modules. This study investigates their applicability to gesture recognition. Method: The study adapts Vision Foundation Models (V-JEPA), Vision Language Models (Gemini Flash 2.0), and a skeleton-based method (HD-GCN) for dynamic, full-body gesture recognition. It introduces the NUGGET dataset tailored for human-robot communication in intralogistics environments to evaluate these approaches. Result: HD-GCN achieves the best performance in dynamic, full-body gesture recognition. V-JEPA comes close when equipped with a simple, task-specific classification head, suggesting potential for multi-task modeling. Gemini Flash 2.0 struggles to differentiate gestures based solely on textual descriptions in a zero-shot setting. Conclusion: The study concludes that while HD-GCN currently performs best for dynamic, full-body gesture recognition, Vision Foundation Models like V-JEPA offer a promising path to reduce system complexity when used with a task-specific classification head. Gemini Flash 2.0 struggles in zero-shot gesture recognition, indicating the need for further research on input representations. Abstract: Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.[44] Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models
Cansu Korkmaz,Ahmet Murat Tekalp,Zafer Dogan
Main category: cs.CV
TL;DR: This paper proposes a novel framework using vision-language models (VLMs) to automatically select the most trustworthy super-resolution (SR) image from a diffusion-generated set, introducing a new hybrid metric called Trustworthiness Score (TWS) that aligns with human perception and semantic correctness.
Details
Motivation: Super-resolution (SR) is an ill-posed problem with multiple feasible solutions. While regressive models struggle with balancing fidelity and perceptual quality, diffusion models generate diverse SR images but lack a principled way to select the most trustworthy one. This work aims to address this challenge by leveraging VLMs for better decision-making in SR tasks. Method: The paper introduces an automated framework using VLMs like BLIP-2 and GPT-4o to evaluate SR images based on semantic correctness, visual quality, and artifact presence. A novel hybrid metric called Trustworthiness Score (TWS) is proposed, combining semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. VLM-guided selection and TWS are validated against human preference. Result: The paper demonstrates that VLM-guided selections yield high Trustworthiness Scores (TWS), which strongly correlate with human preferences. Compared to traditional metrics like PSNR and LPIPS, the proposed TWS offers a more reliable measure of SR trustworthiness. The method effectively identifies high-quality SR images while minimizing artifacts in a cost-effective manner. Conclusion: This paper concludes that leveraging vision-language models (VLMs) to select the most trustworthy super-resolution (SR) image from a diffusion-generated set significantly improves the reliability and trustworthiness of SR outputs. The proposed Trustworthiness Score (TWS) aligns well with human preferences, offering a scalable and generalizable solution for generative SR. Abstract: Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.[45] FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization
Ha Min Son,Shahbaz Rezaei,Xin Liu
Main category: cs.CV
TL;DR: FixCLR是一种半监督领域泛化方法,利用对比学习实现显式领域不变正则化,可提升多数现有半监督方法的性能。
Details
Motivation: 由于标签稀缺,现有的领域泛化方法表现不佳,因此需要一种能够明确正则化以学习跨所有领域的不变表示的方法。 Method: 受自监督学习成功的启发,FixCLR调整了对比学习的两个关键组件:伪标签中的类别信息和仅使用排斥项进行正则化。 Result: FixCLR可以有效提升半监督领域泛化的性能,尤其是在结合其他半监督方法时表现出色,并且适用于具有多领域的数据集。 Conclusion: FixCLR是一种有效的半监督领域泛化方法,为未来的研究提供了新的方向和改进基础。 Abstract: Semi-supervised domain generalization (SSDG) aims to solve the problem of generalizing to out-of-distribution data when only a few labels are available. Due to label scarcity, applying domain generalization methods often underperform. Consequently, existing SSDG methods combine semi-supervised learning methods with various regularization terms. However, these methods do not explicitly regularize to learn domains invariant representations across all domains, which is a key goal for domain generalization. To address this, we introduce FixCLR. Inspired by success in self-supervised learning, we change two crucial components to adapt contrastive learning for explicit domain invariance regularization: utilization of class information from pseudo-labels and using only a repelling term. FixCLR can also be added on top of most existing SSDG and semi-supervised methods for complementary performance improvements. Our research includes extensive experiments that have not been previously explored in SSDG studies. These experiments include benchmarking different improvements to semi-supervised methods, evaluating the performance of pretrained versus non-pretrained models, and testing on datasets with many domains. Overall, FixCLR proves to be an effective SSDG method, especially when combined with other semi-supervised methods.[46] Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision
Yuting He,Shuo Li
Main category: cs.CV
TL;DR: This paper introduces COVER, a novel contrastive learning framework for pixel-wise representation in medical imaging, solving over-dispersion issues through vector regression.
Details
Motivation: Extending contrastive learning to pixel-wise representation in medical vision is an open problem due to the over-dispersion issue from binary optimization, which breaks feature correlations. Method: The COntrast in VEctor Regression (COVER) framework reformulates contrastive learning as a vector regression problem for dispersion quantification, using a vector pyramid architecture for granularity adaptation. Result: COVER enables better modeling of feature distances through vector regression, preserving intra-class distribution and pixel-wise feature correlation across 8 tasks spanning 2 dimensions and 4 modalities. Conclusion: COVER significantly improves pixel-wise self-supervised pretraining, advancing generalizable medical visual foundation models. Abstract: Contrastive learning (CL) has become a cornerstone of self-supervised pretraining (SSP) in foundation models, however, extending CL to pixel-wise representation, crucial for medical vision, remains an open problem. Standard CL formulates SSP as a binary optimization problem (binary CL) where the excessive pursuit of feature dispersion leads to an over-dispersion problem, breaking pixel-wise feature correlation thus disrupting the intra-class distribution. Our vector CL reformulates CL as a vector regression problem, enabling dispersion quantification in pixel-wise pretraining via modeling feature distances in regressing displacement vectors. To implement this novel paradigm, we propose the COntrast in VEctor Regression (COVER) framework. COVER establishes an extendable vector-based self-learning, enforces a consistent optimization flow from vector regression to distance modeling, and leverages a vector pyramid architecture for granularity adaptation, thus preserving pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks, spanning 2 dimensions and 4 modalities, show that COVER significantly improves pixel-wise SSP, advancing generalizable medical visual foundation models.[47] Enhancing Ambiguous Dynamic Facial Expression Recognition with Soft Label-based Data Augmentation
Ryosuke Kawamura,Hideaki Hayashi,Shunsuke Otake,Noriko Takemura,Hajime Nagahara
Main category: cs.CV
TL;DR: 本文提出了一种名为MIDAS的数据增强方法,用于动态面部表情识别,尤其针对模糊表情的识别,实验表明其性能优越。
Details
Motivation: 为了提高实际应用中对模糊面部表情的准确识别,这是经常在野外数据中遇到的问题。 Method: 提出了MIDAS,通过结合视频帧及其对应的软标签来增强训练数据,扩展了mixup在软标签视频数据中的应用。 Result: 使用MIDAS增强数据训练的模型在DFEW和FERV39k-Plus数据集上均显示出优于现有方法的性能。 Conclusion: MIDAS是一个有效的数据增强方法,能够提升DFER任务中对模糊面部表情的识别性能。 Abstract: Dynamic facial expression recognition (DFER) is a task that estimates emotions from facial expression video sequences. For practical applications, accurately recognizing ambiguous facial expressions -- frequently encountered in in-the-wild data -- is essential. In this study, we propose MIDAS, a data augmentation method designed to enhance DFER performance for ambiguous facial expression data using soft labels representing probabilities of multiple emotion classes. MIDAS augments training data by convexly combining pairs of video frames and their corresponding emotion class labels. This approach extends mixup to soft-labeled video data, offering a simple yet highly effective method for handling ambiguity in DFER. To evaluate MIDAS, we conducted experiments on both the DFEW dataset and FERV39k-Plus, a newly constructed dataset that assigns soft labels to an existing DFER dataset. The results demonstrate that models trained with MIDAS-augmented data achieve superior performance compared to the state-of-the-art method trained on the original dataset.[48] THIRDEYE: Cue-Aware Monocular Depth Estimation via Brain-Inspired Multi-Stage Fusion
Calin Teodor Ioan
Main category: cs.CV
TL;DR: ThirdEye是一种新的单目深度估计方法,它通过专门的预训练网络显式地利用人类视觉系统依赖的线索,如遮挡边界、阴影和透视,并通过一个三阶段的皮层层次结构融合这些线索。
Details
Motivation: 传统的单目深度估计方法通常忽略人类视觉系统所依赖的明确的单目线索,而期望网络自行发现这些线索。ThirdEye旨在通过显式提供这些线索来改进这一过程。 Method: ThirdEye使用专门设计的、预训练且冻结的网络来提取单目线索,并在一个三阶段的皮层层次结构(V1->V2->V3)中进行融合,该结构配备了根据可靠性加权线索的关键值工作记忆模块,最后通过一个自适应bin变换器头生成高分辨率的视差图。 Result: 由于线索专家是冻结的,ThirdEye继承了大量的外部监督,同时只需要适度的微调;定量结果将在未来修订中提供。 Conclusion: ThirdEye通过显式整合人类视觉系统的单目线索,为单目深度估计提供了一个更为高效和受监督的方法。 Abstract: Monocular depth estimation methods traditionally train deep models to infer depth directly from RGB pixels. This implicit learning often overlooks explicit monocular cues that the human visual system relies on, such as occlusion boundaries, shading, and perspective. Rather than expecting a network to discover these cues unaided, we present ThirdEye, a cue-aware pipeline that deliberately supplies each cue through specialised, pre-trained, and frozen networks. These cues are fused in a three-stage cortical hierarchy (V1->V2->V3) equipped with a key-value working-memory module that weights them by reliability. An adaptive-bins transformer head then produces a high-resolution disparity map. Because the cue experts are frozen, ThirdEye inherits large amounts of external supervision while requiring only modest fine-tuning. This extended version provides additional architectural detail, neuroscientific motivation, and an expanded experimental protocol; quantitative results will appear in a future revision.[49] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans
Shubhankar Borse,Seokeon Choi,Sunghyun Park,Jeongho Kim,Shreya Kadambi,Risheek Garrepalli,Sungrack Yun,Munawar Hayat,Fatih Porikli
Main category: cs.CV
TL;DR: This paper introduces MultiHuman-Testbench, a new benchmark for evaluating multi-human image generation models, featuring diverse data and evaluation metrics alongside improved identity preservation techniques.
Details
Motivation: Generating images with multiple humans performing complex actions while preserving facial identities is challenging due to the lack of a dedicated benchmark. Method: Introduced MultiHuman-Testbench, which includes 1800 samples with text prompts, 5550 diverse face images, pose conditioning images, and a four-metric evaluation suite. Also employed zero-shot and training-based models with regional priors, alongside human segmentation and Hungarian matching techniques. Result: A comprehensive benchmark with diverse data samples and evaluation metrics was developed. Model evaluations showed improvements in ID similarity using novel incorporation techniques. Conclusion: The proposed MultiHuman-Testbench provides a standardized tool for evaluating generative models in multi-human image generation, offering insights into model performance and improvements through novel techniques like image and region isolation. Abstract: Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation.[50] The Role of Cyclopean-Eye in Stereo Vision
Sherlon Almeida da Silva,Davi Geiger,Luiz Velho,Moacir Antonelli Ponti
Main category: cs.CV
TL;DR: 该论文研究了立体视觉系统的几何基础,提出了新的几何约束,并探讨了注意力机制和深度学习特征在3D深度重建中的作用。
Details
Motivation: 探索3D结构和人类感知如何促进精确的深度重建。 Method: 通过理论分析和实证研究,重新审视Cyclopean Eye模型并提出新的几何约束。 Result: 展示了注意力机制在恢复有意义的3D表面中的作用,并评估了深度学习模型的立体特征匹配质量。 Conclusion: 结合几何先验与学习特征能够提供理解立体视觉系统的内部抽象表示。 Abstract: This work investigates the geometric foundations of modern stereo vision systems, with a focus on how 3D structure and human-inspired perception contribute to accurate depth reconstruction. We revisit the Cyclopean Eye model and propose novel geometric constraints that account for occlusions and depth discontinuities. Our analysis includes the evaluation of stereo feature matching quality derived from deep learning models, as well as the role of attention mechanisms in recovering meaningful 3D surfaces. Through both theoretical insights and empirical studies on real datasets, we demonstrate that combining strong geometric priors with learned features provides internal abstractions for understanding stereo vision systems.[51] FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing
Advait Gupta,Rishie Raj,Dang Nguyen,Tianyi Zhou
Main category: cs.CV
TL;DR: FaSTA: A fast-slow planning agent for multi-turn image editing that uses LLMs for high-level planning and A search for low-level execution, achieving cost-efficiency and competitive performance.
Details
Motivation: The motivation is to develop a cost-efficient neurosymbolic agent for complex multi-turn image editing tasks by combining the strengths of fast planning and accurate low-level search. Method: FaSTA* combines fast, high-level subtask planning using LLMs with slow, accurate tool-use and local A* search. It extracts/refines frequently used subroutines via inductive reasoning on previously successful toolpaths to enable adaptive fast-slow planning. Result: FaSTA* achieves computational efficiency through reusable symbolic subroutines and maintains competitiveness with state-of-the-art methods in terms of success rate. Conclusion: FaSTA* saves exploration cost by reusing symbolic subroutines and is significantly more computationally efficient while maintaining a high success rate in multi-turn image editing tasks. Abstract: We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as "Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.'' It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A$^*$ search per subtask to find a cost-efficient toolpath -- a sequence of calls to AI tools. To save the cost of A$^*$ on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A$^*$ search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent "FaSTA$^*$'': fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A$^*$ search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA$^*$ is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.[52] M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization
Ju-Hyeon Nam,Dong-Hyun Moon,Sang-Chul Lee
Main category: cs.CV
TL;DR: This paper proposes M2SFormer, an improved image forgery localization method based on Transformer encoder, which achieves better performance than existing methods by unifying multi-frequency and multi-scale attentions and preserving fine details through a global prior map.
Details
Motivation: Deep learning-based methods struggle with computational overhead and limited representation power for subtle or complex tampering. Method: Proposed M2SFormer, a novel Transformer encoder-based framework that unifies multi-frequency and multi-scale attentions in the skip connection and utilizes a global prior map to preserve fine details during upsampling. Result: Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models. Conclusion: M2SFormer offers superior generalization in detecting and localizing forgeries across unseen domains. Abstract: Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map, a curvature metric indicating the difficulty of forgery localization, which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains.[53] PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling
Hao Zhang,Haolan Xu,Chun Feng,Varun Jampani,Narendra Ahuja
Main category: cs.CV
TL;DR: 这项工作提出了PhysRig,一个用于动画、铰接物体重建、运动传递和4D生成的蒙皮和绑定框架。
Details
Motivation: 现有的方法主要依赖于线性混合蒙皮(LBS),由于其简单性和可微性,但LBS会导致体积损失和不自然的变形,并且无法建模如软组织、毛发和柔性附件等弹性材料。 Method: PhysRig利用连续介质力学,并将物体离散化为嵌入在欧拉背景网格中的粒子,以确保对于材料属性和骨骼运动的可微性。此外,该方法还引入了材料原型,显著减少了学习空间同时保持了高度的表现力。 Result: PhysRig在综合合成数据集上的评估表明,与传统的基于LBS的方法相比,生成的结果更加真实且符合物理规律。 Conclusion: PhysRig是一个可微分的基于物理的蒙皮和绑定框架,它通过将刚性骨架嵌入到体积表示中来克服现有方法的局限性。 Abstract: Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling.[54] AIR-VIEW: The Aviation Image Repository for Visibility Estimation of Weather, A Dataset and Benchmark
Chad Mourning,Zhewei Wang,Justin Murray
Main category: cs.CV
TL;DR: 本文介绍了一个用于航空天气监测的新数据集,并评估了几种机器学习方法在能见度估计中的表现。
Details
Motivation: 传统天气传感器成本高昂且缺乏足够的公开数据集用于监督学习,因此需要一种低成本的替代方案。 Method: 通过一年的数据收集活动,从FAA天气相机网络获取图像数据,并进行基准测试。 Result: 推出了一个新的、适合航空大气能见度估计的数据集,并展示了不同数据集上的性能基准。 Conclusion: 该论文介绍了适用于航空大气能见度估计的新数据集,并提供了三种常用方法和一个通用基线的基准测试结果。 Abstract: Machine Learning for aviation weather is a growing area of research for providing low-cost alternatives for traditional, expensive weather sensors; however, in the area of atmospheric visibility estimation, publicly available datasets, tagged with visibility estimates, of distances relevant for aviation, of diverse locations, of sufficient size for use in supervised learning, are absent. This paper introduces a new dataset which represents the culmination of a year-long data collection campaign of images from the FAA weather camera network suitable for this purpose. We also present a benchmark when applying three commonly used approaches and a general-purpose baseline when trained and tested on three publicly available datasets, in addition to our own, when compared against a recently ratified ASTM standard.[55] Hierarchical Sub-action Tree for Continuous Sign Language Recognition
Dejie Yang,Zhu Xu,Xinjie Gao,Yang Liu
Main category: cs.CV
TL;DR: This paper proposes HST-CSLR, a method that combines visual and textual data using hierarchical sub-action trees and contrastive alignment for improved continuous sign language recognition.
Details
Motivation: Lack of large datasets and precise annotations has limited progress in continuous sign language recognition; thus, leveraging textual gloss knowledge from language models could enhance performance. Method: A Hierarchical Sub-action Tree (HST) is constructed to represent textual information, aligning visual and textual modalities step-by-step with contrastive alignment enhancement. Result: The HST-CSLR approach demonstrates effectiveness across four datasets: PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture. Conclusion: The proposed HST-CSLR method effectively integrates textual and visual information for continuous sign language recognition, showing promising results on multiple datasets. Abstract: Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.[56] OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs
Yiman Zhang,Ziheng Luo,Qiangyu Yan,Wei He,Borui Jiang,Xinghao Chen,Kai Han
Main category: cs.CV
TL;DR: OmniEval is a new benchmark for omni-modality models, featuring collaborative perception, diverse video content, and detailed task categorization for comprehensive model evaluation.
Details
Motivation: To address limitations in existing benchmarks by offering a comprehensive platform that evaluates full-modal collaboration, diversity of content, and granular task performance. Method: Designing evaluation tasks that emphasize audio-video coupling and utilizing 810 synchronized videos along with 2617 question-answer pairs across diverse task types. Result: The OmniEval benchmark includes features like full-modal collaboration, video diversity (285 Chinese and 525 English videos), and a wide range of tasks (2617 QA pairs, 3 major task types, and 12 sub-task types) including a new localization task called Grounding. Conclusion: OmniEval is introduced as a benchmark for evaluating omni-modality models, aiming to assess the ability to construct and understand coherence from multi-modal contexts. Abstract: In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos: OmniEval includes 810 audio-visual synchronized videos, 285 Chinese videos and 525 English videos; (iii) Diversity and granularity of tasks: OmniEval contains 2617 question-answer pairs, comprising 1412 open-ended questions and 1205 multiple-choice questions. These questions are divided into 3 major task types and 12 sub-task types to achieve comprehensive evaluation. Among them, we introduce a more granular video localization task named Grounding. Then we conduct experiments on OmniEval with several omni-modality models. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities. Codes and data could be found at https://omnieval.github.io/.[57] Evidence-based diagnostic reasoning with multi-agent copilot for human pathology
Chengkuan Chen,Luca L. Weishaupt,Drew F. K. Williamson,Richard J. Chen,Tong Ding,Bowen Chen,Anurag Vaidya,Long Phi Le,Guillaume Jaume,Ming Y. Lu,Faisal Mahmood
Main category: cs.CV
TL;DR: This paper introduces PathChat+, a specialized multimodal language model for pathology, and SlideSeek, an AI system capable of autonomous diagnostic reasoning on whole-slide images, both achieving state-of-the-art performance.
Details
Motivation: Traditional computational pathology models focus only on image analysis without integrating natural language or contextual understanding, while existing MLLMs face challenges such as limited training data, poor multi-image support, and lack of autonomous diagnostic reasoning capabilities. Method: Development of PathChat+, a multimodal large language model trained on over 1 million pathology-specific instruction samples and nearly 5.5 million question-answer turns, and SlideSeek, a multi-agent AI system that uses iterative, hierarchical diagnostic reasoning to evaluate gigapixel whole-slide images. Result: PathChat+ outperformed prior models including the original PathChat copilot, state-of-the-art general-purpose models, and other pathology-specific models. SlideSeek achieved high accuracy on DDxBench and can generate visually grounded, interpretable summary reports. Conclusion: PathChat+ and SlideSeek represent significant advancements in computational pathology by overcoming limitations of previous models, enabling more accurate and autonomous diagnostic reasoning on whole-slide images while generating interpretable reports. Abstract: Pathology is experiencing rapid digital transformation driven by whole-slide imaging and artificial intelligence (AI). While deep learning-based computational pathology has achieved notable success, traditional models primarily focus on image analysis without integrating natural language instruction or rich, text-based context. Current multimodal large language models (MLLMs) in computational pathology face limitations, including insufficient training data, inadequate support and evaluation for multi-image understanding, and a lack of autonomous, diagnostic reasoning capabilities. To address these limitations, we introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology-specific instruction samples and nearly 5.5 million question answer turns. Extensive evaluations across diverse pathology benchmarks demonstrated that PathChat+ substantially outperforms the prior PathChat copilot, as well as both state-of-the-art (SOTA) general-purpose and other pathology-specific models. Furthermore, we present SlideSeek, a reasoning-enabled multi-agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole-slide images (WSIs) through iterative, hierarchical diagnostic reasoning, reaching high accuracy on DDxBench, a challenging open-ended differential diagnosis benchmark, while also capable of generating visually grounded, humanly-interpretable summary reports.[58] DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing
Lingling Cai,Kang Zhao,Hangjie Yuan,Xiang Wang,Yingya Zhang,Kejie Huang
Main category: cs.CV
TL;DR: DFVEdit is an efficient zero-shot video editing method for Video DiTs that unifies editing and sampling under a continuous flow perspective, eliminating the need for attention modification or fine-tuning.
Details
Motivation: Existing video editing methods incur substantial computational overhead when applied to Video DiTs due to resource-intensive attention modification or fine-tuning. DFVEdit aims to address this inefficiency by proposing a zero-shot editing approach tailored for Video DiTs. Method: DFVEdit operates on clean latents via flow transformation, introducing Conditional Delta Flow Vector (CDFV), Implicit Cross Attention (ICA) guidance, and Embedding Reinforcement (ER) to enhance editing quality without attention modification or fine-tuning. Result: DFVEdit achieves at least 20x inference speed-up and 85% memory reduction compared to attention-engineering-based editing methods. It also demonstrates superior performance across structural fidelity, spatial-temporal consistency, and editing quality on popular Video DiTs like CogVideoX and Wan2.1. Conclusion: DFVEdit provides a highly efficient and zero-shot video editing solution for Video DiTs, achieving state-of-the-art performance in structural fidelity, spatial-temporal consistency, and editing quality while significantly reducing computational overhead. Abstract: The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) -- a theoretically unbiased estimation of DFV -- and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85\% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.[59] From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging
Tao Liu,Dafeng Zhang,Gengchen Li,Shizhuo Liu,Yongqi Song,Senmao Li,Shiqi Yang,Boqian Li,Kai Wang,Yaxing Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Cradle2Cane的面部老化框架,通过两阶段方法解决了年龄准确性和身份保留之间的权衡问题,实验证明其性能优于现有技术。
Details
Motivation: 现有方法在实现整个生命周期内的真实且无缝的面部老化转换方面存在困难,尤其是在处理大年龄差距或极端头部姿态时,难以平衡年龄准确性和身份保留之间的关系。 Method: 提出了一种基于少量步骤文本到图像扩散模型的两阶段面部老化框架Cradle2Cane,第一阶段引入自适应噪声注入机制以解决年龄准确性问题,第二阶段通过两个身份感知嵌入来增强身份保留。 Result: 在CelebA-HQ测试数据集上进行了大量实验,并通过Face++和Qwen-VL协议进行评估,结果表明Cradle2Cane在年龄准确性和身份一致性方面优于现有方法。 Conclusion: Cradle2Cane通过两阶段的方法,在保持身份一致性的同时实现了更准确的年龄转换,优于现有的面部老化方法。 Abstract: Face aging has become a crucial task in computer vision, with applications ranging from entertainment to healthcare. However, existing methods struggle with achieving a realistic and seamless transformation across the entire lifespan, especially when handling large age gaps or extreme head poses. The core challenge lies in balancing age accuracy and identity preservation--what we refer to as the Age-ID trade-off. Most prior methods either prioritize age transformation at the expense of identity consistency or vice versa. In this work, we address this issue by proposing a two-pass face aging framework, named Cradle2Cane, based on few-step text-to-image (T2I) diffusion models. The first pass focuses on solving age accuracy by introducing an adaptive noise injection (AdaNI) mechanism. This mechanism is guided by including prompt descriptions of age and gender for the given person as the textual condition. Also, by adjusting the noise level, we can control the strength of aging while allowing more flexibility in transforming the face. However, identity preservation is weakly ensured here to facilitate stronger age transformations. In the second pass, we enhance identity preservation while maintaining age-specific features by conditioning the model on two identity-aware embeddings (IDEmb): SVR-ArcFace and Rotate-CLIP. This pass allows for denoising the transformed image from the first pass, ensuring stronger identity preservation without compromising the aging accuracy. Both passes are jointly trained in an end-to-end way. Extensive experiments on the CelebA-HQ test dataset, evaluated through Face++ and Qwen-VL protocols, show that our Cradle2Cane outperforms existing face aging methods in age accuracy and identity consistency.[60] 3D Scene-Camera Representation with Joint Camera Photometric Optimization
Weichen Dai,Kangcheng Ma,Jiaxin Wang,Kecen Pan,Yuhang Ming,Hua Zhang,Wanzeng Kong
Main category: cs.CV
TL;DR: 本文提出了一种结合光度优化的3D场景-相机表示方法,能够在存在光度失真的情况下生成高质量的3D场景表示。
Details
Motivation: 由于相机成像中的固有光度失真可能显著降低图像质量,进而影响3D场景表示的准确性,因此需要一种能够考虑并纠正这些失真的方法。 Method: 引入内部和外部光度模型,提出完整的光度模型及相应的相机表示,并在优化光度参数时引入深度正则化。 Result: 实验结果表明,即使在存在成像退化的条件下(如渐晕和污渍),所提方法仍能实现高质量的3D场景表示。 Conclusion: 该论文提出了一种新的3D场景-相机表示方法,并通过联合相机光度优化,有效分离了与场景无关的信息,从而在成像退化条件下也能实现高质量的3D场景表示。 Abstract: Representing scenes from multi-view images is a crucial task in computer vision with extensive applications. However, inherent photometric distortions in the camera imaging can significantly degrade image quality. Without accounting for these distortions, the 3D scene representation may inadvertently incorporate erroneous information unrelated to the scene, diminishing the quality of the representation. In this paper, we propose a novel 3D scene-camera representation with joint camera photometric optimization. By introducing internal and external photometric model, we propose a full photometric model and corresponding camera representation. Based on simultaneously optimizing the parameters of the camera representation, the proposed method effectively separates scene-unrelated information from the 3D scene representation. Additionally, during the optimization of the photometric parameters, we introduce a depth regularization to prevent the 3D scene representation from fitting scene-unrelated information. By incorporating the camera model as part of the mapping process, the proposed method constructs a complete map that includes both the scene radiance field and the camera photometric model. Experimental results demonstrate that the proposed method can achieve high-quality 3D scene representations, even under conditions of imaging degradation, such as vignetting and dirt.[61] Rethink Sparse Signals for Pose-guided Text-to-image Generation
Wenjie Xuan,Jing Zhang,Juhua Liu,Bo Du,Dacheng Tao
Main category: cs.CV
TL;DR: 该研究提出了一种名为SP-Ctrl的新方法,利用稀疏信号实现高效的姿态引导图像生成,解决了现有方法中存在的问题。
Details
Motivation: 密集信号虽然提供了详细的空间指导,但存在编辑困难和与文本提示潜在不一致的问题,因此重新探索稀疏信号用于姿态引导。 Method: 提出了一种新的Spatial-Pose ControlNet (SP-Ctrl),扩展了OpenPose到可学习的空间表示,并引入了关键点概念学习以提高姿态对齐。 Result: 实验表明,SP-Ctrl在稀疏姿态引导下优于近期的空间可控T2I生成方法,并且能够匹配基于密集信号的方法的性能。 Conclusion: SP-Ctrl通过稀疏信号实现了对姿态引导图像生成的强可控性,并展示了其在多样性和跨物种生成中的潜力。 Abstract: Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals. Codes will be available at https://github.com/DREAMXFAR/SP-Ctrl.[62] EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning
Xiao Zhang,Yongqiang Ma,Haodong Jing,Nanning Zheng
Main category: cs.CV
TL;DR: 本文提出了一种新的 Compositional Zero-Shot Learning (CZSL) 框架 EVA,通过引入 domain-expert adaptation 和 semantic variant alignment 技术,在多个基准测试中实现了领先性能。
Details
Motivation: 现有 CZSL 方法使用简单的组合原型映射来提取基本特征,这在处理可划分为不同语义子集的个体集合时表现不佳。此外,一对一跨模态基本匹配忽略了相同状态或对象内的组合差异,限制了细粒度图像-组合对齐的效果。 Method: 提出了一种名为 EVA 的 Mixture-of-Experts Semantic Variant Alignment 框架,包括 domain-expert adaptation 和 semantic variant alignment 技术。 Result: EVA 在三个流行的封闭世界和开放世界基准测试中均显著优于其他 SOTA CZSL 方法。 Conclusion: EVA 框架在 CZSL 中表现出色,显著优于其他最先进方法,证明了其有效性。 Abstract: Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets. Moreover, the all-to-one cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Semantic Variant Alignment framework for CZSL. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching. Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.[63] Segment Anything in Pathology Images with Natural Language
Zhixuan Chen,Junlin Hou,Liqi Lin,Yihui Wang,Yequan Bie,Xi Wang,Yanning Zhou,Ronald Cheong Kin Chan,Hao Chen
Main category: cs.CV
TL;DR: PathSegmentor 是一种用于病理图像分割的新方法,利用自然语言提示进行语义分割,解决了传统方法中注释数据有限和类别定义受限的问题。
Details
Motivation: 当前的病理图像分割方法在临床应用中面临注释数据有限和类别定义受限的主要挑战。 Method: 提出 PathSegmentor,这是首个专为病理图像设计的文本提示分割基础模型,并引入了名为 PathSeg 的最大最全面的病理分割数据集。 Result: PathSegmentor 在广泛的实验中表现出色,准确性和适用性均优于现有模型,整体 Dice 分数分别比现有的空间和文本提示模型高出 0.145 和 0.429。 Conclusion: PathSegmentor 推动了解释性人工智能在精准肿瘤学中的发展,增强了诊断模型的可解释性和临床决策支持能力。 Abstract: Pathology image segmentation is crucial in computational pathology for analyzing histological features relevant to cancer diagnosis and prognosis. However, current methods face major challenges in clinical applications due to limited annotated data and restricted category definitions. To address these limitations, we propose PathSegmentor, the first text-prompted segmentation foundation model designed specifically for pathology images. We also introduce PathSeg , the largest and most comprehensive dataset for pathology segmentation, built from 17 public sources and containing 275k image-mask-label triples across 160 diverse categories. With PathSegmentor, users can perform semantic segmentation using natural language prompts, eliminating the need for laborious spatial inputs such as points or boxes. Extensive experiments demonstrate that PathSegmentor outperforms specialized models with higher accuracy and broader applicability, while maintaining a compact architecture. It significantly surpasses existing spatial- and text-prompted models by 0.145 and 0.429 in overall Dice scores, respectively, showing strong robustness in segmenting complex structures and generalizing to external datasets. Moreover, PathSegmentor's outputs enhance the interpretability of diagnostic models through feature importance estimation and imaging biomarker discovery, offering pathologists evidence-based support for clinical decision-making. This work advances the development of explainable AI in precision oncology.[64] TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation
Chade Li,Pengju Zhang,Yihong Wu
Main category: cs.CV
TL;DR: This paper proposes TSDASeg for interactive point cloud segmentation, using direct cross-modal alignment and memory modules to improve 3D-text alignment and achieve state-of-the-art results.
Details
Motivation: Existing methods underperform in point-level tasks like segmentation due to missing direct 3D-text alignment, limiting the ability to link local 3D features with textual context. This necessitates an improved approach for better alignment and performance in real-world applications. Method: TSDASeg, a Two-Stage model coupled with a Direct cross-modal Alignment module and memory module for interactive point cloud Segmentation. The direct cross-modal alignment module establishes explicit alignment between 3D point clouds and textual/2D image data. Dedicated memory banks store text features, visual features, and their cross-modal correspondence mappings, which are dynamically leveraged through attention mechanisms to update scene-specific features. Result: Experiments on multiple datasets show that the proposed method achieves state-of-the-art performance in interactive point cloud processing tasks. Conclusion: The proposed TSDASeg model achieves state-of-the-art performance in interactive point cloud segmentation by effectively aligning 3D point clouds with textual and visual data through a direct cross-modal alignment module and memory module. Abstract: The rapid advancement of 3D vision-language models (VLMs) has spurred significant interest in interactive point cloud processing tasks, particularly for real-world applications. However, existing methods often underperform in point-level tasks, such as segmentation, due to missing direct 3D-text alignment, limiting their ability to link local 3D features with textual context. To solve this problem, we propose TSDASeg, a Two-Stage model coupled with a Direct cross-modal Alignment module and memory module for interactive point cloud Segmentation. We introduce the direct cross-modal alignment module to establish explicit alignment between 3D point clouds and textual/2D image data. Within the memory module, we employ multiple dedicated memory banks to separately store text features, visual features, and their cross-modal correspondence mappings. These memory banks are dynamically leveraged through self-attention and cross-attention mechanisms to update scene-specific features based on prior stored data, effectively addressing inconsistencies in interactive segmentation results across diverse scenarios. Experiments conducted on multiple 3D instruction, reference, and semantic segmentation datasets demonstrate that the proposed method achieves state-of-the-art performance.[65] Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
Akio Hayakawa,Masato Ishii,Takashi Shibuya,Yuki Mitsufuji
Main category: cs.CV
TL;DR: 本文提出了一种逐步生成视频对应音频的新方法,能够在无需专业数据的情况下生成高质量的复合音频。
Details
Motivation: 受传统Foley工作流程启发,旨在全面捕捉视频中引发的所有声音事件,同时避免对专业配对数据集的需求。 Method: 该方法基于概念否定的设计思路,将每一步生成任务表述为引导式的视频到音频合成任务,并引入了一种利用预训练视频到音频模型的训练框架。 Result: 实现了单个输入视频生成多个语义上不同的音频轨道,显著提升了合成音频的质量。 Conclusion: 实验结果表明,所提出的方法能够为单个输入视频生成多个语义上不同的音轨,从而实现比现有基线方法更高品质的复合音频合成。 Abstract: We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.[66] DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting
Yeon-Ji Song,Jaein Kim,Byung-Ju Kim,Byoung-Tak Zhang
Main category: cs.CV
TL;DR: This paper proposes DBMovi-GS, a method for dynamic view synthesis from blurry monocular videos, which generates dense 3D Gaussians to restore sharpness and reconstruct detailed 3D geometry, achieving state-of-the-art results.
Details
Motivation: Synthesizing dynamic scenes from blurry monocular videos remains an unresolved challenge due to reliance on high-resolution images or strong assumptions about static geometry and rigid scene priors. This limits robustness in real-world environments with dynamic object and camera motion. Method: Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS) is proposed, which generates dense 3D Gaussians to restore sharpness and reconstruct detailed 3D geometry affected by dynamic motion variations. Result: The model achieves robust performance in novel view synthesis under dynamic blurry scenes and sets a new benchmark in realistic novel view synthesis for blurry monocular video inputs. Conclusion: The proposed DBMovi-GS method effectively addresses the challenge of dynamic view synthesis from blurry monocular videos, achieving robust performance and setting a new benchmark in realistic novel view synthesis for such inputs. Abstract: Novel view synthesis is a task of generating scenes from unseen perspectives; however, synthesizing dynamic scenes from blurry monocular videos remains an unresolved challenge that has yet to be effectively addressed. Existing novel view synthesis methods are often constrained by their reliance on high-resolution images or strong assumptions about static geometry and rigid scene priors. Consequently, their approaches lack robustness in real-world environments with dynamic object and camera motion, leading to instability and degraded visual fidelity. To address this, we propose Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS), a method designed for dynamic view synthesis from blurry monocular videos. Our model generates dense 3D Gaussians, restoring sharpness from blurry videos and reconstructing detailed 3D geometry of the scene affected by dynamic motion variations. Our model achieves robust performance in novel view synthesis under dynamic blurry scenes and sets a new benchmark in realistic novel view synthesis for blurry monocular video inputs.[67] Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology
Qiuyi Qi,Xin Li,Ming Kong,Zikang Xu,Bingdi Chen,Qiang Zhu,S Kevin Zhou
Main category: cs.CV
TL;DR: 本文提出了一种新的风格对齐图像合成方法(SAIC),用于增强异常细胞检测模型的有效性和鲁棒性。
Details
Motivation: 缺乏高质量标注、长尾数据分布以及不一致的染色风格等问题阻碍了神经网络在细胞病理学中稳健地检测异常细胞。 Method: 提出了一种风格对齐的图像合成(SAIC)方法,包括异常细胞库中候选样本的选择、高频特征重建以实现风格对齐和高保真合成,并利用大型视觉-语言模型筛选高质量合成图像。 Result: 实验结果表明,使用SAIC合成的图像能够显著提升检测模型在尾部类别和不同风格下的表现,综合质量评估也验证了该方法在临床应用中的泛化能力和实用性。 Conclusion: SAIC方法通过合成高质量的病理图像,有效提升了检测模型在尾部类别和不同染色风格下的性能与鲁棒性。 Abstract: Challenges such as the lack of high-quality annotations, long-tailed data distributions, and inconsistent staining styles pose significant obstacles to training neural networks to detect abnormal cells in cytopathology robustly. This paper proposes a style-aligned image composition (SAIC) method that composes high-fidelity and style-preserved pathological images to enhance the effectiveness and robustness of detection models. Without additional training, SAIC first selects an appropriate candidate from the abnormal cell bank based on attribute guidance. Then, it employs a high-frequency feature reconstruction to achieve a style-aligned and high-fidelity composition of abnormal cells and pathological backgrounds. Finally, it introduces a large vision-language model to filter high-quality synthesis images. Experimental results demonstrate that incorporating SAIC-synthesized images effectively enhances the performance and robustness of abnormal cell detection for tail categories and styles, thereby improving overall detection performance. The comprehensive quality evaluation further confirms the generalizability and practicality of SAIC in clinical application scenarios. Our code will be released at https://github.com/Joey-Qi/SAIC.[68] Inverse Scene Text Removal
Takumi Yoshimatsu,Shumpei Takezaki,Seiichi Uchida
Main category: cs.CV
TL;DR: This paper introduces Inverse Scene Text Removal (ISTR) to detect and analyze images processed by STR, achieving high accuracy in detection and localization while exploring text recovery.
Details
Motivation: Motivated by the increased misuse risks of Scene Text Removal (STR) due to advancements in neural networks and synthetic data, this paper aims to explore methods for detecting and understanding STR-processed images. Method: The paper investigates Inverse Scene Text Removal (ISTR) through experiments involving binary classification and localization techniques, as well as training a text recognizer to recover removed text content. Result: The experiments demonstrate that ISTR achieves high accuracy in both detecting STR-processed images and localizing removed text regions, offering insights into the recovery of removed text content. Conclusion: The paper concludes that ISTR can effectively detect whether an image has undergone STR and localize removed text regions, with potential for improving STR and detecting its misuse. Abstract: Scene text removal (STR) aims to erase textual elements from images. It was originally intended for removing privacy-sensitiveor undesired texts from natural scene images, but is now also appliedto typographic images. STR typically detects text regions and theninpaints them. Although STR has advanced through neural networksand synthetic data, misuse risks have increased. This paper investi-gates Inverse STR (ISTR), which analyzes STR-processed images andfocuses on binary classification (detecting whether an image has un-dergone STR) and localizing removed text regions. We demonstrate inexperiments that these tasks are achievable with high accuracies, en-abling detection of potential misuse and improving STR. We also at-tempt to recover the removed text content by training a text recognizerto understand its difficulty.[69] VisionGuard: Synergistic Framework for Helmet Violation Detection
Lam-Huy Nguyen,Thinh-Phuc Nguyen,Thanh-Hai Nguyen,Gia-Huy Dinh,Minh-Triet Tran,Trung-Nghia Le
Main category: cs.CV
TL;DR: 本文提出了一种新的多阶段框架VisionGuard,用于提升头盔违规检测任务中的准确性和一致性,解决了数据不平衡和环境干扰的问题,并在实验中展现了优于现有方法的表现。
Details
Motivation: 摩托车手头盔法规的执行对于提高道路安全和交通管理系统有效性至关重要,但环境变化、摄像头角度以及数据不一致性等因素给头盔违规行为的自动检测带来了重大挑战。 Method: 提出了一种名为VisionGuard的协同多阶段框架,包括Adaptive Labeling模块和Contextual Expander模块。Adaptive Labeling模块利用跟踪算法分配持久标签并纠正错误分类,Contextual Expander模块通过生成虚拟边界框来改善对代表性不足类别(如未佩戴头盔)的召回率。 Result: 实验结果表明,与基线检测器相比,VisionGuard的整体mAP提高了3.1%。 Conclusion: VisionGuard是一个多阶段框架,能够有效解决头盔违规检测中的数据不平衡和分类不一致问题,具有在交通监控系统中实际部署的潜力。 Abstract: Enforcing helmet regulations among motorcyclists is essential for enhancing road safety and ensuring the effectiveness of traffic management systems. However, automatic detection of helmet violations faces significant challenges due to environmental variability, camera angles, and inconsistencies in the data. These factors hinder reliable detection of motorcycles and riders and disrupt consistent object classification. To address these challenges, we propose VisionGuard, a synergistic multi-stage framework designed to overcome the limitations of frame-wise detectors, especially in scenarios with class imbalance and inconsistent annotations. VisionGuard integrates two key components: Adaptive Labeling and Contextual Expander modules. The Adaptive Labeling module is a tracking-based refinement technique that enhances classification consistency by leveraging a tracking algorithm to assign persistent labels across frames and correct misclassifications. The Contextual Expander module improves recall for underrepresented classes by generating virtual bounding boxes with appropriate confidence scores, effectively addressing the impact of data imbalance. Experimental results show that VisionGuard improves overall mAP by 3.1% compared to baseline detectors, demonstrating its effectiveness and potential for real-world deployment in traffic surveillance systems, ultimately promoting safety and regulatory compliance.[70] Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning
Tyler Ward,Xiaoqin Wang,Braxton McFarland,Md Atik Ahamed,Sahar Nozad,Talal Arshad,Hafsa Nebbache,Jin Chen,Abdullah Imran
Main category: cs.CV
TL;DR: 提出了一种新的深度学习框架FFCL-SAM,用于提高乳腺癌手术中的边缘评估准确性和速度。
Details
Motivation: 2D标本放射成像(SR)在评估手术中标本边缘状态时存在局限性,导致许多患者需要额外的手术。 Method: 结合Segment Anything Model (SAM) 和Forward-Forward Contrastive Learning (FFCL),利用ResNet-18对SR图像进行分类,并生成粗二值掩码以提示SAM进行肿瘤边缘分割。 Result: 在边缘分类任务中AUC达到了0.8455,在Dice相似度上比基线模型提高了27.4%,同时将每张图像的推理时间减少到47毫秒。 Conclusion: FFCL-SAM是一种有潜力的方法,可以提高乳腺癌手术中切除边缘评估的速度和准确性。 Abstract: Complete removal of cancer tumors with a negative specimen margin during lumpectomy is essential in reducing breast cancer recurrence. However, 2D specimen radiography (SR), the current method used to assess intraoperative specimen margin status, has limited accuracy, resulting in nearly a quarter of patients requiring additional surgery. To address this, we propose a novel deep learning framework combining the Segment Anything Model (SAM) with Forward-Forward Contrastive Learning (FFCL), a pre-training strategy leveraging both local and global contrastive learning for patch-level classification of SR images. After annotating SR images with regions of known maligancy, non-malignant tissue, and pathology-confirmed margins, we pre-train a ResNet-18 backbone with FFCL to classify margin status, then reconstruct coarse binary masks to prompt SAM for refined tumor margin segmentation. Our approach achieved an AUC of 0.8455 for margin classification and segmented margins with a 27.4% improvement in Dice similarity over baseline models, while reducing inference time to 47 milliseconds per image. These results demonstrate that FFCL-SAM significantly enhances both the speed and accuracy of intraoperative margin assessment, with strong potential to reduce re-excision rates and improve surgical outcomes in breast cancer treatment. Our code is available at https://github.com/tbwa233/FFCL-SAM/.[71] The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion
Bang Gong,Luchao Qi,Jiaye Wu,Zhicheng Fu,Chunbo Song,David W. Jacobs,John Nicholson,Roni Sengupta
Main category: cs.CV
TL;DR: 本文提出Aging Multiverse,利用无需训练的扩散模型生成多样化的面部衰老轨迹,并通过外部因素进行调节,实现更真实的老化模拟。
Details
Motivation: 传统的面部老化方法通常将老化建模为单一确定性路径,而现实中的老化受到多种外部因素影响。因此,作者希望引入一种能够生成多样化老化结果的方法,以反映真实的衰老可能性。 Method: 该论文采用了一种无需训练的扩散模型方法,结合注意力混合机制以调节编辑强度,并提出了模拟衰老正则化策略来稳定编辑效果。 Result: 实验和用户研究表明,该方法在身份保留、年龄真实感和条件对齐方面表现优异,超越了现有模型。此外,该方法成功地将老化转化为一个可控且多维的过程,满足了多个编辑标准。 Conclusion: 该论文提出了一种名为Aging Multiverse的新框架,能够基于单张图像生成多种合理的面部衰老轨迹,并通过外部因素(如环境、健康和生活方式)进行调节。这种方法不仅实现了对编辑强度的调控,还提供了多维度、可控制且可解释的老化过程,为数字叙事、健康教育和个人可视化开辟了新的创意和实用途径。 Abstract: We introduce the Aging Multiverse, a framework for generating multiple plausible facial aging trajectories from a single image, each conditioned on external factors such as environment, health, and lifestyle. Unlike prior methods that model aging as a single deterministic path, our approach creates an aging tree that visualizes diverse futures. To enable this, we propose a training-free diffusion-based method that balances identity preservation, age accuracy, and condition control. Our key contributions include attention mixing to modulate editing strength and a Simulated Aging Regularization strategy to stabilize edits. Extensive experiments and user studies demonstrate state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing editing and age-progression models, which often fail to account for one or more of the editing criteria. By transforming aging into a multi-dimensional, controllable, and interpretable process, our approach opens up new creative and practical avenues in digital storytelling, health education, and personalized visualization.[72] User-in-the-Loop View Sampling with Error Peaking Visualization
Ayaka Yasunaga,Hideo Saito,Shohei Mori
Main category: cs.CV
TL;DR: 本文提出了一种新的移动视图合成方法,利用光场重构与误差可视化,减少用户操作难度并提高场景重建灵活性。
Details
Motivation: 现有的新型视图合成方法要求用户进行复杂的3D注释并限制了场景探索区域,因此需要一种更灵活、更少侵入性的方法。 Method: 通过可视化需要插入新视图来消除的误差,而不是使用传统的3D注释方法。 Result: 实验结果表明,该方法在减少用户的认知负担、提升最终结果满意度方面表现良好,并且可以应用于较大的场景重建任务,如3D高斯点阵。 Conclusion: 本文提出了一种基于局部重构光场和误差可视化的移动视图合成系统,减少了对3D注释的依赖,并提高了场景探索的自由度。 Abstract: Augmented reality (AR) provides ways to visualize missing view samples for novel view synthesis. Existing approaches present 3D annotations for new view samples and task users with taking images by aligning the AR display. This data collection task is known to be mentally demanding and limits capture areas to pre-defined small areas due to the ideal but restrictive underlying sampling theory. To free users from 3D annotations and limited scene exploration, we propose using locally reconstructed light fields and visualizing errors to be removed by inserting new views. Our results show that the error-peaking visualization is less invasive, reduces disappointment in final results, and is satisfactory with fewer view samples in our mobile view synthesis system. We also show that our approach can contribute to recent radiance field reconstruction for larger scenes, such as 3D Gaussian splatting.[73] Bridging Video Quality Scoring and Justification via Large Multimodal Models
Qizhi Xie,Kun Yuan,Yunpeng Qu,Jiachao Gong,Mingda Wu,Ming Sun,Chao Zhou,Jihong Zhu
Main category: cs.CV
TL;DR: This paper proposes an automated pipeline for generating instruction data to improve video quality assessment using large multimodal models, resulting in better scoring and justification capabilities.
Details
Motivation: Traditional VQA methods rely on numerical scores that fail to capture complex quality dimensions. Current instruction data generation heavily depends on human annotations and proprietary systems, limiting scalability. This work aims to automate and scale the process while improving video LMMs' quality assessment capabilities. Method: The study introduces a Score-based Instruction Generation (SIG) pipeline that automatically generates instruction-response pairs using hierarchical Chain-of-Thought reasoning. It maps video quality scores to textual descriptions and creates a large-scale dataset (S2I) with over 320K pairs. A progressive tuning strategy and a new benchmark (S2I-Bench) are also developed. Result: The SIG pipeline successfully generated the S2I dataset containing over 320K instruction-response pairs. Experimental results show consistent improvement in both quality scoring and justification abilities across multiple video LMMs on S2I-Bench and existing benchmarks. Conclusion: The proposed SIG pipeline and S2I dataset effectively enhance video LMMs' abilities in quality scoring and justification, as demonstrated by improvements on the S2I-Bench and existing benchmarks. Abstract: Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity. Yet, a score fails to describe the video's complex quality dimensions, restricting its applicability. Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems, limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system's reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs' quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs.[74] HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Qize Yang,Shimin Yao,Weixuan Chen,Shenghao Fu,Detao Bai,Jiaxing Zhao,Boyuan Sun,Bowen Yin,Xihan Wei,Jingren Zhou
Main category: cs.CV
TL;DR: 本研究解决了多模态大语言模型在推理中的关键问题,提出了新的奖励机制和基准测试,显著提升了多模态推理能力。
Details
Motivation: 现有研究在多模态推理中存在全局上下文理解不足和捷径问题,需要更深入的推理以准确理解和解释人类意图。 Method: 引入上下文奖励、格式和准确性奖励,以及逻辑奖励机制,利用大语言模型进行多模态信息的深度推理。 Result: 所提出方法在多个多模态基准测试中表现出优于其他开源多模态模型的性能。 Conclusion: 该论文提出了一种通过强化学习提升多模态大语言模型推理能力的方法,并设计了IntentBench基准测试来评估复杂人类意图和情感的理解能力。 Abstract: With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.[75] FedSC: Federated Learning with Semantic-Aware Collaboration
Huan Wang,Haoran Li,Huaming Chen,Jun Yan,Jiahua Shi,Jun Shen
Main category: cs.CV
TL;DR: 本文提出了一种新的联邦学习方法FedSC,旨在通过语义感知协作解决数据异构性问题。
Details
Motivation: 解决联邦学习中的数据异构性问题,利用客户端内部的语义信息。 Method: 提出了基于关系原型和一致性原型的联邦学习方法,并引入了对比学习策略和差异聚合方式。 Result: 实验结果表明FedSC在各种复杂场景下都具有良好的效果和关键组件的有效性。 Conclusion: FedSC通过语义感知协作来处理数据异构性问题,提高了联邦学习的效果。 Abstract: Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global model), often neglecting inherent semantic information contained in each client. To explore the possibility of using intra-client semantically meaningful knowledge in handling data heterogeneity, in this paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC) to capture client-specific and class-relevant knowledge across heterogeneous clients. The core idea of FedSC is to construct relational prototypes and consistent prototypes at semantic-level, aiming to provide fruitful class underlying knowledge and stable convergence signals in a prototype-wise collaborative way. On the one hand, FedSC introduces an inter-contrastive learning strategy to bring instance-level embeddings closer to relational prototypes with the same semantics and away from distinct classes. On the other hand, FedSC devises consistent prototypes via a discrepancy aggregation manner, as a regularization penalty to constrain the optimization region of the local model. Moreover, a theoretical analysis for FedSC is provided to ensure a convergence guarantee. Experimental results on various challenging scenarios demonstrate the effectiveness of FedSC and the efficiency of crucial components.[76] Logios : An open source Greek Polytonic Optical Character Recognition system
Perifanos Konstantinos,Goutsos Dionisis
Main category: cs.CV
TL;DR: 本文提出了一种专门针对希腊多调文本的光学字符识别(OCR)系统,结合卷积层和循环层的优势,提高了识别准确性和效率,并开放源代码供学术使用。
Details
Motivation: 传统OCR方法在处理希腊多调文本时存在准确性与效率上的局限性,因此需要一种更有效的方法来解决这些挑战。 Method: 利用卷积层进行特征提取,同时采用循环层进行序列学习,以捕捉希腊多调文本的独特结构。 Result: 开发出一个高效的OCR系统,在识别希腊多调文本方面表现出更高的准确性和效率。 Conclusion: 该OCR系统为希腊多调文本提供了有效的解决方案,并通过开源库促进了学术研究的应用。 Abstract: In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use.[77] HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation
Qingyue Jiao,Kangyu Zheng,Yiyu Shi,Zhiding Liang
Main category: cs.CV
TL;DR: 本文提出了一种新的经典-量子生成对抗网络(GAN),用于生成彩色医学图像,解决了传统方法在计算资源和训练时间上的限制,并展示了其在皮肤疾病检测中的潜力。
Details
Motivation: 现有的经典生成模型需要大量计算资源和长时间训练,而基于量子计算的图像生成方法只能生成低质量灰度图像,因此需要一种更高效的方法来生成高质量的彩色医学图像。 Method: 通过一种新颖的经典-量子潜在空间融合技术,构建了第一个能够生成彩色医学图像的经典-量子生成对抗网络(GAN)。 Result: 该模型在图像生成质量和作为数据增强时的分类性能提升方面均优于经典的深度卷积GAN和现有的混合经典-量子GAN,且与最先进的经典生成模型相比,参数减少了25倍以上,训练周期减少了10倍以上。此外,在带有硬件噪声的真实IBM量子机器上也展示了模型的稳定性能。 Conclusion: 随着量子硬件的发展,量子图像生成具有广阔的应用前景。 Abstract: Machine learning-assisted diagnosis is gaining traction in skin disease detection, but training effective models requires large amounts of high-quality data. Skin disease datasets often suffer from class imbalance, privacy concerns, and object bias, making data augmentation essential. While classical generative models are widely used, they demand extensive computational resources and lengthy training time. Quantum computing offers a promising alternative, but existing quantum-based image generation methods can only yield grayscale low-quality images. Through a novel classical-quantum latent space fusion technique, our work overcomes this limitation and introduces the first classical-quantum generative adversarial network (GAN) capable of generating color medical images. Our model outperforms classical deep convolutional GANs and existing hybrid classical-quantum GANs in both image generation quality and classification performance boost when used as data augmentation. Moreover, the performance boost is comparable with that achieved using state-of-the-art classical generative models, yet with over 25 times fewer parameters and 10 times fewer training epochs. Such results suggest a promising future for quantum image generation as quantum hardware advances. Finally, we demonstrate the robust performance of our model on real IBM quantum machine with hardware noise.[78] HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
Xinzhuo Li,Adheesh Juvekar,Xingyou Liu,Muntasir Wahed,Kiet A. Nguyen,Ismini Lourentzou
Main category: cs.CV
TL;DR: This paper introduces HalluSegBench to evaluate hallucinations in vision-language segmentation, showing that current models often hallucinate visually driven errors.
Details
Motivation: Current evaluation protocols inadequately address vision-driven hallucinations in vision-language segmentation models. Method: Introduced HalluSegBench, a benchmark with 1340 counterfactual instance pairs and new metrics to evaluate hallucinations in visual grounding. Result: Experiments revealed that vision-driven hallucinations are more common than label-driven ones, with models frequently producing false segmentations. Conclusion: Vision-language segmentation models show significant hallucinations, especially vision-driven ones, and require counterfactual reasoning for accurate evaluation. Abstract: Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.[79] Multimodal Prompt Alignment for Facial Expression Recognition
Fuyan Ma,Yiran He,Bin Sun,Shutao Li
Main category: cs.CV
TL;DR: This paper proposes MPA-FER, a multimodal prompt alignment framework for facial expression recognition, which generates detailed descriptions using an LLM and aligns visual features with class-specific prototypes to improve performance while minimizing computational costs.
Details
Motivation: Current VLM-based FER methods struggle to capture fine-grained textual-visual relationships which are essential for distinguishing subtle differences between facial expressions. Method: A multimodal prompt alignment framework for FER, called MPA-FER, is introduced. It uses a multi-granularity hard prompt generation strategy with an LLM to generate detailed descriptions for each facial expression. The framework minimizes feature discrepancy between soft and hard prompts, incorporates prototype-guided visual feature alignment, and includes a cross-modal global-local alignment module. Result: Extensive experiments demonstrate that the proposed framework surpasses state-of-the-art methods on three FER benchmark datasets. Conclusion: The proposed MPA-FER framework outperforms state-of-the-art methods on three FER benchmark datasets while maintaining the benefits of the pretrained model and minimizing computational costs. Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.[80] LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection
Lei Hao,Lina Xu,Chang Liu,Yanni Dong
Main category: cs.CV
TL;DR: 本文提出LASFNet,通过单个特征级融合单元实现高效的多模态目标检测,有效平衡了效率与准确性。
Details
Motivation: 现有的多模态目标检测方法通常需要复杂的训练过程以及显著的计算开销,因此需要一种更高效的特征融合方案。 Method: 提出了一种新的融合检测基线,并设计了轻量级的注意力引导自调节特征融合网络(LASFNet),包含ASFF模块和FATM模块。 Result: 实验表明,与现有最先进方法相比,LASFNet减少了多达90%的参数数量和85%的计算成本,同时将检测准确率(mAP)提升了1%-3%。 Conclusion: LASFNet在多模态目标检测中实现了高效的特征融合,通过引入ASFF模块和FATM模块,在降低计算成本的同时提高了检测精度。 Abstract: Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at https://github.com/leileilei2000/LASFNet.[81] Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation
Ze Wang,Hao Chen,Benran Hu,Jiang Liu,Ximeng Sun,Jialian Wu,Yusheng Su,Xiaodong Yu,Emad Barsoum,Zicheng Liu
Main category: cs.CV
TL;DR: 本文提出了一种高效的1D二进制潜在空间图像标记化方法,大幅减少标记数量并提升文本到图像生成的训练和推理速度。
Details
Motivation: 减少高分辨率图像建模的计算需求,提高图像和多模态理解和生成的效率。 Method: 引入1D二进制潜在空间表示图像,使用二进制向量序列代替传统的独热码本标记,结合简单的模型架构实现高效的文本到图像生成。 Result: 提出的文本到图像模型在仅使用128个离散标记的情况下达到了与现代图像生成模型相当的性能,且训练和推理速度显著提升,在单个GPU节点上实现了4096的全局批次大小。 Conclusion: 该论文提出了一种基于1D二进制潜在空间的新型图像标记化方法,相较于传统方法,其在保持高分辨率细节的同时显著减少了所需的标记数量,并提高了训练和推理速度。 Abstract: Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.[82] DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation
Wenzhou Lyu,Jialing Lin,Wenqi Ren,Ruihao Xia,Feng Qian,Yang Tang
Main category: cs.CV
TL;DR: DidSee是一种改进的扩散模型框架,解决了非朗伯物体深度补全中的偏差和精度问题,结合噪声调度优化、训练策略改进和语义增强模块,实现了优异的性能和广泛的应用潜力。
Details
Motivation: 商用RGB-D相机对非朗伯物体生成的深度图通常包含噪声且不完整。传统方法受限于训练数据的多样性和规模,难以泛化。最近的方法利用预训练文本到图像扩散模型的视觉先验来提升密集预测任务的泛化能力,但在深度补全任务中存在偏差和精度问题。 Method: 1. 引入了一个重新缩放的噪声调度器,以消除信号泄漏偏差。 2. 设计了一种噪声无关的单步训练方法,减轻了误差累积问题。 3. 加入了一个语义增强模块,同时实现深度补全和语义分割。 Result: DidSee在多个基准测试中达到了最先进的性能,展示了强大的现实世界泛化能力,并有效提升了下游任务(如类别级姿态估计和机器人抓取)的表现。 Conclusion: DidSee是一个基于扩散模型的深度补全框架,专门用于处理非朗伯物体。它通过改进噪声调度器、训练策略和引入语义增强模块,在多个基准测试中实现了最先进的性能,并在实际应用中表现出色。 Abstract: Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose \textbf{DidSee}, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic grasping.Project page: https://wenzhoulyu.github.io/DidSee/[83] Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability
Boyong He,Yuxiang Ji,Zhuoyue Tan,Liaoni Wu
Main category: cs.CV
TL;DR: 本文提出一种高效的基于扩散模型的跨域检测方法,通过改进特征提取和融合,结合对象中心辅助分支和一致性损失,有效降低推理成本,提升跨域检测性能。
Details
Motivation: 由于训练数据和测试数据之间的领域差距,检测器通常会遇到性能下降的问题,而现有方法在推理成本和扩散模型能力利用方面存在不足。 Method: 通过单步扩散过程提取中间特征以提高特征收集和融合效率,构建以对象为中心的辅助分支并应用一致性损失对齐辅助分支与普通分支,实现特征级和对象级的对齐。 Result: 该方法减少了75%的推理时间,在源域、目标域以及低数据场景下的跨域检测任务中表现出显著优势,并在3个DA基准测试和5个DG基准测试中取得良好结果。 Conclusion: 该论文提出了一种基于扩散模型的统一框架,用于解决跨域检测任务中的性能下降问题,并在多个基准测试中取得了具有竞争力的结果。 Abstract: Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at \href{https://github.com/heboyong/Fitness-Generalization-Transferability}{Fitness-Generalization-Transferability}.[84] Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling
Hansam Cho,Seoung Bum Kim
Main category: cs.CV
TL;DR: 本文提出了一种新的方法FGS,在文本引导的扩散模型中解决了编辑性和保真性之间的权衡问题。
Details
Motivation: 在文本引导的扩散模型中,实现高质量的图像合成和动态编辑,但编辑性和保真性之间存在权衡问题。 Method: 提出了保真引导(Faithfulness Guidance)和调度策略(Scheduling Strategy)。 Result: FGS在保持编辑性的同时显著提高了保真度,并且适用于多种编辑方法。 Conclusion: FGS实现了高保真度和编辑性的平衡,并提升了图像编辑的质量。 Abstract: Text-guided diffusion models have become essential for high-quality image synthesis, enabling dynamic image editing. In image editing, two crucial aspects are editability, which determines the extent of modification, and faithfulness, which reflects how well unaltered elements are preserved. However, achieving optimal results is challenging because of the inherent trade-off between editability and faithfulness. To address this, we propose Faithfulness Guidance and Scheduling (FGS), which enhances faithfulness with minimal impact on editability. FGS incorporates faithfulness guidance to strengthen the preservation of input image information and introduces a scheduling strategy to resolve misalignment between editability and faithfulness. Experimental results demonstrate that FGS achieves superior faithfulness while maintaining editability. Moreover, its compatibility with various editing methods enables precise, high-quality image edits across diverse tasks.[85] Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features
Shangbo Wu,Yu-an Tan,Ruinan Ma,Wencong Ma,Dehua Zhu,Yuanzhang Li
Main category: cs.CV
TL;DR: This paper proposes dSVA, an adversarial attack method using dual self-supervised Vision Transformer features that significantly improves black-box transferability across different model architectures.
Details
Motivation: The motivation stems from the synergy between self-supervised learning and Transformers, aiming to enhance adversarial perturbation generalization by utilizing intermediate DNN features instead of hard labels. Method: The paper introduces dSVA, a generative attack framework leveraging both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), combined with a generator trained to exploit these joint features and the attention mechanism in self-supervised ViTs. Result: The results demonstrate that dSVA achieves remarkable black-box transferability, outperforming state-of-the-art methods by disrupting dual deep features derived from self-supervised ViTs. Conclusion: The paper concludes that by exploiting dual self-supervised Vision Transformer (ViT) features, the proposed dSVA attack effectively generates black-box adversarial examples with improved transferability across various model architectures. Abstract: The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.[86] Class-Agnostic Region-of-Interest Matching in Document Images
Demin Zhang,Jiahao Lyu,Zhijie Shen,Yu Zhou
Main category: cs.CV
TL;DR: This paper proposes RoI-Matcher, a novel framework for class-agnostic region-of-interest matching in document images, enabling flexible, multi-granularity analysis across domains.
Details
Motivation: Existing document analysis solutions are limited by fixed category definitions and granularities, making them unsuitable for customized applications. This paper addresses this limitation by defining a new, more flexible task called RoI-Matching. Method: The paper introduces a new framework called RoI-Matcher, which utilizes a siamese network to extract multi-level features and cross-attention layers to align similar semantics in reference and target domains. Result: Experiments show that the proposed method is effective on the newly constructed RoI-Matching-Bench benchmark, serving as a baseline for future research. Conclusion: The paper concludes that the proposed RoI-Matcher framework effectively meets the requirements of the new RoI-Matching task, providing a flexible and efficient solution for document understanding and analysis. Abstract: Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named ``Class-Agnostic Region-of-Interest Matching'' (``RoI-Matching'' for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at https://github.com/pd162/RoI-Matching.[87] SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification
Dinh-Khoi Vo,Van-Loc Nguyen,Minh-Triet Tran,Trung-Nghia Le
Main category: cs.CV
TL;DR: This paper introduces SAMURAI, a new approach for 3D object retrieval in challenging indoor environments, combining language understanding with shape analysis to improve accuracy.
Details
Motivation: The motivation stems from the challenges in retrieving 3D objects in complex indoor environments using only a masked 2D image and natural language description, especially under conditions like distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. Method: The authors propose SAMURAI, a hybrid retrieval framework integrating CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, along with a robust majority voting strategy. A preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Result: The proposed method achieves competitive performance on the ROOMELSA private test set, demonstrating the effectiveness of leveraging both language and shape cues for 3D object retrieval. Conclusion: The study concludes that combining shape priors with language understanding is crucial for robust open-world 3D object retrieval, as demonstrated by the competitive performance of the proposed SAMURAI framework on the ROOMELSA private test set. Abstract: Retrieving 3D objects in complex indoor environments using only a masked 2D image and a natural language description presents significant challenges. The ROOMELSA challenge limits access to full 3D scene context, complicating reasoning about object appearance, geometry, and semantics. These challenges are intensified by distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. To address this, we propose SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, alongside a robust majority voting strategy. A dedicated preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Our hybrid retrieval framework leverages both language and shape cues, achieving competitive performance on the ROOMELSA private test set. These results highlight the importance of combining shape priors with language understanding for robust open-world 3D object retrieval.[88] PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image
Hongyu Yan,Kunming Luo,Weiyu Li,Yixun Liang,Shengming Li,Jingwei Huang,Chunchao Guo,Ping Tan
Main category: cs.CV
TL;DR: PoseMaster is an end-to-end framework for 3D character generation that enables accurate pose control by integrating pose transformation into a flow-based generation process, solving distortion issues present in prior methods.
Details
Motivation: Existing image-based methods suffer from distorted images during pose standardization due to self-occlusion and viewpoint issues, which affects the quality of 3D reconstruction. Method: PoseMaster unifies pose transformation and 3D character generation into a flow-based 3D native generation framework, using the 3D body bones of the skeleton as the pose condition. Random emptying of pose and image conditions during training improves generalizability. Result: Extensive experiments demonstrate that PoseMaster achieves better performance than current techniques for A-pose character generation and provides strong arbitrary-pose control capabilities. Conclusion: PoseMaster is an effective and efficient framework for 3D character generation that allows precise arbitrary-pose control while outperforming state-of-the-art methods in both qualitative and quantitative evaluations. Abstract: 3D characters play a crucial role in our daily entertainment. To improve the efficiency of 3D character modeling, recent image-based methods use two separate models to achieve pose standardization and 3D reconstruction of the A-pose character. However, these methods are prone to generating distorted and degraded images in the pose standardization stage due to self-occlusion and viewpoints, which further affects the geometric quality of the subsequent reconstruction process. To tackle these problems, we propose PoseMaster, an end-to-end controllable 3D character generation framework. Specifically, we unify pose transformation and 3D character generation into a flow-based 3D native generation framework. To achieve accurate arbitrary-pose control, we propose to leverage the 3D body bones existing in the skeleton of an animatable character as the pose condition. Furthermore, considering the specificity of multi-condition control, we randomly empty the pose condition and the image condition during training to improve the effectiveness and generalizability of pose control. Finally, we create a high-quality pose-control dataset derived from realistic character animation data to make the model learning the implicit relationships between skeleton and skinning weights. Extensive experiments show that PoseMaster outperforms current state-of-the-art techniques in both qualitative and quantitative evaluations for A-pose character generation while demonstrating its powerful ability to achieve precise control for arbitrary poses.[89] EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
Sanjoy Chowdhury,Subrata Biswas,Sayan Nag,Tushar Nagarajan,Calvin Murdock,Ishwarya Ananthabhotla,Yijun Qian,Vamsi Krishna Ithapu,Dinesh Manocha,Ruohan Gao
Main category: cs.CV
TL;DR: 本文提出了一种名为EgoAdapt的框架,通过自适应跨模态蒸馏和策略学习,实现了高效的自我中心感知任务推理。
Details
Motivation: 现代感知模型在多感官自我中心任务中表现出色,但计算成本高,在资源受限环境中部署困难。 Method: 引入了一种自适应执行跨模态蒸馏和策略学习的框架EgoAdapt,以实现高效的推断。 Result: 实验结果表明,EgoAdapt减少了GMACs高达89.09%,参数高达82.02%,能耗高达9.6倍。 Conclusion: EgoAdapt可以显著提高效率,同时保持或超越现有最先进模型的性能。 Abstract: Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets EPIC-Kitchens, EasyCom, and Aria Everyday Activities demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6x, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.[90] ESMStereo: Enhanced ShuffleMixer Disparity Upsampling for Real-Time and Accurate Stereo Matching
Mahmoud Tahmasebi,Saif Huq,Kevin Meehan,Marion McAfee
Main category: cs.CV
TL;DR: This paper proposes ESM, a novel method for stereo matching that balances real-time performance and high accuracy by improving small-scale cost volume processing.
Details
Motivation: The motivation is to overcome the limitations of large-scale cost volumes (redundant information and computational intensity) and improve the accuracy of small-scale cost volumes for real-time performance. Method: The method involves integrating primary features into a disparity upsampling unit, extracting and fusing features using a compact feature-guided hourglass network, and focusing on local contextual connectivity for disparity map reconstruction. Result: The compact version of ESMStereo achieves 116 FPS on high-end GPUs and 91 FPS on the AGX Orin while delivering highly accurate disparity maps. Conclusion: The proposed Enhanced Shuffle Mixer (ESM) effectively mitigates information loss in small-scale cost volumes, achieving real-time stereo matching with high accuracy. Abstract: Stereo matching has become an increasingly important component of modern autonomous systems. Developing deep learning-based stereo matching models that deliver high accuracy while operating in real-time continues to be a major challenge in computer vision. In the domain of cost-volume-based stereo matching, accurate disparity estimation depends heavily on large-scale cost volumes. However, such large volumes store substantial redundant information and also require computationally intensive aggregation units for processing and regression, making real-time performance unattainable. Conversely, small-scale cost volumes followed by lightweight aggregation units provide a promising route for real-time performance, but lack sufficient information to ensure highly accurate disparity estimation. To address this challenge, we propose the Enhanced Shuffle Mixer (ESM) to mitigate information loss associated with small-scale cost volumes. ESM restores critical details by integrating primary features into the disparity upsampling unit. It quickly extracts features from the initial disparity estimation and fuses them with image features. These features are mixed by shuffling and layer splitting then refined through a compact feature-guided hourglass network to recover more detailed scene geometry. The ESM focuses on local contextual connectivity with a large receptive field and low computational cost, leading to the reconstruction of a highly accurate disparity map at real-time. The compact version of ESMStereo achieves an inference speed of 116 FPS on high-end GPUs and 91 FPS on the AGX Orin.[91] OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography
Caoshuo Li,Zengmao Ding,Xiaobin Hu,Bang Li,Donghao Luo,AndyPian Wu,Chaoyang Wang,Chengjie Wang,Taisong Jin,SevenShu,Yunsheng Wu,Yongge Liu,Rongrong Ji
Main category: cs.CV
TL;DR: This paper proposes OracleFusion, a novel two-stage semantic typography framework designed to assist experts in deciphering Oracle Bone Script (OBS), demonstrating superior performance in semantics, visual appeal, and glyph maintenance.
Details
Motivation: Approximately 4,500 Oracle Bone Script (OBS) characters have been discovered, but only about 1,600 have been deciphered. The undeciphered characters pose significant challenges due to their complex structure and abstract imagery. Method: A two-stage semantic typography framework named OracleFusion is proposed. In the first stage, Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) is used to analyze glyph structure and perform visual localization of key components. In the second stage, Oracle Structural Vector Fusion (OSVF) is introduced to generate semantically enriched vector fonts by incorporating glyph structure and maintenance constraints. Result: Extensive qualitative and quantitative experiments show that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing readability and aesthetic quality. Conclusion: OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS. Abstract: As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.[92] Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection
Luosheng Xu,Dalin Zhang,Zhaohui Song
Main category: cs.CV
TL;DR: 本研究提出了一种高效的变化检测方法FlickCD,在显著降低计算资源需求的同时保持了高精度。
Details
Motivation: 为了应对现代深度学习模型复杂度高且计算需求大但准确率提升有限的问题,探索一种在资源受限场景(如卫星处理)下仍能保持高性能的变化检测方法。 Method: 提出了FlickCD方法,包含增强差异模块(EDM)和结合Shifted Window Self-Attention (SWSA) 与 Enhanced Global Self-Attention (EGSA)的Local-Global Fusion Blocks。 Result: 实验表明FlickCD在四个基准数据集上减少了超过一个数量级的计算和存储开销,同时达到最先进的性能或仅产生微小的精度损失(<1% F1)。 Conclusion: FlickCD实现了高效的遥感变化检测,平衡了性能与资源消耗之间的关系。 Abstract: Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Enhanced Global Self-Attention (EGSA) to efficiently capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1\% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.[93] IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes
Yujia Liang,Jile Jiao,Zhicheng Wang,Xuetao Feng,Zixuan Ye,Yuan Wang,Hao Lu
Main category: cs.CV
TL;DR: This paper introduces a new dataset and model to improve VideoLLMs' performance in multi-shot scenarios by addressing the lack of multi-shot annotations and incorporating instance-level features.
Details
Motivation: Current VideoLLMs struggle with multi-shot scenarios due to lack of multi-shot annotations in existing datasets. Method: Introduced MultiClip-Bench dataset with dense descriptions and question-answering pairs, and proposed the IPFormer-VideoLLM model using instance prompts via an attention-based connector. Result: The new dataset boosts multi-shot performance and the testing benchmark reliably measures model capability, while the new model effectively aggregates instance-specific information across scenes. Conclusion: The proposed MultiClip-Bench dataset and IPFormer-VideoLLM model significantly enhance multi-scene video understanding and offer advantages across various video benchmarks. Abstract: Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or lossy manner, at the risk of missing identity information, we then contribute a new model IPFormer-VideoLLM. Its key idea is the injection of instance-level features as instance prompts through an efficient attention-based connector. This allows for the aggregation of instance-specific information across scenes. Experiments demonstrate that our proposed dataset and model not only enhance the multi-scene video understanding significantly, but also offer distinct advantages across various video benchmarks.[94] CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization
Jan Ackermann,Jonas Kulhanek,Shengqu Cai,Haofei Xu,Marc Pollefeys,Gordon Wetzstein,Leonidas Guibas,Songyou Peng
Main category: cs.CV
TL;DR: CL-Splats是一种用于更新基于高斯点绘的3D表示的新方法,它通过集成变化检测模块和局部优化,实现了比现有方法更高效的更新和更好的重建质量。
Details
Motivation: 在动态3D环境中,准确地随时间推移更新场景表示对于机器人技术、混合现实和具身AI中的应用至关重要。随着场景的发展,需要有效的方法来纳入变化,以保持最新的高质量重建,而不需要重新优化整个场景的计算开销。 Method: CL-Splats集成了一个强大的变化检测模块,该模块在场景中分割更新和静态组件,实现集中、局部优化,避免不必要的重新计算,并且支持存储和恢复先前的场景状态。 Result: 通过大量实验表明,CL-Splats实现了比最先进的方法更高效的更新和改进的重建质量。 Conclusion: CL-Splats提供了一种高效的方法来更新基于高斯点绘的3D表示,为未来的实时适应在3D场景重建任务中奠定了坚实的基础。 Abstract: In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures. CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation. Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications. Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.[95] GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction
Muleilan Pei,Shaoshuai Shi,Lu Zhang,Peiliang Li,Shaojie Shen
Main category: cs.CV
TL;DR: This paper proposes a new Graph-oriented Inverse Reinforcement Learning framework for trajectory prediction in autonomous driving, combining vectorized context representations with effective feature aggregation and trajectory generation techniques to achieve superior performance and generalization.
Details
Motivation: Trajectory prediction for surrounding agents in autonomous driving is challenging due to uncertainty and multimodality, which current data-driven supervised methods struggle to handle effectively. Method: A novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework is introduced. It includes a feature adaptor to aggregate lane-graph features, integration with the maximum entropy IRL paradigm to infer reward distribution, and a hierarchical parameterized trajectory generator with refinement and probability fusion strategies. Result: Extensive experimental results show that the approach achieves state-of-the-art performance on Argoverse and nuScenes benchmarks while exhibiting better generalization than existing supervised models. Conclusion: The proposed GoIRL framework outperforms existing supervised models in trajectory prediction for autonomous driving, demonstrating state-of-the-art performance and superior generalization abilities. Abstract: Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.[96] Learning to See in the Extremely Dark
Hai Jiang,Binhao Guan,Zhen Liu,Xiaohong Liu,Jian Yu,Zheng Liu,Songchen Han,Shuaicheng Liu
Main category: cs.CV
TL;DR: This paper introduces a novel dataset (SIED) and a diffusion-based framework to enhance extremely low-light RAW images, enabling better performance in very dark environments.
Details
Motivation: The motivation stems from the lack of datasets for extremely dark scenes (down to 0.0001 lux), limiting the exploration of learning-based methods in such conditions. Method: A diffusion-based framework with an Adaptive Illumination Correction Module (AICM) and color consistency loss was developed. Additionally, a paired-to-paired data synthesis pipeline was used to create the SIED dataset. Result: The proposed method achieved effective restoration of visually pleasing results from extremely low-SNR RAW images, validated through experiments on the new SIED dataset and existing benchmarks. Conclusion: The study concludes that the proposed diffusion-based framework and the SIED dataset significantly advance low-light RAW image enhancement, particularly in extremely dark environments. Abstract: Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/JianghaiSCU/SIED.[97] YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection
Jiawei Hu
Main category: cs.CV
TL;DR: 本文提出了一种名为YOLO-FDA的新方法,用于改进工业场景中的表面缺陷检测,通过创新性的特征融合和注意力机制提升了检测效果。
Details
Motivation: 在工业场景中,表面缺陷检测至关重要且技术要求高,因为缺陷类型多样、形状尺寸不规则、需要细致检测并具有复杂材质纹理。现有的AI方法存在冗余特征、细节敏感性有限以及多尺度条件下的鲁棒性差等问题。 Method: 该研究采用了BiFPN风格的架构,并引入了细节方向融合模块(DDFM)以及两种新的基于注意力的融合策略:注意力加权连接(AC)和跨层注意力融合(CAF)。 Result: 实验表明,YOLO-FDA在准确性和鲁棒性方面均优于现有最先进的方法,适用于各种类型的缺陷和尺度。 Conclusion: YOLO-FDA是一个基于YOLO的新型检测框架,用于工业场景中的表面缺陷检测。它通过整合细粒度细节增强和注意力引导的特征融合来提高检测性能。 Abstract: Surface defect detection in industrial scenarios is both crucial and technically demanding due to the wide variability in defect types, irregular shapes and sizes, fine-grained requirements, and complex material textures. Although recent advances in AI-based detectors have improved performance, existing methods often suffer from redundant features, limited detail sensitivity, and weak robustness under multiscale conditions. To address these challenges, we propose YOLO-FDA, a novel YOLO-based detection framework that integrates fine-grained detail enhancement and attention-guided feature fusion. Specifically, we adopt a BiFPN-style architecture to strengthen bidirectional multilevel feature aggregation within the YOLOv5 backbone. To better capture fine structural changes, we introduce a Detail-directional Fusion Module (DDFM) that introduces a directional asymmetric convolution in the second-lowest layer to enrich spatial details and fuses the second-lowest layer with low-level features to enhance semantic consistency. Furthermore, we propose two novel attention-based fusion strategies, Attention-weighted Concatenation (AC) and Cross-layer Attention Fusion (CAF) to improve contextual representation and reduce feature noise. Extensive experiments on benchmark datasets demonstrate that YOLO-FDA consistently outperforms existing state-of-the-art methods in terms of both accuracy and robustness across diverse types of defects and scales.[98] Tree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation
Junwen Wang,Oscar Maccormac,William Rochford,Aaron Kujawa,Jonathan Shapey,Tom Vercauteren
Main category: cs.CV
TL;DR: 本文提出一种新的语义损失函数,用于高光谱图像的生物医学分割,提高了精度并能识别异常区域。
Details
Motivation: 传统的学习方法对所有错误进行等同惩罚,无法利用标签空间中的类别间语义关系。 Method: 引入了两种基于树的语义损失函数,并将其整合到稀疏标注训练方法中。 Result: 所提方法在包含107个类别的稀疏标注高光谱图像数据集上达到了最先进的性能,并能有效检测分布外像素。 Conclusion: 该论文提出了一种基于树结构的语义损失函数,以提升生物医学分割任务中的性能,并能够有效检测分布外像素。 Abstract: Hyperspectral imaging (HSI) shows great promise for surgical applications, offering detailed insights into biological tissue differences beyond what the naked eye can perceive. Refined labelling efforts are underway to train vision systems to distinguish large numbers of subtly varying classes. However, commonly used learning methods for biomedical segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. In this work, we introduce two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations. Extensive experiments demonstrate that our proposed method reaches state-of-the-art performance on a sparsely annotated HSI dataset comprising $107$ classes organised in a clinically-defined semantic tree structure. Furthermore, our method enables effective detection of out-of-distribution (OOD) pixels without compromising segmentation performance on in-distribution (ID) pixels.[99] Robust Deep Learning for Myocardial Scar Segmentation in Cardiac MRI with Noisy Labels
Aida Moafi,Danial Moafi,Evgeny M. Mirkes,Gerry P. McCann,Abbas S. Alatrany,Jayanth R. Arnold,Mostafa Mehdipour Ghazi
Main category: cs.CV
TL;DR: 本研究开发了一种高效的心肌瘢痕分割深度学习方法,解决了标签噪声等问题,并在多种条件下展现了优秀的性能和泛化能力。
Details
Motivation: 准确分割心脏MRI中的心肌瘢痕对于临床评估和治疗计划至关重要,但存在标签噪声、数据异质性和类别不平衡等挑战,需要一种稳健的方法解决这些问题。 Method: 通过微调最先进的模型,提出了一种用于全自动心肌瘢痕检测和分割的稳健深度学习流程,并采用Kullback-Leibler损失和大量数据增强来应对半自动注释中的标签噪声、数据异质性和类别不平衡问题。 Result: 该方法在急性和慢性病例上均表现出良好的性能,即使在有噪声标签的情况下也能生成准确且平滑的分割结果,优于nnU-Net等最先进模型,并在分布外测试集中表现出强大的泛化能力。 Conclusion: 该研究为心肌瘢痕的自动化分割提供了一个可靠的深度学习框架,并展示了其在不同成像条件和临床任务中的鲁棒性和广泛应用潜力。 Abstract: The accurate segmentation of myocardial scars from cardiac MRI is essential for clinical assessment and treatment planning. In this study, we propose a robust deep-learning pipeline for fully automated myocardial scar detection and segmentation by fine-tuning state-of-the-art models. The method explicitly addresses challenges of label noise from semi-automatic annotations, data heterogeneity, and class imbalance through the use of Kullback-Leibler loss and extensive data augmentation. We evaluate the model's performance on both acute and chronic cases and demonstrate its ability to produce accurate and smooth segmentations despite noisy labels. In particular, our approach outperforms state-of-the-art models like nnU-Net and shows strong generalizability in an out-of-distribution test set, highlighting its robustness across various imaging conditions and clinical tasks. These results establish a reliable foundation for automated myocardial scar quantification and support the broader clinical adoption of deep learning in cardiac imaging.[100] Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image
Pufan Li,Bi'an Du,Wei Hu
Main category: cs.CV
TL;DR: This paper introduces an innovative method for generating realistic 3D objects from single-view images by integrating geometry and perception priors without extra training, achieving superior multiview consistency and geometric detail.
Details
Motivation: The motivation is to overcome the limitations of existing methods for generating 3D objects from single-view images, which often suffer from poor multiview consistency and lack of geometric detail. The authors aim to create a method that ensures both natural appearance and detailed 3D reconstruction. Method: The paper proposes a method that trains three different Gaussian branches initialized from geometry prior, perception prior, and Gaussian noise. The geometry prior captures rough 3D shapes, while the perception prior leverages a pretrained 2D diffusion model to enhance multiview information. These branches are refined through mutual interaction and a reprojection-based strategy to enforce depth consistency. Result: Experiments show that the proposed method outperforms existing approaches in novel view synthesis and 3D reconstruction, demonstrating robust, consistent, and high-fidelity 3D object generation. Conclusion: The paper concludes that their novel method, which integrates geometry and perception priors without requiring additional model training, achieves higher-fidelity 3D object reconstructions with improved multiview consistency and geometric detail compared to existing approaches. Abstract: Generating realistic 3D objects from single-view images requires natural appearance, 3D consistency, and the ability to capture multiple plausible interpretations of unseen regions. Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference or 3D Gaussian Splatting, but their results generally suffer from poor multiview consistency and lack geometric detail. To takle these issues, we present a novel method that seamlessly integrates geometry and perception priors without requiring additional model training to reconstruct detailed 3D objects from a single image. Specifically, we train three different Gaussian branches initialized from the geometry prior, perception prior and Gaussian noise, respectively. The geometry prior captures the rough 3D shapes, while the perception prior utilizes the 2D pretrained diffusion model to enhance multiview information. Subsequently, we refine 3D Gaussian branches through mutual interaction between geometry and perception priors, further enhanced by a reprojection-based strategy that enforces depth consistency. Experiments demonstrate the higher-fidelity reconstruction results of our method, outperforming existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation.[101] Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition
Longkun Zou,Kangjun Liu,Ke Chen,Kailing Guo,Kui Jia,Yaowei Wang
Main category: cs.CV
TL;DR: 本文提出了一种用于解决模拟到现实领域差距的拓扑感知建模框架,通过利用全局空间拓扑信息及局部几何特征的拓扑关系来提升点云分类的泛化能力。
Details
Motivation: 由于数据获取方法的不同,3D物体形状点集的语义表示学习面临显著的几何变化挑战,导致当前无监督领域适应技术难以克服模拟到现实领域的差距。 Method: 引入了一种新的拓扑感知建模(TAM)框架,利用全局空间拓扑和局部几何特征的拓扑关系进行Sim2Real UDA,并提出了一种结合跨域对比学习与自训练的先进自训练策略。 Result: 在三个公共Sim2Real基准测试中,TAM框架验证了其有效性,并在所有评估任务中始终优于现有方法。 Conclusion: 实验结果表明,TAM框架在Sim2Real UDA任务上优于现有技术,并将在未来公开源代码以促进进一步研究。 Abstract: Learning semantic representations from point sets of 3D object shapes is often challenged by significant geometric variations, primarily due to differences in data acquisition methods. Typically, training data is generated using point simulators, while testing data is collected with distinct 3D sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits the generalization ability of point classifiers. Current unsupervised domain adaptation (UDA) techniques struggle with this gap, as they often lack robust, domain-insensitive descriptors capable of capturing global topological information, resulting in overfitting to the limited semantic patterns of the source domain. To address this issue, we introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures, and by modeling the topological relations of local geometric features through a novel self-supervised learning task. Additionally, we propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training, effectively reducing the impact of noisy pseudo-labels and enhancing the robustness of the adaptation process. Experimental results on three public Sim2Real benchmarks validate the effectiveness of our TAM framework, showing consistent improvements over state-of-the-art methods across all evaluated tasks. The source code of this work will be available at https://github.com/zou-longkun/TAG.git.[102] Task-Aware KV Compression For Cost-Effective Long Video Understanding
Minghao Qin,Yan Shu,Peitian Zhang,Kun Lun,Huaying Yuan,Juenjie Zhou,Shitao Xiao,Bo Zhao,Zheng Liu
Main category: cs.CV
TL;DR: 本文提出了Video-X^2L,一种高效的长视频理解方法,通过双级KV压缩和选择性重载技术,在减少计算成本的同时保持了视频信息的完整性。
Details
Motivation: 现有的多模态大语言模型(MLLMs)在处理长视频理解任务时面临巨大的计算成本挑战,而现有的KV压缩方法往往在高压缩比下导致显著的信息丢失。 Method: 提出了一种双级KV压缩方法(bi-level KV压缩)和选择性KV重载(selective KV re-loading),通过低压缩和高压缩KVs保留关键视频信息,并在解码阶段动态选择使用哪种KVs。 Result: Video-X^2L在多个流行的LVU基准测试中表现优异,相比现有KV压缩方法具有显著优势,同时大幅节省了计算成本。 Conclusion: Video-X^2L是一种无需额外训练且兼容现有KV压缩MLLM的方法,有效解决了长视频理解中的计算成本和信息损失问题。 Abstract: Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X^2L, which flexibly preserves critical video information for each LVU task. Video-X^2L involves two key operations. The first one is called bi-level KV compression. During the MLLM's pre-filling stage, Video-X^2L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM's decoding stage, Video-X^2L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X^2L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.[103] Out-of-Distribution Semantic Occupancy Prediction
Yuheng Zhang,Mengfei Duan,Kunyu Peng,Yuhang Wang,Ruiping Liu,Fei Teng,Kai Luo,Zhiyong Li,Kailun Yang
Main category: cs.CV
TL;DR: 本文提出了一种新的3D语义占用预测框架OccOoD,用于检测分布外(OoD)对象,并引入了合成异常集成管道来生成两个新的数据集VAA-KITTI和VAA-KITTI-360。
Details
Motivation: 现有的3D语义占用预测方法主要关注于分布内场景,对分布外对象和长尾分布较为敏感,可能导致异常未被检测或误解释,从而带来安全隐患。 Method: 作者提出了一个合成异常集成管道,用于在保持真实空间和遮挡模式的同时注入合成异常,并设计了一个新的框架OccOoD,通过Voxel-BEV渐进融合(VBPF)结合基于RWKV的分支增强OoD检测能力。 Result: 实验结果表明,OccOoD在1.2米区域内实现了67.34%的AuROC和29.21%的AuPRCr,达到了当前最先进的OoD检测效果,同时保持了良好的占用预测性能。 Conclusion: 该研究成功地将分布外检测融入到3D语义占用预测中,为自动驾驶提供了更安全、更可靠的环境感知方案。 Abstract: 3D Semantic Occupancy Prediction is crucial for autonomous driving, providing a dense, semantically rich environmental representation. However, existing methods focus on in-distribution scenes, making them susceptible to Out-of-Distribution (OoD) objects and long-tail distributions, which increases the risk of undetected anomalies and misinterpretations, posing safety hazards. To address these challenges, we introduce Out-of-Distribution Semantic Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that injects synthetic anomalies while preserving realistic spatial and occlusion patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360. We introduce OccOoD, a novel framework integrating OoD detection into 3D semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF) leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m region, while maintaining competitive occupancy prediction performance. The established datasets and source code will be made publicly available at https://github.com/7uHeng/OccOoD.[104] GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
Zijun Lin,Shuting He,Cheston Tan,Bihan Wen
Main category: cs.CV
TL;DR: 本文提出了GroundFlow,一种提升3D视觉定位模型在复杂多步指令下定位能力的新方法,在SG3D任务上取得了显著成果。
Details
Motivation: 当前的3D视觉定位方法无法有效提取文本指令中每一步的时序信息,而SG3D任务需要理解上下文并从之前的步骤中检索相关信息以正确地定位对象序列。 Method: 提出了一种名为GroundFlow的插件模块,用于在3D点云序列中进行时序推理,并结合现有3DVG基线方法进行改进。 Result: GroundFlow在SG3D基准测试中显著提高了3DVG基线方法的任务准确性(+7.5%和+10.2%),甚至超过了预训练的3D大语言模型的表现。 Conclusion: GroundFlow有效地提升了3DVG模型在SG3D任务中的时序推理能力,并在五个数据集中实现了最先进的性能。 Abstract: Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as "it", "here" and "the same" to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow -- a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5\% and +10.2\%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.[105] Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation
Yihong Cao,Jiaming Zhang,Xu Zheng,Hao Shi,Kunyu Peng,Hang Liu,Kailun Yang,Hui Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为UNLOCK的新方法,在无需源数据的情况下实现了全景图像的无缝分割,解决了失真和遮挡问题,并在实验中表现优异。
Details
Motivation: 全景图像处理面临失真、遮挡和标注有限等问题,现有无监督领域自适应方法需要访问源数据,因此提出了更实际的任务SFOASS。 Method: 引入了Omni Pseudo-Labeling Learning和Amodal-Driven Context Learning两个模块,用于解决无源数据约束下的全景图像分割问题。 Result: 实验显示,该方法在mAAP和mAP指标上分别取得了10.9和11.6的成绩,并在mAPQ上比仅使用源数据的方法提升了+4.3。 Conclusion: 该论文提出了一种无需源数据的全景图像分割方法UNLOCK,实验结果表明其性能可与依赖源数据的方法相媲美,并在特定指标上达到最先进的分数。 Abstract: Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360{\deg} viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.[106] MedPrompt: LLM-CNN Fusion with Weight Routing for Medical Image Segmentation and Classification
Shadman Sobhan,Kazi Abrar Mahmud,Abduz Zami
Main category: cs.CV
TL;DR: MedPrompt is a unified framework for medical image analysis that interprets user instructions and dynamically routes task-specific pretrained weights without retraining the entire framework.
Details
Motivation: Current medical image analysis systems are typically task-specific and lack the flexibility to support user-defined workflows. Method: MedPrompt combines a few-shot prompted Large Language Model (Llama-4-17B) for high-level task planning with a modular Convolutional Neural Network (DeepFusionLab) for low-level image processing. Result: The system achieves a 97% end-to-end correctness in interpreting and executing prompt-driven instructions, with an average inference latency of 2.5 seconds. DeepFusionLab achieves competitive segmentation accuracy (e.g., Dice 0.9856 on lungs) and strong classification performance (F1 0.9744 on tuberculosis). Conclusion: MedPrompt enables scalable, prompt-driven medical imaging by combining the interpretability of LLMs with the efficiency of modular CNNs. Abstract: Current medical image analysis systems are typically task-specific, requiring separate models for classification and segmentation, and lack the flexibility to support user-defined workflows. To address these challenges, we introduce MedPrompt, a unified framework that combines a few-shot prompted Large Language Model (Llama-4-17B) for high-level task planning with a modular Convolutional Neural Network (DeepFusionLab) for low-level image processing. The LLM interprets user instructions and generates structured output to dynamically route task-specific pretrained weights. This weight routing approach avoids retraining the entire framework when adding new tasks-only task-specific weights are required, enhancing scalability and deployment. We evaluated MedPrompt across 19 public datasets, covering 12 tasks spanning 5 imaging modalities. The system achieves a 97% end-to-end correctness in interpreting and executing prompt-driven instructions, with an average inference latency of 2.5 seconds, making it suitable for near real-time applications. DeepFusionLab achieves competitive segmentation accuracy (e.g., Dice 0.9856 on lungs) and strong classification performance (F1 0.9744 on tuberculosis). Overall, MedPrompt enables scalable, prompt-driven medical imaging by combining the interpretability of LLMs with the efficiency of modular CNNs.[107] BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models
Louis Kerner,Michel Meintz,Bihe Zhao,Franziska Boenisch,Adam Dziedzic
Main category: cs.CV
TL;DR: This paper introduces BitMark, a robust bitwise watermarking method for text-to-image models like Infinity, designed to prevent model collapse by embedding detectable watermarks that remain traceable even after being used to train other models.
Details
Motivation: The increasing use of generated images as training data poses a risk of model collapse, where performance degrades due to repeated training on synthetic content. A robust watermarking solution is needed to identify and mitigate the impact of such generated data. Method: BitMark embeds a watermark directly at the bit level of the token stream across multiple scales during Infinity's image generation process. It subtly influences bits to maintain visual fidelity and generation speed while resisting removal techniques. Result: BitMark successfully embeds watermarks that are imperceptible yet detectable, robust against various removal methods, and highly radioactive—ensuring that even fine-tuned models trained on watermarked images retain detectable traces of the watermark. Conclusion: BitMark is an effective bitwise watermarking framework that prevents model collapse in image generative models by embedding robust and radioactive watermarks during the generation process, ensuring reliable detection of generated content. Abstract: State-of-the-art text-to-image models like Infinity generate photorealistic images at an unprecedented speed. These models operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models' own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework for Infinity. Our method embeds a watermark directly at the bit level of the token stream across multiple scales (also referred to as resolutions) during Infinity's image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model's outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs.[108] ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation
Xiwei Xuan,Ziquan Deng,Kwan-Liu Ma
Main category: cs.CV
TL;DR: 本文提出了一种新的无训练开放词汇语义分割方法,通过构建高质量的参考集和基于相似度的检索策略,显著提升了现有技术的表现。
Details
Motivation: 现有的无训练开放词汇语义分割解决方案受限于依赖模型的能力或参考集的质量不佳,因此需要探索数据质量问题以提升性能。 Method: 构建了一个具有良好配对的段落-文本嵌入的参考集的数据管道,并采用基于相似度的检索方法来揭示数据的本质效果。 Result: 在十个基准数据集上的广泛评估表明,该方法优于所有现有的无训练开放词汇语义分割方法。 Conclusion: 作者提出了一种以数据质量为导向的框架,用于无训练的开放词汇语义分割任务,并证明了高质量参考集对这一任务的重要性。 Abstract: Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available at https://github.com/xiweix/ReME .[109] Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping
Qifei Cui,Yuang Zhou,Ruichen Deng
Main category: cs.CV
TL;DR: This paper introduces ESFP, a comprehensive pipeline that translates RGB video into joint trajectories for a desktop robotic arm using a combination of estimation, smoothing, filtering, and mapping techniques.
Details
Motivation: The motivation is to create an end-to-end pipeline that can transform simple video input into usable joint movements for a low-cost robotic arm, enabling accessible robotics applications. Method: The method involves four modules: Estimating 3-D skeletons from each frame, Smoothing using HPSTM Transformer, Filtering with variance-weighted trajectories, and Pose-Mapping to the arm's workspace. Result: The result is ESFP, a successful pipeline that estimates, smoothes, filters, and maps joint trajectories while maintaining anatomical plausibility and wrist orientation. Conclusion: ESFP is able to convert monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm through four sequential modules, providing an efficient and effective solution. Abstract: This paper presents ESFP, an end-to-end pipeline that converts monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a 24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence Transformer with self-attention-combines long-range temporal context with a differentiable forward-kinematics decoder, enforcing constant bone lengths and anatomical plausibility while jointly predicting joint means and full covariances. (3) Filtering: root-normalized trajectories are variance-weighted according to HPSTM's uncertainty estimates, suppressing residual noise. (4) Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist triples into the uArm's polar workspace, preserving wrist orientation.[110] DiMPLe -- Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation
Umaima Rahman,Mohammad Yaqub,Dwarikanath Mahapatra
Main category: cs.CV
TL;DR: DiMPLe improves multi-modal learning by disentangling invariant and spurious features, enhancing generalization and robustness.
Details
Motivation: Spurious correlations in visual data hinder out-of-distribution performance, and prior methods focus solely on image features. Method: DiMPLe uses mutual information minimization, spurious feature regularization, and contrastive learning on invariant features for disentangling features within and across modalities. Result: DiMPLe achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy across 11 datasets. Conclusion: DiMPLe outperforms CoOp-OOD in generalization to novel classes and robustness to distribution shifts. Abstract: We introduce DiMPLe (Disentangled Multi-Modal Prompt Learning), a novel approach to disentangle invariant and spurious features across vision and language modalities in multi-modal learning. Spurious correlations in visual data often hinder out-of-distribution (OOD) performance. Unlike prior methods focusing solely on image features, DiMPLe disentangles features within and across modalities while maintaining consistent alignment, enabling better generalization to novel classes and robustness to distribution shifts. Our method combines three key objectives: (1) mutual information minimization between invariant and spurious features, (2) spurious feature regularization, and (3) contrastive learning on invariant features. Extensive experiments demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy.[111] Temporal Rate Reduction Clustering for Human Motion Segmentation
Xianghan Meng,Zhengyu Tong,Zhiyuan Huang,Chun-Guang Li
Main category: cs.CV
TL;DR: This paper introduces a new method called $ ext{TR}^2 ext{C}$ for Human Motion Segmentation that outperforms existing methods on benchmark datasets.
Details
Motivation: The motivation behind this paper is that existing approaches for Human Motion Segmentation (HMS) are mainly dominated by subspace clustering methods, which may not align well with the Union-of-Subspaces (UoS) distribution when dealing with videos capturing complex human motions with cluttered backgrounds. Method: The paper proposes a novel approach for HMS called Temporal Rate Reduction Clustering ($ ext{TR}^2 ext{C}$), which jointly learns structured representations and affinity to segment frame sequences in video. Result: The results show that the $ ext{TR}^2 ext{C}$ method achieves state-of-the-art performances on five benchmark HMS datasets using different feature extractors. Conclusion: The paper concludes that the proposed $ ext{TR}^2 ext{C}$ method achieves state-of-the-art performance on five benchmark HMS datasets with different feature extractors. Abstract: Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the frame sequences in video. Specifically, the structured representations learned by $\text{TR}^2\text{C}$ maintain temporally consistent and align well with a UoS structure, which is favorable for the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors.[112] DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic
Munish Monga,Vishal Chudasama,Pankaj Wasnik,Biplab Banerjee
Main category: cs.CV
TL;DR: This paper introduces DuET, a method for Dual Incremental Object Detection that simultaneously addresses class and domain shifts without requiring prior task data.
Details
Motivation: Real-world object detection systems need to learn new categories and adapt to changing environments without exemplars from previous tasks. Method: DuET uses a Task Arithmetic-based merging framework with a Directional Consistency Loss to enable stable incremental learning. Result: DuET achieves significant improvements in RAI while maintaining high retention and adaptation across multiple datasets. Conclusion: DuET effectively handles both class and domain shifts in object detection, outperforming existing methods. Abstract: Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD) only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET's effectiveness, achieving a +13.12% RAI improvement while preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39% RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks), outperforming existing methods.[113] Video Virtual Try-on with Conditional Diffusion Transformer Inpainter
Cheng Zou,Senlin Cheng,Bolei Xu,Dandan Zheng,Xiaobo Li,Jingdong Chen,Ming Yang
Main category: cs.CV
TL;DR: This paper introduces ViTI, a new method for video virtual try-on that improves spatial-temporal consistency by using a video inpainting framework based on Diffusion Transformer.
Details
Motivation: The motivation behind this paper is the challenge in video virtual try-on, where an image-based approach applied naively frame by frame results in poor spatial-temporal consistency. Although recent diffusion-based video try-on methods have shown improvements, inconsistency problems still exist. Method: ViTI (Video Try-on Inpainter) formulates and implements video virtual try-on as a conditional video inpainting task. It uses a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, which is progressively adapted for video garment inpainting through masking strategies and multi-stage training. Result: The proposed method, ViTI, has been shown to be superior to previous works in both quantitative and qualitative experimental results. Conclusion: ViTI is able to inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency, making it superior to previous works. Abstract: Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.[114] WordCon: Word-level Typography Control in Scene Text Rendering
Wenda Shi,Yiren Song,Zihan Rao,Dengming Zhang,Jiaming Liu,Xingxing Zou
Main category: cs.CV
TL;DR: 本文介绍了一种新的文本-图像对齐框架TIA和WordCon方法,用于解决生成图像中的词级排版控制问题,并取得了显著成果。
Details
Motivation: 生成图像中实现精确的词级排版控制仍然是一个挑战,因此需要一种新方法来提高文本到图像模型的可控性和效果。 Method: 构建了一个词级控制的场景文本数据集,提出了Text-Image Alignment (TIA)框架以及WordCon混合参数高效微调方法,同时采用了掩码损失和联合注意力损失来优化模型训练。 Result: 论文在定性和定量实验上均表现出优于现有技术的效果,并计划开放数据集和源代码供学术使用。 Conclusion: 论文提出了一种新的文本-图像对齐框架和WordCon方法,实现了更精确的词级排版控制,并通过引入掩码损失和联合注意力损失进一步增强了可控性。 Abstract: Achieving precise word-level typography control within generated images remains a persistent challenge. To address it, we newly construct a word-level controlled scene text dataset and introduce the Text-Image Alignment (TIA) framework. This framework leverages cross-modal correspondence between text and local image regions provided by grounding models to enhance the Text-to-Image (T2I) model training. Furthermore, we propose WordCon, a hybrid parameter-efficient fine-tuning (PEFT) method. WordCon reparameterizes selective key parameters, improving both efficiency and portability. This allows seamless integration into diverse pipelines, including artistic text rendering, text editing, and image-conditioned text rendering. To further enhance controllability, the masked loss at the latent level is applied to guide the model to concentrate on learning the text region in the image, and the joint-attention loss provides feature-level supervision to promote disentanglement between different words. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. The datasets and source code will be available for academic use.[115] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation
Diego Biagini,Nassir Navab,Azade Farshad
Main category: cs.CV
TL;DR: 本文提出了一种名为HieraSurg的外科视频生成框架,通过两个扩散模型实现粗粒度到细粒度的视频生成,在胆囊切除手术数据集上表现优异。
Details
Motivation: 现有的外科视频生成方法大多是无条件的,无法保持与外科动作和阶段的一致性,缺乏事实模拟所需的外科理解和细粒度指导。 Method: 提出了HieraSurg,一种层次感知的手术视频生成框架,包含两个专门的扩散模型。第一个模型预测未来粗粒度的语义变化,第二个模型将时间分割地图与细粒度视觉特征结合生成最终视频。 Result: 在胆囊切除手术视频生成上的实验结果表明,该模型在定量和定性方面均显著优于以往工作。 Conclusion: HieraSurg具有强大的泛化能力和生成高帧率视频的能力,特别是在提供现有分割地图时表现出潜在的实际外科应用价值。 Abstract: Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.[116] Continual Self-Supervised Learning with Masked Autoencoders in Remote Sensing
Lars Möllenbrok,Behnood Rasti,Begüm Demir
Main category: cs.CV
TL;DR: CoSMAE enhances continual learning for remote sensing by combining data and model mixup techniques to reduce forgetting and improve task generalization.
Details
Motivation: Existing continual learning methods in remote sensing require a large number of labeled samples, which is costly and often infeasible. Method: CoSMAE uses data mixup and model mixup knowledge distillation to regularize masked autoencoders at the data and model levels. Result: CoSMAE achieves up to 4.94% improvement over state-of-the-art continual learning methods applied to masked autoencoders. Conclusion: CoSMAE effectively improves continual learning in remote sensing by reducing catastrophic forgetting and enhancing generalization across tasks. Abstract: The development of continual learning (CL) methods, which aim to learn new tasks in a sequential manner from the training data acquired continuously, has gained great attention in remote sensing (RS). The existing CL methods in RS, while learning new tasks, enhance robustness towards catastrophic forgetting. This is achieved by using a large number of labeled training samples, which is costly and not always feasible to gather in RS. To address this problem, we propose a novel continual self-supervised learning method in the context of masked autoencoders (denoted as CoSMAE). The proposed CoSMAE consists of two components: i) data mixup; and ii) model mixup knowledge distillation. Data mixup is associated with retaining information on previous data distributions by interpolating images from the current task with those from the previous tasks. Model mixup knowledge distillation is associated with distilling knowledge from past models and the current model simultaneously by interpolating their model weights to form a teacher for the knowledge distillation. The two components complement each other to regularize the MAE at the data and model levels to facilitate better generalization across tasks and reduce the risk of catastrophic forgetting. Experimental results show that CoSMAE achieves significant improvements of up to 4.94% over state-of-the-art CL methods applied to MAE. Our code is publicly available at: https://git.tu-berlin.de/rsim/CoSMAE.[117] DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images
Badri Vishal Kasuba,Parag Chaudhuri,Ganesh Ramakrishnan
Main category: cs.CV
TL;DR: 本论文提出了DhrishtiKon,一种用于复杂多语言文档的多粒度视觉基础框架,显著提高了文档智能与视觉问答系统的准确性和可解释性。
Details
Motivation: 视觉基础在文本丰富的文档图像中是一个重要但未被充分探索的挑战,特别是在需要高精度定位的文档智能和视觉问答系统中。 Method: 整合了强大的多语言OCR、大型语言模型和一种新的区域匹配算法,并通过一个具有细粒度人工验证注释的新基准进行实验评估。 Result: 该方法实现了最先进的基础准确性,其中线级粒度在精确性和召回率之间取得了最佳平衡;消融研究进一步突出了多块和多行推理的优势。 Conclusion: 研究展示了DhrishtiKon这一多粒度视觉基础框架在增强文档智能和视觉问答系统中的有效性,为现实世界中以文本为中心的场景提供了更强大且可解释的文档理解系统。 Abstract: Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present \drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at https://github.com/kasuba-badri-vishal/DhrishtiKon.[118] LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning
Dewen Zhang,Tahir Hussain,Wangpeng An,Hayaru Shouno
Main category: cs.CV
TL;DR: 本文提出了一种利用人体关键点生成视觉-语言指令跟随数据的方法,用于提升视觉-语言模型在以人为中心任务上的表现。
Details
Motivation: 当前视觉-语言模型在处理与人体姿态和动作相关的复杂视觉任务时表现不足,缺乏专业的视觉-语言指令跟随数据。 Method: 通过结合传统视觉特征(如字幕和边界框)与人体关键点生成专门的视觉-语言指令跟随数据,并构建了一个包含200,328个样本的数据集以微调模型。 Result: 微调后的LLaVA-Pose模型在E-HPAUB基准测试中比原始LLaVA-1.5-7B模型整体提升了33.2%。 Conclusion: 将关键点集成数据用于多模态模型可有效增强对以人为中心的视觉理解。 Abstract: Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding. Code is available at https://github.com/Ody-trek/LLaVA-Pose.[119] Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models
Haoyang Wu,Tsun-Hsuan Wang,Mathias Lechner,Ramin Hasani,Jennifer A. Eckhoff,Paul Pak,Ozanan R. Meireles,Guy Rosman,Yutong Ban,Daniela Rus
Main category: cs.CV
TL;DR: 为了解决传统Transformer模型在处理长时程手术视频效率低下的问题,本文提出了一种新的层次化输入依赖状态空间模型框架,结合局部与全局动态建模,并通过混合监督策略训练,在多个手术数据分析任务中取得了显著的性能提升。
Details
Motivation: 由于手术流程分析在机器人辅助手术中至关重要,而传统Transformer模型因注意力机制限制难以高效处理长时程手术视频,因此需要一种更高效的解决方案。 Method: 该文设计了一个包含局部-聚合状态空间模型块和全局-关系状态空间模型块的框架,并结合了时间一致的视觉特征提取器来传播时间信息。 Result: 实验结果表明,所提方法在Cholec80、MICCAI2016和Heichole数据集上分别以+2.8%、+4.3%和+12.9%的性能优势超越当前最先进的方法。 Conclusion: 本文提出了一种基于状态空间模型的层次化输入依赖框架,通过局部聚合和全局关系模块有效捕捉手术视频的局部和全局动态,并采用混合离散-连续监督策略进行训练,实现了对完整长度视频的决策能力。 Abstract: Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.[120] PanSt3R: Multi-view Consistent Panoptic Segmentation
Lojze Zust,Yohann Cabon,Juliette Marrie,Leonid Antsfeld,Boris Chidlovskii,Jerome Revaud,Gabriela Csurka
Main category: cs.CV
TL;DR: This paper presents PanSt3R, an efficient and scalable approach for panoptic segmentation of 3D scenes that avoids test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass.
Details
Motivation: Existing approaches to panoptic segmentation of 3D scenes are suboptimal as they rely on 2D panoptic segmentations, failing to fully leverage spatial relationships across views. They also require camera parameters and computationally expensive test-time optimization. Method: The work proposes PanSt3R, which jointly predicts 3D geometry and multi-view panoptic segmentation in a single forward pass without test-time optimization. It builds upon MUSt3R, enhances it with semantic awareness, and introduces a more principled approach for multi-view segmentation. Additionally, a method for generating novel-view predictions based on PanSt3R and vanilla 3DGS is introduced. Result: The proposed PanSt3R achieves state-of-the-art performance on several benchmarks and is significantly faster than existing methods. Conclusion: PanSt3R is conceptually simple, fast, scalable, and achieves state-of-the-art performance on several benchmarks while being significantly faster than existing methods. Abstract: Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.[121] Generalizable Neural Electromagnetic Inverse Scattering
Yizhe Cheng,Chunxun Tian,Haoru Wang,Wentao Zhu,Xiaoxuan Ma,Yizhou Wang
Main category: cs.CV
TL;DR: This paper introduces a generalizable, physics-driven framework for solving electromagnetic inverse scattering problems, achieving better accuracy and robustness than existing methods.
Details
Motivation: The authors aim to overcome limitations in existing methods like Img-Interiors, particularly regarding generalization, case-specific optimization, and performance under sparse transmitter setups. Method: They propose a two-stage framework consisting of a current estimator and a permittivity solver, which decouples the scattering process from the inverse problem. Result: The experiments demonstrate that the proposed method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. Conclusion: The paper concludes that their proposed method offers a new physics-informed perspective for solving EISPs, leading to improved practical solutions in electromagnetic imaging. Abstract: Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires case-specific optimization, lacks generalization to unseen data, and fails under sparse transmitter setups (e.g., with only one transmitter). To address these limitations, we revisit EISP from a physics-informed perspective, reformulating it as a two stage inverse transmission-scattering process. This formulation reveals the induced current as a generalizable intermediate representation, effectively decoupling the nonlinear scattering process from the ill-posed inverse problem. Built on this insight, we propose the first generalizable physics-driven framework for EISP, comprising a current estimator and a permittivity solver, working in an end-to-end manner. The current estimator explicitly learns the induced current as a physical bridge between the incident and scattered field, while the permittivity solver computes the relative permittivity directly from the estimated induced current. This design enables data-driven training and generalizable feed-forward prediction of relative permittivity on unseen data while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.[122] ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
Hongbo Liu,Jingwen He,Yi Jin,Dian Zheng,Yuhao Dong,Fan Zhang,Ziqi Huang,Yinan He,Yangguang Li,Weichao Chen,Yu Qiao,Wanli Ouyang,Shengjie Zhao,Ziwei Liu
Main category: cs.CV
TL;DR: This paper introduces ShotBench, a benchmark for evaluating Vision-Language Models' understanding of cinematic grammar. It also presents ShotQA, a large dataset, and ShotVL, a model that achieves state-of-the-art results in this domain.
Details
Motivation: The motivation is to address the gap in Vision-Language Models' ability to understand nuanced cinematic grammar, which limits advancements in fine-grained visual comprehension and AI-assisted video generation. Method: The researchers introduced ShotBench, a benchmark for cinematic language understanding with over 3.5k expert-annotated QA pairs from acclaimed films. They evaluated 24 leading VLMs and identified their shortcomings. To improve performance, they created ShotQA, a large-scale multimodal dataset, and developed ShotVL through supervised fine-tuning and Group Relative Policy Optimization. Result: The evaluation showed that even top-performing VLMs had less than 60% average accuracy on ShotBench, especially struggling with fine-grained visual cues and spatial reasoning. Using ShotQA, the ShotVL model significantly outperformed all existing models, setting a new state-of-the-art. Conclusion: The study concludes that despite the limitations of current Vision-Language Models in understanding cinematic grammar, the proposed ShotVL model, developed using the ShotQA dataset and advanced training techniques, achieves state-of-the-art performance on the ShotBench benchmark. Abstract: Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce \textbf{ShotBench}, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60\% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct \textbf{ShotQA}, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop \textbf{ShotVL} through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new \textbf{state-of-the-art} performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.[123] CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations
Julian Lorenz,Mrunmai Phatak,Robin Schön,Katja Ludwig,Nico Hörmann,Annemarie Friedrich,Rainer Lienhart
Main category: cs.CV
TL;DR: 本论文提出了CoPa-SG,一个具有高精度真实标注和全面关系注释的合成场景图数据集,以及参数化关系和原型关系两个新概念,用于增强场景理解的规划与推理能力。
Details
Motivation: 当前研究受限于缺乏精确的场景图数据,因此需要一个新的数据集来解决这一瓶颈问题。 Method: 提出了一个名为CoPa-SG的合成场景图数据集,并引入了参数化关系和原型关系两个新概念。 Result: 利用CoPa-SG比较了各种场景图生成模型的性能,并展示了如何在下游应用中集成新的关系类型。 Conclusion: CoPa-SG克服了现有场景图数据不准确的问题,通过引入参数化关系和原型关系提供了更细粒度和假设性的关系表示,从而增强了规划和推理能力。 Abstract: 2D scene graphs provide a structural and explainable framework for scene understanding. However, current work still struggles with the lack of accurate scene graph data. To overcome this data bottleneck, we present CoPa-SG, a synthetic scene graph dataset with highly precise ground truth and exhaustive relation annotations between all objects. Moreover, we introduce parametric and proto-relations, two new fundamental concepts for scene graphs. The former provides a much more fine-grained representation than its traditional counterpart by enriching relations with additional parameters such as angles or distances. The latter encodes hypothetical relations in a scene graph and describes how relations would form if new objects are placed in the scene. Using CoPa-SG, we compare the performance of various scene graph generation models. We demonstrate how our new relation types can be integrated in downstream applications to enhance planning and reasoning capabilities.[124] ToosiCubix: Monocular 3D Cuboid Labeling via Vehicle Part Annotations
Behrooz Nasihatkon,Hossein Resani,Amirreza Mehrzadian
Main category: cs.CV
TL;DR: ToosiCubix是一种基于单目图像和相机内参的高效3D立方体注释方法,具有较高的实用性和扩展性。
Details
Motivation: 许多现有的车辆3D立方体注释方法依赖昂贵且经过精心校准的相机-LiDAR或立体设置,限制了其在大规模数据收集中的可用性。 Method: 通过跨不同车辆部件标注特定特征(例如,车轮、汽车徽章、对称性),将几何约束公式化为优化问题,并使用坐标下降策略解决该问题。 Result: 验证表明,该方法在KITTI和Cityscapes3D数据集上表现良好,提供了成本效益高且可扩展的解决方案。 Conclusion: ToosiCubix提供了一种经济高效且可扩展的解决方案,用于高质量的3D立方体注释,仅需单目图像和相机内参。 Abstract: Many existing methods for 3D cuboid annotation of vehicles rely on expensive and carefully calibrated camera-LiDAR or stereo setups, limiting their accessibility for large-scale data collection. We introduce ToosiCubix, a simple yet powerful approach for annotating ground-truth cuboids using only monocular images and intrinsic camera parameters. Our method requires only about 10 user clicks per vehicle, making it highly practical for adding 3D annotations to existing datasets originally collected without specialized equipment. By annotating specific features (e.g., wheels, car badge, symmetries) across different vehicle parts, we accurately estimate each vehicle's position, orientation, and dimensions up to a scale ambiguity (8 DoF). The geometric constraints are formulated as an optimization problem, which we solve using a coordinate descent strategy, alternating between Perspective-n-Points (PnP) and least-squares subproblems. To handle common ambiguities such as scale and unobserved dimensions, we incorporate probabilistic size priors, enabling 9 DoF cuboid placements. We validate our annotations against the KITTI and Cityscapes3D datasets, demonstrating that our method offers a cost-effective and scalable solution for high-quality 3D cuboid annotation.[125] CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection
Zhixin Cheng,Jiacheng Deng,Xinjun Li,Xiaotian Yin,Bohao Liao,Baoqun Yin,Wenfei Yang,Tianzhu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的跨模态配准方法,通过CAA模块增强模态内特征、GOS模块实现全局优化,有效解决了特征通道注意力差异和冗余对应问题,提高了注册精度。
Details
Motivation: 现有的无检测方法存在图像与点云之间特征通道注意力差异导致匹配效果下降、场景中相似结构造成跨模态匹配冗余对应的问题,需要改进注册精度。 Method: 提出了Channel Adaptive Adjustment Module (CAA) 和 Global Optimal Selection Module (GOS),分别用于增强模态内特征并抑制跨模态敏感性,以及用全局优化代替局部选择。 Result: 在RGB-D Scenes V2和7-Scenes数据集上进行了实验,验证了所提方法的有效性。 Conclusion: 实验结果证明了该方法在RGB-D Scenes V2和7-Scenes数据集上的优越性,实现了图像到点云配准的最先进性能。 Abstract: Detection-free methods typically follow a coarse-to-fine pipeline, extracting image and point cloud features for patch-level matching and refining dense pixel-to-point correspondences. However, differences in feature channel attention between images and point clouds may lead to degraded matching results, ultimately impairing registration accuracy. Furthermore, similar structures in the scene could lead to redundant correspondences in cross-modal matching. To address these issues, we propose Channel Adaptive Adjustment Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances intra-modal features and suppresses cross-modal sensitivity, while GOS replaces local selection with global optimization. Experiments on RGB-D Scenes V2 and 7-Scenes demonstrate the superiority of our method, achieving state-of-the-art performance in image-to-point cloud registration.[126] GenFlow: Interactive Modular System for Image Generation
Duc-Hung Nguyen,Huu-Phuc Huynh,Minh-Triet Tran,Trung-Nghia Le
Main category: cs.CV
TL;DR: GenFlow是一个易于使用的生成艺术框架,显著提高了创作效率和可访问性。
Details
Motivation: 由于需要专业技术知识,生成艺术的潜力尚未被完全挖掘,因此需要一个适用于所有技能水平用户的工具。 Method: 开发了一个基于节点编辑器的模块化框架,并集成了自然语言处理的智能助手来简化工作流程。 Result: 用户研究表明,GenFlow能优化工作流程、减少任务完成时间并提升用户理解能力。 Conclusion: GenFlow是一个创新的模块化框架,它通过直观的设计和自动化过程,使用户能够更高效、便捷地使用生成艺术工具。 Abstract: Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present GenFlow, a novel modular framework that empowers users of all skill levels to generate images with precision and ease. Featuring a node-based editor for seamless customization and an intelligent assistant powered by natural language processing, GenFlow transforms the complexity of workflow creation into an intuitive and accessible experience. By automating deployment processes and minimizing technical barriers, our framework makes cutting-edge generative art tools available to everyone. A user study demonstrated GenFlow's ability to optimize workflows, reduce task completion times, and enhance user understanding through its intuitive interface and adaptive features. These results position GenFlow as a groundbreaking solution that redefines accessibility and efficiency in the realm of generative art.[127] FastRef:Fast Prototype Refinement for Few-Shot Industrial Anomaly Detection
Long Tian,Yufei Li,Yuyang Dai,Wenchao Chen,Xiyang Liu,Bo Chen
Main category: cs.CV
TL;DR: FastRef enhances few-shot industrial anomaly detection by refining prototypes through query feature transfer and anomaly suppression, achieving strong results on benchmark datasets.
Details
Motivation: Existing methods neglect query image statistics in prototype-based few-shot industrial anomaly detection, leading to suboptimal performance in data-scarce environments. Method: Proposed FastRef, an iterative two-stage prototype refinement framework involving characteristic transfer and anomaly suppression using optimal transport. Result: Integration of FastRef with existing methods (PatchCore, FastRecon, WinCLIP, AnomalyDINO) showed improved performance across benchmark datasets like MVTec, ViSA, MPDD, and RealIAD under 1/2/4-shots. Conclusion: FastRef is effective and computationally efficient for few-shot industrial anomaly detection when integrated with existing methods. Abstract: Few-shot industrial anomaly detection (FS-IAD) presents a critical challenge for practical automated inspection systems operating in data-scarce environments. While existing approaches predominantly focus on deriving prototypes from limited normal samples, they typically neglect to systematically incorporate query image statistics to enhance prototype representativeness. To address this issue, we propose FastRef, a novel and efficient prototype refinement framework for FS-IAD. Our method operates through an iterative two-stage process: (1) characteristic transfer from query features to prototypes via an optimizable transformation matrix, and (2) anomaly suppression through prototype alignment. The characteristic transfer is achieved through linear reconstruction of query features from prototypes, while the anomaly suppression addresses a key observation in FS-IAD that unlike conventional IAD with abundant normal prototypes, the limited-sample setting makes anomaly reconstruction more probable. Therefore, we employ optimal transport (OT) for non-Gaussian sampled features to measure and minimize the gap between prototypes and their refined counterparts for anomaly suppression. For comprehensive evaluation, we integrate FastRef with three competitive prototype-based FS-IAD methods: PatchCore, FastRecon, WinCLIP, and AnomalyDINO. Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and RealIAD demonstrate both the effectiveness and computational efficiency of our approach under 1/2/4-shots.[128] Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction
Zhirui Gao. Renjiao Yi,Yaqiao Dai,Xuening Zhu,Wei Chen,Chenyang Zhu,Kai Xu
Main category: cs.CV
TL;DR: This paper introduces CurveGaussian, a one-stage end-to-end framework for reconstructing 3D parametric curves directly from 2D edge maps, outperforming traditional two-stage approaches in accuracy and efficiency.
Details
Motivation: Existing methods use a two-stage pipeline involving edge point cloud reconstruction followed by parametric curve fitting, which leads to error accumulation due to the optimization gap between stages. Furthermore, parametric curves are not inherently suitable for rendering-based multi-view optimization, necessitating an improved representation that preserves geometric properties while enabling differentiable rendering. Method: A bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components is introduced to create a curve-aware Gaussian representation (CurveGaussian). This allows differentiable rendering of 3D curves for direct optimization using multi-view evidence. Additionally, a dynamically adaptive topology optimization framework is used during training, enabling refinement through linearization, merging, splitting, and pruning operations. Result: Comprehensive evaluations on the ABC dataset and real-world benchmarks show that the one-stage approach produces cleaner and more robust reconstructions compared to two-stage alternatives. It also significantly reduces parameter count during training, achieving higher efficiency and superior performance over existing methods. Conclusion: The proposed one-stage method, CurveGaussian, outperforms two-stage approaches by directly optimizing 3D parametric curves from 2D edge maps. It reduces training parameters and achieves better efficiency and performance. Abstract: This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting'' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.[129] XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
Bowen Chen,Mengyi Zhao,Haomiao Sun,Li Chen,Xu Wang,Kang Du,Xinglong Wu
Main category: cs.CV
TL;DR: XVerse improves multi-subject text-to-image generation by enabling precise and independent control of subject identity and semantic attributes without disrupting image features.
Details
Motivation: Existing methods often undermine the editability and coherence of Diffusion Transformers (DiTs), introduce artifacts, or suffer from attribute entanglement. A better solution is needed for multi-subject controlled generation. Method: XVerse transforms reference images into offsets for token-specific text-stream modulation to allow precise and independent control of specific subjects. Result: XVerse achieves high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. Conclusion: XVerse provides a solution for fine-grained control over subject identity and semantic attributes in multi-subject text-to-image generation, improving personalized and complex scene generation capabilities. Abstract: Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.[130] EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting
Taoyu Wu,Yiyi Miao,Zhuoxiao Li,Haocheng Zhao,Kang Dang,Jionglong Su,Limin Yu,Haoang Li
Main category: cs.CV
TL;DR: This paper addresses challenges in endoscopic 3D reconstruction and visualization by introducing optical flow loss and depth regularisation into 3DGS-based SLAM, achieving better performance than existing methods.
Details
Motivation: The motivation stems from challenges in endoscopic scenarios, such as photometric inconsistencies due to non-Lambertian surfaces and dynamic motion caused by breathing, which affect the performance of SLAM systems. Method: The method introduces optical flow loss as a geometric constraint and proposes a depth regularisation strategy to address photometric inconsistencies. Additionally, it enhances the 3DGS refinement process by focusing on Keyframes with suboptimal rendering quality. Result: Extensive experiments on the C3VD static dataset and the StereoMIS dynamic dataset show that the method outperforms existing approaches in novel view synthesis and pose estimation. Conclusion: The paper concludes that the proposed method, which integrates optical flow loss and depth regularisation into 3DGS-based SLAM, significantly improves performance in both static and dynamic surgical scenes, outperforming existing state-of-the-art methods. Abstract: Efficient three-dimensional reconstruction and real-time visualization are critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in efficient 3D reconstruction and rendering. Most 3DGS-based Simultaneous Localization and Mapping (SLAM) methods only rely on the appearance constraints for optimizing both 3DGS and camera poses. However, in endoscopic scenarios, the challenges include photometric inconsistencies caused by non-Lambertian surfaces and dynamic motion from breathing affects the performance of SLAM systems. To address these issues, we additionally introduce optical flow loss as a geometric constraint, which effectively constrains both the 3D structure of the scene and the camera motion. Furthermore, we propose a depth regularisation strategy to mitigate the problem of photometric inconsistencies and ensure the validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve scene representation in the SLAM system, we improve the 3DGS refinement strategy by focusing on viewpoints corresponding to Keyframes with suboptimal rendering quality frames, achieving better rendering results. Extensive experiments on the C3VD static dataset and the StereoMIS dynamic dataset demonstrate that our method outperforms existing state-of-the-art methods in novel view synthesis and pose estimation, exhibiting high performance in both static and dynamic surgical scenes. The source code will be publicly available upon paper acceptance.[131] HyperSORT: Self-Organising Robust Training with hyper-networks
Samuel Joutard,Marijn Stollenga,Marc Balle Sanchez,Mohammad Farid Azampour,Raphael Prevost
Main category: cs.CV
TL;DR: 本文提出了HyperSORT,一个使用超网络预测UNet参数的框架,用于识别和表征医学图像数据集中的异质性偏差。
Details
Motivation: 医学图像数据集中存在从错误标签到不一致标注风格的各种偏差,这可能对深度分割网络的性能产生负面影响。然而,此类偏差的识别和表征是一项特别繁琐且具有挑战性的任务。 Method: HyperSORT框架通过一个超网络来预测UNet参数,该网络基于代表图像和注释变异性的潜在向量进行训练。超网络参数和与训练集中每个数据样本对应的潜在向量集合是联合学习的。 Result: 实验表明,HyperSORT能够创建数据集的结构化映射,从而识别相关系统偏差和错误样本。潜在空间聚类生成的UNet参数能够根据底层学习到的系统偏差执行分割任务。此外,该方法在包含合成扰动的AMOS数据集和含有真实未知偏差的大规模TotalSegmentator数据集上均得到了验证。 Conclusion: HyperSORT为分析医学图像数据集中的偏差提供了一种有效的方法,同时展示了其在不同数据集上的应用潜力。 Abstract: Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets' parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying learned systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: https://github.com/ImFusionGmbH/HyperSORT[132] Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation
Sweta Banerjee,Viktoria Weiss,Taryn A. Donovan,Rutger A. Fick,Thomas Conrad,Jonas Ammeling,Nils Porsche,Robert Klopfleisch,Christopher Kaltenecker,Katharina Breininger,Marc Aubreville,Christof A. Bertram
Main category: cs.CV
TL;DR: 本研究提出了一个用于自动分类非典型有丝分裂的深度学习基准,并表明迁移学习和模型微调技术能显著提高分类效果。
Details
Motivation: 由于非典型有丝分裂的低发生率、与正常有丝分裂形态差异细微、病理学家之间评分一致性较低以及数据集中类别不平衡等问题,其识别仍然具有挑战性。因此需要更有效的自动化解决方案。 Method: 基于AMi-Br数据集,比较了深度学习方法在非典型有丝分裂体(AMF)分类上的表现,包括基线模型、采用线性探测的基础模型以及使用低秩自适应(LoRA)进行微调的基础模型。同时引入了两个新的保留AMF数据集AtNorM-Br和AtNorM-MD进行严格评估。 Result: 研究发现,在域内AMi-Br和域外AtNorM-Br及AtNorM-MD数据集上,平均平衡准确率分别达到0.8135、0.7696和0.7705,特别是Virchow系列基础模型结合LoRA的方法表现尤为出色。 Conclusion: 这项研究表明,使用迁移学习和模型微调技术可以有效解决非典型有丝分裂分类这一具有挑战性的问题。 Abstract: Atypical mitoses mark a deviation in the cell division process that can be an independent prognostically relevant marker for tumor malignancy. However, their identification remains challenging due to low prevalence, at times subtle morphological differences from normal mitoses, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including baseline models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new hold-out AMF datasets - AtNorM-Br, a dataset of mitoses from the The TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitoses from the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7696, and 0.7705 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively, with the results being particularly good for LoRA-based adaptation of the Virchow-line of foundation models. Our work shows that atypical mitosis classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make available all code and data used in this paper in this github repository: https://github.com/DeepMicroscopy/AMi-Br_Benchmark.[133] Controllable 3D Placement of Objects with Scene-Aware Diffusion Models
Mohamed Omran,Dimitris Kalatzis,Jens Petersen,Amirhossein Habibian,Auke Wiggers
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉地图和粗略对象掩码的新方法,用于高质量的对象放置,解决了在特定位置和方向精确放置对象的挑战。
Details
Motivation: 尽管文本条件生成模型使图像编辑更加强大和灵活,但在特定位置和方向上精确放置对象仍然是一个挑战,因为这通常需要精心设计的修复掩码或提示。 Method: 设计了一种结合视觉地图和粗略对象掩码的条件信号,以解决歧义问题,同时保持足够的灵活性以允许形状或对象方向的变化。通过建立在修复模型上,该方法在设计上保留了背景不变。 Result: 在汽车环境中验证了该方法的有效性,比较了不同条件信号在新对象放置任务中的表现。这些任务不仅评估外观质量,还评估姿势和位置准确性,并展示了精细位置控制与外观控制结合的能力。 Conclusion: 结果表明,所提出的方法能够实现高质量的对象放置,克服了传统方法需要复杂输入的限制,并提供了更高的灵活性和精度。 Abstract: Image editing approaches have become more powerful and flexible with the advent of powerful text-conditioned generative models. However, placing objects in an environment with a precise location and orientation still remains a challenge, as this typically requires carefully crafted inpainting masks or prompts. In this work, we show that a carefully designed visual map, combined with coarse object masks, is sufficient for high quality object placement. We design a conditioning signal that resolves ambiguities, while being flexible enough to allow for changing of shapes or object orientations. By building on an inpainting model, we leave the background intact by design, in contrast to methods that model objects and background jointly. We demonstrate the effectiveness of our method in the automotive setting, where we compare different conditioning signals in novel object placement tasks. These tasks are designed to measure edit quality not only in terms of appearance, but also in terms of pose and location accuracy, including cases that require non-trivial shape changes. Lastly, we show that fine location control can be combined with appearance control to place existing objects in precise locations in a scene.[134] A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario
Cyrus Addy,Ajay Kumar Gurumadaiah,Yixiang Gao,Kwame Awuah-Offei
Main category: cs.CV
TL;DR: 这篇论文介绍了一个新的热成像数据集,旨在推动可靠矿工检测系统的发展,以应对地下采矿的安全挑战。
Details
Motivation: 地下采矿作业面临重大的安全挑战,使得应急响应能力至关重要。虽然机器人在协助搜救行动方面显示出潜力,但其有效性取决于可靠的矿工检测能力。深度学习算法提供了潜在解决方案,但需要全面的训练数据集,而目前地下采矿环境的数据集仍然缺乏。 Method: 论文的方法包括系统地捕捉各种采矿活动和场景的热成像图像,以创建一个用于检测算法的稳健数据集,并评估了YOLOv8、YOLOv10、YOLO11和RT-DETR等最先进的目标检测算法的基线性能。 Result: 论文的结果是提出了一个新的专门设计的热成像数据集,以促进矿工检测系统的发展和验证,并展示了现有目标检测算法在该数据集上的基线性能。 Conclusion: 该论文得出的结论是,使用热成像技术进行矿工检测是可行的,并为这一关键安全应用的未来研究奠定了基础。 Abstract: Underground mining operations face significant safety challenges that make emergency response capabilities crucial. While robots have shown promise in assisting with search and rescue operations, their effectiveness depends on reliable miner detection capabilities. Deep learning algorithms offer potential solutions for automated miner detection, but require comprehensive training datasets, which are currently lacking for underground mining environments. This paper presents a novel thermal imaging dataset specifically designed to enable the development and validation of miner detection systems for potential emergency applications. We systematically captured thermal imagery of various mining activities and scenarios to create a robust foundation for detection algorithms. To establish baseline performance metrics, we evaluated several state-of-the-art object detection algorithms including YOLOv8, YOLOv10, YOLO11, and RT-DETR on our dataset. While not exhaustive of all possible emergency situations, this dataset serves as a crucial first step toward developing reliable thermal-based miner detection systems that could eventually be deployed in real emergency scenarios. This work demonstrates the feasibility of using thermal imaging for miner detection and establishes a foundation for future research in this critical safety application.[135] Rethinking Oversaturation in Classifier-Free Guidance via Low Frequency
Kaiyu Song,Hanjiang Lai
Main category: cs.CV
TL;DR: This paper proposes Low-Frequency Improved Classifier-Free Guidance (LF-CFG) to mitigate oversaturation and unrealistic artifacts in diffusion models by identifying and reducing redundant information in low-frequency signals.
Details
Motivation: High guidance scales in classifier-free guidance often lead to oversaturation and unrealistic artifacts in diffusion models, necessitating a novel approach to address this issue. Method: We propose LF-CFG by analyzing low-frequency signals to identify redundant information. An adaptive threshold-based measurement is introduced to pinpoint such redundancy, and a down-weight strategy is applied to reduce its impact. Result: Experimental results show that LF-CFG successfully reduces oversaturation and unrealistic artifacts across various diffusion models, including Stable Diffusion-XL, Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL. Conclusion: LF-CFG effectively alleviates oversaturation and unrealistic artifacts in diffusion models, enhancing the performance of conditional terms without compromising realism. Abstract: Classifier-free guidance (CFG) succeeds in condition diffusion models that use a guidance scale to balance the influence of conditional and unconditional terms. A high guidance scale is used to enhance the performance of the conditional term. However, the high guidance scale often results in oversaturation and unrealistic artifacts. In this paper, we introduce a new perspective based on low-frequency signals, identifying the accumulation of redundant information in these signals as the key factor behind oversaturation and unrealistic artifacts. Building on this insight, we propose low-frequency improved classifier-free guidance (LF-CFG) to mitigate these issues. Specifically, we introduce an adaptive threshold-based measurement to pinpoint the locations of redundant information. We determine a reasonable threshold by analyzing the change rate of low-frequency information between prior and current steps. We then apply a down-weight strategy to reduce the impact of redundant information in the low-frequency signals. Experimental results demonstrate that LF-CFG effectively alleviates oversaturation and unrealistic artifacts across various diffusion models, including Stable Diffusion-XL, Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL.[136] Evaluation of Traffic Signals for Daily Traffic Pattern
Mohammad Shokrolah Shirazi,Hung-Fu Chang
Main category: cs.CV
TL;DR: This paper proposes dynamic, static, and hybrid traffic signal methods based on TMC data, showing that hybrid signals improve traffic flow during peak hours and uneven traffic distribution.
Details
Motivation: Turning movement count (TMC) data is essential for optimizing traffic signals and intersection design, prompting the need for adaptive signal control strategies to manage varying traffic patterns. Method: Three methods (dynamic, static, and hybrid) were evaluated using a vision-based tracking system and traffic simulations. Realistic data was generated using a built-in traffic generator and signal design module. Result: Dynamic signal timing performed best at four intersections, while the hybrid method outperformed others under high-traffic conditions and uneven zone-based traffic distributions. Conclusion: The hybrid method is effective for traffic flow management during peak and off-peak times, especially when there are significant differences in traffic volume between intersection zones. Abstract: The turning movement count data is crucial for traffic signal design, intersection geometry planning, traffic flow, and congestion analysis. This work proposes three methods called dynamic, static, and hybrid configuration for TMC-based traffic signals. A vision-based tracking system is developed to estimate the TMC of six intersections in Las Vegas using traffic cameras. The intersection design, route (e.g. vehicle movement directions), and signal configuration files with compatible formats are synthesized and imported into Simulation of Urban MObility for signal evaluation with realistic data. The initial experimental results based on estimated waiting times indicate that the cycle time of 90 and 120 seconds works best for all intersections. In addition, four intersections show better performance for dynamic signal timing configuration, and the other two with lower performance have a lower ratio of total vehicle count to total lanes of the intersection leg. Since daily traffic flow often exhibits a bimodal pattern, we propose a hybrid signal method that switches between dynamic and static methods, adapting to peak and off-peak traffic conditions for improved flow management. So, a built-in traffic generator module creates vehicle routes for 4 hours, including peak hours, and a signal design module produces signal schedule cycles according to static, dynamic, and hybrid methods. Vehicle count distributions are weighted differently for each zone (i.e., West, North, East, South) to generate diverse traffic patterns. The extended experimental results for 6 intersections with 4 hours of simulation time imply that zone-based traffic pattern distributions affect signal design selection. Although the static method works great for evenly zone-based traffic distribution, the hybrid method works well for highly weighted traffic at intersection pairs of the West-East and North-South zones.[137] Global and Local Entailment Learning for Natural World Imagery
Srikumar Sastry,Aayush Dhakal,Eric Xing,Subash Khanal,Nathan Jacobs
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉-语言模型框架RCME,可以更好地学习数据中的层次结构,并在多个任务上取得了更好的性能。
Details
Motivation: 现有的视觉-语言模型未能显式建模蕴含关系的传递性,这限制了它们对数据层次结构的学习能力。 Method: 提出了Radial Cross-Modal Embeddings (RCME)框架,该框架通过优化概念的部分顺序来显式建模蕴含关系的传递性。 Result: 在层次化物种分类和层次化检索任务上的实验表明,所提出的模型优于现有最先进的模型。 Conclusion: RCME框架能够有效建模蕴含关系的传递性,提高了视觉-语言模型在层次化任务上的性能。 Abstract: Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.[138] TITAN: Query-Token based Domain Adaptive Adversarial Learning
Tajamul Ashraf,Janibul Bashir
Main category: cs.CV
TL;DR: 本文提出了一种名为TITAN的方法,用于解决源域数据不可用时的无监督领域自适应目标检测问题。通过将目标域图像分为“易”和“难”两部分,并使用基于查询令牌的对抗模块减少领域差异,TITAN在多个数据集上表现出优于现有最先进方法的性能。
Details
Motivation: 现有的源无关领域自适应目标检测方法依赖于带有噪声的伪标签进行模型训练,这种噪声来源于领域偏移、差异以及伪标签中的高噪声,导致教师模型性能崩溃,影响学生模型的表现。因此需要一种能够有效降低伪标签噪声并提高模型适应能力的方法。 Method: 作者提出了一种名为TITAN(Target-based Iterative Query-Token Adversarial Network)的框架。该方法首先通过估计检测方差将目标域图像划分为与源域相似(易)和不相似(难)的两个子集。对于“易”样本,使用更可靠的伪标签进行训练;对于“难”样本,则利用基于查询令牌的对抗模块来减小特征表示之间的领域差距。整个过程采用迭代的方式逐步优化模型。 Result: TITAN在四个自然图像数据集和两个医学图像数据集上进行了实验验证,结果显示其性能显著优于当前最先进的方法。具体而言,在C2F、C2B、S2C和K2C基准测试中,mAP分别提升了22.7%、22.2%、21.1%和3.7%。 Conclusion: TITAN是一种有效的源无关领域自适应目标检测方法,通过分离目标域图像并结合对抗学习策略,成功缓解了伪标签噪声带来的问题,实现了在多个数据集上的性能突破。 Abstract: We focus on the source-free domain adaptive object detection (SF-DAOD) problem when source data is unavailable during adaptation and the model must adapt to an unlabeled target domain. The majority of approaches for the problem employ a self-supervised approach using a student-teacher (ST) framework where pseudo-labels are generated via a source-pretrained model for further fine-tuning. We observe that the performance of a student model often degrades drastically, due to the collapse of the teacher model, primarily caused by high noise in pseudo-labels, resulting from domain bias, discrepancies, and a significant domain shift across domains. To obtain reliable pseudo-labels, we propose a Target-based Iterative Query-Token Adversarial Network (TITAN), which separates the target images into two subsets: those similar to the source (easy) and those dissimilar (hard). We propose a strategy to estimate variance to partition the target domain. This approach leverages the insight that higher detection variances correspond to higher recall and greater similarity to the source domain. Also, we incorporate query-token-based adversarial modules into a student-teacher baseline framework to reduce the domain gaps between two feature representations. Experiments conducted on four natural imaging datasets and two challenging medical datasets have substantiated the superior performance of TITAN compared to existing state-of-the-art (SOTA) methodologies. We report an mAP improvement of +22.7, +22.2, +21.1, and +3.7 percent over the current SOTA on C2F, C2B, S2C, and K2C benchmarks, respectively.[139] Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection
Tobias J. Riedlinger,Kira Maag,Hanno Gottschalk
Main category: cs.CV
TL;DR: This paper introduces an object detection model based on spatial statistics that better quantifies uncertainty in regions without detected objects, improving safety and reliability in applications like autonomous driving.
Details
Motivation: Deep neural networks often produce miscalibrated confidence estimates due to architectures and loss functions tailored for task performance rather than probabilistic accuracy. Additionally, existing object detectors fail to quantify uncertainty in regions without detected objects, which poses safety risks, especially in applications like automated driving. Method: The authors use a statistical framework based on marked point processes to model bounding box data. This approach allows likelihood-based training and provides well-defined confidence estimates for whether a region is drivable (i.e., free of objects). Result: The proposed method demonstrates effectiveness through improved calibration assessments and performance evaluations, offering reliable uncertainty quantification in object-free regions. Conclusion: The paper concludes that their proposed object detection model, grounded in spatial statistics, provides well-calibrated confidence estimates for drivable regions, addressing the uncertainty quantification issue outside detected bounding boxes. Abstract: Deep neural networks have set the state-of-the-art in computer vision tasks such as bounding box detection and semantic segmentation. Object detectors and segmentation models assign confidence scores to predictions, reflecting the model's uncertainty in object detection or pixel-wise classification. However, these confidence estimates are often miscalibrated, as their architectures and loss functions are tailored to task performance rather than probabilistic foundation. Even with well calibrated predictions, object detectors fail to quantify uncertainty outside detected bounding boxes, i.e., the model does not make a probability assessment of whether an area without detected objects is truly free of obstacles. This poses a safety risk in applications such as automated driving, where uncertainty in empty areas remains unexplored. In this work, we propose an object detection model grounded in spatial statistics. Bounding box data matches realizations of a marked point process, commonly used to describe the probabilistic occurrence of spatial point events identified as bounding box centers, where marks are used to describe the spatial extension of bounding boxes and classes. Our statistical framework enables a likelihood-based training and provides well-defined confidence estimates for whether a region is drivable, i.e., free of objects. We demonstrate the effectiveness of our method through calibration assessments and evaluation of performance.[140] Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration
Jiahe Chen,Jiaying He,Qian Shao,Qiyuan Chen,Jiahe Ying,Hongxia Xu,Jintai Chen,Jianwei Zheng,Jian Wu
Main category: cs.CV
TL;DR: This paper proposes Dynamic Logits Calibration (DLC), a training-free decoding framework that effectively reduces hallucinations in Large Vision-Language Models (LVLMs) by dynamically aligning text generation with visual evidence.
Details
Motivation: LVLMs suffer from hallucinations due to static constraints, inefficiency, and detail degradation in existing training-free decoding strategies. DLC aims to overcome these limitations. Method: Dynamic Logits Calibration (DLC) uses CLIP during the decoding phase to dynamically align text generation with visual evidence, employing a Relative Visual Advantage (RVA) and adaptive weighting mechanism. Result: Experiments show that DLC significantly reduces hallucinations across various LVLM architectures while maintaining high inference efficiency by avoiding multiple forward passes. Conclusion: DLC provides an effective and efficient decoding-time solution to mitigate hallucinations in LVLMs, enhancing their reliability for practical applications. Abstract: Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.[141] GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation
Wentao Hu,Shunkai Li,Ziqiao Peng,Haoxian Zhang,Fan Shi,Xiaoqiang Liu,Pengfei Wan,Di Zhang,Hui Tian
Main category: cs.CV
TL;DR: 本文提出了GGTalker,一种结合通用先验和身份特定适应的语音驱动3D头像生成方法,在质量和效率上均取得显著成果。
Details
Motivation: 现有方法在大规模音频变化和大头部旋转情况下表现不佳,且需要耗时的身份特定训练,缺乏足够的3D先验知识限制了合成头像的外推能力。 Method: 提出了一种两阶段的Prior-Adaptation训练策略,引入了Audio-Expression和Expression-Visual先验模型,并采用颜色MLP和Body Inpainter技术生成精细纹理和背景融合效果。 Result: 实验表明,GGTalker在渲染质量、3D一致性、唇音同步准确性和训练效率方面均达到最先进水平。 Conclusion: GGTalker通过结合通用先验和身份特定适应的方法,有效解决了生成高质量、可泛化的语音驱动3D说话头像的问题。 Abstract: Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.[142] G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation
Mohammed Rakib,Arunkumar Bagavathi
Main category: cs.CV
TL;DR: 提出了一个名为Gradient-Guided Distillation (G²D) 的知识蒸馏框架,通过融合单模态和多模态目标的损失函数以及动态顺序模态优先化技术,解决多模态学习中的模态不平衡问题。
Details
Motivation: 传统多模态模型存在模态不平衡问题,导致弱模态利用不足,本文旨在解决这一挑战。 Method: 引入了Gradient-Guided Distillation (G²D) 框架,结合定制损失函数和动态顺序模态优先化技术(SMP),确保各模态在训练中平等参与。 Result: G²D在多个真实数据集上验证有效,提升了弱模态的重要性,并在分类与回归任务中优于现有方法。 Conclusion: G²D成功缓解了多模态学习中的模态不平衡问题,增强了模型整体性能。 Abstract: Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D on multiple real-world datasets and show that G$^{2}$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.[143] MADrive: Memory-Augmented Driving Scene Modeling
Polina Karpikova,Daniil Selikhanovych,Kirill Struminsky,Ruslan Musaev,Maria Golitsyna,Dmitry Baranchuk
Main category: cs.CV
TL;DR: 本文提出MADrive框架,利用大规模3D车辆资产库实现自动驾驶场景的逼真重建与灵活修改。
Details
Motivation: 现有3D高斯散射技术在自动驾驶场景重建中难以支持显著改变或新驾驶场景的逼真合成。 Method: 引入MADrive框架和MAD-Cars数据集,并结合检索模块从大规模外部内存库中获取视觉相似的3D资产进行场景集成。 Result: 实验表明,MADrive提供了车辆场景的完整多视角表示,能够实现高度逼真的配置更改。 Conclusion: MADrive框架通过替换观察到的车辆,扩展了现有场景重建方法的能力,实现了逼真的驾驶环境合成。 Abstract: Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods by replacing observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank. Specifically, we release MAD-Cars, a curated dataset of ${\sim}70$K 360{\deg} car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting. The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations, as demonstrated in our experiments. Project page: https://yandex-research.github.io/madrive/[144] WAFT: Warping-Alone Field Transforms for Optical Flow
Yihan Wang,Jia Deng
Main category: cs.CV
TL;DR: 本文提出了一种新的光流估计方法WAFT,通过替换成本体积为高分辨率warping,实现了更高的准确性和更低的内存消耗。
Details
Motivation: 挑战了传统观点,即构建成本体积对于实现强性能是必要的,提出了一种更高效的方法来解决光流问题。 Method: 引入了Warping-Alone Field Transforms (WAFT),用高分辨率warping代替成本体积,并采用简单灵活的元架构设计。 Result: WAFT在Spring和KITTI基准测试中排名第一,在KITTI上实现了最佳的零样本泛化能力,同时比类似性能的方法快达4.1倍。 Conclusion: WAFT是一个简单而有效的光流估计方法,优于现有技术,具有良好的泛化能力和运行效率。 Abstract: We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at https://github.com/princeton-vl/WAFT.[145] Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval
Hani Alomari,Anushka Sivakumar,Andrew Zhang,Chris Thomas
Main category: cs.CV
TL;DR: This paper proposes a novel method for cross-modal image-text retrieval by optimizing set-based embeddings through Maximal Pair Assignment Similarity and two new loss functions, achieving state-of-the-art results.
Details
Motivation: Traditional single-vector embeddings struggle to capture nuanced cross-modal relationships. Set-based approaches offer richer representations but face issues like sparse supervision and set collapse. Method: The paper introduces Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets, along with two loss functions: Global Discriminative Loss and Intra-Set Divergence Loss. Result: The method addresses the limitations of set-based approaches and improves cross-modal image-text retrieval performance. Conclusion: The proposed method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data. Abstract: Cross-modal image-text retrieval is challenging because of the diverse possible associations between content from different modalities. Traditional methods learn a single-vector embedding to represent semantics of each sample, but struggle to capture nuanced and diverse relationships that can exist across modalities. Set-based approaches, which represent each sample with multiple embeddings, offer a promising alternative, as they can capture richer and more diverse relationships. In this paper, we show that, despite their promise, these set-based representations continue to face issues including sparse supervision and set collapse, which limits their effectiveness. To address these challenges, we propose Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets which preserve semantic diversity within the set. We also introduce two loss functions to further enhance the representations: Global Discriminative Loss to enhance distinction among embeddings, and Intra-Set Divergence Loss to prevent collapse within each set. Our method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data.[146] StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning
Chuxin Wang,Yixin Zha,Wenfei Yang,Tianzhu Zhang
Main category: cs.CV
TL;DR: StruMamba3D improves Mamba-based point cloud learning by preserving spatial structure and enhancing long-sequence memory, achieving top performance on 3D classification tasks.
Details
Motivation: Mamba-based methods have shown promise in point cloud representation learning but suffer from two key issues: destruction of 3D point adjacency during SSM processing and loss of long-sequence memory as input length increases. Method: The authors propose StruMamba3D, which preserves spatial dependencies using spatial states, enhances SSM with a state-wise update strategy and lightweight convolution, and introduces a sequence length-adaptive strategy to reduce sensitivity to input length. Result: StruMamba3D achieves state-of-the-art results with 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without using a voting strategy, across four downstream tasks. Conclusion: StruMamba3D addresses the limitations of Mamba-based methods in preserving adjacency and retaining long-sequence memory in point cloud representation learning, achieving superior performance and state-of-the-art accuracy on ModelNet40 and ScanObjectNN. Abstract: Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.[147] DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion
Yansong Qu,Shaohui Dai,Xinyang Li,Yuze Wang,You Shen,Liujuan Cao,Rongrong Ji
Main category: cs.CV
TL;DR: This paper introduces an end-to-end framework for generating consistent multi-views from occluded images, improving 3D reconstruction without prior inpainting or annotations, supported by a new benchmark for occlusion-aware methods.
Details
Motivation: The motivation is to address the challenge of reconstructing 3D objects from single RGB images under real-world occlusions, which current diffusion-based view synthesis models struggle with due to their assumption of fully visible inputs. Method: The method involves constructing a self-supervised training pipeline using the Pix2Gestalt dataset with occluded-unoccluded image pairs and pseudo-ground-truth views to teach structure-aware completion and view consistency. The model is fine-tuned without modifying the original architecture. Result: The result is an improved ability to generate structurally consistent novel views from occluded images, along with the introduction of a benchmark for occlusion-aware reconstruction that includes diverse occlusion levels, object categories, and mask patterns. Conclusion: The paper proposes an end-to-end framework for occlusion-aware multi-view generation, enabling 3D reconstruction from partially occluded images without prior inpainting or manual annotations. Abstract: Reconstructing 3D objects from a single image is a long-standing challenge, especially under real-world occlusions. While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they generally assume fully visible inputs and fail when parts of the object are occluded. This leads to inconsistent views and degraded 3D reconstruction quality. To overcome this limitation, we propose an end-to-end framework for occlusion-aware multi-view generation. Our method directly synthesizes six structurally consistent novel views from a single partially occluded image, enabling downstream 3D reconstruction without requiring prior inpainting or manual annotations. We construct a self-supervised training pipeline using the Pix2Gestalt dataset, leveraging occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency. Without modifying the original architecture, we fully fine-tune the view synthesis model to jointly learn completion and multi-view generation. Additionally, we introduce the first benchmark for occlusion-aware reconstruction, encompassing diverse occlusion levels, object categories, and mask patterns. This benchmark provides a standardized protocol for evaluating future methods under partial occlusions. Our code is available at https://github.com/Quyans/DeOcc123.[148] SAM4D: Segment Anything in Camera and LiDAR Streams
Jianyun Xu,Song Wang,Ziqian Ni,Chunyong Hu,Sheng Yang,Jianke Zhu,Qiang Li
Main category: cs.CV
TL;DR: SAM4D 是一种多模态和时间基础模型,用于在相机和LiDAR流之间进行可提示的分割。
Details
Motivation: 为了克服注释瓶颈并确保在动态变化的自动驾驶场景中稳健的分割,引入了统一的多模态位置编码(UMPE)和运动感知跨模态记忆注意力(MCMA)。 Method: 提出了一个名为SAM4D的多模态和时间基础模型,其中包括统一多模态位置编码(UMPE)和运动感知跨模态记忆注意力(MCMA)。此外,开发了一个多模态自动化数据引擎,该引擎结合了VFM驱动的视频masklets、时空4D重建和跨模态masklet融合。 Result: 通过广泛的实验验证了所提出的SAM4D的有效性,展示了其强大的跨模态分割能力和在数据注释中的巨大潜力。 Conclusion: SAM4D是一个有效的解决方案,可以在共享3D空间中对齐相机和LiDAR特征,并提供比人类注释快几个数量级的伪标签生成速度。 Abstract: We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.[149] SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark
Alex Costanzino,Pierluigi Zama Ramirez,Luigi Lella,Matteo Ragaglia,Alessandro Oliva,Giuseppe Lisanti,Luigi Di Stefano
Main category: cs.CV
TL;DR: The paper introduces SiM3D, the first benchmark for 3D anomaly detection and segmentation using multiview and multimodal data, focusing on single-instance training and synthetic-to-real generalization in manufacturing contexts.
Details
Motivation: To address the lack of benchmarks integrating multiview and multimodal data for 3D anomaly detection and segmentation, especially in scenarios where only one object instance is available for training, which is common in manufacturing. Method: The authors propose SiM3D, a new benchmark with a dataset acquired using industrial sensors and robots. It includes multiview images, point clouds, CAD models, and manual annotations. They adapt existing singleview methods as baselines and introduce new metrics for evaluation. Result: SiM3D provides a comprehensive dataset with high-resolution multiview images, point clouds, CAD models, and ground truth annotations for 333 object instances across eight types. It establishes baseline performances for multiview 3D ADS tasks. Conclusion: SiM3D is the first benchmark that integrates multiview and multimodal information for 3D anomaly detection and segmentation, particularly addressing single-instance anomaly detection and generalization from synthetic to real data. Abstract: We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.[150] Whole-Body Conditioned Egocentric Video Prediction
Yutong Bai,Danny Tran,Amir Bar,Yann LeCun,Trevor Darrell,Jitendra Malik
Main category: cs.CV
TL;DR: 本文提出了PEVA模型,利用3D身体姿态和历史视频预测第一人称视角视频,为理解人类动作与环境互动提供了新方法。