Skip to content

Table of Contents

cs.CL [Back]

[1] Towards Probabilistic Question Answering Over Tabular Data

Chen Shen,Sajjadur Rahman,Estevam Hruschka

Main category: cs.CL

TL;DR: 本文介绍了一个用于表格数据概率问答的新框架LUCARIO,结合了贝叶斯网络和大型语言模型,提高了对不确定性问题的回答能力。

Details Motivation: 现有的如NL2SQL系统在处理需要不确定推理的概率性问题时存在不足,因此提出了这一新方法。 Method: 从表格中生成贝叶斯网络,将自然语言问题转化为概率查询,并使用大语言模型生成最终答案。 Result: 实验结果表明,与基线相比有显著改进,体现了混合符号神经推理的优势。 Conclusion: LUCARIO是一个新的基准,并且提出了一种基于贝叶斯网络和大语言模型的表格数据概率问答方法。 Abstract: Current approaches for question answering (QA) over tabular data, such as NL2SQL systems, perform well for factual questions where answers are directly retrieved from tables. However, they fall short on probabilistic questions requiring reasoning under uncertainty. In this paper, we introduce a new benchmark LUCARIO and a framework for probabilistic QA over large tabular data. Our method induces Bayesian Networks from tables, translates natural language queries into probabilistic queries, and uses large language models (LLMs) to generate final answers. Empirical results demonstrate significant improvements over baselines, highlighting the benefits of hybrid symbolic-neural reasoning.

[2] Multi-lingual Functional Evaluation for Large Language Models

Victor Ojewale,Inioluwa Deborah Raji,Suresh Venkatasubramanian

Main category: cs.CL

TL;DR: This paper introduces new cross-lingual functional benchmarks to better assess the real-world performance and robustness of large language models across multiple languages, showing that current benchmarks often fail to capture true functional capabilities.

Details Motivation: Current multi-lingual evaluations using static data benchmarks like Belebele, M-MMLU, and M-GSM do not adequately reflect real-world performance and robustness of large language models across various linguistic contexts. Method: The researchers created new cross-lingual benchmarks by translating existing functional benchmark templates into five additional languages: French, Spanish, Hindi, Arabic, and Yoruba. These include the Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval). Result: Results showed that some static benchmarks poorly capture functional performance compared to the new functional benchmarks. For instance, there were significant performance drops between M-GSM and CL-GSM Symbolic in English, French, and Spanish, and between Belebele and CL-IFEval, while M-MMLU showed minimal drop against CL-IFEval. Model robustness also varied significantly across languages. Conclusion: The study concludes that multi-lingual functional benchmarks provide a more accurate understanding of practical performance and robustness of language models across different languages, revealing significant variations in model performance across languages. Abstract: Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.

[3] The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

Chenglei Si,Tatsunori Hashimoto,Diyi Yang

Main category: cs.CL

TL;DR: LLM生成的研究想法在执行阶段表现不佳,与人类专家相比失去了优势。

Details Motivation: 测试AI生成的研究想法是否能够产生比人类专家更好的研究成果,并探究其新颖性是否能在实际执行中保持。 Method: 通过招募43位专家研究人员随机执行由专家撰写或LLM生成的想法,并对执行前后的评审分数进行比较分析。 Result: LLM生成的想法在执行后评分显著下降,而人类专家的想法则保持稳定甚至反超,显示出从创意到实现之间存在的差距。 Conclusion: 当前的大型语言模型在生成真正有效的研究想法上仍存在局限性,且研究想法的评估缺乏执行结果时面临挑战。 Abstract: Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.

[4] MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering

Chinmay Gondhalekar,Urjitkumar Patel,Fang-Chun Yeh

Main category: cs.CL

TL;DR: This paper presents MultiFinRAG, a retrieval-augmented generation framework designed for financial QA, achieving higher accuracy compared to existing models by enabling efficient cross-modal reasoning.

Details Motivation: Financial documents are complex and multimodal, making traditional LLMs and RAG pipelines inefficient due to token limits, layout loss, and fragmented context. Method: The paper introduces MultiFinRAG, a framework that uses multimodal extraction, structured outputs, modality-aware indexing, and a tiered fallback strategy for financial QA. Result: MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning. Conclusion: MultiFinRAG provides an effective solution for complex financial QA tasks by enabling cross-modal reasoning and outperforms traditional models like ChatGPT-4o. Abstract: Financial documents--such as 10-Ks, 10-Qs, and investor presentations--span hundreds of pages and combine diverse modalities, including dense narrative text, structured tables, and complex figures. Answering questions over such content often requires joint reasoning across modalities, which strains traditional large language models (LLMs) and retrieval-augmented generation (RAG) pipelines due to token limitations, layout loss, and fragmented cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation framework purpose-built for financial QA. MultiFinRAG first performs multimodal extraction by grouping table and figure images into batches and sending them to a lightweight, quantized open-source multimodal LLM, which produces both structured JSON outputs and concise textual summaries. These outputs, along with narrative text, are embedded and indexed with modality-aware similarity thresholds for precise retrieval. A tiered fallback strategy then dynamically escalates from text-only to text+table+image contexts when necessary, enabling cross-modal reasoning while reducing irrelevant context. Despite running on commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning.

[5] Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes

Quintin Myers,Yanjun Gao

Main category: cs.CL

TL;DR: This study evaluates how LLMs respond to violent content using a validated social science instrument, finding inconsistencies in their responses and demographic biases.

Details Motivation: As large language models (LLMs) are increasingly proposed for detecting and responding to violent content online, their ability to handle morally ambiguous, real-world scenarios remains underexamined, especially regarding potential biases. Method: The study used the Violent Behavior Vignette Questionnaire (VBVQ), a validated social science tool, along with persona-based prompting to evaluate six LLMs across different geopolitical and organizational contexts in a zero-shot setting. Result: Two key findings emerged: (1) LLMs' surface-level text generation often diverges from their internal preference for violent responses; and (2) their violent tendencies vary across demographics, frequently contradicting established findings in criminology, social science, and psychology. Conclusion: LLMs show discrepancies in generating violent content responses and exhibit demographic bias, highlighting the need for further scrutiny of their application in socially sensitive areas. Abstract: Large language models (LLMs) are increasingly proposed for detecting and responding to violent content online, yet their ability to reason about morally ambiguous, real-world scenarios remains underexamined. We present the first study to evaluate LLMs using a validated social science instrument designed to measure human response to everyday conflict, namely the Violent Behavior Vignette Questionnaire (VBVQ). To assess potential bias, we introduce persona-based prompting that varies race, age, and geographic identity within the United States. Six LLMs developed across different geopolitical and organizational contexts are evaluated under a unified zero-shot setting. Our study reveals two key findings: (1) LLMs surface-level text generation often diverges from their internal preference for violent responses; (2) their violent tendencies vary across demographics, frequently contradicting established findings in criminology, social science, and psychology.

[6] Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine

Sebastian Joseph,Lily Chen,Barry Wei,Michael Mackert,Iain J. Marshall,Paul Pu Liang,Ramez Kouzy,Byron C. Wallace,Junyi Jessy Li

Main category: cs.CL

TL;DR: 该研究探讨了为何端到端医学事实核查系统仍未被广泛采用,发现其根本性挑战包括声明与证据的连接困难、声明本身的模糊性与意图不匹配以及真实性判断的主观性。

Details Motivation: 由于医疗决策的高风险性质和评估大量多样化医学文献的挑战,人们对将自动事实核查系统应用于公共卫生和医学的兴趣日益增加。然而,这类系统仍未得到广泛使用。 Method: 对临床专家如何验证来自社交媒体的真实声明进行综合医学证据的实证研究。 Result: 揭示了在医学中应用端到端事实核查的根本性挑战:难以将现实中的声明与临床试验形式的科学证据联系起来、模糊的未明确声明与意图不匹配混杂的问题以及本质上主观的真实性标签。 Conclusion: 医学事实核查应被视为一个交互式沟通问题,而非端到端的过程。 Abstract: Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripens the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. To understand this, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached and evaluated as an interactive communication problem, rather than an end-to-end process.

[7] Optimising Language Models for Downstream Tasks: A Post-Training Perspective

Zhengyan Shi

Main category: cs.CL

TL;DR: 这篇论文致力于解决语言模型在适应特定任务时遇到的效率和鲁棒性问题,提出了一系列创新的方法和技术,包括利用无标签数据的新持续预训练技术、参数高效微调方法以及改进的监督微调方法等,旨在提升语言模型在多样化的NLP任务中的表现和适应能力。

Details Motivation: 尽管语言模型在NLP中展示了卓越的能力,但将其高效且稳健地适应特定任务仍然具有挑战性。传统的微调方法往往不能充分利用可用的无标签数据,并可能导致过拟合和高计算成本。 Method: 探索了从无标签数据中提取任务相关知识的策略,介绍了一种新的持续预训练技术;提出了一种参数高效的微调方法;引入了改进的监督微调方法;开发了新的评估方法和基准测试。 Result: 提出的持续预训练技术优于最先进的半监督方法;参数高效微调方法显著降低了内存和计算成本;改进的监督微调方法增强了模型在少量标注数据下的表现;新的评估方法更全面地衡量了LM的能力和适应性。 Conclusion: 该论文提出了一系列方法来更好地将语言模型(LMs)适应到下游应用中,通过实证研究表明这些方法显著提高了LM的鲁棒性、效率和泛化能力。 Abstract: Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often underutilizes available unlabelled data, leads to overfitting on small task-specific sets, and imposes significant computational costs. These limitations hamper their application to the open-ended landscape of real-world language tasks. This thesis proposes a series of methods to better adapt LMs to downstream applications. First, we explore strategies for extracting task-relevant knowledge from unlabelled data, introducing a novel continued pre-training technique that outperforms state-of-the-art semi-supervised approaches. Next, we present a parameter-efficient fine-tuning method that substantially reduces memory and compute costs while maintaining competitive performance. We also introduce improved supervised fine-tuning methods that enable LMs to better follow instructions, especially when labelled data is scarce, enhancing their performance across a range of NLP tasks, including open-ended generation. Finally, we develop new evaluation methods and benchmarks, such as multi-hop spatial reasoning tasks, to assess LM capabilities and adaptation more comprehensively. Through extensive empirical studies across diverse NLP tasks, our results demonstrate that these approaches substantially improve LM robustness, efficiency, and generalization, making them more adaptable to a broad range of applications. These advances mark a significant step towards more robust and efficient LMs, bringing us closer to the goal of artificial general intelligence.

[8] FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo,Hynek Kydlíček,Vinko Sabolčec,Bettina Messmer,Negar Foroutan,Amir Hossein Kargaran,Colin Raffel,Martin Jaggi,Leandro Von Werra,Thomas Wolf

Main category: cs.CL

TL;DR: 本研究开发了一种适用于任意语言的高效多语言预训练数据集构建方法,并发布了大规模多语言数据集FineWeb2。

Details Motivation: 由于大型高质量英文预训练数据集的发展迅速,而多语言LLM的训练仍存在挑战,特别是在过滤和去重流程上难以适配大量语言。 Method: 基于FineWeb的预训练数据集构建管道,结合了自动适应任何语言的技术和重新平衡数据集的方法。 Result: 该方法在九种不同语言上进行了广泛实验,最终成功创建了一个超过1000种语言支持、包含50亿文档的20TB多语言数据集FineWeb2。 Conclusion: 本文提出了一种新的多语言LLM预训练数据集构建管道,并证明了其在非英语语料库上的有效性。 Abstract: Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.

[9] KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Xinping Zhao,Xinshuo Hu,Zifei Shan,Shouzheng Huang,Yao Zhou,Zetian Sun,Zhenyu Liu,Dongfang Li,Xinyuan Wei,Qian Chen,Youcheng Pan,Yang Xiang,Meishan Zhang,Haofen Wang,Jun Yu,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: This paper proposes KaLM-Embedding-V2, an efficient and compact text embedding model that achieves state-of-the-art performance by leveraging advanced training techniques and diverse datasets.

Details Motivation: To develop a versatile and compact embedding model that performs impressively in general-purpose text embedding tasks while being significantly smaller than existing models. Method: The paper introduces KaLM-Embedding-V2, which uses a fully bidirectional transformer with mean-pooling, a multi-stage training pipeline, focal-style reweighting, online hard-negative mixing, and diverse data collection for pre-training and fine-tuning. Result: KaLM-Embedding-V2 outperforms other models of comparable size and competes with much larger models across extensive evaluations on the MTEB benchmark for both Chinese and English. Conclusion: The paper concludes that KaLM-Embedding-V2, with its innovative architecture and training methods, sets a new standard for compact embedding models under 1B parameters. Abstract: In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.

[10] Can Gradient Descent Simulate Prompting?

Eric Zhang,Leshem Choshen,Jacob Andreas

Main category: cs.CL

TL;DR: This paper explores a method for meta-training language models such that gradient updates can effectively emulate the process of conditioning on new information, achieving performance comparable to prompting.

Details Motivation: The motivation is to explore whether fine-tuning can be made to emulate prompting, which is often more effective for model updates, enabling robust generalization from single examples and logical inferences not typically seen under standard fine-tuning. Method: The method involves meta-training language models using gradient-based meta-learning tools, with the model's own prompted predictions as targets, aiming to make gradient updates mimic the effects of conditioning on new information. Result: The results show that gradient descent training can recover some or all of the performance of prompted models, demonstrating improvements on tasks like the 'reversal curse' and answering questions about text passages after a single gradient update. Conclusion: The paper concludes that gradient descent can emulate the effects of prompting in language models, opening new avenues for long-context modeling and offering insights into gradient-based learning's generalization capabilities. Abstract: There are two primary ways of incorporating new information into a language model (LM): changing its prompt or changing its parameters, e.g. via fine-tuning. Parameter updates incur no long-term storage cost for model changes. However, for many model updates, prompting is significantly more effective: prompted models can generalize robustly from single examples and draw logical inferences that do not occur under standard fine-tuning. Can models be modified so that fine-tuning does emulate prompting? This paper describes a method for meta-training LMs such that gradient updates emulate the effects of conditioning on new information. Our approach uses tools from gradient-based meta-learning but uses an LM's own prompted predictions as targets, eliminating the need for ground-truth labels. Subsequent gradient descent training recovers some (and occasionally all) of prompted model performance -- showing improvement on the ``reversal curse'' tasks, and answering questions about text passages after a single gradient update. These results suggest that, with appropriate initialization, gradient descent can be surprisingly expressive. Our results suggest new avenues for long-context modeling and offer insight into the generalization capabilities of gradient-based learning.

[11] SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control

Adithya Chittem,Aishna Shrivastava,Sai Tarun Pendela,Jagat Sesh Challa,Dhruv Kumar

Main category: cs.CL

TL;DR: 本文通过扩展机器个性库存(MPI),将其纳入16个个性因素(16PF)模型,并开发了一种称为特定属性控制(SAC)的结构化框架,以实现对十六个不同特征的表达式控制。

Details Motivation: 现有模型主要依赖于仅提供粗略个性维度的大五人格(OCEAN)框架,并且缺乏控制特征强度的机制。 Method: 开发了一种称为特定属性控制(SAC)的结构化框架,并引入了基于形容词的语义锚定来指导特征强度表达。 Result: 将建模强度视为连续谱比二元特征切换能产生更加一致和可控的个性表达。此外,目标特征强度的变化系统地影响密切相关的特征,表明LLMs内部具有多维个性结构。 Conclusion: 这项工作为在医疗保健、教育和面试过程中等领域实现可控且细微的人机交互打开了新的途径,使我们更接近真正类人社交机器的目标。 Abstract: Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and \textit{Willingness}. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.

[12] Large Language Models Acing Chartered Accountancy

Jatin Gupta,Akhil Sharma,Saransh Singhania,Mohammad Adnan,Sakshi Deo,Ali Imam Abidi,Keshav Gupta

Main category: cs.CL

TL;DR: This paper introduces CA-Ben, a Chartered Accountancy benchmark to evaluate LLMs' financial, legal, and quantitative reasoning capabilities. Claude 3.5 Sonnet and GPT-4o performed best, but challenges persist in numerical and legal tasks.

Details Motivation: This paper addresses the gap in assessing how effectively LLMs capture and apply domain-specific financial knowledge within the Indian financial context. Method: The study evaluates six prominent LLMs using CA-Ben, a benchmark derived from Chartered Accountancy exams conducted by ICAI, with standardized protocols across foundational, intermediate, and advanced curriculum stages. Result: Claude 3.5 Sonnet and GPT-4o outperformed other models in conceptual and legal reasoning, but all models faced notable challenges in numerical computations and legal interpretations. Conclusion: The paper concludes that while some LLMs perform well in conceptual and legal reasoning, challenges remain in numerical computations and legal interpretations. It suggests future improvements through hybrid reasoning and retrieval-augmented generation methods. Abstract: Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.

[13] A Semi-supervised Scalable Unified Framework for E-commerce Query Classification

Chunyuan Yuan,Chong Zhang,Zheng Fang,Ming Pang,Xue Jiang,Changping Peng,Zhangang Lin,Ching Law

Main category: cs.CL

TL;DR: 本文提出了一种用于电子商务查询分类的新颖半监督统一框架,通过多个增强模块解决信息不足和依赖后验标签的问题,实验表明其性能优于现有方法。

Details Motivation: 电子商务查询通常较短且缺乏上下文,导致建模先验信息不足,而现有的方法依赖于用户后验点击行为,效率较低。 Method: 知识增强模块利用世界知识来增强查询表示;标签增强模块利用标签语义和半监督信号减少对后验标签的依赖;结构增强模块基于复杂的标签关系增强标签表示。 Result: 进行了大量的离线和在线A/B实验,结果表明SSUF显著优于最先进的模型。 Conclusion: 本文提出了一种新颖的半监督可扩展统一框架(SSUF),用于统一度量学习查询分类任务,并通过多个增强模块显著优于现有模型。 Abstract: Query classification, including multiple subtasks such as intent and category prediction, is vital to e-commerce applications. E-commerce queries are usually short and lack context, and the information between labels cannot be used, resulting in insufficient prior information for modeling. Most existing industrial query classification methods rely on users' posterior click behavior to construct training samples, resulting in a Matthew vicious cycle. Furthermore, the subtasks of query classification lack a unified framework, leading to low efficiency for algorithm optimization. In this paper, we propose a novel Semi-supervised Scalable Unified Framework (SSUF), containing multiple enhanced modules to unify the query classification tasks. The knowledge-enhanced module uses world knowledge to enhance query representations and solve the problem of insufficient query information. The label-enhanced module uses label semantics and semi-supervised signals to reduce the dependence on posterior labels. The structure-enhanced module enhances the label representation based on the complex label relations. Each module is highly pluggable, and input features can be added or removed as needed according to each subtask. We conduct extensive offline and online A/B experiments, and the results show that SSUF significantly outperforms the state-of-the-art models.

[14] MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection

Fuqiang Niu,Genan Dai,Yisha Lu,Jiayu Liao,Xiang Li,Hu Huang,Bowen Zhang

Main category: cs.CL

TL;DR: This paper introduces the MT2-CSD dataset for multi-target, multi-turn conversational stance detection and proposes LLM-CRAN, a model leveraging large language models to improve conversational understanding, showing superior performance over baselines.

Details Motivation: Traditional stance detection research targets individual instances, limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. Method: The authors introduced the MT2-CSD dataset and proposed the Large Language Model enhanced Conversational Relational Attention Network (LLM-CRAN) to address the challenges of conversational stance detection. They conducted extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. Result: MT2-CSD is the largest dataset available for multi-target, multi-turn conversational stance detection, comprising 24,457 annotated instances and exhibiting the greatest conversational depth. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection. Conclusion: The paper concludes that LLM-CRAN significantly outperforms strong baseline models in conversational stance detection, indicating its effectiveness in addressing the challenges posed by the MT2-CSD dataset. Abstract: In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.

[15] DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

Kang He,Yuzhe Ding. Haining Wang,Fei Li,Chong Teng,Donghong Ji

Main category: cs.CL

TL;DR: This paper proposes DALR, a dual-level alignment approach that improves multimodal sentence representation by addressing cross-modal and intra-modal challenges, achieving superior performance over existing methods.

Details Motivation: Previous methods face two critical challenges: cross-modal misalignment bias and intra-modal semantic divergence, which degrade sentence representation quality. Method: DALR uses a consistency learning module for fine-grained cross-modal alignment and integrates ranking distillation with global intra-modal alignment learning to enhance representation quality. Result: Experiments on semantic textual similarity (STS) and transfer (TR) tasks show that DALR consistently surpasses existing approaches. Conclusion: The proposed DALR method outperforms state-of-the-art baselines in multimodal sentence representation learning by addressing cross-modal misalignment bias and intra-modal semantic divergence. Abstract: Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.

[16] ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry

Qinwen Chen,Wenbiao Tao,Zhiwei Zhu,Mingfan Xi,Liangzhong Guo,Yuan Wang,Wei Wang,Yunshi Lan

Main category: cs.CL

TL;DR: 本文提出了一个名为ComRAG的新框架,该框架通过一种新颖的记忆机制提高了工业社区问答平台的效果和效率。

Details Motivation: 现有的方法往往未能充分利用外部知识,无法整合动态的历史问答上下文,或者缺乏适合工业部署的记忆机制。 Method: 提出了一种基于质心的记忆机制,将静态知识与动态历史问答对集成到检索增强生成框架中。 Result: 在三个工业社区问答数据集上进行的评估表明,ComRAG始终优于所有基线--实现了最高25.9%的向量相似度改进,延迟降低了8.7%至23.3%,并在迭代过程中降低了块增长从20.23%到2.06%。 Conclusion: ComRAG是一个用于实时工业社区问答的检索增强生成框架,其性能优于所有基线。 Abstract: Community Question Answering (CQA) platforms can be deemed as important knowledge bases in community, but effectively leveraging historical interactions and domain knowledge in real-time remains a challenge. Existing methods often underutilize external knowledge, fail to incorporate dynamic historical QA context, or lack memory mechanisms suited for industrial deployment. We propose ComRAG, a retrieval-augmented generation framework for real-time industrial CQA that integrates static knowledge with dynamic historical QA pairs via a centroid-based memory mechanism designed for retrieval, generation, and efficient storage. Evaluated on three industrial CQA datasets, ComRAG consistently outperforms all baselines--achieving up to 25.9% improvement in vector similarity, reducing latency by 8.7% to 23.3%, and lowering chunk growth from 20.23% to 2.06% over iterations.

[17] Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models

Xiaoshuang Ji,Zhendong Zhao,Xiaojun Chen,Xin Zhao,Zeyao Liu

Main category: cs.CL

TL;DR: This paper introduces Progtuning, a method for efficiently fine-tuning large language models by progressively updating only the most impactful Transformer blocks, significantly reducing computational costs without sacrificing performance.

Details Motivation: As Transformer-based language models grow in size, updating all parameters becomes costly. Existing fine-tuning and parameter-efficient methods do not account for the varying contributions of different Transformer blocks, leading to inefficient resource allocation. Method: The paper proposes Progtuning, which uses progressive learning to selectively update Transformer blocks based on their contribution, combined with parameter-efficient fine-tuning methods. Result: Progtuning reduces the number of updated parameters by approximately 25% while maintaining strong performance and demonstrates adaptability across various fine-tuning scenarios. Conclusion: Progtuning is a novel fine-tuning framework that optimizes resource allocation by progressively reducing the number of updated transformer blocks, thereby reducing the number of updated parameters while maintaining competitive performance. Abstract: Fine-tuning is a promising technique for leveraging Transformer-based language models in downstream tasks. As model sizes continue to grow, updating all model parameters becomes increasingly costly. Parameter-efficient fine-tuning methods effectively address this issue by selectively updating a small subset of parameters. However, fine-tuning and most existing parameter-efficient fine-tuning methods require updating the same number of parameters as the initial size, ignoring the unequal contribution across Transformer blocks and leading to extremely inefficient allocation of computing resources. In this paper, we propose Progtuning, the novel fine-tuning framework combined with progressive learning for Transformer-based language models. Specifically, Progtuning progressively reduces the number of updated transformer blocks based on the contribution. Remarkably, Progtuning optimizes resource allocation and reduces the number of updated parameters by approximately 25\%, while still maintaining competitive performance. And it also exhibits high adaptability with parameter-efficient fine-tuning methods, demonstrating excellent performance across various adaptation scenarios.

[18] Compressed and Smooth Latent Space for Text Diffusion Modeling

Viacheslav Meshchaninov,Egor Chimbulatov,Alexander Shabalin,Aleksandr Abramov,Dmitry Vetrov

Main category: cs.CL

TL;DR: Cosmos is a new text generation method using diffusion in a compressed latent space, offering faster inference and better performance compared to traditional approaches.

Details Motivation: Autoregressive language models are slow in decoding and struggle with global coherence, while diffusion models face challenges due to the high dimensionality of token-level representations. Method: Cosmos uses an autoencoder trained for token-level reconstruction and alignment with frozen activations from a pretrained language encoder to learn a compressed latent space for diffusion-based text generation. Result: Cosmos achieves 8× compression of text representations while maintaining generation quality comparable to token-level diffusion models. It also surpasses both diffusion-based and autoregressive baselines when increasing latent sequence length and provides more than 2× faster inference. Conclusion: Cosmos offers a novel approach to text generation by operating in a compressed, smooth latent space for diffusion models, overcoming limitations of autoregressive and token-level diffusion models. Abstract: Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference.

[19] Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

Isaac Chung,Imene Kerboua,Marton Kardos,Roman Solomatin,Kenneth Enevoldsen

Main category: cs.CL

TL;DR: 该论文介绍了如何通过工程实践确保MTEB文本嵌入基准的持续可重复性和可扩展性。

Details Motivation: 尽管已有工作确立了MTEB的核心基准方法,但本文关注于确保其持续可重复性和可扩展性的工程方面,以应对在机器学习评估框架中保证可重复性和可用性所面临的挑战。 Method: 本文提出了确保MTEB持续可重复性和可扩展性的工程方法,包括维护鲁棒的持续集成管道、验证数据集完整性、自动化测试执行以及评估基准结果的泛化能力,并讨论了处理社区贡献和扩展新任务及数据集的策略。 Result: 这些工程实践对于将MTEB扩展得更加全面,同时保持质量和对领域的相关性起到了关键作用。 Conclusion: MTEB通过强大的工程实践,如持续集成管道和社区贡献策略,实现了可扩展性和质量保持,为其他基准维护者提供了有价值的见解。 Abstract: The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB's continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results' generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb

[20] Prompt-Guided Turn-Taking Prediction

Koji Inoue,Mikey Elmers,Yahui Fu,Zi Haur Pang,Divesh Lala,Keiko Ochi,Tatsuya Kawahara

Main category: cs.CL

TL;DR: This paper proposes a transformer-based turn-taking prediction model enhanced with textual prompt embeddings, enabling dynamic adjustment of conversational behaviors and improving prediction accuracy.

Details Motivation: To enable more intuitive and explicit control over turn-taking prediction models in spoken dialogue systems, adapting to different conversational partners and contexts in real-time. Method: A novel model based on a transformer-based voice activity projection (VAP) model was developed, integrating textual prompt embeddings into channel-wise and cross-channel transformers to dynamically control turn-taking prediction. Result: Using over 950 hours of human-human spoken dialogue data and synthetic prompts generated by an LLM, the model demonstrated improved prediction accuracy and successfully adapted turn-taking timing according to textual prompts. Conclusion: The proposed model enhances turn-taking prediction in dialogue systems by incorporating textual prompt embeddings into a transformer-based architecture, allowing dynamic behavior adjustments. Abstract: Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as "faster" or "calmer" adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.

[21] Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval

Yongchan Chun,Minhyuk Kim,Dongjun Kim,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: 这篇论文研究了如何利用基于句法的检索策略来提升大型语言模型在自动术语提取任务中的性能,并发现这种方法能有效提高准确性。

Details Motivation: 尽管大型语言模型(LLMs)在各种NLP任务中取得了显著进展,但它们在自动术语提取(ATE)方面的潜力几乎没有被研究。这是本论文研究的主要动机。 Method: 论文提出了一种基于检索的提示策略,该策略在少量样本设置下根据句法相似性而非语义相似性选择示例。这种方法是领域无关的,并且对于捕捉术语边界提供了更可靠的指导。 Result: 实验结果显示,这种句法检索方法在三个专门的ATE基准测试中提高了F1分数。 Conclusion: 该论文得出的结论是,在适应大型语言模型进行术语提取任务时,句法提示具有重要作用,并且通过句法检索的方法可以提高F1分数。 Abstract: Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emph{syntactic} rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.

[22] Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents

Tianyi Men,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao

Main category: cs.CL

TL;DR: This paper introduces Agent-RewardBench, a new benchmark for evaluating reward models in multimodal agents, revealing that current models have limited effectiveness.

Details Motivation: Multimodal agents struggle with self-correction and generalization due to a lack of external feedback, making it important to develop effective reward models. Method: The authors developed Agent-RewardBench, a benchmark with multiple dimensions and step-level evaluation to assess reward modeling in MLLMs. Result: Experiments showed that even advanced multimodal models perform poorly on the benchmark, indicating the need for better reward modeling techniques. Conclusion: The study concludes that current reward models are insufficient for multimodal agents, and specialized training is needed to improve their performance. Abstract: As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.

[23] Cat and Mouse -- Can Fake Text Generation Outpace Detector Systems?

Andrea McGlinchey,Peter J Barclay

Main category: cs.CL

TL;DR: This paper investigates whether the increasing size and complexity of language models will make fake text detection impractical, finding that while some models become more deceptive, effective detection remains feasible.

Details Motivation: Motivated by the concern that detecting artificially generated text might become increasingly difficult as language models grow more advanced, this study explores whether detection capabilities reach a plateau. Method: The researchers examined the ability of statistical classifiers to identify 'fake text' in the style of classical detective fiction and observed changes in performance with an increase in model versions. Result: Over a 0.5 version increase, Gemini showed improved ability to generate deceptive text, whereas GPT did not, indicating varying behaviors across model architectures. Conclusion: The study concludes that reliable detection of fake text may remain feasible even as models become larger, although new model architectures could enhance the deceptiveness of generated text. Abstract: Large language models can produce convincing "fake text" in domains such as academic writing, product reviews, and political news. Many approaches have been investigated for the detection of artificially generated text. While this may seem to presage an endless "arms race", we note that newer LLMs use ever more parameters, training data, and energy, while relatively simple classifiers demonstrate a good level of detection accuracy with modest resources. To approach the question of whether the models' ability to beat the detectors may therefore reach a plateau, we examine the ability of statistical classifiers to identify "fake text" in the style of classical detective fiction. Over a 0.5 version increase, we found that Gemini showed an increased ability to generate deceptive text, while GPT did not. This suggests that reliable detection of fake text may remain feasible even for ever-larger models, though new model architectures may improve their deceptiveness

[24] Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Xin Xu,Tianhao Chen,Fan Zhang,Wanlong Liu,Pengxiang Li,Ajay Kumar Jaiswal,Yuchen Yan,Jishan Hu,Yang Wang,Hao Chen,Shiwei Liu,Shizhe Diao,Can Yang,Lu Yin

Main category: cs.CL

TL;DR: 本文介绍了一种名为Double-Checker的框架,通过迭代自我批判和优化,显著提高了慢思考大语言模型的推理能力。

Details Motivation: 尽管慢思考的大语言模型(LLMs)表现出类似反思的推理能力,但它们生成有信息量的批判和改进先前解决方案的能力仍然有限。 Method: 通过在我们精心策划的1,730个自我批判实例上进行微调,Double-Checker使长链式LLMs能够在推理过程中迭代地批判并优化其输出,直到在自我生成的批判下评估其解决方案为正确。 Result: 在广泛的推理基准测试中验证了Double-Checker的有效性,证明迭代自我批判显著增强了长链式LLMs的推理能力。值得注意的是,在具有挑战性的AIME基准测试中,与原始长链式LLMs相比,我们的Double-Checker将pass@1性能从4.4%提高到了18.2%。 Conclusion: Double-Checker是一个有原则的框架,旨在通过促进明确的自我批评和迭代改进先前解决方案来增强慢思考LLMs的推理能力。 Abstract: While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.

[25] Small Encoders Can Rival Large Decoders in Detecting Groundedness

Istabrak Abbes,Gabriele Prato,Quentin Fournier,Fernando Rodriguez,Alaa Boukhary,Adam Elwood,Sarath Chandar

Main category: cs.CL

TL;DR: This paper proposes using small, efficient models to check if a query can be answered from given context before using larger LLMs, saving time and resources without sacrificing accuracy.

Details Motivation: LLMs often generate unreliable responses when context is insufficient, making groundedness detection crucial for factual consistency and efficiency improvements. Method: The study uses lightweight encoder models like RoBERTa and NomicBERT, fine-tuned on curated datasets, to detect groundedness before answer generation by LLMs. Result: RoBERTa and NomicBERT achieved performance levels similar to state-of-the-art LLMs like Llama3 8B and GPT4o in groundedness detection while greatly lowering latency. Conclusion: Lightweight encoder models can effectively detect groundedness in queries, offering accuracy comparable to advanced LLMs while significantly reducing inference time and resource use. Abstract: Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : https://github.com/chandarlab/Hallucinate-less

[26] Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

Bram Willemsen,Gabriel Skantze

Main category: cs.CL

TL;DR: 本研究探讨了仅使用语言模型从视觉对话中提取指代表达的效果,并发现语言上下文对该任务具有重要作用,尽管任务本质仍为多模态。

Details Motivation: 研究纯语言学上下文在多大程度上能够支持从视觉对话中检测出具有视觉可感知参照物的提及片段。 Method: 调整预训练大型语言模型(LLM),通过下一词预测对对话中的提及片段进行粗粒度标注。 Result: 即使使用中等大小的LLM、较小的数据集和参数高效微调,基于文本的方法依然有效。 Conclusion: 文本单独的方法在指代表达提取任务中具有一定效果,但该任务本质上是多模态问题,单模态方法存在根本性局限。 Abstract: In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.

[27] Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models

Fangzhou Dong,Yifan Zeng,Yingpeng Sang,Hong Shen

Main category: cs.CL

TL;DR: This paper introduces GLASS, a new framework for literary analysis using AI, which improves the ability of large language models to provide professional literary criticism.

Details Motivation: Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. Method: This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs' ability to conduct in-depth literary analysis. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Result: Our framework's results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. Conclusion: GLASS provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement. Abstract: Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs' ability to conduct in-depth literary analysis. GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Our framework's results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.

[28] Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation

Guanting Dong,Xiaoxi Li,Yuyao Zhang,Mengjie Deng

Main category: cs.CL

TL;DR: This paper introduces Omni-RAG, a framework that enhances retrieval-augmented generation systems' ability to handle complex, noisy user queries in real-world scenarios.

Details Motivation: Real-world live retrieval-augmented generation (RAG) systems face challenges processing noisy, ambiguous user queries with multiple intents. Current systems struggle with such inputs as they are typically trained or evaluated on cleaner data. Method: Omni-RAG preprocesses user inputs through three modules: (1) Deep Query Understanding and Decomposition using LLMs with tailored prompts to denoise and decompose queries; (2) Intent-Aware Knowledge Retrieval for retrieving relevant information per sub-query; and (3) Reranking and Generation where a reranker refines document selection before generating a final response via an LLM using a chain-of-thought prompt. Result: Omni-RAG improves the robustness and effectiveness of RAG systems in handling complex and noisy queries, addressing the demands highlighted by real-world applications like the SIGIR 2025 LiveRAG Challenge. Conclusion: Omni-RAG aims to bridge the gap between current RAG capabilities and real-world demands by robustly handling complex and noisy queries in live, open-domain settings. Abstract: Real-world live retrieval-augmented generation (RAG) systems face significant challenges when processing user queries that are often noisy, ambiguous, and contain multiple intents. While RAG enhances large language models (LLMs) with external knowledge, current systems typically struggle with such complex inputs, as they are often trained or evaluated on cleaner data. This paper introduces Omni-RAG, a novel framework designed to improve the robustness and effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs LLM-assisted query understanding to preprocess user inputs through three key modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs with tailored prompts to denoise queries (e.g., correcting spelling errors) and decompose multi-intent queries into structured sub-queries; (2) Intent-Aware Knowledge Retrieval, which performs retrieval for each sub-query from a corpus (i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking and Generation, where a reranker (i.e., BGE) refines document selection before a final response is generated by an LLM (i.e., Falcon-10B) using a chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG capabilities and the demands of real-world applications, such as those highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex and noisy queries.

[29] Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection

Ali Şenol,Garima Agrawal,Huan Liu

Main category: cs.CL

TL;DR: 本文提出了一种结合预训练大语言模型和任务特定领域知识的新框架,用于检测欺骗性对话和概念漂移,验证了领域知识的价值并展示了其在高风险自然语言处理应用中的有效性和优势。

Details Motivation: 由于语言模式的变化和概念漂移,动态平台上检测欺骗性对话变得越来越困难,现有的大语言模型在敏感风险场景中表现出色但面临挑战。 Method: 提出了一种DK增强LLM框架,包含三个主要组件:用于检测虚假或欺骗性对话的DK-LLM模块;用于确定语义变化的OCDD漂移检测单元;以及将漂移分类为良性或欺诈性的第二个DK-LLM模块。 Result: 该系统能够以98%的分类准确率检测虚假对话,并且通过比较研究证明了结合领域知识和漂移意识显著提升了性能。 Conclusion: 整合领域知识和概念漂移检测的LLM框架在高风险NLP应用中提高了性能、可解释性和鲁棒性。 Abstract: Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.

[30] Text2Cypher Across Languages: Evaluating Foundational Models Beyond English

Makbule Gulcin Ozsoy,William Tai

Main category: cs.CL

TL;DR: This paper evaluates foundational LLMs on the Text2Cypher task across English, Spanish, and Turkish, finding performance differences due to training data and linguistic traits. It emphasizes the need for more inclusive multilingual development.

Details Motivation: Most research on natural language interfaces for database queries, like Text2SQL, focuses only on English. This paper aims to investigate the performance of foundational LLMs in other languages, addressing the lack of multilingual evaluation. Method: The authors created a multilingual test set by translating English questions into Spanish and Turkish while preserving original Cypher queries. They evaluated multiple foundational models using standardized prompts and metrics to compare performances across languages. Result: Foundational LLMs showed varying performance across languages: best on English, lower on Spanish, and lowest on Turkish. Prompt translation had minimal impact on results. Conclusion: The paper concludes that the performance of foundational LLMs in the Text2Cypher task varies across languages, with English showing the highest performance followed by Spanish and then Turkish. The study highlights the need for more inclusive evaluation and development in multilingual query generation. Abstract: Recent advances in large language models have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses solely on English, with limited evaluation in other languages. This paper investigates the performance of foundational LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. We evaluate multiple foundational models using standardized prompts and metrics. Our results show a consistent performance pattern: highest on English, then Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic characteristics. Additionally, we explore the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Our findings highlight the need for more inclusive evaluation and development in multilingual query generation. Future work includes schema localization and fine-tuning across diverse languages.

[31] Aligning Spoken Dialogue Models from User Interactions

Anne Wu,Laurent Mazaré,Neil Zeghidour,Alexandre Défossez

Main category: cs.CL

TL;DR: This paper introduces a new framework for aligning preferences in spoken dialogue models, significantly enhancing their performance in real-time conversations through comprehensive feedback and fine-tuning.

Details Motivation: Current preference learning methods are focused on text-based language models and are not suited to the complexities of real-time speech interactions, which involve richer dynamics such as interruptions and interjections, and lack explicit segmentation between speaker turns. Method: A novel preference alignment framework was developed using a large-scale dataset of over 150,000 preference pairs derived from multi-turn speech conversations. Offline alignment methods were used to fine-tune a full-duplex autoregressive speech-to-speech model. Result: Extensive experiments showed that feedback on generic conversations can consistently improve spoken dialogue models, making them more factual, safer, and contextually aligned. Holistic human evaluations confirmed the positive impact beyond single-turn conversations. Conclusion: The proposed preference alignment framework effectively improves spoken dialogue models by incorporating feedback on generic conversations, highlighting the importance of balancing various dynamics for natural real-time speech systems. Abstract: We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.

[32] TopK Language Models

Ryosuke Takahashi,Tatsuro Inaba,Kentaro Inui,Benjamin Heinzerling

Main category: cs.CL

TL;DR: 本文提出了一种新的Transformer架构变体TopK LM,它结合了TopK激活函数,解决了传统稀疏自编码器的问题,提供了更好的可解释性和稳定性。

Details Motivation: 稀疏自编码器(SAEs)存在一些不足,例如训练条件和架构选择会影响其特征学习,同时缺乏稳定性,导致难以比较不同检查点中的特征。 Method: 引入了一种修改后的Transformer架构,在选定的层中结合了TopK激活函数,使模型的隐藏状态等同于TopK SAE的潜在特征。 Result: TopK LMs 能够维持原始能力的同时提供强大的可解释性优势,包括通过目标神经元干预实现成功操控以及对跨检查点和层的神经元形成过程进行详细分析。 Conclusion: TopK LMs 提供了一个在模型大小、计算效率和可解释性之间良好折衷的解决方案,并且能稳定可靠地帮助理解语言模型如何学习和表示概念。 Abstract: Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE's side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model's hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.

[33] Bridging Offline and Online Reinforcement Learning for LLMs

Jack Lanchantin,Angelica Chen,Janice Lan,Xian Li,Swarnadeep Saha,Tianlu Wang,Jing Xu,Ping Yu,Weizhe Yuan,Jason E Weston,Sainbayar Sukhbaatar,Ilia Kulikov

Main category: cs.CL

TL;DR: The study finds that reinforcement learning methods are effective in fine-tuning large language models across different regimes, with multi-tasking improving performance on both verifiable and non-verifiable tasks.

Details Motivation: To understand the effectiveness of reinforcement learning methods for fine-tuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Method: Experiments were conducted on verifiable math as well as non-verifiable instruction following, comparing online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives. Result: Similar performance and convergence were found between online and semi-online variants, which all strongly outperformed offline methods. Optimal results were achieved through detailed analysis of training dynamics and hyperparameter selection strategies. Conclusion: Multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types. Abstract: We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

[34] Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments

Jiashuo Wang,Kaitao Song,Chunpu Xu,Changhe Song,Yang Xiao,Dongsheng Li,Lili Qiu,Wenjie Li

Main category: cs.CL

TL;DR: This paper proposes a method to enhance user engagement in interactive LLMs by leveraging future conversation signals and aligning models using direct preference optimization based on user reactions.

Details Motivation: Prior works have not guaranteed user engagement in socially-driven dialogues by optimizing models over relevant knowledge or dialogue act flow. This research aims to bridge the subtle relationship between user engagement and dialogue systems by learning from future conversation developments. Method: The study introduces a user simulator and employs i×MCTS (Monte Carlo Tree Search for interaction) to gather pairs of high- and low-quality interaction experiences. Interactive LLMs are then aligned using direct preference optimization (DPO) based on user reactions related to dialogue intention as a reward signal. Result: Experiments show that leveraging signals from the future development of conversations and aligning LLMs via DPO improves user engagement in two socially-driven dialogue scenarios: emotional support conversations and persuasion for good. Conclusion: The proposed method effectively enhances user engagement in interactive LLMs, as demonstrated through experiments on emotional support conversations and persuasion for good scenarios. Abstract: Enhancing user engagement through interactions plays an essential role in socially-driven dialogues. While prior works have optimized models to reason over relevant knowledge or plan a dialogue act flow, the relationship between user engagement and knowledge or dialogue acts is subtle and does not guarantee user engagement in socially-driven dialogues. To this end, we enable interactive LLMs to learn user engagement by leveraging signals from the future development of conversations. Specifically, we adopt a more direct and relevant indicator of user engagement, i.e., the user's reaction related to dialogue intention after the interaction, as a reward to align interactive LLMs. To achieve this, we develop a user simulator to interact with target interactive LLMs and explore interactions between the user and the interactive LLM system via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree \textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset containing pairs of higher and lower-quality experiences using \textit{i$\times$MCTS}, and align interactive LLMs for high-level user engagement by direct preference optimization (DPO) accordingly. Experiments conducted on two socially-driven dialogue scenarios (emotional support conversations and persuasion for good) demonstrate that our method effectively enhances user engagement in interactive LLMs.

[35] skLEP: A Slovak General Language Understanding Benchmark

Marek Šuppa,Andrej Ridzik,Daniel Hládek,Tomáš Javůrek,Viktória Ondrejová,Kristína Sásiková,Martin Tamajka,Marián Šimko

Main category: cs.CL

TL;DR: 本文介绍了skLEP,一个专为评估斯洛伐克语自然语言理解模型而设计的全面基准测试工具。

Details Motivation: 为了填补斯洛伐克语自然语言理解模型缺乏全面评价基准的空白。 Method: 创建了涵盖九个不同任务的基准测试,包括新数据集的策划和现有英文资源的精确翻译,并对多种语言模型进行了系统评估。 Result: 发布了完整的基准数据、开源工具包以及公共排行榜,支持模型微调和评估。 Conclusion: skLEP是一个促进斯洛伐克语NLU领域未来研究和可重复性的综合基准测试工具。 Abstract: In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.

[36] Potemkin Understanding in Large Language Models

Marina Mancoridis,Bec Weeks,Keyon Vafa,Sendhil Mullainathan

Main category: cs.CL

TL;DR: This paper challenges the validity of using human-centric benchmarks to assess LLM understanding, revealing that LLMs often exhibit 'potemkin understanding,' which creates illusions of comprehension mirroring human errors.

Details Motivation: The motivation is to question the validity of using traditional benchmark datasets to infer LLM capabilities, particularly since these benchmarks are also used to test humans and might not capture true conceptual understanding in LLMs. Method: The paper presents a formal framework to evaluate LLMs and introduces two procedures: one for quantifying 'potemkins' (illusions of understanding) in specific domains, and another general procedure to estimate their prevalence across models and tasks. Result: The study finds that 'potemkins'—false impressions of understanding—are common across models, tasks, and domains. These failures indicate deeper internal incoherence in how LLMs represent concepts, beyond mere incorrect answers. Conclusion: The paper concludes that current benchmarks may not accurately reflect an LLM's true understanding, as they can create 'potemkin understanding' mirroring human misinterpretations. Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.

[37] "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets

Akshay Paruchuri,Maryam Aziz,Rohit Vartak,Ayman Ali,Best Uchehara,Xin Liu,Ishan Chatterjee,Monica Agrawal

Main category: cs.CL

TL;DR: 本文创建了 HealthChat-11K 数据集,用于系统分析用户通过聊天机器人与大型语言模型(LLMs)进行医疗信息交流的行为和潜在问题。

Details Motivation: 随着越来越多的人通过聊天机器人向大型语言模型(LLMs)寻求医疗信息,该研究旨在探索这些对话的本质及潜在风险。 Method: 研究团队从大规模对话 AI 数据集中筛选出了 HealthChat-11K,并使用临床医生驱动的分类法分析用户在 21 个不同医疗领域中与 LLM 的互动。 Result: 分析揭示了用户如何及为何寻求医疗信息的多种模式,包括常见互动、上下文不完整、情感行为以及可能引发趋炎附势的互动形式。 Conclusion: HealthChat-11K 强调了用户与 LLM 在医疗信息交互中的问题,例如诱导性提问和情感行为,凸显出改进 LLM 医疗支持能力的必要性。 Abstract: People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: https://github.com/yahskapar/HealthChat

[38] Data Efficacy for Language Model Training

Yalun Dai,Yangyu Huang,Xin Zhang,Wenshan Wu,Chong Li,Wenhui Lu,Shijie Cao,Li Dong,Scarlett Li

Main category: cs.CL

TL;DR: 本文研究了语言模型训练中数据组织的重要性,提出了DELT范式及其实例LQS和Folding,实验证明这些方法能有效提升模型性能。

Details Motivation: 优化训练数据的组织方式以提高语言模型性能是一个相对未被探索的领域。 Method: 提出了一种通用范式DELT,包含数据评分、数据选择和数据排序三个部分,并设计了LQS评分方法和Folding排序方法。 Result: 实验验证了DELT在不增加数据规模和模型大小的情况下提高了语言模型性能,其中LQS和Folding的结合效果最佳。 Conclusion: 数据效能与数据效率相结合可以成为语言模型训练中的有前景的基础领域。 Abstract: Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.

cs.CV [Back]

[39] OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport

Qin Ren,Yifan Wang,Ruogu Fang,Haibin Ling,Chenyu You

Main category: cs.CV

TL;DR: OTSurv addresses the issue of pathological heterogeneity in whole slide images for survival prediction by incorporating optimal transport with two constraints, achieving superior performance over existing methods.

Details Motivation: Existing MIL methods for survival prediction fail to explicitly capture pathological heterogeneity within WSIs both globally through long-tailed morphological distributions and locally through tile-level prediction uncertainty. Method: OTSurv formulates survival predictions as a heterogeneity-aware OT problem with two constraints: (1) global long-tail constraint to model morphological distributions and (2) local uncertainty-aware constraint to prioritize high-confidence patches. The OT problem is recast into an unbalanced OT formulation solved via a matrix scaling algorithm. Result: OTSurv sets new state-of-the-art results across six benchmarks with a 3.6% improvement in average C-index and achieves statistical significance in log-rank tests while offering high interpretability. Conclusion: OTSurv is a novel MIL framework for survival prediction in digital pathology that achieves state-of-the-art results, statistical significance in log-rank tests, and high interpretability. Abstract: Survival prediction using whole slide images (WSIs) can be formulated as a multiple instance learning (MIL) problem. However, existing MIL methods often fail to explicitly capture pathological heterogeneity within WSIs, both globally -- through long-tailed morphological distributions, and locally through -- tile-level prediction uncertainty. Optimal transport (OT) provides a principled way of modeling such heterogeneity by incorporating marginal distribution constraints. Building on this insight, we propose OTSurv, a novel MIL framework from an optimal transport perspective. Specifically, OTSurv formulates survival predictions as a heterogeneity-aware OT problem with two constraints: (1) global long-tail constraint that models prior morphological distributions to avert both mode collapse and excessive uniformity by regulating transport mass allocation, and (2) local uncertainty-aware constraint that prioritizes high-confidence patches while suppressing noise by progressively raising the total transport mass. We then recast the initial OT problem, augmented by these constraints, into an unbalanced OT formulation that can be solved with an efficient, hardware-friendly matrix scaling algorithm. Empirically, OTSurv sets new state-of-the-art results across six popular benchmarks, achieving an absolute 3.6% improvement in average C-index. In addition, OTSurv achieves statistical significance in log-rank tests and offers high interpretability, making it a powerful tool for survival prediction in digital pathology. Our codes are available at https://github.com/Y-Research-SBU/OTSurv.

[40] StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation

Haodong Li,Chen Wang,Jiahui Lei,Kostas Daniilidis,Lingjie Liu

Main category: cs.CV

TL;DR: StereoDiff是一种新的视频深度估计方法,通过结合立体匹配和视频深度扩散技术,实现了在动态和静态区域的深度一致性与准确性。

Details Motivation: 视频深度估计不能简单地扩展图像深度估计的方法,因为动态和静态区域对时间一致性的要求截然不同。 Method: StereoDiff采用两阶段策略:第一阶段利用立体匹配处理静态区域,第二阶段通过视频深度扩散保证动态区域的连续性。 Result: 实验结果表明,StereoDiff在多个现实世界动态视频深度基准测试中表现出最先进的性能,具有更高的准确性和一致性。 Conclusion: StereoDiff成功融合了立体匹配和视频深度扩散的优势,为视频深度估计提供了更有效的方法。 Abstract: Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff's SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.

[41] ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations

Zhiyuan Wu,Yongqiang Zhao,Shan Luo

Main category: cs.CV

TL;DR: This paper introduces ConViTac, a novel approach for fusing visual and tactile data in robotics using contrastive learning and cross-modal attention, achieving significant performance improvements over existing methods.

Details Motivation: Existing approaches for combining visual and tactile data often use simplistic fusion techniques like feature addition or concatenation, leading to suboptimal performance. This work aims to improve feature alignment during fusion using contrastive representation learning. Method: The authors propose a network called ConViTac, which uses a Contrastive Embedding Conditioning (CEC) mechanism to align visual and tactile features through unified latent embeddings and cross-modal attention. Result: Experiments demonstrate that ConViTac achieves superior performance compared to current methods, with up to a 12.0% improvement in accuracy on material classification and grasping prediction tasks. Conclusion: The paper concludes that the proposed ConViTac method significantly enhances visual-tactile feature integration and outperforms state-of-the-art methods in real-world tasks like material classification and grasping prediction. Abstract: Vision and touch are two fundamental sensory modalities for robots, offering complementary information that enhances perception and manipulation tasks. Previous research has attempted to jointly learn visual-tactile representations to extract more meaningful information. However, these approaches often rely on direct combination, such as feature addition and concatenation, for modality fusion, which tend to result in poor feature integration. In this paper, we propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0% in material classification and grasping prediction tasks.

[42] AI-Driven MRI-based Brain Tumour Segmentation Benchmarking

Connor Ludwig,Khashayar Namdar,Farzad Khalvati

Main category: cs.CV

TL;DR: 本研究评估了多种基于提示的医学图像分割模型(如SAM、SAM 2、MedSAM、SAM-Med-3D)与传统模型nnU-Net在BraTS 2023数据集上的表现。结果显示,当使用高质量边界框提示时,SAM系列模型可以超越nnU-Net,但在实际中提供如此高精度的提示并不现实。

Details Motivation: 近年来出现了许多通用可提示模型及其医学变体,但目前缺乏在统一医学数据集上对不同提示质量下的这些模型进行全面评估与比较的研究。 Method: 研究采用了Segment Anything Model (SAM)、Segment Anything Model 2 (SAM 2)、MedSAM、SAM-Med-3D 和 nnU-Net 模型,在 BraTS 2023 成人胶质瘤和儿科数据集上进行零样本推理,并通过多种提示质量对模型进行了评估与比较。此外,还对 SAM、SAM 2、MedSAM 和 SAM-Med-3D 进行了微调以进一步测试其性能。 Result: 多个模型表现出了有前景的Dice分数,尤其是SAM和SAM 2在获得高度精确的边界框提示时分别达到了0.894和0.893的分数,超过了nnU-Net的表现。然而,在微调后,尽管点提示性能有了显著提升,但仍然无法优于边界框提示或nnU-Net。 Conclusion: 虽然SAM系列模型在使用精确边界框提示时表现出优秀的分割性能,甚至超过了nnU-Net,但nnU-Net仍然是医学图像分割的主导网络,因为提供高度准确的提示对于实际应用来说是不现实的。微调后点提示的性能得到了显著提升,但仍无法超越边界框或nnU-Net的分割效果。 Abstract: Medical image segmentation has greatly aided medical diagnosis, with U-Net based architectures and nnU-Net providing state-of-the-art performance. There have been numerous general promptable models and medical variations introduced in recent years, but there is currently a lack of evaluation and comparison of these models across a variety of prompt qualities on a common medical dataset. This research uses Segment Anything Model (SAM), Segment Anything Model 2 (SAM 2), MedSAM, SAM-Med-3D, and nnU-Net to obtain zero-shot inference on the BraTS 2023 adult glioma and pediatrics dataset across multiple prompt qualities for both points and bounding boxes. Several of these models exhibit promising Dice scores, particularly SAM and SAM 2 achieving scores of up to 0.894 and 0.893, respectively when given extremely accurate bounding box prompts which exceeds nnU-Net's segmentation performance. However, nnU-Net remains the dominant medical image segmentation network due to the impracticality of providing highly accurate prompts to the models. The model and prompt evaluation, as well as the comparison, are extended through fine-tuning SAM, SAM 2, MedSAM, and SAM-Med-3D on the pediatrics dataset. The improvements in point prompt performance after fine-tuning are substantial and show promise for future investigation, but are unable to achieve better segmentation than bounding boxes or nnU-Net.

[43] How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?

Stephanie Käs,Anton Burenko,Louis Markert,Onur Alp Culha,Dennis Mack,Timm Linder,Bastian Leibe

Main category: cs.CV

TL;DR: 该研究探索了使用视觉基础模型和视觉语言模型进行动态全身手势识别的可能性,并引入了一个新的人机通信数据集NUGGET。

Details Motivation: 传统基于深度学习的手势识别依赖于特定任务的架构,而具有强大泛化能力的视觉基础模型(VFMs)和视觉语言模型(VLMs)有望通过替代专用任务模块来降低系统复杂性。 Method: 本研究引入了NUGGET数据集,用于评估不同的手势识别方法,并比较了V-JEPA、Gemini Flash 2.0和HD-GCN在动态全身手势识别中的表现。 Result: 实验结果显示,HD-GCN取得了最好的性能,而V-JEPA也实现了接近的结果;Gemini在仅依靠文本描述的零样本设置中难以区分手势。 Conclusion: HD-GCN展现了最佳性能,但V-JEPA通过简单的任务特定分类头也能实现接近的效果,表明其可能作为共享多任务模型降低系统复杂度。Gemini在零样本设置中难以区分手势,表明需要进一步研究适合的手势输入表示。 Abstract: Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.

[44] Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models

Cansu Korkmaz,Ahmet Murat Tekalp,Zafer Dogan

Main category: cs.CV

TL;DR: This paper proposes a framework using vision-language models (VLMs) to identify the most trustworthy super-resolution image from a diffusion model's output, introducing a novel Trustworthiness Score (TWS) to evaluate reliability. It demonstrates strong alignment with human preferences and sets a new benchmark in generative SR.

Details Motivation: The motivation stems from the challenge of selecting the most reliable SR image from multiple feasible solutions produced by diffusion models, especially in information-critical applications where ambiguity must be minimized. Method: The method involves using VLMs like BLIP-2 and GPT-4o to assess and rank SR images based on structured queries evaluating semantic correctness, visual quality, and artifact presence. A Trustworthiness Score (TWS) is introduced as a hybrid metric combining semantic similarity, structural integrity, and artifact sensitivity. Top-ranked SR candidates are then ensembled to produce a single output. Result: The results show that the proposed Trustworthiness Score (TWS) strongly correlates with human preference, and VLM-guided selections yield high TWS values. The method outperforms conventional metrics like PSNR and LPIPS in reflecting information fidelity and offers a generalizable solution for navigating uncertainty in diffusion SR space. Conclusion: The paper concludes that leveraging vision-language models (VLMs) to select the most trustworthy super-resolution (SR) image from a set generated by diffusion models provides a principled and scalable solution for enhancing SR trustworthiness. This approach aligns outputs with human expectations and semantic correctness, setting a new benchmark in generative SR. Abstract: Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.

[45] FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization

Ha Min Son,Shahbaz Rezaei,Xin Liu

Main category: cs.CV

TL;DR: This paper introduces FixCLR, a method that enhances semi-supervised domain generalization by adapting contrastive learning for domain invariance regularization, showing effectiveness particularly when combined with other semi-supervised methods.

Details Motivation: Existing SSDG methods underperform due to label scarcity and do not explicitly regularize to learn domains invariant representations across all domains, which is a key goal for domain generalization. Method: FixCLR adapts contrastive learning for explicit domain invariance regularization by utilizing class information from pseudo-labels and using only a repelling term. It can be added on top of most existing SSDG and semi-supervised methods. Result: FixCLR is evaluated through extensive experiments including benchmarking different improvements to semi-supervised methods, evaluating the performance of pretrained versus non-pretrained models, and testing on datasets with many domains. Conclusion: FixCLR proves to be an effective SSDG method, especially when combined with other semi-supervised methods. Abstract: Semi-supervised domain generalization (SSDG) aims to solve the problem of generalizing to out-of-distribution data when only a few labels are available. Due to label scarcity, applying domain generalization methods often underperform. Consequently, existing SSDG methods combine semi-supervised learning methods with various regularization terms. However, these methods do not explicitly regularize to learn domains invariant representations across all domains, which is a key goal for domain generalization. To address this, we introduce FixCLR. Inspired by success in self-supervised learning, we change two crucial components to adapt contrastive learning for explicit domain invariance regularization: utilization of class information from pseudo-labels and using only a repelling term. FixCLR can also be added on top of most existing SSDG and semi-supervised methods for complementary performance improvements. Our research includes extensive experiments that have not been previously explored in SSDG studies. These experiments include benchmarking different improvements to semi-supervised methods, evaluating the performance of pretrained versus non-pretrained models, and testing on datasets with many domains. Overall, FixCLR proves to be an effective SSDG method, especially when combined with other semi-supervised methods.

[46] Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision

Yuting He,Shuo Li

Main category: cs.CV

TL;DR: This paper introduces COVER, a novel contrastive learning framework for medical vision that reformulates contrastive learning as a vector regression problem to better preserve pixel-wise feature correlations.

Details Motivation: Extending contrastive learning to pixel-wise representation is essential for medical vision, but standard binary CL causes over-dispersion, disrupting intra-class distributions and pixel-wise feature correlations. Method: The COntrast in VEctor Regression (COVER) framework reformulates CL as a vector regression problem, modeling feature distances through displacement vectors and incorporating a vector pyramid architecture for granularity adaptation. Result: COVER outperforms existing methods across 8 tasks spanning 2 dimensions and 4 modalities, demonstrating superior performance in pixel-wise self-supervised pretraining. Conclusion: COVER provides a new paradigm for contrastive learning in pixel-wise representation, significantly improving self-supervised pretraining for medical vision tasks. Abstract: Contrastive learning (CL) has become a cornerstone of self-supervised pretraining (SSP) in foundation models, however, extending CL to pixel-wise representation, crucial for medical vision, remains an open problem. Standard CL formulates SSP as a binary optimization problem (binary CL) where the excessive pursuit of feature dispersion leads to an over-dispersion problem, breaking pixel-wise feature correlation thus disrupting the intra-class distribution. Our vector CL reformulates CL as a vector regression problem, enabling dispersion quantification in pixel-wise pretraining via modeling feature distances in regressing displacement vectors. To implement this novel paradigm, we propose the COntrast in VEctor Regression (COVER) framework. COVER establishes an extendable vector-based self-learning, enforces a consistent optimization flow from vector regression to distance modeling, and leverages a vector pyramid architecture for granularity adaptation, thus preserving pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks, spanning 2 dimensions and 4 modalities, show that COVER significantly improves pixel-wise SSP, advancing generalizable medical visual foundation models.

[47] Enhancing Ambiguous Dynamic Facial Expression Recognition with Soft Label-based Data Augmentation

Ryosuke Kawamura,Hideaki Hayashi,Shunsuke Otake,Noriko Takemura,Hajime Nagahara

Main category: cs.CV

TL;DR: 本研究提出了一种名为MIDAS的数据增强方法,能够有效提升动态面部表情识别任务中对模糊表情的识别性能,并通过实验验证了该方法的有效性。

Details Motivation: 在实际应用中,准确识别模糊的面部表情(在野外数据中经常遇到)对于动态面部表情识别任务至关重要。 Method: 提出了MIDAS,一种使用表示多个情感类别的软标签来增强DFER性能的数据增强方法,并在DFEW数据集和新构建的FERV39k-Plus上进行了实验评估。 Result: 实验结果表明,在使用MIDAS增强的数据训练模型后,其性能优于基于原始数据训练的最先进方法。 Conclusion: MIDAS是用于增强DFER任务中模糊面部表情数据性能的数据增强方法,通过凸组合视频帧及其对应的情感类别标签来扩展mixup到软标签视频数据。 Abstract: Dynamic facial expression recognition (DFER) is a task that estimates emotions from facial expression video sequences. For practical applications, accurately recognizing ambiguous facial expressions -- frequently encountered in in-the-wild data -- is essential. In this study, we propose MIDAS, a data augmentation method designed to enhance DFER performance for ambiguous facial expression data using soft labels representing probabilities of multiple emotion classes. MIDAS augments training data by convexly combining pairs of video frames and their corresponding emotion class labels. This approach extends mixup to soft-labeled video data, offering a simple yet highly effective method for handling ambiguity in DFER. To evaluate MIDAS, we conducted experiments on both the DFEW dataset and FERV39k-Plus, a newly constructed dataset that assigns soft labels to an existing DFER dataset. The results demonstrate that models trained with MIDAS-augmented data achieve superior performance compared to the state-of-the-art method trained on the original dataset.

[48] THIRDEYE: Cue-Aware Monocular Depth Estimation via Brain-Inspired Multi-Stage Fusion

Calin Teodor Ioan

Main category: cs.CV

TL;DR: ThirdEye是一种新的单目深度估计方法,它通过预先训练的网络提供明确的单目线索,并在多阶段架构中融合这些线索以提高性能。

Details Motivation: 传统的单目深度估计方法通常依赖于直接从RGB像素推断深度,忽略了人类视觉系统所依赖的显式单目线索,如遮挡边界、阴影和透视。ThirdEye旨在通过明确地提供这些线索来改进这一问题。 Method: ThirdEye利用预先训练并冻结的网络来提供特定的单目线索,并在一个具有关键值工作记忆模块的三阶段皮层层次结构(V1-> V2-> V3)中融合这些线索,最后通过一个自适应箱子变换头生成高分辨率的视差图。 Result: 由于线索专家是固定的,ThirdEye继承了大量外部监督,同时只需要适度的微调。 Conclusion: ThirdEye是一个cue-aware的单目深度估计方法,通过使用专门的预训练网络提供明确的单目线索,并在三个阶段的皮层层次结构中融合这些线索。 Abstract: Monocular depth estimation methods traditionally train deep models to infer depth directly from RGB pixels. This implicit learning often overlooks explicit monocular cues that the human visual system relies on, such as occlusion boundaries, shading, and perspective. Rather than expecting a network to discover these cues unaided, we present ThirdEye, a cue-aware pipeline that deliberately supplies each cue through specialised, pre-trained, and frozen networks. These cues are fused in a three-stage cortical hierarchy (V1->V2->V3) equipped with a key-value working-memory module that weights them by reliability. An adaptive-bins transformer head then produces a high-resolution disparity map. Because the cue experts are frozen, ThirdEye inherits large amounts of external supervision while requiring only modest fine-tuning. This extended version provides additional architectural detail, neuroscientific motivation, and an expanded experimental protocol; quantitative results will appear in a future revision.

[49] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

Shubhankar Borse,Seokeon Choi,Sunghyun Park,Jeongho Kim,Shreya Kadambi,Risheek Garrepalli,Sungrack Yun,Munawar Hayat,Fatih Porikli

Main category: cs.CV

TL;DR: 该论文提出了一种名为MultiHuman-Testbench的新基准,用于评估多个人物生成的生成模型。

Details Motivation: 生成包含多个人物并保持面部身份的图像是一项重大挑战,部分原因是缺乏专门的基准。 Method: 介绍了一个包含1800个样本的新基准,其中包括文本提示和5,550张独特的人脸图像,并提出了一个使用四个关键指标的多方面评估套件。 Result: 通过全面评估一系列模型,包括零样本方法和基于训练的方法,以及提出结合图像和区域隔离的新技术,显著提高了ID相似性。 Conclusion: 所提出的基准和关键发现为推进多人物图像生成研究提供了宝贵的见解和标准化工具。 Abstract: Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation.

[50] The Role of Cyclopean-Eye in Stereo Vision

Sherlon Almeida da Silva,Davi Geiger,Luiz Velho,Moacir Antonelli Ponti

Main category: cs.CV

TL;DR: 该论文探讨了3D结构和人类感知对立体视觉系统中深度重建的影响,并提出了新的几何约束方法,结合深度学习模型和注意力机制以提高准确性。

Details Motivation: 研究现代立体视觉系统的几何基础,特别是3D结构和人类感知如何贡献于精确的深度重建。 Method: 重新审视了Cyclopean Eye模型,并提出了考虑遮挡和深度不连续性的新几何约束。评估了来自深度学习模型的立体特征匹配质量以及注意力机制在恢复有意义的三维表面中的作用。 Result: 通过理论洞察和真实数据集上的实证研究,证明了结合强几何先验与学习特征的有效性。 Conclusion: 结合几何先验与学习特征能够提供理解立体视觉系统的内部抽象。 Abstract: This work investigates the geometric foundations of modern stereo vision systems, with a focus on how 3D structure and human-inspired perception contribute to accurate depth reconstruction. We revisit the Cyclopean Eye model and propose novel geometric constraints that account for occlusions and depth discontinuities. Our analysis includes the evaluation of stereo feature matching quality derived from deep learning models, as well as the role of attention mechanisms in recovering meaningful 3D surfaces. Through both theoretical insights and empirical studies on real datasets, we demonstrate that combining strong geometric priors with learned features provides internal abstractions for understanding stereo vision systems.

[51] FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

Advait Gupta,Rishie Raj,Dang Nguyen,Tianyi Zhou

Main category: cs.CV

TL;DR: 本文提出了一种成本高效、类人快速-慢速规划的神经符号代理FaSTA$^*$,用于解决复杂的多轮图像编辑任务。

Details Motivation: 传统的多轮图像编辑方法在计算成本和效率上面临挑战,需要一种能够有效减少探索成本并保持高性能的方法。 Method: 将大语言模型(LLMs)的快速高层子任务规划与每个子任务的慢速精确工具使用及局部A$^*$搜索相结合,并通过归纳推理提取/优化常用子程序以供重用。 Result: FaSTA$^*$在计算效率方面显著优于现有方法,同时在成功率方面与最先进的基线方法具有竞争力。 Conclusion: FaSTA$^*$通过结合LLMs的快速高层子任务规划和针对每个子任务的慢速精确工具使用与局部A$^*$搜索,实现了成本高效的多轮图像编辑。 Abstract: We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as "Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.'' It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A$^*$ search per subtask to find a cost-efficient toolpath -- a sequence of calls to AI tools. To save the cost of A$^*$ on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A$^*$ search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent "FaSTA$^*$'': fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A$^*$ search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA$^*$ is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.

[52] M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization

Ju-Hyeon Nam,Dong-Hyun Moon,Sang-Chul Lee

Main category: cs.CV

TL;DR: This paper proposes M2SFormer, a new image forgery localization method that improves computational efficiency and representation power by integrating multi-frequency and multi-scale attentions with a difficulty-guided attention module.

Details Motivation: To address the challenges of computational overhead and limited representation power in deep learning-based image editing techniques, particularly for subtle or complex tampering. Method: M2SFormer, a novel Transformer encoder-based framework that unifies multi-frequency and multi-scale attentions in the skip connection and utilizes a global prior map to preserve fine details during upsampling. Result: Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models. Conclusion: M2SFormer is effective in overcoming computational overhead and limited representation power in image forgery localization, showing superior generalization in detecting and localizing forgeries across unseen domains. Abstract: Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map, a curvature metric indicating the difficulty of forgery localization, which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains.

[53] PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling

Hao Zhang,Haolan Xu,Chun Feng,Varun Jampani,Narendra Ahuja

Main category: cs.CV

TL;DR: 提出了一种名为PhysRig的新框架,用于解决动画、关节对象重建、运动传递和4D生成中的蒙皮和绑定问题。

Details Motivation: 现有的方法主要依赖线性混合蒙皮(LBS),因为其简单性和可微分性,但LBS存在体积损失、不自然的变形等缺陷,并且无法模拟弹性材料,如软组织、毛发和柔性附属物(如象鼻、耳朵和脂肪组织)。 Method: PhysRig将刚性骨架嵌入到体积表示(如四面体网格)中,并将其模拟为由动画骨架驱动的可变形软体结构。该方法利用连续介质力学,将物体离散化为嵌入欧拉背景网格中的粒子,以确保相对于材料属性和骨架运动的可微分性。此外,还引入了材料原型,大大减少了学习空间,同时保持了高表达能力。 Result: PhysRig通过解决传统LBS方法的局限性,生成更加真实和符合物理规律的结果,并构建了一个全面的合成数据集用于评估框架性能。 Conclusion: PhysRig是一个基于物理的可微分蒙皮和绑定框架,解决了传统LBS方法的问题,并在姿态转移任务中展现了其多功能性。 Abstract: Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling.

[54] AIR-VIEW: The Aviation Image Repository for Visibility Estimation of Weather, A Dataset and Benchmark

Chad Mourning,Zhewei Wang,Justin Murray

Main category: cs.CV

TL;DR: 这篇论文提出了一种用于航空大气能见度估计的新数据集,并对其应用常见方法进行了基准测试。

Details Motivation: 目前缺乏适用于监督学习、带有能见度估计标签的公开数据集,这对低成本航空天气传感技术的发展构成了障碍。 Method: 通过一年的数据收集活动,从联邦航空管理局(FAA)天气摄像网络中获取图像数据,并在多个数据集上进行了基准测试。 Result: 开发了一个新的数据集,并展示了与最新ASTM标准相比,在三个公开数据集和新数据集上的基准测试结果。 Conclusion: 该论文介绍了用于航空大气能见度估计的新数据集,并提供了基于三个常用方法和一个通用基线的基准测试结果。 Abstract: Machine Learning for aviation weather is a growing area of research for providing low-cost alternatives for traditional, expensive weather sensors; however, in the area of atmospheric visibility estimation, publicly available datasets, tagged with visibility estimates, of distances relevant for aviation, of diverse locations, of sufficient size for use in supervised learning, are absent. This paper introduces a new dataset which represents the culmination of a year-long data collection campaign of images from the FAA weather camera network suitable for this purpose. We also present a benchmark when applying three commonly used approaches and a general-purpose baseline when trained and tested on three publicly available datasets, in addition to our own, when compared against a recently ratified ASTM standard.

[55] Hierarchical Sub-action Tree for Continuous Sign Language Recognition

Dejie Yang,Zhu Xu,Xinjie Gao,Yang Liu

Main category: cs.CV

TL;DR: This paper proposes HST-CSLR, which uses a hierarchical tree structure and contrastive alignment to enhance continuous sign language recognition by better leveraging textual gloss knowledge.

Details Motivation: Lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. Method: A Hierarchical Sub-action Tree (HST) is constructed to align visual and textual modalities step-by-step while reducing computational complexity, along with contrastive alignment enhancement. Result: Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of the proposed HST-CSLR approach. Conclusion: The proposed HST-CSLR method effectively combines gloss knowledge with visual representation learning for continuous sign language recognition. Abstract: Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.

[56] OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs

Yiman Zhang,Ziheng Luo,Qiangyu Yan,Wei He,Borui Jiang,Xinghao Chen,Kai Han

Main category: cs.CV

TL;DR: OmniEval is a comprehensive benchmark for evaluating omni-modality models, featuring full-modal collaboration, diverse audio-visual content, and granular task evaluation.

Details Motivation: To overcome the limitations of existing benchmarks in evaluating omni-modality models, which require strong coupling and collaboration between visual, auditory, and textual inputs. Method: The OmniEval benchmark includes 810 audio-visual synchronized videos (285 Chinese and 525 English) and 2617 question-answer pairs (1412 open-ended and 1205 multiple-choice), categorized into 3 major task types and 12 sub-task types. A detailed video localization task called Grounding is introduced. Result: Experiments were conducted on several omni-modality models using the OmniEval benchmark, highlighting its effectiveness in evaluating multi-modal integration and understanding. Conclusion: OmniEval is a new benchmark designed to evaluate omni-modality models, providing a platform for assessing the ability to integrate and understand multiple modalities. Abstract: In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos: OmniEval includes 810 audio-visual synchronized videos, 285 Chinese videos and 525 English videos; (iii) Diversity and granularity of tasks: OmniEval contains 2617 question-answer pairs, comprising 1412 open-ended questions and 1205 multiple-choice questions. These questions are divided into 3 major task types and 12 sub-task types to achieve comprehensive evaluation. Among them, we introduce a more granular video localization task named Grounding. Then we conduct experiments on OmniEval with several omni-modality models. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities. Codes and data could be found at https://omnieval.github.io/.

[57] Evidence-based diagnostic reasoning with multi-agent copilot for human pathology

Chengkuan Chen,Luca L. Weishaupt,Drew F. K. Williamson,Richard J. Chen,Tong Ding,Bowen Chen,Anurag Vaidya,Long Phi Le,Guillaume Jaume,Ming Y. Lu,Faisal Mahmood

Main category: cs.CV

TL;DR: PathChat+ and SlideSeek improve computational pathology by integrating advanced AI models with autonomous diagnostic reasoning, outperforming previous approaches and generating human-interpretable reports.

Details Motivation: To address the limitations of current multimodal large language models (MLLMs) in computational pathology, such as insufficient training data, inadequate support for multi-image understanding, and lack of autonomous diagnostic reasoning capabilities, while integrating natural language instruction and rich textual context. Method: The authors introduced PathChat+, a new MLLM trained on over 1 million diverse, pathology-specific instruction samples and nearly 5.5 million question-answer turns, along with SlideSeek, a reasoning-enabled multi-agent AI system that leverages PathChat+ for iterative, hierarchical diagnostic reasoning on gigapixel WSIs. Result: Extensive evaluations showed that PathChat+ significantly outperforms prior models including state-of-the-art general-purpose and other pathology-specific models. SlideSeek achieved high accuracy on DDxBench, an open-ended differential diagnosis benchmark, and can generate interpretable summary reports. Conclusion: PathChat+ and SlideSeek represent significant advancements in computational pathology by overcoming existing limitations of multimodal large language models, enabling autonomous diagnostic reasoning and high-accuracy evaluation of whole-slide images. Abstract: Pathology is experiencing rapid digital transformation driven by whole-slide imaging and artificial intelligence (AI). While deep learning-based computational pathology has achieved notable success, traditional models primarily focus on image analysis without integrating natural language instruction or rich, text-based context. Current multimodal large language models (MLLMs) in computational pathology face limitations, including insufficient training data, inadequate support and evaluation for multi-image understanding, and a lack of autonomous, diagnostic reasoning capabilities. To address these limitations, we introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology-specific instruction samples and nearly 5.5 million question answer turns. Extensive evaluations across diverse pathology benchmarks demonstrated that PathChat+ substantially outperforms the prior PathChat copilot, as well as both state-of-the-art (SOTA) general-purpose and other pathology-specific models. Furthermore, we present SlideSeek, a reasoning-enabled multi-agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole-slide images (WSIs) through iterative, hierarchical diagnostic reasoning, reaching high accuracy on DDxBench, a challenging open-ended differential diagnosis benchmark, while also capable of generating visually grounded, humanly-interpretable summary reports.

[58] DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing

Lingling Cai,Kang Zhao,Hangjie Yuan,Xiang Wang,Yingya Zhang,Kejie Huang

Main category: cs.CV

TL;DR: DFVEdit is a highly efficient, zero-shot video editing method for Video DiTs that avoids attention modification and fine-tuning, offering significant speed and memory improvements without compromising quality.

Details Motivation: Existing video editing methods applied to Video DiTs incur substantial computational overhead due to resource-intensive operations like attention modification or fine-tuning. Method: DFVEdit operates on clean latents via flow transformation, unifying editing and sampling under a continuous flow perspective. It introduces Conditional Delta Flow Vector (CDFV), Implicit Cross Attention (ICA) guidance, and Embedding Reinforcement (ER). Result: DFVEdit achieves at least 20x inference speed-up and 85% memory reduction compared to existing methods while maintaining superior performance on structural fidelity, spatial-temporal consistency, and editing quality. Conclusion: DFVEdit provides an efficient zero-shot video editing method for Video DiTs without attention modification or fine-tuning, achieving high efficiency and performance. Abstract: The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) -- a theoretically unbiased estimation of DFV -- and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85\% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.

[59] From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging

Tao Liu,Dafeng Zhang,Gengchen Li,Shizhuo Liu,Yongqi Song,Senmao Li,Shiqi Yang,Boqian Li,Kai Wang,Yaxing Wang

Main category: cs.CV

TL;DR: This paper proposes a two-pass face aging framework called Cradle2Cane that effectively balances age accuracy and identity preservation, outperforming existing methods.

Details Motivation: To overcome the core challenge in face aging, known as the Age-ID trade-off, where existing methods struggle to balance age accuracy with identity preservation. Method: A two-pass face aging framework based on few-step text-to-image diffusion models, with the first pass focusing on age accuracy using an adaptive noise injection mechanism and textual conditions, and the second pass enhancing identity preservation through conditioning on identity-aware embeddings. Result: Extensive experiments on the CelebA-HQ test dataset showed that Cradle2Cane outperforms existing face aging methods in both age accuracy and identity consistency. Conclusion: The proposed Cradle2Cane framework effectively addresses the Age-ID trade-off in face aging tasks, demonstrating superior performance in age accuracy and identity consistency compared to existing methods. Abstract: Face aging has become a crucial task in computer vision, with applications ranging from entertainment to healthcare. However, existing methods struggle with achieving a realistic and seamless transformation across the entire lifespan, especially when handling large age gaps or extreme head poses. The core challenge lies in balancing age accuracy and identity preservation--what we refer to as the Age-ID trade-off. Most prior methods either prioritize age transformation at the expense of identity consistency or vice versa. In this work, we address this issue by proposing a two-pass face aging framework, named Cradle2Cane, based on few-step text-to-image (T2I) diffusion models. The first pass focuses on solving age accuracy by introducing an adaptive noise injection (AdaNI) mechanism. This mechanism is guided by including prompt descriptions of age and gender for the given person as the textual condition. Also, by adjusting the noise level, we can control the strength of aging while allowing more flexibility in transforming the face. However, identity preservation is weakly ensured here to facilitate stronger age transformations. In the second pass, we enhance identity preservation while maintaining age-specific features by conditioning the model on two identity-aware embeddings (IDEmb): SVR-ArcFace and Rotate-CLIP. This pass allows for denoising the transformed image from the first pass, ensuring stronger identity preservation without compromising the aging accuracy. Both passes are jointly trained in an end-to-end way. Extensive experiments on the CelebA-HQ test dataset, evaluated through Face++ and Qwen-VL protocols, show that our Cradle2Cane outperforms existing face aging methods in age accuracy and identity consistency.

[60] 3D Scene-Camera Representation with Joint Camera Photometric Optimization

Weichen Dai,Kangcheng Ma,Jiaxin Wang,Kecen Pan,Yuhang Ming,Hua Zhang,Wanzeng Kong

Main category: cs.CV

TL;DR: 本文提出了一种新的3D场景-相机表示方法,结合了光度优化,能有效去除成像退真因素对3D重建的影响,提高了重建质量。

Details Motivation: 由于相机成像中的固有光度失真会显著降低图像质量,因此需要一种能够分离这些失真的方法以提高3D场景表示的质量。 Method: 通过同时优化相机表示的参数,并在光度参数优化过程中引入深度正则化,防止3D场景表示拟合与场景无关的信息。 Result: 实验结果表明,即使在成像退化(如渐晕和污垢)条件下,所提方法也能实现高质量的3D场景表示。 Conclusion: 该论文提出了一种新颖的3D场景-相机表示方法,并通过引入内部和外部光度模型,有效地将与场景无关的信息从3D场景表示中分离出来。 Abstract: Representing scenes from multi-view images is a crucial task in computer vision with extensive applications. However, inherent photometric distortions in the camera imaging can significantly degrade image quality. Without accounting for these distortions, the 3D scene representation may inadvertently incorporate erroneous information unrelated to the scene, diminishing the quality of the representation. In this paper, we propose a novel 3D scene-camera representation with joint camera photometric optimization. By introducing internal and external photometric model, we propose a full photometric model and corresponding camera representation. Based on simultaneously optimizing the parameters of the camera representation, the proposed method effectively separates scene-unrelated information from the 3D scene representation. Additionally, during the optimization of the photometric parameters, we introduce a depth regularization to prevent the 3D scene representation from fitting scene-unrelated information. By incorporating the camera model as part of the mapping process, the proposed method constructs a complete map that includes both the scene radiance field and the camera photometric model. Experimental results demonstrate that the proposed method can achieve high-quality 3D scene representations, even under conditions of imaging degradation, such as vignetting and dirt.

[61] Rethink Sparse Signals for Pose-guided Text-to-image Generation

Wenjie Xuan,Jing Zhang,Juhua Liu,Bo Du,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出了SP-Ctrl,利用稀疏信号实现高效的姿态引导图像生成,解决了密集信号存在的问题,并取得了优异的结果。

Details Motivation: 由于密集信号存在编辑困难和潜在的文本提示不一致问题,因此重新探索稀疏信号用于姿态引导的可能性。 Method: 提出了一种新颖的Spatial-Pose ControlNet (SP-Ctrl),将OpenPose扩展为可学习的空间表示,并引入了关键点概念学习机制。 Result: 实验表明,SP-Ctrl在稀疏姿态引导下优于近期的空间可控T2I生成方法,并且在动物和人类中心的图像生成任务中表现出色。 Conclusion: SP-Ctrl通过稀疏信号实现了与密集信号方法相当的性能,并展示了在多样性和跨物种生成方面的潜力。 Abstract: Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals. Codes will be available at https://github.com/DREAMXFAR/SP-Ctrl.

[62] EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning

Xiao Zhang,Yongqiang Ma,Haodong Jing,Nanning Zheng

Main category: cs.CV

TL;DR: 本文提出了一种名为 EVA 的新框架,通过领域专家适配和语义变体对齐来提升 CZSL 的性能。

Details Motivation: 现有 CZSL 方法无法有效处理可划分为不同语义子集的个体集合,并且忽略了相同状态或对象内的组合差异。 Method: 提出 EVA 框架,包括领域专家适配和语义变体对齐,以提高 CZSL 的泛化能力。 Result: EVA 在三个流行数据集的闭世界和开世界设置中均显著优于最先进方法。 Conclusion: EVA 框架在 CZSL 中显著优于现有方法,证明了其在图像-组合对齐方面的有效性。 Abstract: Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets. Moreover, the all-to-one cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Semantic Variant Alignment framework for CZSL. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching. Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.

[63] Segment Anything in Pathology Images with Natural Language

Zhixuan Chen,Junlin Hou,Liqi Lin,Yihui Wang,Yequan Bie,Xi Wang,Yanning Zhou,Ronald Cheong Kin Chan,Hao Chen

Main category: cs.CV

TL;DR: 本文提出了名为 PathSegmentor 的文本提示分割模型及相应的大型数据集 PathSeg,实现了更准确、广泛适用且可解释的病理图像分割方法。

Details Motivation: 解决当前病理图像分割方法在临床应用中因注释数据有限和类别定义受限所导致的问题。 Method: 提出 PathSegmentor 模型与 PathSeg 数据集,并通过广泛的实验验证其性能。 Result: PathSegmentor 在整体 Dice 分数上分别比现有的空间提示和文本提示模型高出 0.145 和 0.429,展现出更强的鲁棒性和适用性。 Conclusion: PathSegmentor 是第一个专为病理图像设计的文本提示分割基础模型,它提高了诊断模型的可解释性,并推动了精准肿瘤学中可解释AI的发展。 Abstract: Pathology image segmentation is crucial in computational pathology for analyzing histological features relevant to cancer diagnosis and prognosis. However, current methods face major challenges in clinical applications due to limited annotated data and restricted category definitions. To address these limitations, we propose PathSegmentor, the first text-prompted segmentation foundation model designed specifically for pathology images. We also introduce PathSeg , the largest and most comprehensive dataset for pathology segmentation, built from 17 public sources and containing 275k image-mask-label triples across 160 diverse categories. With PathSegmentor, users can perform semantic segmentation using natural language prompts, eliminating the need for laborious spatial inputs such as points or boxes. Extensive experiments demonstrate that PathSegmentor outperforms specialized models with higher accuracy and broader applicability, while maintaining a compact architecture. It significantly surpasses existing spatial- and text-prompted models by 0.145 and 0.429 in overall Dice scores, respectively, showing strong robustness in segmenting complex structures and generalizing to external datasets. Moreover, PathSegmentor's outputs enhance the interpretability of diagnostic models through feature importance estimation and imaging biomarker discovery, offering pathologists evidence-based support for clinical decision-making. This work advances the development of explainable AI in precision oncology.

[64] TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation

Chade Li,Pengju Zhang,Yihong Wu

Main category: cs.CV

TL;DR: This paper introduces TSDASeg, a novel approach for interactive point cloud segmentation that improves 3D-text alignment and cross-modal feature consistency using a two-stage model with direct alignment and memory modules.

Details Motivation: Existing methods underperform in point-level tasks like segmentation due to missing direct 3D-text alignment, which limits their ability to link local 3D features with textual context. Method: TSDASeg uses a Two-Stage model with a Direct cross-modal Alignment module and a memory module containing multiple dedicated memory banks to align 3D point clouds with textual/2D image data, leveraging self-attention and cross-attention mechanisms for dynamic feature updates. Result: Experiments on multiple datasets show that TSDASeg outperforms existing approaches, achieving superior results in 3D instruction, reference, and semantic segmentation tasks. Conclusion: The proposed TSDASeg method achieves state-of-the-art performance in interactive point cloud segmentation by addressing the issue of direct 3D-text alignment and improving cross-modal feature consistency. Abstract: The rapid advancement of 3D vision-language models (VLMs) has spurred significant interest in interactive point cloud processing tasks, particularly for real-world applications. However, existing methods often underperform in point-level tasks, such as segmentation, due to missing direct 3D-text alignment, limiting their ability to link local 3D features with textual context. To solve this problem, we propose TSDASeg, a Two-Stage model coupled with a Direct cross-modal Alignment module and memory module for interactive point cloud Segmentation. We introduce the direct cross-modal alignment module to establish explicit alignment between 3D point clouds and textual/2D image data. Within the memory module, we employ multiple dedicated memory banks to separately store text features, visual features, and their cross-modal correspondence mappings. These memory banks are dynamically leveraged through self-attention and cross-attention mechanisms to update scene-specific features based on prior stored data, effectively addressing inconsistencies in interactive segmentation results across diverse scenarios. Experiments conducted on multiple 3D instruction, reference, and semantic segmentation datasets demonstrate that the proposed method achieves state-of-the-art performance.

[65] Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

Akio Hayakawa,Masato Ishii,Takashi Shibuya,Yuki Mitsufuji

Main category: cs.CV

TL;DR: This paper introduces a novel sequential method for generating distinct audio tracks from a video, using guided synthesis and pre-trained models without requiring specialized datasets.

Details Motivation: To comprehensively capture all sound events induced by a given video, mirroring traditional Foley workflows, and eliminating the need for specialized paired datasets. Method: A step-by-step video-to-audio generation approach inspired by concept negation from compositional generation frameworks. It uses pre-trained video-to-audio models and formulates each generation step as a guided synthesis task conditioned on text prompts and previously generated audio tracks. Result: Experimental results show that the method successfully generates individual audio tracks corresponding to specific sound events in the video, achieving higher-quality composite audio synthesis. Conclusion: The proposed method can generate multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis compared to existing baselines. Abstract: We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.

[66] DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting

Yeon-Ji Song,Jaein Kim,Byung-Ju Kim,Byoung-Tak Zhang

Main category: cs.CV

TL;DR: This paper introduces DBMovi-GS, a novel method for synthesizing views from dynamic, blurry monocular videos by leveraging Gaussian splatting to reconstruct sharp, detailed 3D scenes.

Details Motivation: Synthesizing dynamic scenes from blurry monocular videos is an unresolved challenge due to existing methods' reliance on high-resolution images or strong assumptions about static geometry, leading to instability and degraded visual fidelity in real-world dynamic environments. Method: The method, called Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS), generates dense 3D Gaussians to restore sharpness from blurry videos and reconstructs detailed 3D geometry affected by dynamic motion variations. Result: The model achieves robust performance in novel view synthesis under dynamic blurry scenes and provides improved visual fidelity and stability compared to previous approaches. Conclusion: The proposed DBMovi-GS method effectively addresses the challenge of dynamic view synthesis from blurry monocular videos, achieving robust performance and setting a new benchmark in realistic novel view synthesis for such inputs. Abstract: Novel view synthesis is a task of generating scenes from unseen perspectives; however, synthesizing dynamic scenes from blurry monocular videos remains an unresolved challenge that has yet to be effectively addressed. Existing novel view synthesis methods are often constrained by their reliance on high-resolution images or strong assumptions about static geometry and rigid scene priors. Consequently, their approaches lack robustness in real-world environments with dynamic object and camera motion, leading to instability and degraded visual fidelity. To address this, we propose Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS), a method designed for dynamic view synthesis from blurry monocular videos. Our model generates dense 3D Gaussians, restoring sharpness from blurry videos and reconstructing detailed 3D geometry of the scene affected by dynamic motion variations. Our model achieves robust performance in novel view synthesis under dynamic blurry scenes and sets a new benchmark in realistic novel view synthesis for blurry monocular video inputs.

[67] Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology

Qiuyi Qi,Xin Li,Ming Kong,Zikang Xu,Bingdi Chen,Qiang Zhu,S Kevin Zhou

Main category: cs.CV

TL;DR: 该论文提出了一种风格对齐图像合成方法,以提升细胞病理学中异常细胞检测的性能和鲁棒性。

Details Motivation: 缺乏高质量标注、长尾数据分布和不一致染色风格阻碍了神经网络在细胞病理学中稳健检测异常细胞的能力。 Method: 风格对齐图像合成(SAIC)方法包括候选选择、高频特征重构和大视觉-语言模型过滤三个步骤。 Result: 实验结果表明,结合SAIC生成的图像有效提升了检测性能,尤其是在尾部类别和风格上的表现更好。 Conclusion: SAIC方法在临床应用中具有普遍性和实用性,有效提升了异常细胞检测的性能和鲁棒性。 Abstract: Challenges such as the lack of high-quality annotations, long-tailed data distributions, and inconsistent staining styles pose significant obstacles to training neural networks to detect abnormal cells in cytopathology robustly. This paper proposes a style-aligned image composition (SAIC) method that composes high-fidelity and style-preserved pathological images to enhance the effectiveness and robustness of detection models. Without additional training, SAIC first selects an appropriate candidate from the abnormal cell bank based on attribute guidance. Then, it employs a high-frequency feature reconstruction to achieve a style-aligned and high-fidelity composition of abnormal cells and pathological backgrounds. Finally, it introduces a large vision-language model to filter high-quality synthesis images. Experimental results demonstrate that incorporating SAIC-synthesized images effectively enhances the performance and robustness of abnormal cell detection for tail categories and styles, thereby improving overall detection performance. The comprehensive quality evaluation further confirms the generalizability and practicality of SAIC in clinical application scenarios. Our code will be released at https://github.com/Joey-Qi/SAIC.

[68] Inverse Scene Text Removal

Takumi Yoshimatsu,Shumpei Takezaki,Seiichi Uchida

Main category: cs.CV

TL;DR: 这篇论文探讨了逆向场景文本去除(ISTR)的方法,用于检测图像是否经过文本去除处理,并尝试恢复被删除的文本内容。

Details Motivation: 随着场景文本去除(STR)技术的发展,其潜在的滥用风险增加,因此需要研究如何检测STR处理过的图像并定位被删除的文本区域,以防止技术滥用。 Method: 该论文采用了实验方法,利用神经网络模型对经过STR处理的图像进行分析,并尝试训练文本识别器以恢复被删除的文本内容。 Result: 实验证明,ISTR能够以高准确率完成二分类任务(检测图像是否经过STR处理)和定位被删除文本区域的任务,并且在一定程度上可以恢复被删除的文本内容。 Conclusion: 本文研究了逆向场景文本去除(ISTR),展示了其在检测图像是否经过STR处理和定位被删除文本区域方面的有效性,同时尝试通过训练文本识别器来恢复被删除的文本内容。 Abstract: Scene text removal (STR) aims to erase textual elements from images. It was originally intended for removing privacy-sensitiveor undesired texts from natural scene images, but is now also appliedto typographic images. STR typically detects text regions and theninpaints them. Although STR has advanced through neural networksand synthetic data, misuse risks have increased. This paper investi-gates Inverse STR (ISTR), which analyzes STR-processed images andfocuses on binary classification (detecting whether an image has un-dergone STR) and localizing removed text regions. We demonstrate inexperiments that these tasks are achievable with high accuracies, en-abling detection of potential misuse and improving STR. We also at-tempt to recover the removed text content by training a text recognizerto understand its difficulty.

[69] VisionGuard: Synergistic Framework for Helmet Violation Detection

Lam-Huy Nguyen,Thinh-Phuc Nguyen,Thanh-Hai Nguyen,Gia-Huy Dinh,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: 本文提出了一种名为VisionGuard的新框架,用于解决摩托车头盔违规自动检测中的数据不平衡和标注不一致问题,显著提升了检测精度和实际应用潜力。

Details Motivation: 由于环境变化、摄像头角度和数据不一致性等因素,现有的逐帧检测方法在自动识别摩托车骑手是否佩戴头盔时面临重大挑战,这影响了检测的可靠性和分类的一致性。因此需要一种更高效的方法来提升检测准确率并支持交通管理系统。 Method: VisionGuard集成了两个关键组件:Adaptive Labeling(自适应标记)和Contextual Expander(上下文扩展)模块。Adaptive Labeling利用跟踪算法在视频帧间分配持久标签并纠正错误分类,Contextual Expander通过生成虚拟边界框和相应的置信度分数来提高对少数类的召回率。 Result: 实验结果表明,与基线检测器相比,VisionGuard的整体mAP提升了3.1%,证明了其在现实世界交通监控系统中的有效性及部署潜力。 Conclusion: VisionGuard是一个多阶段框架,通过整合自适应标记和上下文扩展模块,有效解决了数据不平衡和标注不一致带来的挑战,提高了交通监控系统的头盔违规检测性能。 Abstract: Enforcing helmet regulations among motorcyclists is essential for enhancing road safety and ensuring the effectiveness of traffic management systems. However, automatic detection of helmet violations faces significant challenges due to environmental variability, camera angles, and inconsistencies in the data. These factors hinder reliable detection of motorcycles and riders and disrupt consistent object classification. To address these challenges, we propose VisionGuard, a synergistic multi-stage framework designed to overcome the limitations of frame-wise detectors, especially in scenarios with class imbalance and inconsistent annotations. VisionGuard integrates two key components: Adaptive Labeling and Contextual Expander modules. The Adaptive Labeling module is a tracking-based refinement technique that enhances classification consistency by leveraging a tracking algorithm to assign persistent labels across frames and correct misclassifications. The Contextual Expander module improves recall for underrepresented classes by generating virtual bounding boxes with appropriate confidence scores, effectively addressing the impact of data imbalance. Experimental results show that VisionGuard improves overall mAP by 3.1% compared to baseline detectors, demonstrating its effectiveness and potential for real-world deployment in traffic surveillance systems, ultimately promoting safety and regulatory compliance.

[70] Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning

Tyler Ward,Xiaoqin Wang,Braxton McFarland,Md Atik Ahamed,Sahar Nozad,Talal Arshad,Hafsa Nebbache,Jin Chen,Abdullah Imran

Main category: cs.CV

TL;DR: This paper proposes FFCL-SAM, a deep learning framework combining SAM and FFCL for improved intraoperative margin assessment in breast cancer surgery, achieving higher accuracy and faster processing.

Details Motivation: 2D specimen radiography (SR), the current method for assessing intraoperative specimen margin status, has limited accuracy, leading to nearly 25% of patients requiring additional surgery. This study aims to improve margin assessment during lumpectomy to reduce breast cancer recurrence. Method: A deep learning framework combining the Segment Anything Model (SAM) with Forward-Forward Contrastive Learning (FFCL) was developed. It uses a ResNet-18 backbone pre-trained with FFCL for margin status classification, followed by refined tumor margin segmentation using SAM prompted by reconstructed binary masks. Result: The approach achieved an AUC of 0.8455 for margin classification, improved Dice similarity by 27.4% over baseline models, and reduced inference time to 47 milliseconds per image. Conclusion: The proposed FFCL-SAM framework significantly improves the accuracy and speed of intraoperative margin assessment, showing potential to reduce re-excision rates and enhance surgical outcomes in breast cancer treatment. Abstract: Complete removal of cancer tumors with a negative specimen margin during lumpectomy is essential in reducing breast cancer recurrence. However, 2D specimen radiography (SR), the current method used to assess intraoperative specimen margin status, has limited accuracy, resulting in nearly a quarter of patients requiring additional surgery. To address this, we propose a novel deep learning framework combining the Segment Anything Model (SAM) with Forward-Forward Contrastive Learning (FFCL), a pre-training strategy leveraging both local and global contrastive learning for patch-level classification of SR images. After annotating SR images with regions of known maligancy, non-malignant tissue, and pathology-confirmed margins, we pre-train a ResNet-18 backbone with FFCL to classify margin status, then reconstruct coarse binary masks to prompt SAM for refined tumor margin segmentation. Our approach achieved an AUC of 0.8455 for margin classification and segmented margins with a 27.4% improvement in Dice similarity over baseline models, while reducing inference time to 47 milliseconds per image. These results demonstrate that FFCL-SAM significantly enhances both the speed and accuracy of intraoperative margin assessment, with strong potential to reduce re-excision rates and improve surgical outcomes in breast cancer treatment. Our code is available at https://github.com/tbwa233/FFCL-SAM/.

[71] The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion

Bang Gong,Luchao Qi,Jiaye Wu,Zhicheng Fu,Chunbo Song,David W. Jacobs,John Nicholson,Roni Sengupta

Main category: cs.CV

TL;DR: The paper proposes the Aging Multiverse, a novel framework for generating multiple realistic facial aging paths from a single image while considering external conditions, offering superior performance and broader applications.

Details Motivation: Prior methods modeled aging as a single deterministic path. This work aims to create a more diverse and realistic representation of potential aging outcomes by incorporating external factors such as environment, health, and lifestyle. Method: A training-free diffusion-based method was proposed, incorporating attention mixing to modulate editing strength and Simulated Aging Regularization to stabilize edits, generating multiple plausible facial aging trajectories conditioned on external factors. Result: Extensive experiments and user studies showed state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing models that often fail to meet one or more criteria. Conclusion: The Aging Multiverse framework successfully transforms facial aging into a multi-dimensional, controllable, and interpretable process, opening new possibilities in digital storytelling, health education, and personalized visualization. Abstract: We introduce the Aging Multiverse, a framework for generating multiple plausible facial aging trajectories from a single image, each conditioned on external factors such as environment, health, and lifestyle. Unlike prior methods that model aging as a single deterministic path, our approach creates an aging tree that visualizes diverse futures. To enable this, we propose a training-free diffusion-based method that balances identity preservation, age accuracy, and condition control. Our key contributions include attention mixing to modulate editing strength and a Simulated Aging Regularization strategy to stabilize edits. Extensive experiments and user studies demonstrate state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing editing and age-progression models, which often fail to account for one or more of the editing criteria. By transforming aging into a multi-dimensional, controllable, and interpretable process, our approach opens up new creative and practical avenues in digital storytelling, health education, and personalized visualization.

[72] User-in-the-Loop View Sampling with Error Peaking Visualization

Ayaka Yasunaga,Hideo Saito,Shohei Mori

Main category: cs.CV

TL;DR: This paper introduces a novel approach for novel view synthesis that eliminates the need for 3D annotations by using locally reconstructed light fields and error visualization, resulting in improved flexibility and efficiency.

Details Motivation: To overcome the limitations of existing approaches that require mentally demanding tasks and restrict capture areas due to predefined sampling theories. Method: Utilizing locally reconstructed light fields and visualizing errors to guide the insertion of new views. Result: The error-peaking visualization technique is less invasive, minimizes disappointment in final results, and successfully integrates with recent radiance field reconstruction methods for larger scenes. Conclusion: The proposed method effectively reduces the need for 3D annotations and allows for more flexible scene exploration, achieving satisfactory results with fewer view samples. Abstract: Augmented reality (AR) provides ways to visualize missing view samples for novel view synthesis. Existing approaches present 3D annotations for new view samples and task users with taking images by aligning the AR display. This data collection task is known to be mentally demanding and limits capture areas to pre-defined small areas due to the ideal but restrictive underlying sampling theory. To free users from 3D annotations and limited scene exploration, we propose using locally reconstructed light fields and visualizing errors to be removed by inserting new views. Our results show that the error-peaking visualization is less invasive, reduces disappointment in final results, and is satisfactory with fewer view samples in our mobile view synthesis system. We also show that our approach can contribute to recent radiance field reconstruction for larger scenes, such as 3D Gaussian splatting.

[73] Bridging Video Quality Scoring and Justification via Large Multimodal Models

Qizhi Xie,Kun Yuan,Yunpeng Qu,Jiachao Gong,Mingda Wu,Ming Sun,Chao Zhou,Jihong Zhu

Main category: cs.CV

TL;DR: 本文研究了视频质量评估(VQA),提出了一种新的基于评分的指令生成(SIG)方法,并构建了大规模数据集S2I以及基准测试S2I-Bench,显著提高了视频大多媒体模型(LMMs)的质量评分与解释能力。

Details Motivation: 传统的视频质量评估(VQA)方法仅提供一个数值评分,无法描述视频复杂的质量维度;将视频大多媒体模型(LMMs)通过指令微调适应VQA有望解决这一问题。 Method: 提出了一种基于评分的指令生成(SIG)流水线,并设计了一种渐进式调优策略以充分释放S2I数据集的潜力。 Result: 开发了Score2Instruct(S2I)数据集,包含超过320K个多样化的指令-响应对,并创建了一个新的基准S2I-Bench,用于更好地评估视频LMM的质量解释能力。 Conclusion: 实验结果表明,该方法在多个视频LMM上持续提升了质量评分和解释能力。 Abstract: Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity. Yet, a score fails to describe the video's complex quality dimensions, restricting its applicability. Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems, limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system's reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs' quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs.

[74] FedSC: Federated Learning with Semantic-Aware Collaboration

Huan Wang,Haoran Li,Huaming Chen,Jun Yan,Jiahua Shi,Jun Shen

Main category: cs.CV

TL;DR: This paper introduces FedSC, a novel federated learning method that leverages semantic-aware collaboration to address data heterogeneity by using relational and consistent prototypes.

Details Motivation: Data heterogeneity is a major challenge in federated learning, where existing methods often neglect semantic information within clients. This work aims to explore intra-client semantic knowledge to improve collaboration across heterogeneous clients. Method: The paper proposes FedSC, which constructs relational and consistent prototypes at the semantic level to capture client-specific and class-relevant knowledge. It uses inter-contrastive learning and discrepancy aggregation for model regularization. Result: Experimental results show that FedSC outperforms existing methods in various challenging scenarios, demonstrating its effectiveness and efficiency in handling data heterogeneity. Conclusion: FedSC effectively addresses data heterogeneity in federated learning by leveraging semantic-level prototypes and provides a convergence guarantee. Abstract: Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global model), often neglecting inherent semantic information contained in each client. To explore the possibility of using intra-client semantically meaningful knowledge in handling data heterogeneity, in this paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC) to capture client-specific and class-relevant knowledge across heterogeneous clients. The core idea of FedSC is to construct relational prototypes and consistent prototypes at semantic-level, aiming to provide fruitful class underlying knowledge and stable convergence signals in a prototype-wise collaborative way. On the one hand, FedSC introduces an inter-contrastive learning strategy to bring instance-level embeddings closer to relational prototypes with the same semantics and away from distinct classes. On the other hand, FedSC devises consistent prototypes via a discrepancy aggregation manner, as a regularization penalty to constrain the optimization region of the local model. Moreover, a theoretical analysis for FedSC is provided to ensure a convergence guarantee. Experimental results on various challenging scenarios demonstrate the effectiveness of FedSC and the efficiency of crucial components.

[75] HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation

Qingyue Jiao,Kangyu Zheng,Yiyu Shi,Zhiding Liang

Main category: cs.CV

TL;DR: 本文提出了一种结合经典计算和量子计算的生成对抗网络(GAN),用于生成彩色医学图像,克服了现有量子图像生成方法只能产生低质量灰度图像的局限性。

Details Motivation: 皮肤疾病数据集通常存在类别不平衡、隐私问题和对象偏差,因此需要数据增强。传统生成模型计算资源消耗大,训练时间长;而现有的量子图像生成方法只能产生低质量的灰度图像。 Method: 通过一种新颖的经典-量子潜在空间融合技术,构建了首个能够生成彩色医学图像的经典-量子生成对抗网络(GAN)。 Result: 该模型在图像生成质量和作为数据增强工具的分类性能提升方面优于传统的深度卷积GAN和现有的混合经典-量子GAN。其性能提升与最先进的经典生成模型相当,但参数减少了25倍以上,训练周期减少了10倍。此外,该模型在IBM真实量子机器上的表现稳健,即使存在硬件噪声。 Conclusion: 随着量子硬件的进步,量子图像生成具有广阔的应用前景。 Abstract: Machine learning-assisted diagnosis is gaining traction in skin disease detection, but training effective models requires large amounts of high-quality data. Skin disease datasets often suffer from class imbalance, privacy concerns, and object bias, making data augmentation essential. While classical generative models are widely used, they demand extensive computational resources and lengthy training time. Quantum computing offers a promising alternative, but existing quantum-based image generation methods can only yield grayscale low-quality images. Through a novel classical-quantum latent space fusion technique, our work overcomes this limitation and introduces the first classical-quantum generative adversarial network (GAN) capable of generating color medical images. Our model outperforms classical deep convolutional GANs and existing hybrid classical-quantum GANs in both image generation quality and classification performance boost when used as data augmentation. Moreover, the performance boost is comparable with that achieved using state-of-the-art classical generative models, yet with over 25 times fewer parameters and 10 times fewer training epochs. Such results suggest a promising future for quantum image generation as quantum hardware advances. Finally, we demonstrate the robust performance of our model on real IBM quantum machine with hardware noise.

[76] Multimodal Prompt Alignment for Facial Expression Recognition

Fuyan Ma,Yiran He,Bin Sun,Shutao Li

Main category: cs.CV

TL;DR: 本文提出了一种用于面部表情识别的多模态提示对齐框架(MPA-FER),通过结合大语言模型生成的详细描述、原型引导的特征对齐以及跨模态对齐模块,在提升性能的同时保持了低计算成本。

Details Motivation: 当前基于VLM的面部表情识别方法难以捕捉区分细微表情差异所需的细粒度文本-视觉关系,因此需要一种更精确和可解释的方法。 Method: 提出了一种多粒度硬提示生成策略,并利用大语言模型生成详细的面部表情描述;引入原型引导的视觉特征对齐;设计了跨模态全局-局部对齐模块。 Result: MPA-FER在三个FER基准数据集上均超越了最先进的方法,同时保留了预训练模型的优势并最小化了计算成本。 Conclusion: MPA-FER通过多模态提示对齐框架,有效提升了基于CLIP的面部表情识别方法在细粒度文本-视觉关系捕捉方面的能力。 Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.

[77] LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection

Lei Hao,Lina Xu,Chang Liu,Yanni Dong

Main category: cs.CV

TL;DR: 本文提出了一种高效的多模态目标检测方法LASFNet,在保持高精度的同时显著降低了计算开销。

Details Motivation: 传统的多模态目标检测方法通常需要复杂的训练过程和大量的计算资源,因此需要一种更高效的特征融合方案。 Method: 提出了一种新的融合检测基线,并设计了LASFNet网络,包含ASFF模块和FATM模块,以实现高效的特征融合。 Result: 实验表明,与现有最先进方法相比,LASFNet在三个代表性数据集上减少了最多90%的参数和85%的计算成本,同时提升了1%-3%的检测精度(mAP)。 Conclusion: LASFNet通过引入注意力引导的自调节特征融合模块和轻量级特征注意力转换模块,在多模态目标检测中实现了高效且准确的特征融合。 Abstract: Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at https://github.com/leileilei2000/LASFNet.

[78] Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation

Ze Wang,Hao Chen,Benran Hu,Jiang Liu,Ximeng Sun,Jialian Wu,Yusheng Su,Xiaodong Yu,Emad Barsoum,Zicheng Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的1D二进制图像潜在空间表示方法,用于减少高分辨率图像建模的计算需求,并在文本到图像生成任务中实现高效训练和推理。

Details Motivation: 为了降低高分辨率图像建模的计算需求,提高图像和多模态理解和生成的效率,需要一种更紧凑的离散图像表示方法。 Method: 引入了1D二进制图像潜在表示方法,通过将图像表示为二进制向量序列,而不是使用传统的一热编码词汇表标记,从而保持高分辨率细节的同时实现紧凑表示。 Result: 该方法在使用仅128个离散标记的情况下,在扩散模型和自回归生成任务中均取得了具有竞争力的表现,相比标准VQ-VAE减少了32倍的标记数量。同时,该方法显著提高了训练和推理速度,可在单个GPU节点上以4096的全局批量大小完成训练,总训练时间控制在200 GPU天内。 Conclusion: 提出的1D二进制潜在空间结合简单模型架构,提供了一种可扩展且高效的替代传统标记化方法的方案,在无需内部私有训练数据或后期优化的前提下实现了与现代图像生成模型相当的性能。 Abstract: Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.

[79] DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation

Wenzhou Lyu,Jialing Lin,Wenqi Ren,Ruihao Xia,Feng Qian,Yang Tang

Main category: cs.CV

TL;DR: DidSee是一种基于扩散模型的深度补全框架,通过解决信号泄漏偏差、误差累积问题和引入语义增强模块,在非朗伯物体上实现了优秀的深度补全效果,并在多个基准测试和实际应用中表现出色。

Details Motivation: 商用RGB-D相机在处理非朗伯物体时通常会产生噪声大且不完整的深度图,而传统的深度补全方法由于训练数据的多样性和规模受限,难以实现良好的泛化能力。此外,扩散模型中训练与推理阶段的不匹配以及非朗伯区域缺乏显著视觉特征也阻碍了精确预测。 Method: DidSee采用了一种扩散模型框架,主要包含三个关键方法:1)引入了重新缩放的噪声调度器以消除信号泄漏偏差;2)设计了一个与噪声无关的单步训练公式,以缓解暴露偏差带来的误差累积;3)结合了一个语义增强模块,实现了深度补全和语义分割的联合优化。 Result: DidSee在多个基准测试中达到了最先进的性能,展示了强大的现实世界泛化能力,并有效提升了类别级姿态估计和机器人抓取等下游任务的表现。 Conclusion: DidSee是一个基于扩散模型的深度补全框架,解决了非朗伯物体上的深度补全问题,取得了最先进的性能,并展示了强大的现实世界泛化能力和对下游任务的有效改进。 Abstract: Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose \textbf{DidSee}, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic grasping.Project page: https://wenzhoulyu.github.io/DidSee/

[80] Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability

Boyong He,Yuxiang Ji,Zhuoyue Tan,Liaoni Wu

Main category: cs.CV

TL;DR: 本文提出了一种高效的扩散模型应用方法,用于解决跨领域检测任务中的性能下降问题,通过优化特征提取和对齐策略,在多个基准测试中表现出色。

Details Motivation: 由于训练数据和测试数据之间的领域差距,检测器通常会遇到性能下降的问题。虽然最近的方法探索了将扩散模型应用于领域泛化和适应任务,但它们仍然面临较大的推理成本,并未完全发挥扩散模型的能力。 Method: 1. 提取单步扩散过程的中间特征,优化特征收集和融合,减少推理时间并提升源域性能;2. 使用框掩码图像和类别提示构建目标中心辅助分支,提取鲁棒且领域不变的特征;3. 应用一致性损失对齐辅助分支和普通分支,平衡拟合性和泛化能力;4. 在统一框架下,通过特征级和对象级对齐引导标准检测器,提高跨领域检测性能。 Result: 1. 推理时间减少了75%,同时提升了源域的性能;2. 在三个DA基准测试和五个DG基准测试中取得了竞争性的结果;3. COCO泛化基准实验表明该方法在大领域偏移和低数据场景下具有显著优势和效率。 Conclusion: 该论文提出了一种新的方法来解决领域泛化(DG)和领域适应(DA)任务中的问题,通过利用扩散模型的中间特征和构建目标中心辅助分支,有效提高了跨领域检测性能,并在多个基准测试中取得了竞争性的结果。 Abstract: Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at \href{https://github.com/heboyong/Fitness-Generalization-Transferability}{Fitness-Generalization-Transferability}.

[81] HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

Qize Yang,Shimin Yao,Weixuan Chen,Shenghao Fu,Detao Bai,Jiaxing Zhao,Boyuan Sun,Bowen Yin,Xihan Wei,Jingren Zhou

Main category: cs.CV

TL;DR: This paper addresses challenges in multimodal reasoning by proposing a reinforcement learning-based method and a new benchmark, IntentBench, achieving superior performance in understanding complex human intentions.

Details Motivation: The study aims to address two critical issues in existing multimodal reasoning models—inadequate global context understanding and shortcut problems—to improve the interpretation of complex human intentions and emotions. Method: The paper introduces a reinforcement learning approach with context, format, accuracy, and logical rewards to enhance the global context understanding of multimodal models. It also proposes a new benchmark, IntentBench, for evaluation. Result: The proposed method outperforms other open-source omni-modal models on multiple benchmarks, showcasing enhanced reasoning capabilities through effective integration of multimodal information. Conclusion: The proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source models, indicating its effectiveness in enhancing multimodal reasoning capabilities. Abstract: With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.

[82] Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling

Hansam Cho,Seoung Bum Kim

Main category: cs.CV

TL;DR: This paper proposes Faithfulness Guidance and Scheduling (FGS) to improve faithfulness without compromising editability in image editing using text-guided diffusion models.

Details Motivation: The challenge of balancing editability and faithfulness in text-guided diffusion models for image editing. Method: Faithfulness Guidance and Scheduling (FGS) that includes faithfulness guidance and a scheduling strategy. Result: Experimental results show FGS enhances faithfulness with minimal impact on editability and works well with various editing methods. Conclusion: FGS is an effective method for achieving superior faithfulness while maintaining editability in image editing with compatibility across diverse tasks. Abstract: Text-guided diffusion models have become essential for high-quality image synthesis, enabling dynamic image editing. In image editing, two crucial aspects are editability, which determines the extent of modification, and faithfulness, which reflects how well unaltered elements are preserved. However, achieving optimal results is challenging because of the inherent trade-off between editability and faithfulness. To address this, we propose Faithfulness Guidance and Scheduling (FGS), which enhances faithfulness with minimal impact on editability. FGS incorporates faithfulness guidance to strengthen the preservation of input image information and introduces a scheduling strategy to resolve misalignment between editability and faithfulness. Experimental results demonstrate that FGS achieves superior faithfulness while maintaining editability. Moreover, its compatibility with various editing methods enables precise, high-quality image edits across diverse tasks.

[83] Logios : An open source Greek Polytonic Optical Character Recognition system

Perifanos Konstantinos,Goutsos Dionisis

Main category: cs.CV

TL;DR: 本文介绍了一种基于深度学习的希腊语多音调文本OCR系统,相较于传统方法,在准确性和效率上都有显著提升,并已开源供学术研究使用。

Details Motivation: 传统的OCR方法在处理希腊语多音调文本时存在准确性和效率上的局限性,因此需要一种更有效的方法来解决这些问题。 Method: 通过结合卷积层用于特征提取和循环层用于序列学习的方法,构建了一个专门针对希腊语多音调文本的OCR系统。 Result: 该系统成功地解决了希腊语多音调脚本带来的独特挑战,并在准确性与效率方面实现了显著改进。 Conclusion: 该论文得出的结论是,所提出的OCR系统在处理希腊语多音调文本方面优于传统OCR方法,并且模型作为开源库发布,可供学术使用。 Abstract: In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use.

[84] Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features

Shangbo Wu,Yu-an Tan,Ruinan Ma,Wencong Ma,Dehua Zhu,Yuanzhang Li

Main category: cs.CV

TL;DR: This paper introduces dSVA, an adversarial attack leveraging dual self-supervised Vision Transformer features for improved cross-architecture black-box transferability.

Details Motivation: The research is motivated by the limitations of previous adversarial attacks relying on hard labels and supervised features, aiming to improve black-box transferability through self-supervised learning's synergy with Vision Transformers. Method: The authors proposed dSVA, a generative attack framework utilizing both global structural features (from contrastive learning) and local textural features (from masked image modeling) in self-supervised ViTs to craft adversarial examples. Result: The dSVA method achieved remarkable black-box transferability, outperforming state-of-the-art approaches against various model architectures. Conclusion: The study concludes that leveraging dual self-supervised Vision Transformer features significantly improves adversarial transferability across different model architectures. Abstract: The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.

[85] HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation

Xinzhuo Li,Adheesh Juvekar,Xingyou Liu,Muntasir Wahed,Kiet A. Nguyen,Ismini Lourentzou

Main category: cs.CV

TL;DR: This paper introduces HalluSegBench, a new benchmark for evaluating hallucinations in vision-language segmentation models, showing that vision-driven errors are more prevalent than label-based ones.

Details Motivation: To address the limitations of existing evaluation protocols that focus only on label or textual hallucinations, which do not manipulate visual context, thus limiting their ability to diagnose critical failures in grounded visual understanding. Method: The researchers introduced HalluSegBench, a benchmark designed to evaluate hallucinations in visual grounding through counterfactual visual reasoning. It includes a dataset of 1340 counterfactual instance pairs and new metrics to assess hallucination sensitivity. Result: Experiments revealed that state-of-the-art vision-language segmentation models exhibit significant vision-driven hallucinations, persisting even with visually coherent scene edits. Conclusion: The study concludes that vision-driven hallucinations are more common than label-driven ones in vision-language segmentation models, emphasizing the importance of counterfactual reasoning for diagnosing grounding fidelity. Abstract: Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.

[86] Class-Agnostic Region-of-Interest Matching in Document Images

Demin Zhang,Jiahao Lyu,Zhijie Shen,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出了一个新的文档理解任务——类无关的感兴趣区域匹配(RoI-Matching),并设计了相应的框架RoI-Matcher与基准RoI-Matching-Bench,验证了其灵活性与有效性。

Details Motivation: 现有的文档分析解决方案(如文档布局分析和关键信息提取)仅适用于固定的类别定义和粒度,无法实现用户自定义的灵活应用。因此,本文定义了一个新的任务:类无关的感兴趣区域匹配(RoI-Matching)。 Method: 文章提出了一种名为RoI-Matcher的新框架,使用Siamese网络提取参考域和目标域中的多级特征,并利用跨注意力层整合和对齐不同域中的相似语义。 Result: 研究表明,所提出的方法在新构建的基准RoI-Matching-Bench上表现有效,且流程简单,能够作为未来研究的基础。 Conclusion: 本文提出了一个灵活、高效、多粒度和开放集的文档感兴趣区域匹配任务,并为此构建了一个基准RoI-Matching-Bench,同时设计了宏观和微观评价指标。实验结果表明该方法在RoI-Matching-Bench上效果良好,为后续研究提供了基线。 Abstract: Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named ``Class-Agnostic Region-of-Interest Matching'' (``RoI-Matching'' for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at https://github.com/pd162/RoI-Matching.

[87] SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification

Dinh-Khoi Vo,Van-Loc Nguyen,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: 本文提出了SAMURAI方法,通过融合语言理解和形状先验信息,在复杂室内环境中实现鲁棒的3D物体检索,并在ROOMELSA数据集上验证了其有效性。

Details Motivation: 在复杂的室内环境中,仅使用被遮挡的2D图像和自然语言描述来检索3D物体存在显著挑战,包括受限的3D场景上下文、扭曲的视角、无纹理的遮挡区域、模糊的语言提示和有噪声的分割掩码。 Method: 提出了一种名为SAMURAI的形状感知多模态检索方法,结合了基于CLIP的语义匹配、基于二值轮廓的形状引导重新排序以及强大的多数投票策略。同时设计了一个专用的预处理流程来提升掩码质量。 Result: 所提出的混合检索框架在ROOMELSA私有测试集上取得了具有竞争力的性能表现。 Conclusion: 结合形状先验和语言理解对于鲁棒的开放世界3D物体检索至关重要。 Abstract: Retrieving 3D objects in complex indoor environments using only a masked 2D image and a natural language description presents significant challenges. The ROOMELSA challenge limits access to full 3D scene context, complicating reasoning about object appearance, geometry, and semantics. These challenges are intensified by distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. To address this, we propose SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, alongside a robust majority voting strategy. A dedicated preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Our hybrid retrieval framework leverages both language and shape cues, achieving competitive performance on the ROOMELSA private test set. These results highlight the importance of combining shape priors with language understanding for robust open-world 3D object retrieval.

[88] PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image

Hongyu Yan,Kunming Luo,Weiyu Li,Yixun Liang,Shengming Li,Jingwei Huang,Chunchao Guo,Ping Tan

Main category: cs.CV

TL;DR: PoseMaster is an end-to-end controllable 3D character generation framework that improves upon existing methods by enabling accurate arbitrary-pose control while delivering high-quality A-pose character generation.

Details Motivation: Existing image-based methods for 3D character modeling suffer from distorted images due to self-occlusion and viewpoint issues during pose standardization, which affects the quality of 3D reconstruction. Method: PoseMaster unifies pose transformation and 3D character generation into a flow-based 3D native generation framework, leveraging the 3D body bones in the skeleton of animatable characters as pose conditions. It also uses random emptying of pose and image conditions during training to enhance generalizability. Result: PoseMaster outperforms current state-of-the-art techniques in both qualitative and quantitative evaluations for A-pose character generation and demonstrates strong ability in achieving precise control for arbitrary poses. Conclusion: PoseMaster is an effective and generalizable framework for 3D character generation that enables precise control for arbitrary poses. Abstract: 3D characters play a crucial role in our daily entertainment. To improve the efficiency of 3D character modeling, recent image-based methods use two separate models to achieve pose standardization and 3D reconstruction of the A-pose character. However, these methods are prone to generating distorted and degraded images in the pose standardization stage due to self-occlusion and viewpoints, which further affects the geometric quality of the subsequent reconstruction process. To tackle these problems, we propose PoseMaster, an end-to-end controllable 3D character generation framework. Specifically, we unify pose transformation and 3D character generation into a flow-based 3D native generation framework. To achieve accurate arbitrary-pose control, we propose to leverage the 3D body bones existing in the skeleton of an animatable character as the pose condition. Furthermore, considering the specificity of multi-condition control, we randomly empty the pose condition and the image condition during training to improve the effectiveness and generalizability of pose control. Finally, we create a high-quality pose-control dataset derived from realistic character animation data to make the model learning the implicit relationships between skeleton and skinning weights. Extensive experiments show that PoseMaster outperforms current state-of-the-art techniques in both qualitative and quantitative evaluations for A-pose character generation while demonstrating its powerful ability to achieve precise control for arbitrary poses.

[89] EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception

Sanjoy Chowdhury,Subrata Biswas,Sayan Nag,Tushar Nagarajan,Calvin Murdock,Ishwarya Ananthabhotla,Yijun Qian,Vamsi Krishna Ithapu,Dinesh Manocha,Ruohan Gao

Main category: cs.CV

TL;DR: 本文提出了一种高效的以自我为中心的感知框架EgoAdapt,通过跨模态蒸馏与策略学习相结合的方法,在降低计算成本的同时保持高性能。

Details Motivation: 现有的多感官以自我为中心的任务感知模型虽然性能优秀,但计算开销大,在资源受限环境下难以部署。 Method: 引入了一种结合跨模态蒸馏和策略学习的自适应框架EgoAdapt,该框架能够根据特定任务的动作空间进行调整。 Result: 在三个具有挑战性的以自我为中心的数据集(EPIC-Kitchens、EasyCom和Aria Everyday Activities)上,EgoAdapt显著提高了效率,GMACs减少高达89.09%,参数减少高达82.02%,能耗降低达9.6倍,同时保持甚至超越了现有最先进模型的性能。 Conclusion: EgoAdapt是一个可以跨不同以自我为中心的感知任务实现高效推理的框架,它通过自适应策略模块减少了计算成本,同时保持了模型性能。 Abstract: Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets EPIC-Kitchens, EasyCom, and Aria Everyday Activities demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6x, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.

[90] ESMStereo: Enhanced ShuffleMixer Disparity Upsampling for Real-Time and Accurate Stereo Matching

Mahmoud Tahmasebi,Saif Huq,Kevin Meehan,Marion McAfee

Main category: cs.CV

TL;DR: This paper proposes Enhanced Shuffle Mixer (ESM) for stereo matching to recover detailed scene geometry and achieve real-time performance with high accuracy.

Details Motivation: To address the challenge of achieving both real-time performance and high accuracy in stereo matching by overcoming information loss in small-scale cost volumes. Method: The Enhanced Shuffle Mixer (ESM) is proposed to mitigate information loss from small-scale cost volumes by integrating primary features into the disparity upsampling unit. Features are extracted, fused, shuffled, split, and refined through a compact feature-guided hourglass network. Result: ESMStereo achieves an real-time inference speed of 116 FPS on high-end GPUs and 91 FPS on the AGX Orin. Conclusion: The compact version of ESMStereo achieves high inference speed on high-end GPUs and AGX Orin while reconstructing highly accurate disparity maps, focusing on local contextual connectivity with a large receptive field and low computational cost. Abstract: Stereo matching has become an increasingly important component of modern autonomous systems. Developing deep learning-based stereo matching models that deliver high accuracy while operating in real-time continues to be a major challenge in computer vision. In the domain of cost-volume-based stereo matching, accurate disparity estimation depends heavily on large-scale cost volumes. However, such large volumes store substantial redundant information and also require computationally intensive aggregation units for processing and regression, making real-time performance unattainable. Conversely, small-scale cost volumes followed by lightweight aggregation units provide a promising route for real-time performance, but lack sufficient information to ensure highly accurate disparity estimation. To address this challenge, we propose the Enhanced Shuffle Mixer (ESM) to mitigate information loss associated with small-scale cost volumes. ESM restores critical details by integrating primary features into the disparity upsampling unit. It quickly extracts features from the initial disparity estimation and fuses them with image features. These features are mixed by shuffling and layer splitting then refined through a compact feature-guided hourglass network to recover more detailed scene geometry. The ESM focuses on local contextual connectivity with a large receptive field and low computational cost, leading to the reconstruction of a highly accurate disparity map at real-time. The compact version of ESMStereo achieves an inference speed of 116 FPS on high-end GPUs and 91 FPS on the AGX Orin.

[91] OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

Caoshuo Li,Zengmao Ding,Xiaobin Hu,Bang Li,Donghao Luo,AndyPian Wu,Chaoyang Wang,Chengjie Wang,Taisong Jin,SevenShu,Yunsheng Wu,Yongge Liu,Rongrong Ji

Main category: cs.CV

TL;DR: This paper introduces OracleFusion, a new framework that helps in deciphering Oracle Bone Script characters by using advanced models and methods to enhance the understanding and visual representation of these ancient texts.

Details Motivation: The motivation behind this paper is the challenge posed by the undeciphered Oracle Bone Script (OBS) characters due to their complex structure and abstract imagery. Despite about 1,600 OBS characters being deciphered out of approximately 4,500 discovered, there remains a significant need for interpretation assistance. Method: This paper proposes a novel two-stage semantic typography framework named OracleFusion. In the first stage, it leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, Oracle Structural Vector Fusion (OSVF) is introduced, incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. Result: Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Conclusion: OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS. Abstract: As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.

[92] Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection

Luosheng Xu,Dalin Zhang,Zhaohui Song

Main category: cs.CV

TL;DR: 本研究提出了一种高效的遥感变化检测方法FlickCD,在保证高性能的同时显著降低计算和存储需求。

Details Motivation: 尽管深度学习改变了变化检测,但现代模型的复杂性和计算需求并未带来显著的准确性提升,因此需要探索在保持高准确性的同时最小化资源消耗的轻量级模型。 Method: 提出了FlickCD,包含增强差异模块(EDM)和局部-全局融合块,利用EDM放大关键特征差异并抑制不相关变化,使用Shifted Window Self-Attention (SWSA) 和 Enhanced Global Self-Attention (EGSA) 进行多尺度语义信息捕捉。 Result: 实验表明,FlickCD在四个基准数据集上将计算和存储开销减少了超过一个数量级,同时达到最先进的性能或仅产生轻微的准确度损失(<1% F1)。 Conclusion: FlickCD实现了高效的遥感变化检测,通过减少计算和存储开销超过一个数量级,同时保持最先进的性能或仅有微小的准确度权衡。 Abstract: Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Enhanced Global Self-Attention (EGSA) to efficiently capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1\% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.

[93] IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Yujia Liang,Jile Jiao,Zhicheng Wang,Xuetao Feng,Zixuan Ye,Yuan Wang,Hao Lu

Main category: cs.CV

TL;DR: This paper introduces MultiClip-Bench and IPFormer-VideoLLM to address multi-shot video understanding challenges in VideoLLMs.

Details Motivation: Existing VideoLLMs struggle with multi-shot scenarios due to lack of appropriate annotations in datasets, leading to issues like identity forgetting and key frame negligence. Method: A new dataset (MultiClip-Bench) and a model (IPFormer-VideoLLM) that injects instance-level features through an attention-based connector to improve multi-shot video understanding. Result: The new dataset improves training effectiveness for multi-shot scenarios, while the proposed model successfully aggregates instance-specific information across scenes, enhancing performance on video understanding tasks. Conclusion: The proposed dataset and model significantly enhance multi-scene video understanding and offer advantages across various video benchmarks. Abstract: Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or lossy manner, at the risk of missing identity information, we then contribute a new model IPFormer-VideoLLM. Its key idea is the injection of instance-level features as instance prompts through an efficient attention-based connector. This allows for the aggregation of instance-specific information across scenes. Experiments demonstrate that our proposed dataset and model not only enhance the multi-scene video understanding significantly, but also offer distinct advantages across various video benchmarks.

[94] CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization

Jan Ackermann,Jonas Kulhanek,Shengqu Cai,Haofei Xu,Marc Pollefeys,Gordon Wetzstein,Leonidas Guibas,Songyou Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为CL-Splats的方法,用于动态3D环境中的场景表示更新,通过增量更新和局部优化,提高了重建质量和效率。

Details Motivation: 在动态3D环境中,随着时间的推移准确更新场景表示对于机器人技术、混合现实和具身人工智能的应用至关重要。当场景演变时,需要有效的方法来整合变化,以保持最新的高质量重建,而无需重新优化整个场景的计算开销。 Method: 引入了CL-Splats,它通过集成一个稳健的变更检测模块,从稀疏场景捕获中增量更新基于高斯点绘的3D表示,该模块可以分割场景中的更新和静态组件,从而实现专注的局部优化。 Result: 实验表明,CL-Splats实现了高效的更新,并在重建质量上超过了最先进的技术。 Conclusion: CL-Splats为未来3D场景重建任务的实时适应性奠定了坚实的基础,并且在高效更新和改进重建质量方面优于现有技术。 Abstract: In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures. CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation. Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications. Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.

[95] GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction

Muleilan Pei,Shaoshuai Shi,Lu Zhang,Peiliang Li,Shaojie Shen

Main category: cs.CV

TL;DR: This paper introduces a novel GoIRL framework for trajectory prediction in autonomous driving, combining graph-based feature aggregation with inverse reinforcement learning and achieving superior performance over existing methods.

Details Motivation: Trajectory prediction for surrounding agents in autonomous driving is inherently uncertain and multimodal, which presents challenges for prevailing data-driven, supervised learning methods. Method: A Graph-oriented Inverse Reinforcement Learning (GoIRL) framework is introduced, incorporating a feature adaptor to aggregate lane-graph features into grid space, combined with maximum entropy IRL to infer reward distribution. A hierarchical parameterized trajectory generator with refinement and probability fusion strategies is also implemented. Result: Extensive experimental results show that the approach achieves state-of-the-art performance on Argoverse and nuScenes motion forecasting benchmarks while exhibiting better generalization than existing supervised models. Conclusion: The proposed GoIRL framework outperforms existing supervised models in trajectory prediction for autonomous driving, demonstrating state-of-the-art performance and superior generalization abilities. Abstract: Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.

[96] Learning to See in the Extremely Dark

Hai Jiang,Binhao Guan,Zhen Liu,Xiaohong Liu,Jian Yu,Zheng Liu,Songchen Han,Shuaicheng Liu

Main category: cs.CV

TL;DR: This paper introduces a new dataset (SIED) for extremely low-light RAW image enhancement and proposes a diffusion-based framework with specialized modules to effectively restore high-quality images from very noisy inputs.

Details Motivation: Learning-based methods have advanced low-light RAW image enhancement, but their performance in extremely dark scenes (e.g., 0.0001 lux) remains unexplored due to the lack of appropriate datasets. Method: A paired-to-paired data synthesis pipeline is proposed for generating extremely low-light RAW images across precise illuminance ranges, along with a diffusion-based enhancement framework that exploits generative capabilities and intrinsic denoising properties. Result: The proposed method generates a large-scale dataset named See-in-the-Extremely-Dark (SIED) and demonstrates effective restoration of visually pleasing results from extremely low-light inputs on both SIED and public benchmarks. Conclusion: The paper proposes a diffusion-based framework with an Adaptive Illumination Correction Module and color consistency loss to restore visually pleasing results from extremely low-SNR RAW inputs, showing effectiveness through experiments. Abstract: Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/JianghaiSCU/SIED.

[97] YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection

Jiawei Hu

Main category: cs.CV

TL;DR: YOLO-FDA improves surface defect detection by integrating fine-grained detail enhancement and attention-guided feature fusion, offering better accuracy and robustness than current methods.

Details Motivation: Surface defect detection in industrial scenarios is challenging due to variability in defect types, irregular shapes and sizes, fine-grained requirements, and complex material textures. Existing AI-based methods face issues like redundant features, limited detail sensitivity, and weak robustness under multiscale conditions. Method: YOLO-FDA integrates fine-grained detail enhancement and attention-guided feature fusion using a BiFPN-style architecture, Detail-directional Fusion Module (DDFM), and two novel attention-based fusion strategies: Attention-weighted Concatenation (AC) and Cross-layer Attention Fusion (CAF). Result: Extensive experiments on benchmark datasets demonstrate that YOLO-FDA consistently outperforms state-of-the-art methods in terms of accuracy and robustness across diverse defects and scales. Conclusion: YOLO-FDA is a highly effective framework for surface defect detection, offering improved accuracy and robustness compared to existing methods. Abstract: Surface defect detection in industrial scenarios is both crucial and technically demanding due to the wide variability in defect types, irregular shapes and sizes, fine-grained requirements, and complex material textures. Although recent advances in AI-based detectors have improved performance, existing methods often suffer from redundant features, limited detail sensitivity, and weak robustness under multiscale conditions. To address these challenges, we propose YOLO-FDA, a novel YOLO-based detection framework that integrates fine-grained detail enhancement and attention-guided feature fusion. Specifically, we adopt a BiFPN-style architecture to strengthen bidirectional multilevel feature aggregation within the YOLOv5 backbone. To better capture fine structural changes, we introduce a Detail-directional Fusion Module (DDFM) that introduces a directional asymmetric convolution in the second-lowest layer to enrich spatial details and fuses the second-lowest layer with low-level features to enhance semantic consistency. Furthermore, we propose two novel attention-based fusion strategies, Attention-weighted Concatenation (AC) and Cross-layer Attention Fusion (CAF) to improve contextual representation and reduce feature noise. Extensive experiments on benchmark datasets demonstrate that YOLO-FDA consistently outperforms existing state-of-the-art methods in terms of both accuracy and robustness across diverse types of defects and scales.

[98] Tree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation

Junwen Wang,Oscar Maccormac,William Rochford,Aaron Kujawa,Jonathan Shapey,Tom Vercauteren

Main category: cs.CV

TL;DR: 本文提出了一种新的语义损失函数,用于超光谱图像分割,提高了多类别精细分割的效果,并能识别异常区域。

Details Motivation: 现有的生物医学分割方法对所有错误等同惩罚,无法利用标签空间中的类别间语义关系。 Method: 引入了两种基于树的语义损失函数,并结合稀疏、无背景注释的方法进行训练。 Result: 该方法在一个包含107个类别的临床定义语义树结构的稀疏注释HSI数据集上达到了最先进的性能,并且能够有效检测分布外像素而不影响分布内像素的分割性能。 Conclusion: 该论文提出了一种基于树结构的语义损失函数,用于超光谱成像的生物医学分割任务,并证明其在稀疏注释数据集上的有效性。 Abstract: Hyperspectral imaging (HSI) shows great promise for surgical applications, offering detailed insights into biological tissue differences beyond what the naked eye can perceive. Refined labelling efforts are underway to train vision systems to distinguish large numbers of subtly varying classes. However, commonly used learning methods for biomedical segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. In this work, we introduce two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations. Extensive experiments demonstrate that our proposed method reaches state-of-the-art performance on a sparsely annotated HSI dataset comprising $107$ classes organised in a clinically-defined semantic tree structure. Furthermore, our method enables effective detection of out-of-distribution (OOD) pixels without compromising segmentation performance on in-distribution (ID) pixels.

[99] Robust Deep Learning for Myocardial Scar Segmentation in Cardiac MRI with Noisy Labels

Aida Moafi,Danial Moafi,Evgeny M. Mirkes,Gerry P. McCann,Abbas S. Alatrany,Jayanth R. Arnold,Mostafa Mehdipour Ghazi

Main category: cs.CV

TL;DR: 本研究开发了一种鲁棒的深度学习方法,可实现心肌疤痕的全自动分割,解决了标签噪声等问题,并在多种条件下表现出色。

Details Motivation: 准确地从心脏MRI中分割心肌疤痕对于临床评估和治疗计划至关重要,但存在标签噪声、数据异质性和类别不平衡等挑战。 Method: 通过微调最先进的模型,构建了一个用于心肌疤痕检测和分割的深度学习流程,利用Kullback-Leibler损失函数和大量数据增强解决标签噪声、数据异质性和类别不平衡问题。 Result: 该方法在急性和慢性病例上均表现出良好的性能,在嘈杂的标签下仍能生成准确和平滑的分割结果,并且在分布外测试集中表现出比nnU-Net等最先进模型更强的泛化能力。 Conclusion: 该研究为心肌疤痕的自动化分割提供了一种可靠的方法,并展示了其在临床心脏成像中的广泛应用潜力。 Abstract: The accurate segmentation of myocardial scars from cardiac MRI is essential for clinical assessment and treatment planning. In this study, we propose a robust deep-learning pipeline for fully automated myocardial scar detection and segmentation by fine-tuning state-of-the-art models. The method explicitly addresses challenges of label noise from semi-automatic annotations, data heterogeneity, and class imbalance through the use of Kullback-Leibler loss and extensive data augmentation. We evaluate the model's performance on both acute and chronic cases and demonstrate its ability to produce accurate and smooth segmentations despite noisy labels. In particular, our approach outperforms state-of-the-art models like nnU-Net and shows strong generalizability in an out-of-distribution test set, highlighting its robustness across various imaging conditions and clinical tasks. These results establish a reliable foundation for automated myocardial scar quantification and support the broader clinical adoption of deep learning in cardiac imaging.

[100] Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image

Pufan Li,Bi'an Du,Wei Hu

Main category: cs.CV

TL;DR: This paper proposes a new method for generating realistic and detailed 3D objects from single-view images by combining geometry and perception priors without extra training, achieving better multiview consistency and 3D reconstruction than existing approaches.

Details Motivation: The motivation is to address the shortcomings of existing methods for generating 3D objects from single-view images, such as poor multiview consistency and lack of geometric detail. Method: The authors propose a method that uses three Gaussian branches initialized from geometry prior, perception prior, and Gaussian noise. The geometry prior captures rough 3D shapes, while the perception prior leverages a pretrained 2D diffusion model to enhance multiview information. These branches are refined through mutual interaction and a reprojection-based strategy to enforce depth consistency. Result: The experiments show that the proposed method delivers higher-fidelity reconstruction results, demonstrating robust and consistent 3D object generation with improved performance on novel view synthesis and 3D reconstruction. Conclusion: The paper concludes that their proposed method, which integrates geometry and perception priors without requiring additional model training, achieves higher-fidelity 3D object reconstruction and outperforms existing methods in novel view synthesis and 3D reconstruction. Abstract: Generating realistic 3D objects from single-view images requires natural appearance, 3D consistency, and the ability to capture multiple plausible interpretations of unseen regions. Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference or 3D Gaussian Splatting, but their results generally suffer from poor multiview consistency and lack geometric detail. To takle these issues, we present a novel method that seamlessly integrates geometry and perception priors without requiring additional model training to reconstruct detailed 3D objects from a single image. Specifically, we train three different Gaussian branches initialized from the geometry prior, perception prior and Gaussian noise, respectively. The geometry prior captures the rough 3D shapes, while the perception prior utilizes the 2D pretrained diffusion model to enhance multiview information. Subsequently, we refine 3D Gaussian branches through mutual interaction between geometry and perception priors, further enhanced by a reprojection-based strategy that enforces depth consistency. Experiments demonstrate the higher-fidelity reconstruction results of our method, outperforming existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation.

[101] Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition

Longkun Zou,Kangjun Liu,Ke Chen,Kailing Guo,Kui Jia,Yaowei Wang

Main category: cs.CV

TL;DR: A novel Topology-Aware Modeling framework improves unsupervised domain adaptation for 3D point cloud classification by leveraging global spatial topology and advanced self-training strategies.

Details Motivation: To overcome the limitations of current UDA techniques in handling simulation-to-reality domain gaps for 3D object shape analysis. Method: Introduces a Topology-Aware Modeling framework with self-supervised learning and a self-training strategy combining cross-domain contrastive learning. Result: Validated on three public benchmarks, showing consistent improvements over existing methods. Conclusion: The proposed TAM framework effectively addresses the Sim2Real domain gap in unsupervised domain adaptation for 3D point cloud classification. Abstract: Learning semantic representations from point sets of 3D object shapes is often challenged by significant geometric variations, primarily due to differences in data acquisition methods. Typically, training data is generated using point simulators, while testing data is collected with distinct 3D sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits the generalization ability of point classifiers. Current unsupervised domain adaptation (UDA) techniques struggle with this gap, as they often lack robust, domain-insensitive descriptors capable of capturing global topological information, resulting in overfitting to the limited semantic patterns of the source domain. To address this issue, we introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures, and by modeling the topological relations of local geometric features through a novel self-supervised learning task. Additionally, we propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training, effectively reducing the impact of noisy pseudo-labels and enhancing the robustness of the adaptation process. Experimental results on three public Sim2Real benchmarks validate the effectiveness of our TAM framework, showing consistent improvements over state-of-the-art methods across all evaluated tasks. The source code of this work will be available at https://github.com/zou-longkun/TAG.git.

[102] Task-Aware KV Compression For Cost-Effective Long Video Understanding

Minghao Qin,Yan Shu,Peitian Zhang,Kun Lun,Huaying Yuan,Juenjie Zhou,Shitao Xiao,Bo Zhao,Zheng Liu

Main category: cs.CV

TL;DR: 本文提出了Video-X^2L,一种高效的长视频理解方法,通过双级KV压缩和选择性KV重加载机制,解决了现有KV压缩方法在高压缩比下的信息丢失问题,并显著节省计算成本。

Details Motivation: 当前多模态大语言模型在处理长视频时面临巨大计算成本挑战,现有KV压缩方法在高压缩比下存在显著信息丢失问题。 Method: 提出了一种名为Video-X^2L的方法,包含两个关键操作:1) 双级KV压缩,生成低压缩KVs(L-KVs)以捕捉细粒度视频细节,以及高压缩KVs(H-KVs)以提供紧凑的视频表示;2) 选择性KV重加载,在解码阶段为最关键视频块重新加载L-KVs,而其他部分使用H-KVs。 Result: 实验表明Video-X^2L在多个流行的LVU基准测试中大幅优于现有的KV压缩方法,同时显著节省了计算成本。 Conclusion: Video-X^2L是一个无需额外训练且与现有KV可压缩MLLM直接兼容的高效长视频理解方法,它通过双级KV压缩和选择性KV重加载机制,在保持任务特定信息的同时显著节省计算成本。 Abstract: Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X^2L, which flexibly preserves critical video information for each LVU task. Video-X^2L involves two key operations. The first one is called bi-level KV compression. During the MLLM's pre-filling stage, Video-X^2L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM's decoding stage, Video-X^2L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X^2L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.

[103] Out-of-Distribution Semantic Occupancy Prediction

Yuheng Zhang,Mengfei Duan,Kunyu Peng,Yuhang Wang,Ruiping Liu,Fei Teng,Kai Luo,Zhiyong Li,Kailun Yang

Main category: cs.CV

TL;DR: OccOoDは、自動運転における3D意味占有予測にOoD検出を統合した新フレームワークで、合成異常統合パイプラインにより生成されたVAA-KITTIとVAA-KITTI-360の2つのデータセット上で評価され、優れたOoD検出性能と占有予測性能を示した。

Details Motivation: 既存の方法は分布内シーンに焦点を当てており、OoDオブジェクトやロングテール分布に対して脆弱であるため、未検出の異常や誤解釈による安全上のリスクが増加している。これに対処するために、より堅牢な3D意味占有予測手法と評価データセットの必要性がある。 Method: Out-of-Distribution Semantic Occupancy Predictionを導入し、Synthetic Anomaly Integration Pipelineを使用して合成異常を注入し、現実的な空間および遮蔽パターンを維持しながらデータセットを生成する。また、OccOoDという新しいフレームワークでVoxel-BEV Progressive Fusion (VBPF) とRWKVベースのブランチを利用した幾何学的意味融合によるOoD検出を強化する。 Result: OccOoDは1.2m領域内でAuROC 67.34%、AuPRCr 29.21%で最先端のOoD検出性能を達成しつつ、占有予測性能も競争力を持っている。 Conclusion: OccOoDは、3D意味占有予測にOoD検出を統合する新しいフレームワークであり、VAA-KITTIとVAA-KITTI-360の2つのデータセットが公開され、自動運転における環境認識の安全性と信頼性を向上させる潜在能力を持つ。 Abstract: 3D Semantic Occupancy Prediction is crucial for autonomous driving, providing a dense, semantically rich environmental representation. However, existing methods focus on in-distribution scenes, making them susceptible to Out-of-Distribution (OoD) objects and long-tail distributions, which increases the risk of undetected anomalies and misinterpretations, posing safety hazards. To address these challenges, we introduce Out-of-Distribution Semantic Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that injects synthetic anomalies while preserving realistic spatial and occlusion patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360. We introduce OccOoD, a novel framework integrating OoD detection into 3D semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF) leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m region, while maintaining competitive occupancy prediction performance. The established datasets and source code will be made publicly available at https://github.com/7uHeng/OccOoD.

[104] GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Zijun Lin,Shuting He,Cheston Tan,Bihan Wen

Main category: cs.CV

TL;DR: This paper introduces GroundFlow, a module that adds temporal reasoning to 3D visual grounding models, enhancing their ability to handle sequential object localization tasks involving multi-step text instructions.

Details Motivation: Current 3DVG methods struggle with SG3D tasks due to their inability to extract temporal information from multi-step instructions involving pronouns. GroundFlow addresses this gap by enabling context understanding and historical information retrieval. Method: GroundFlow was developed as a plug-in module for temporal reasoning in 3D point cloud sequential grounding. It extracts both short-term and long-term step information based on relevance to the current instruction. Result: Integrating GroundFlow significantly improves the task accuracy of 3DVG baseline methods (+7.5% and +10.2%) on the SG3D benchmark, outperforming a pre-trained 3D large language model. Conclusion: GroundFlow enhances 3DVG models by introducing temporal reasoning capabilities, leading to improved performance on the SG3D benchmark. Abstract: Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as "it", "here" and "the same" to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow -- a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5\% and +10.2\%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.

[105] Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation

Yihong Cao,Jiaming Zhang,Xu Zheng,Hao Shi,Kunyu Peng,Hang Liu,Kailun Yang,Hui Zhang

Main category: cs.CV

TL;DR: 本研究提出了一种新的无需源数据的全景图像分割框架UNLOCK,通过创新的模块设计,实现了高质量的360度视点覆盖和遮挡感知的语义分割。

Details Motivation: 解决全景图像处理中因失真、视角遮挡和有限标注带来的挑战,同时改进现有无监督领域自适应方法对源数据的依赖问题。 Method: 引入了两个关键模块:Omni Pseudo-Labeling Learning和Amodal-Driven Context Learning,以实现无源数据或目标标签的自适应学习。 Result: 在SFOASS任务上进行了实验验证,结果显示mAAP为10.9,mAP为11.6,mAPQ比仅使用源的方法提升了+4.3。 Conclusion: 该论文提出了一种无需源数据的全景图像分割方法UNLOCK,在源无关的情况下实现了与源依赖方法相当的性能。 Abstract: Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360{\deg} viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.

[106] MedPrompt: LLM-CNN Fusion with Weight Routing for Medical Image Segmentation and Classification

Shadman Sobhan,Kazi Abrar Mahmud,Abduz Zami

Main category: cs.CV

TL;DR: MedPrompt is a unified framework for medical image analysis that dynamically supports user-defined workflows without retraining the entire system, making it scalable and suitable for near real-time applications.

Details Motivation: Current medical image analysis systems are typically task-specific and lack the flexibility to support user-defined workflows. Method: MedPrompt combines a few-shot prompted Large Language Model (Llama-4-17B) for high-level task planning with a modular Convolutional Neural Network (DeepFusionLab) for low-level image processing. Result: The system achieves a 97% end-to-end correctness in interpreting and executing prompt-driven instructions, with an average inference latency of 2.5 seconds. DeepFusionLab achieves competitive segmentation accuracy and strong classification performance. Conclusion: MedPrompt enables scalable, prompt-driven medical imaging by combining the interpretability of LLMs with the efficiency of modular CNNs. Abstract: Current medical image analysis systems are typically task-specific, requiring separate models for classification and segmentation, and lack the flexibility to support user-defined workflows. To address these challenges, we introduce MedPrompt, a unified framework that combines a few-shot prompted Large Language Model (Llama-4-17B) for high-level task planning with a modular Convolutional Neural Network (DeepFusionLab) for low-level image processing. The LLM interprets user instructions and generates structured output to dynamically route task-specific pretrained weights. This weight routing approach avoids retraining the entire framework when adding new tasks-only task-specific weights are required, enhancing scalability and deployment. We evaluated MedPrompt across 19 public datasets, covering 12 tasks spanning 5 imaging modalities. The system achieves a 97% end-to-end correctness in interpreting and executing prompt-driven instructions, with an average inference latency of 2.5 seconds, making it suitable for near real-time applications. DeepFusionLab achieves competitive segmentation accuracy (e.g., Dice 0.9856 on lungs) and strong classification performance (F1 0.9744 on tuberculosis). Overall, MedPrompt enables scalable, prompt-driven medical imaging by combining the interpretability of LLMs with the efficiency of modular CNNs.

[107] BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models

Louis Kerner,Michel Meintz,Bihe Zhao,Franziska Boenisch,Adam Dziedzic

Main category: cs.CV

TL;DR: This paper introduces BitMark, a robust bitwise watermarking framework for Infinity, which embeds watermarks directly at the bit level during image generation to combat model collapse by reliably identifying generated content.

Details Motivation: As text-to-image models like Infinity generate images that increasingly populate the Internet, there's a growing risk these outputs will be scraped and reused as training data, leading to model collapse. A mitigation strategy like watermarking is needed to identify generated content. Method: The method embeds a watermark at the bit level of the token stream during the image generation process across multiple scales, subtly influencing bits to maintain visual fidelity and speed while ensuring robustness against removal techniques. Result: BitMark demonstrates high radioactivity, meaning when watermarked images are used to train other models, those models' outputs also carry the watermark. The radioactive traces remain detectable even with fine-tuning diffusion or image autoregressive models on watermarked images. Conclusion: BitMark offers a robust bitwise watermarking framework that helps prevent model collapse in image generative models by reliably detecting generated outputs. Abstract: State-of-the-art text-to-image models like Infinity generate photorealistic images at an unprecedented speed. These models operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models' own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework for Infinity. Our method embeds a watermark directly at the bit level of the token stream across multiple scales (also referred to as resolutions) during Infinity's image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model's outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs.

[108] ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation

Xiwei Xuan,Ziquan Deng,Kwan-Liu Ma

Main category: cs.CV

TL;DR: 本研究通过提高参考数据质量,提出一种简单而有效的训练无关开放词汇语义分割方法,性能优于现有技术。

Details Motivation: 现有的训练无关的OVS方法受限于模型能力或参考集的质量,因此需要探索如何提升数据质量来改善性能。 Method: 构建了一个具有高质量段落-文本嵌入配对的参考集,并采用基于相似性的检索机制进行评估。 Result: 在十个基准数据集上的广泛评估表明,所提方法显著超越所有现有的训练无关OVS方法。 Conclusion: 该研究提出了一种以数据质量为导向的框架,用于训练无关的开放词汇语义分割任务,显著优于现有方法,并强调了数据中心设计的重要性。 Abstract: Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available at https://github.com/xiweix/ReME .

[109] Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping

Qifei Cui,Yuang Zhou,Ruichen Deng

Main category: cs.CV

TL;DR: This paper introduces ESFP, a pipeline that translates video of human motion into accurate, executable instructions for a low-cost robotic arm.

Details Motivation: To enable low-cost robotic arms to accurately follow human movements captured via monocular RGB video. Method: The method involves four modules: estimating 3D skeletons, smoothing trajectories with HPSTM, filtering noise using variance weights, and mapping poses to the arm's workspace. Result: ESFP successfully converts human motion into executable trajectories while ensuring anatomical plausibility and precise wrist orientation control. Conclusion: ESFP is an effective end-to-end pipeline for converting monocular RGB video into executable joint trajectories for a low-cost desktop arm. Abstract: This paper presents ESFP, an end-to-end pipeline that converts monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a 24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence Transformer with self-attention-combines long-range temporal context with a differentiable forward-kinematics decoder, enforcing constant bone lengths and anatomical plausibility while jointly predicting joint means and full covariances. (3) Filtering: root-normalized trajectories are variance-weighted according to HPSTM's uncertainty estimates, suppressing residual noise. (4) Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist triples into the uArm's polar workspace, preserving wrist orientation.

[110] DiMPLe -- Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation

Umaima Rahman,Mohammad Yaqub,Dwarikanath Mahapatra

Main category: cs.CV

TL;DR: DiMPLe improves multi-modal learning by disentangling invariant and spurious features, offering strong performance gains across diverse datasets.

Details Motivation: Spurious correlations in visual data hinder out-of-distribution performance, and prior methods focus only on image features without addressing cross-modal disentanglement. Method: DiMPLe combines mutual information minimization, spurious feature regularization, and contrastive learning on invariant features for multi-modal vision and language learning. Result: DiMPLe demonstrates superior performance compared to CoOp-OOD, achieving gains of 15.27 in base class accuracy and 44.31 in novel class accuracy across 11 datasets. Conclusion: DiMPLe enables better generalization to novel classes and robustness to distribution shifts by disentangling invariant and spurious features within and across modalities. Abstract: We introduce DiMPLe (Disentangled Multi-Modal Prompt Learning), a novel approach to disentangle invariant and spurious features across vision and language modalities in multi-modal learning. Spurious correlations in visual data often hinder out-of-distribution (OOD) performance. Unlike prior methods focusing solely on image features, DiMPLe disentangles features within and across modalities while maintaining consistent alignment, enabling better generalization to novel classes and robustness to distribution shifts. Our method combines three key objectives: (1) mutual information minimization between invariant and spurious features, (2) spurious feature regularization, and (3) contrastive learning on invariant features. Extensive experiments demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy.

[111] Temporal Rate Reduction Clustering for Human Motion Segmentation

Xianghan Meng,Zhengyu Tong,Zhiyuan Huang,Chun-Guang Li

Main category: cs.CV

TL;DR: This paper proposes $\text{TR}^2\text{C}$, a novel method for HMS that improves segmentation accuracy by learning temporally consistent structured representations aligned with a UoS structure.

Details Motivation: Existing subspace clustering methods for HMS are based on the assumption of a Union-of-Subspaces distribution, which may not hold for videos capturing complex human motions with cluttered backgrounds. Method: Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$) is introduced, which jointly learns structured representations and affinity to segment frame sequences in video. Result: Extensive experiments on five benchmark HMS datasets show state-of-the-art performance using different feature extractors. Conclusion: The proposed Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$) approach for Human Motion Segmentation (HMS) outperforms existing methods by learning structured representations that align well with a Union-of-Subspaces structure. Abstract: Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the frame sequences in video. Specifically, the structured representations learned by $\text{TR}^2\text{C}$ maintain temporally consistent and align well with a UoS structure, which is favorable for the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors.

[112] DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic

Munish Monga,Vishal Chudasama,Pankaj Wasnik,Biplab Banerjee

Main category: cs.CV

TL;DR: This paper introduces DuET, a novel framework for Dual Incremental Object Detection that simultaneously handles class and domain changes efficiently, achieving strong performance without needing previous data.

Details Motivation: Existing methods for object detection either focus only on Class Incremental Object Detection (CIOD) or Domain Incremental Object Detection (DIOD), which limits their real-world applicability due to catastrophic forgetting or poor performance in unseen domains. A more practical approach is needed. Method: The paper introduces DuET, a Task Arithmetic-based model merging framework with a Directional Consistency Loss to mitigate sign conflicts and enable stable incremental learning. It also proposes the Retention-Adaptability Index (RAI) for evaluation. Result: Extensive experiments showed that DuET achieved significant improvements: +13.12% RAI while preserving 89.3% Average Retention Index on the Pascal Series (4 tasks), and +11.39% RAI with 88.57% Average Retention Index on the Diverse Weather Series (3 tasks). Conclusion: DuET is a new framework that effectively addresses both class and domain shifts in object detection without requiring exemplars, outperforming existing methods in retention and adaptability. Abstract: Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD) only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET's effectiveness, achieving a +13.12% RAI improvement while preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39% RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks), outperforming existing methods.

[113] Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

Cheng Zou,Senlin Cheng,Bolei Xu,Dandan Zheng,Xiaobo Li,Jingdong Chen,Ming Yang

Main category: cs.CV

TL;DR: This paper introduces ViTI, a novel approach to video virtual try-on formulated as a conditional video inpainting task using a Diffusion Transformer with spatial-temporal attention, achieving superior results over existing methods.

Details Motivation: Video virtual try-on is challenging because of the need for spatial-temporal consistency and preservation of garment details across consecutive frames. Existing image-based or diffusion-based video try-on methods often produce inconsistent results, motivating the development of a more effective solution like ViTI. Method: The paper proposes ViTI (Video Try-on Inpainter), which formulates video virtual try-on as a conditional video inpainting task. It builds a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention and progressively adapts it for video garment inpainting through masking strategies and multi-stage training. Result: The proposed method, ViTI, achieves better performance compared to previous works in terms of spatial-temporal consistency and preservation of garment appearance and details, as demonstrated by both quantitative metrics and qualitative evaluation. Conclusion: ViTI is able to inpaint the masked garment area with appropriate garment pixels according to the prompt while maintaining good spatial-temporal consistency, and it outperforms previous methods based on both quantitative and qualitative experimental results. Abstract: Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.

[114] WordCon: Word-level Typography Control in Scene Text Rendering

Wenda Shi,Yiren Song,Zihan Rao,Dengming Zhang,Jiaming Liu,Xingxing Zou

Main category: cs.CV

TL;DR: This paper presents a novel approach to improve word-level typography control in text-to-image generation using a new framework and training method.

Details Motivation: The motivation is to address the persistent challenge of achieving precise word-level typography control within generated images. Method: The paper introduces the Text-Image Alignment (TIA) framework and WordCon, a hybrid parameter-efficient fine-tuning method. It also incorporates masked loss at the latent level and joint-attention loss for enhanced controllability. Result: Both qualitative and quantitative results show that the proposed method outperforms the state of the art in tasks such as artistic text rendering, text editing, and image-conditioned text rendering. Conclusion: The paper concludes that the proposed Text-Image Alignment framework and WordCon method effectively achieve precise word-level typography control in generated images, demonstrating superiority over existing methods. Abstract: Achieving precise word-level typography control within generated images remains a persistent challenge. To address it, we newly construct a word-level controlled scene text dataset and introduce the Text-Image Alignment (TIA) framework. This framework leverages cross-modal correspondence between text and local image regions provided by grounding models to enhance the Text-to-Image (T2I) model training. Furthermore, we propose WordCon, a hybrid parameter-efficient fine-tuning (PEFT) method. WordCon reparameterizes selective key parameters, improving both efficiency and portability. This allows seamless integration into diverse pipelines, including artistic text rendering, text editing, and image-conditioned text rendering. To further enhance controllability, the masked loss at the latent level is applied to guide the model to concentrate on learning the text region in the image, and the joint-attention loss provides feature-level supervision to promote disentanglement between different words. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. The datasets and source code will be available for academic use.

[115] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation

Diego Biagini,Nassir Navab,Azade Farshad

Main category: cs.CV

TL;DR: 本文提出了一种名为HieraSurg的分层感知手术视频生成框架,该框架使用两个专门的扩散模型来生成高质量、与手术动作和阶段一致的视频。

Details Motivation: 现有的手术视频生成方法大多是无条件的,缺乏对手术理解和细粒度指导的必要一致性,无法实现真实的模拟。 Method: HieraSurg框架包含两个阶段:第一阶段通过分割预测模型预测未来粗粒度语义变化;第二阶段通过增强这些时间分割图的细粒度视觉特征生成最终视频。 Result: 实验结果表明,HieraSurg在胆囊切除手术视频生成任务中显著优于以往方法,具有较强的泛化能力和生成高帧率视频的能力。 Conclusion: HieraSurg能够利用多级抽象层次的手术信息,实现更精确的纹理渲染和语义信息整合,展示了其在实际手术应用中的潜力。 Abstract: Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.

[116] Continual Self-Supervised Learning with Masked Autoencoders in Remote Sensing

Lars Möllenbrok,Behnood Rasti,Begüm Demir

Main category: cs.CV

TL;DR: This paper proposes CoSMAE, a continual self-supervised learning method combining data mixup and model mixup knowledge distillation, achieving significant improvements in continual learning for remote sensing.

Details Motivation: To address the issue of catastrophic forgetting in existing continual learning methods for remote sensing, which rely heavily on costly labeled training samples. Method: The method combines data mixup and model mixup knowledge distillation to regularize masked autoencoders (MAE) at the data and model levels. Result: Experimental results show that CoSMAE achieves up to 4.94% improvement over state-of-the-art continual learning methods applied to MAE. Conclusion: CoSMAE effectively improves continual learning performance in remote sensing by reducing catastrophic forgetting and enhancing generalization across tasks. Abstract: The development of continual learning (CL) methods, which aim to learn new tasks in a sequential manner from the training data acquired continuously, has gained great attention in remote sensing (RS). The existing CL methods in RS, while learning new tasks, enhance robustness towards catastrophic forgetting. This is achieved by using a large number of labeled training samples, which is costly and not always feasible to gather in RS. To address this problem, we propose a novel continual self-supervised learning method in the context of masked autoencoders (denoted as CoSMAE). The proposed CoSMAE consists of two components: i) data mixup; and ii) model mixup knowledge distillation. Data mixup is associated with retaining information on previous data distributions by interpolating images from the current task with those from the previous tasks. Model mixup knowledge distillation is associated with distilling knowledge from past models and the current model simultaneously by interpolating their model weights to form a teacher for the knowledge distillation. The two components complement each other to regularize the MAE at the data and model levels to facilitate better generalization across tasks and reduce the risk of catastrophic forgetting. Experimental results show that CoSMAE achieves significant improvements of up to 4.94% over state-of-the-art CL methods applied to MAE. Our code is publicly available at: https://git.tu-berlin.de/rsim/CoSMAE.

[117] DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

Badri Vishal Kasuba,Parag Chaudhuri,Ganesh Ramakrishnan

Main category: cs.CV

TL;DR: 本文提出了Dhrishtikon,一个用于文档智能和视觉问答系统的多粒度视觉定位框架。

Details Motivation: 文本丰富文档图像中的视觉定位是一项关键但未被充分探索的挑战,现有的视觉-语言模型在精确局部化方面存在局限性。因此,需要一种结构化的对齐方法来提高文档理解和视觉问答系统的性能和可解释性。 Method: 本文的方法包括使用鲁棒的多语言OCR技术、大型语言模型以及一种新颖的区域匹配算法,以在块、行、词和点级别上精确定位答案跨度。此外,作者还构建了一个新基准数据集,来源于CircularsVQA测试集,并进行了广泛的实验和消融研究来评估方法的有效性。 Result: 实验表明,该方法在多个粒度级别上实现了最先进的定位准确性,其中行级别粒度在精度和召回率之间提供了最佳权衡。此外,与领先的视觉-语言模型的比较评估揭示了现有模型的局限性,并证明了所提出的结构化方法的有效性。 Conclusion: 本文提出了一种多粒度视觉定位框架Dhrishtikon,用于增强文档智能和视觉问答(VQA)系统的可解释性和信任度。通过整合多语言OCR、大语言模型和新的区域匹配算法,该方法在多个粒度级别上实现了最先进的定位准确性,并为基于文本的现实场景中的文档理解系统铺平了道路。 Abstract: Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present \drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at https://github.com/kasuba-badri-vishal/DhrishtiKon.

[118] LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning

Dewen Zhang,Tahir Hussain,Wangpeng An,Hayaru Shouno

Main category: cs.CV

TL;DR: 本文提出了一种通过整合人体关键点生成视觉-语言数据的方法,显著提高了视觉-语言模型在人体姿态和动作理解任务上的性能。

Details Motivation: 当前视觉-语言模型在处理与人体姿态和动作相关的复杂视觉任务时表现不佳,缺乏专门的视觉-语言指令跟随数据。 Method: 通过结合人体关键点与传统视觉特征(如描述和边界框)生成专门的视觉-语言指令跟随数据,并构建了一个包含200,328个样本的数据集用于微调模型。 Result: 实验结果显示,与原始LLaVA-1.5-7B模型相比,整体性能提升了33.2%。 Conclusion: 将关键点集成数据应用于多模态模型能够有效提升以人为中心的视觉理解任务表现。 Abstract: Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding. Code is available at https://github.com/Ody-trek/LLaVA-Pose.

[119] Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models

Haoyang Wu,Tsun-Hsuan Wang,Mathias Lechner,Ramin Hasani,Jennifer A. Eckhoff,Paul Pak,Ozanan R. Meireles,Guy Rosman,Yutong Ban,Daniela Rus

Main category: cs.CV

TL;DR: This paper proposes an efficient hierarchical state space model for analyzing full-length surgical videos, effectively capturing both local and global dynamics while outperforming existing methods.

Details Motivation: Surgical workflow analysis in robot-assisted surgeries faces challenges due to the long duration of procedures, which limits efficient video analysis using traditional transformer models because of their quadratic attention mechanism. Method: The method introduces a hierarchical input-dependent state space model with two key modules: a local-aggregation block for capturing local dynamics and a global-relation block for modeling long-term temporal dependencies. It also incorporates a temporally consistent visual feature extractor and is trained using a hybrid discrete-continuous supervision strategy. Result: Experiments demonstrated significant improvements over existing methods, achieving performance gains of +2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets. Conclusion: The proposed hierarchical input-dependent state space model outperforms current state-of-the-art methods in surgical workflow analysis on multiple datasets, including Cholec80, MICCAI2016, and Heichole. Abstract: Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.

[120] PanSt3R: Multi-view Consistent Panoptic Segmentation

Lojze Zust,Yohann Cabon,Juliette Marrie,Leonid Antsfeld,Boris Chidlovskii,Jerome Revaud,Gabriela Csurka

Main category: cs.CV

TL;DR: PanSt3R: A unified and integrated approach for panoptic segmentation of 3D scenes that jointly predicts 3D geometry and multi-view panoptic segmentation in a single forward pass, achieving state-of-the-art performance while being significantly faster than existing methods.

Details Motivation: Existing approaches for panoptic segmentation of 3D scenes rely on 2D images and off-the-shelf models to extract per-frame 2D panoptic segmentations before optimizing an implicit geometric representation. This is suboptimal as it fails to leverage the full potential of spatial relationships across views and requires computationally expensive test-time optimization for each scene. Method: The proposed method, PanSt3R, builds upon recent advances in 3D reconstruction (MUSt3R and DUSt3R) to jointly predict 3D geometry and multi-view panoptic segmentation in a single forward pass. It enhances these models with semantic awareness and introduces a more principled approach for multi-view segmentation post-processing. A simple method for generating novel-view predictions based on PanSt3R and vanilla 3DGS is also introduced. Result: The proposed PanSt3R eliminates the need for test-time optimization, making it faster and more scalable than existing methods. It achieves state-of-the-art performance on several benchmarks. Conclusion: PanSt3R is conceptually simple, fast, scalable, and achieves state-of-the-art performance on several benchmarks while being significantly faster than existing methods. Abstract: Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.

[121] Generalizable Neural Electromagnetic Inverse Scattering

Yizhe Cheng,Chunxun Tian,Haoru Wang,Wentao Zhu,Xiaoxuan Ma,Yizhou Wang

Main category: cs.CV

TL;DR: The paper introduces a novel physics-informed framework for solving electromagnetic inverse scattering problems, offering improved accuracy, generalization, and robustness over existing methods.

Details Motivation: The motivation is to overcome limitations of existing methods like Img-Interiors, which require case-specific optimization, lack generalization, and fail under sparse transmitter setups. The goal is to develop a robust, generalizable solution for reconstructing relative permittivity in electromagnetic inverse scattering problems. Method: The paper proposes a two-stage inverse transmission-scattering process, introducing a generalizable intermediate representation called induced current. This approach decouples the nonlinear scattering process from the ill-posed inverse problem. Result: Extensive experiments show that the proposed method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness, particularly under sparse transmitter setups. Conclusion: This paper concludes that their proposed physics-driven framework provides a new perspective on electromagnetic inverse scattering and significantly improves practical solutions for electromagnetic imaging. Abstract: Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires case-specific optimization, lacks generalization to unseen data, and fails under sparse transmitter setups (e.g., with only one transmitter). To address these limitations, we revisit EISP from a physics-informed perspective, reformulating it as a two stage inverse transmission-scattering process. This formulation reveals the induced current as a generalizable intermediate representation, effectively decoupling the nonlinear scattering process from the ill-posed inverse problem. Built on this insight, we propose the first generalizable physics-driven framework for EISP, comprising a current estimator and a permittivity solver, working in an end-to-end manner. The current estimator explicitly learns the induced current as a physical bridge between the incident and scattered field, while the permittivity solver computes the relative permittivity directly from the estimated induced current. This design enables data-driven training and generalizable feed-forward prediction of relative permittivity on unseen data while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.

[122] ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Hongbo Liu,Jingwen He,Yi Jin,Dian Zheng,Yuhao Dong,Fan Zhang,Ziqi Huang,Yinan He,Yangguang Li,Weichao Chen,Yu Qiao,Wanli Ouyang,Shengjie Zhao,Ziwei Liu

Main category: cs.CV

TL;DR: This paper introduces ShotBench, a benchmark for evaluating AI's understanding of cinematic language, and ShotVL, a model that achieves state-of-the-art results on this benchmark.

Details Motivation: Current Vision-Language Models lack robust evaluation in understanding the nuanced visual grammar of cinematography, limiting their effectiveness in visual comprehension and video generation. Method: The authors developed ShotBench, a benchmark with 3.5k QA pairs from Oscar-nominated films, and evaluated 24 VLMs. They also created ShotQA, a dataset of 70k QA pairs, and trained ShotVL using supervised fine-tuning and Group Relative Policy Optimization. Result: Top-performing VLMs achieved less than 60% average accuracy on ShotBench, especially struggling with spatial reasoning and fine-grained cues. ShotVL outperformed all existing models on ShotBench. Conclusion: The authors conclude that ShotBench effectively evaluates Vision-Language Models' understanding of cinematic language and that ShotVL sets a new state-of-the-art performance in this domain. Abstract: Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce \textbf{ShotBench}, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60\% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct \textbf{ShotQA}, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop \textbf{ShotVL} through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new \textbf{state-of-the-art} performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.

[123] CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations

Julian Lorenz,Mrunmai Phatak,Robin Schön,Katja Ludwig,Nico Hörmann,Annemarie Friedrich,Rainer Lienhart

Main category: cs.CV

TL;DR: 本文提出了一种新的合成场景图数据集CoPa-SG以及两种新的场景图基本概念——参数化关系和原型关系,旨在解决当前场景图数据不足的问题并增强其表达能力。

Details Motivation: 当前场景图研究面临准确数据缺乏的瓶颈,因此需要一个具有精确真实标签和丰富关系注释的数据集来推动研究进展。 Method: 提出了一个名为CoPa-SG的合成场景图数据集,包含高精度的真实场景图和全面的关系注释。同时,引入了参数化关系和原型关系两个新概念。 Result: 使用CoPa-SG,研究人员能够更全面地评估不同的场景图生成模型,并利用新的关系类型提升规划和推理任务的能力。 Conclusion: CoPa-SG不仅提供了高质量的场景图数据,还通过引入参数化关系和原型关系扩展了场景图的表达能力,并展示了其在下游应用中的潜力。 Abstract: 2D scene graphs provide a structural and explainable framework for scene understanding. However, current work still struggles with the lack of accurate scene graph data. To overcome this data bottleneck, we present CoPa-SG, a synthetic scene graph dataset with highly precise ground truth and exhaustive relation annotations between all objects. Moreover, we introduce parametric and proto-relations, two new fundamental concepts for scene graphs. The former provides a much more fine-grained representation than its traditional counterpart by enriching relations with additional parameters such as angles or distances. The latter encodes hypothetical relations in a scene graph and describes how relations would form if new objects are placed in the scene. Using CoPa-SG, we compare the performance of various scene graph generation models. We demonstrate how our new relation types can be integrated in downstream applications to enhance planning and reasoning capabilities.

[124] ToosiCubix: Monocular 3D Cuboid Labeling via Vehicle Part Annotations

Behrooz Nasihatkon,Hossein Resani,Amirreza Mehrzadian

Main category: cs.CV

TL;DR: ToosiCubix 是一种基于单目图像的低成本 3D cuboid 标注方法,利用用户点击关键特征点并通过优化算法实现高质量标注,适用于大规模数据集扩展。

Details Motivation: 现有的 3D cuboid 标注方法依赖昂贵的多传感器设置(如 LiDAR 或立体相机),限制了大规模数据收集的可行性,因此需要一种低成本、易用的方法。 Method: 使用单目图像和相机内参,通过用户点击车辆特征点(如轮子、车标、对称点)来估计位置、方向和尺寸,结合 PnP 和最小二乘法优化问题解决 8 DoF 估计,并引入概率尺寸先验处理 9 DoF。 Result: ToosiCubix 在 KITTI 和 Cityscapes3D 数据集上展示了高精度的 3D cuboid 标注效果,仅需约 10 次用户点击即可完成每辆车的标注。 Conclusion: ToosiCubix 是一种高效且实用的 3D cuboid 标注方法,适用于没有专业设备的现有数据集,并在 KITTI 和 Cityscapes3D 数据集上验证了其高质量标注能力。 Abstract: Many existing methods for 3D cuboid annotation of vehicles rely on expensive and carefully calibrated camera-LiDAR or stereo setups, limiting their accessibility for large-scale data collection. We introduce ToosiCubix, a simple yet powerful approach for annotating ground-truth cuboids using only monocular images and intrinsic camera parameters. Our method requires only about 10 user clicks per vehicle, making it highly practical for adding 3D annotations to existing datasets originally collected without specialized equipment. By annotating specific features (e.g., wheels, car badge, symmetries) across different vehicle parts, we accurately estimate each vehicle's position, orientation, and dimensions up to a scale ambiguity (8 DoF). The geometric constraints are formulated as an optimization problem, which we solve using a coordinate descent strategy, alternating between Perspective-n-Points (PnP) and least-squares subproblems. To handle common ambiguities such as scale and unobserved dimensions, we incorporate probabilistic size priors, enabling 9 DoF cuboid placements. We validate our annotations against the KITTI and Cityscapes3D datasets, demonstrating that our method offers a cost-effective and scalable solution for high-quality 3D cuboid annotation.

[125] CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection

Zhixin Cheng,Jiacheng Deng,Xinjun Li,Xiaotian Yin,Bohao Liao,Baoqun Yin,Wenfei Yang,Tianzhu Zhang

Main category: cs.CV

TL;DR: 本研究针对图像与点云配准问题提出了改进模块CAA和GOS,以提高匹配精度和全局优化能力。

Details Motivation: 由于图像和点云之间在特征通道注意力上的差异会导致匹配效果下降,同时场景中相似结构可能导致跨模态匹配的冗余对应关系,因此需要提出新方法提升注册准确性。 Method: 采用检测无关的粗到细流程,引入CAA增强模态内特征并抑制跨模态敏感性,以及GOS用全局优化代替局部选择。 Result: 实验表明,该方法在RGB-D Scenes V2和7-Scenes数据集上实现了最先进的图像到点云注册性能。 Conclusion: 该论文提出了一种新的方法,通过Channel Adaptive Adjustment Module(CAA)和Global Optimal Selection Module(GOS)来解决图像与点云配准中的问题。 Abstract: Detection-free methods typically follow a coarse-to-fine pipeline, extracting image and point cloud features for patch-level matching and refining dense pixel-to-point correspondences. However, differences in feature channel attention between images and point clouds may lead to degraded matching results, ultimately impairing registration accuracy. Furthermore, similar structures in the scene could lead to redundant correspondences in cross-modal matching. To address these issues, we propose Channel Adaptive Adjustment Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances intra-modal features and suppresses cross-modal sensitivity, while GOS replaces local selection with global optimization. Experiments on RGB-D Scenes V2 and 7-Scenes demonstrate the superiority of our method, achieving state-of-the-art performance in image-to-point cloud registration.

[126] GenFlow: Interactive Modular System for Image Generation

Duc-Hung Nguyen,Huu-Phuc Huynh,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: 本文介绍了GenFlow,这是一个旨在降低生成艺术技术门槛的新模块化框架,包括一个节点编辑器和一个智能助手。

Details Motivation: 生成艺术具有巨大的创造潜力,但由于需要高级建筑概念和计算工作流程的技术专业知识,因此其全部潜力尚未被开发。 Method: 开发了一个基于节点的编辑器和一个由自然语言处理驱动的智能助手,并进行了用户研究以评估其效果。 Result: 用户研究证明了GenFlow在优化工作流程、减少任务完成时间和增强用户理解方面的能力。 Conclusion: GenFlow是一个创新的模块化框架,它通过直观的设计和智能辅助显著提高了生成艺术的可访问性和效率。 Abstract: Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present GenFlow, a novel modular framework that empowers users of all skill levels to generate images with precision and ease. Featuring a node-based editor for seamless customization and an intelligent assistant powered by natural language processing, GenFlow transforms the complexity of workflow creation into an intuitive and accessible experience. By automating deployment processes and minimizing technical barriers, our framework makes cutting-edge generative art tools available to everyone. A user study demonstrated GenFlow's ability to optimize workflows, reduce task completion times, and enhance user understanding through its intuitive interface and adaptive features. These results position GenFlow as a groundbreaking solution that redefines accessibility and efficiency in the realm of generative art.

[127] FastRef:Fast Prototype Refinement for Few-Shot Industrial Anomaly Detection

Long Tian,Yufei Li,Yuyang Dai,Wenchao Chen,Xiyang Liu,Bo Chen

Main category: cs.CV

TL;DR: 提出了一种新的工业异常检测方法FastRef,通过迭代的两阶段过程优化原型,提高了在数据稀缺环境下的检测效果和计算效率。

Details Motivation: 现有的少样本工业异常检测方法未能充分利用查询图像统计信息来增强原型代表性,而这一问题在数据稀缺环境下尤为关键。 Method: FastRef采用一种可优化的变换矩阵将查询特征的特性转移到原型上,并利用最优传输(OT)对非高斯采样特征进行原型对齐以抑制异常。 Result: 该方法与三种具有竞争力的方法结合,在MVTec、ViSA、MPDD和RealIAD四个基准数据集上的实验表明其在1/2/4-shot设置下表现出色且计算高效。 Conclusion: FastRef为少样本工业异常检测提供了一个有效且高效的原型优化框架,能够广泛适用于多种现有方法。 Abstract: Few-shot industrial anomaly detection (FS-IAD) presents a critical challenge for practical automated inspection systems operating in data-scarce environments. While existing approaches predominantly focus on deriving prototypes from limited normal samples, they typically neglect to systematically incorporate query image statistics to enhance prototype representativeness. To address this issue, we propose FastRef, a novel and efficient prototype refinement framework for FS-IAD. Our method operates through an iterative two-stage process: (1) characteristic transfer from query features to prototypes via an optimizable transformation matrix, and (2) anomaly suppression through prototype alignment. The characteristic transfer is achieved through linear reconstruction of query features from prototypes, while the anomaly suppression addresses a key observation in FS-IAD that unlike conventional IAD with abundant normal prototypes, the limited-sample setting makes anomaly reconstruction more probable. Therefore, we employ optimal transport (OT) for non-Gaussian sampled features to measure and minimize the gap between prototypes and their refined counterparts for anomaly suppression. For comprehensive evaluation, we integrate FastRef with three competitive prototype-based FS-IAD methods: PatchCore, FastRecon, WinCLIP, and AnomalyDINO. Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and RealIAD demonstrate both the effectiveness and computational efficiency of our approach under 1/2/4-shots.

[128] Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction

Zhirui Gao. Renjiao Yi,Yaqiao Dai,Xuening Zhu,Wei Chen,Chenyang Zhu,Kai Xu

Main category: cs.CV

TL;DR: This paper proposes an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps by introducing a novel representation called CurveGaussian and a dynamic topology optimization approach, resulting in improved accuracy and efficiency over traditional two-stage methods.

Details Motivation: The motivation is to overcome the limitations of existing two-stage methods for reconstructing 3D parametric curves, which suffer from error accumulation due to disconnected stages. The authors aim to develop a more efficient and accurate one-stage approach. Method: The paper introduces a bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components, called CurveGaussian, which allows differentiable rendering of 3D curves. They also propose a dynamically adaptive topology optimization framework during training to refine curve structures through various operations. Result: The results show that the proposed one-stage method achieves superior performance on datasets like ABC and real-world benchmarks, producing cleaner and more robust 3D curve reconstructions while significantly reducing training parameters. Conclusion: The paper concludes that their one-stage method outperforms two-stage alternatives in reconstructing 3D parametric curves, offering cleaner and more robust reconstructions while reducing training parameters, thus achieving higher efficiency and performance. Abstract: This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting'' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.

[129] XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

Bowen Chen,Mengyi Zhao,Haomiao Sun,Li Chen,Xu Wang,Kang Du,Xinglong Wu

Main category: cs.CV

TL;DR: XVerse is a novel multi-subject controlled generation model that enables precise and independent control for specific subjects in text-to-image generation, improving editability, coherence, and control over individual subject characteristics and semantic attributes.

Details Motivation: Achieving fine-grained control over subject identity and semantic attributes in text-to-image generation undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. Method: XVerse transforms reference images into offsets for token-specific text-stream modulation, enabling precise and independent control for specific subjects without disrupting image latents or features. Result: XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. It significantly improves personalized and complex scene generation capabilities. Conclusion: XVerse provides high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. Abstract: Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.

[130] EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting

Taoyu Wu,Yiyi Miao,Zhuoxiao Li,Haocheng Zhao,Kang Dang,Jionglong Su,Limin Yu,Haoang Li

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯随机投影(3DGS)的SLAM方法,在内窥镜场景中引入了光流损失和深度正则化策略,以解决光度不一致性和动态运动带来的问题。

Details Motivation: 在内窥镜手术场景中,由于非朗伯表面引起的光度不一致性和呼吸等动态运动影响了SLAM系统的性能,因此需要一种新的方法来提高场景重建和相机姿态估计的准确性。 Method: 该方法在基于3DGS的SLAM框架中引入了光流损失作为几何约束,并提出了深度正则化策略来缓解光度不一致性问题。此外,改进了3DGS优化策略,专注于关键帧中渲染质量较差的视角,以提升整体渲染效果。 Result: 在C3VD静态数据集和StereoMIS动态数据集上的大量实验表明,该方法在新视角合成和姿态估计方面优于现有的最先进方法,展现了其在静态和动态手术场景中的高性能表现。 Conclusion: 所提出的3DGS-SLAM方法通过引入光流损失和深度正则化策略,有效解决了内窥镜场景中的光度不一致性和动态运动问题,提高了场景重建和实时可视化的性能。 Abstract: Efficient three-dimensional reconstruction and real-time visualization are critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in efficient 3D reconstruction and rendering. Most 3DGS-based Simultaneous Localization and Mapping (SLAM) methods only rely on the appearance constraints for optimizing both 3DGS and camera poses. However, in endoscopic scenarios, the challenges include photometric inconsistencies caused by non-Lambertian surfaces and dynamic motion from breathing affects the performance of SLAM systems. To address these issues, we additionally introduce optical flow loss as a geometric constraint, which effectively constrains both the 3D structure of the scene and the camera motion. Furthermore, we propose a depth regularisation strategy to mitigate the problem of photometric inconsistencies and ensure the validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve scene representation in the SLAM system, we improve the 3DGS refinement strategy by focusing on viewpoints corresponding to Keyframes with suboptimal rendering quality frames, achieving better rendering results. Extensive experiments on the C3VD static dataset and the StereoMIS dynamic dataset demonstrate that our method outperforms existing state-of-the-art methods in novel view synthesis and pose estimation, exhibiting high performance in both static and dynamic surgical scenes. The source code will be publicly available upon paper acceptance.

[131] HyperSORT: Self-Organising Robust Training with hyper-networks

Samuel Joutard,Marijn Stollenga,Marc Balle Sanchez,Mohammad Farid Azampour,Raphael Prevost

Main category: cs.CV

TL;DR: 本文提出HyperSORT,利用超网络学习UNet参数分布,用于识别医学图像数据集中的偏差并提高分割鲁棒性。

Details Motivation: 医学影像数据集中存在异质性偏差,如错误标签和不一致的标注风格,这会影响深度分割网络的表现,且识别这些偏差是一项繁琐而具有挑战性的任务。 Method: 引入了HyperSORT框架,该框架使用超网络从代表图像和标注变异性的潜在向量中预测UNets的参数,并联合学习超网络参数和潜在向量集合。 Result: 实验表明HyperSORT能够在两个3D腹部CT公共数据集(AMOS和TotalSegmentator)上有效创建数据集的结构映射,从而识别相关系统性偏差和错误样本。 Conclusion: HyperSORT能够识别和表征医学图像数据集中的系统性偏差,并通过学习UNet参数的复杂分布来稳健地进行器官分割。 Abstract: Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets' parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying learned systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: https://github.com/ImFusionGmbH/HyperSORT

[132] Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation

Sweta Banerjee,Viktoria Weiss,Taryn A. Donovan,Rutger A. Fick,Thomas Conrad,Jonas Ammeling,Nils Porsche,Robert Klopfleisch,Christopher Kaltenecker,Katharina Breininger,Marc Aubreville,Christof A. Bertram

Main category: cs.CV

TL;DR: 本研究提出了一项针对乳腺癌中非典型有丝分裂分类的深度学习基准测试,表明利用迁移学习和模型微调技术可有效解决这一难题,并发布了相关代码与数据。

Details Motivation: 非典型有丝分裂是肿瘤恶性程度的一个重要标志,但由于其发生率低、形态学差异细微、病理学家之间评分一致性低以及数据集中类别不平衡等原因,识别难度较大。 Method: 基于AMi-Br数据集,比较了多种深度学习方法在非典型有丝分裂分类任务上的性能,包括基线模型、带线性探测的基础模型以及采用低秩适应(LoRA)进行微调的基础模型。同时引入了两个新的保留AMF数据集AtNorM-Br和AtNorM-MD用于严格评估。 Result: 研究结果显示,在AMi-Br和两个新提出的跨域数据集AtNorM-Br与AtNorM-MD上,最佳模型的平均平衡准确率分别达到了0.8135、0.7696和0.7705,其中基于LoRA的Virchow系列基础模型表现尤为出色。 Conclusion: 该研究证明,虽然非典型有丝分裂分类是一个具有挑战性的问题,但通过使用迁移学习和模型微调技术的最新进展可以有效解决。此外,作者提供了所有代码和数据供公众使用。 Abstract: Atypical mitoses mark a deviation in the cell division process that can be an independent prognostically relevant marker for tumor malignancy. However, their identification remains challenging due to low prevalence, at times subtle morphological differences from normal mitoses, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including baseline models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new hold-out AMF datasets - AtNorM-Br, a dataset of mitoses from the The TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitoses from the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7696, and 0.7705 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively, with the results being particularly good for LoRA-based adaptation of the Virchow-line of foundation models. Our work shows that atypical mitosis classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make available all code and data used in this paper in this github repository: https://github.com/DeepMicroscopy/AMi-Br_Benchmark.

[133] Controllable 3D Placement of Objects with Scene-Aware Diffusion Models

Mohamed Omran,Dimitris Kalatzis,Jens Petersen,Amirhossein Habibian,Auke Wiggers

Main category: cs.CV

TL;DR: 本文介绍了一种新的图像编辑方法,在不改变背景的前提下,通过粗略对象掩码与视觉地图相结合实现高精度的物体放置。

Details Motivation: 尽管文本条件生成模型使图像编辑更加强大和灵活,但在特定位置和方向放置对象仍然具有挑战性,通常需要精心设计的修补掩码或提示信息。 Method: 设计了一种结合粗略对象掩码和视觉地图的条件信号,并建立在背景修复模型的基础上进行图像编辑。 Result: 作者展示了其方法在汽车领域的有效性,比较了不同条件信号在新物体放置任务中的表现,证明了该方法在外观、姿态和位置准确性方面的优越性。 Conclusion: 本文提出了一种基于背景修复模型的新方法,通过构建合适的条件信号实现了对物体位置、形状和方向的高质量编辑,并且能够保持背景不变。 Abstract: Image editing approaches have become more powerful and flexible with the advent of powerful text-conditioned generative models. However, placing objects in an environment with a precise location and orientation still remains a challenge, as this typically requires carefully crafted inpainting masks or prompts. In this work, we show that a carefully designed visual map, combined with coarse object masks, is sufficient for high quality object placement. We design a conditioning signal that resolves ambiguities, while being flexible enough to allow for changing of shapes or object orientations. By building on an inpainting model, we leave the background intact by design, in contrast to methods that model objects and background jointly. We demonstrate the effectiveness of our method in the automotive setting, where we compare different conditioning signals in novel object placement tasks. These tasks are designed to measure edit quality not only in terms of appearance, but also in terms of pose and location accuracy, including cases that require non-trivial shape changes. Lastly, we show that fine location control can be combined with appearance control to place existing objects in precise locations in a scene.

[134] A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario

Cyrus Addy,Ajay Kumar Gurumadaiah,Yixiang Gao,Kwame Awuah-Offei

Main category: cs.CV

TL;DR: 这篇论文介绍了一个新的热成像数据集,旨在促进矿工检测系统的发展,并评估了几种最先进的目标检测算法在该数据集上的表现,证明了热成像技术在矿工检测中的可行性。

Details Motivation: 地下采矿作业面临重大的安全挑战,使得应急响应能力至关重要。机器人在搜救行动中有潜力,但其有效性依赖于可靠的矿工检测能力。深度学习算法提供了潜在解决方案,但需要全面的训练数据集,而目前地下采矿环境缺乏这样的数据集。 Method: 论文提出了一种新的热成像数据集,并评估了YOLOv8、YOLOv10、YOLO11和RT-DETR等最先进的目标检测算法在该数据集上的表现。 Result: 论文系统地收集了各种采矿活动和场景的热成像图像,以创建一个稳健的检测算法基础。虽然不是所有可能紧急情况的详尽数据集,但这个数据集是开发可靠热基矿工检测系统的至关重要的第一步。 Conclusion: 该论文得出结论,热成像技术用于矿工检测是可行的,并为未来这一关键安全应用的研究奠定了基础。 Abstract: Underground mining operations face significant safety challenges that make emergency response capabilities crucial. While robots have shown promise in assisting with search and rescue operations, their effectiveness depends on reliable miner detection capabilities. Deep learning algorithms offer potential solutions for automated miner detection, but require comprehensive training datasets, which are currently lacking for underground mining environments. This paper presents a novel thermal imaging dataset specifically designed to enable the development and validation of miner detection systems for potential emergency applications. We systematically captured thermal imagery of various mining activities and scenarios to create a robust foundation for detection algorithms. To establish baseline performance metrics, we evaluated several state-of-the-art object detection algorithms including YOLOv8, YOLOv10, YOLO11, and RT-DETR on our dataset. While not exhaustive of all possible emergency situations, this dataset serves as a crucial first step toward developing reliable thermal-based miner detection systems that could eventually be deployed in real emergency scenarios. This work demonstrates the feasibility of using thermal imaging for miner detection and establishes a foundation for future research in this critical safety application.

[135] Rethinking Oversaturation in Classifier-Free Guidance via Low Frequency

Kaiyu Song,Hanjiang Lai

Main category: cs.CV

TL;DR: This paper proposes LF-CFG to reduce oversaturation and artifacts in diffusion models by managing redundant low-frequency information.

Details Motivation: High guidance scales in classifier-free guidance cause oversaturation and artifacts; addressing this requires analyzing low-frequency signal behavior. Method: An adaptive threshold-based measurement identifies redundant information, and a down-weight strategy reduces its impact in low-frequency signals. Result: LF-CFG successfully mitigates issues across multiple diffusion models, including Stable Diffusion-XL, Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL. Conclusion: LF-CFG effectively addresses oversaturation and unrealistic artifacts in diffusion models by targeting redundant low-frequency information. Abstract: Classifier-free guidance (CFG) succeeds in condition diffusion models that use a guidance scale to balance the influence of conditional and unconditional terms. A high guidance scale is used to enhance the performance of the conditional term. However, the high guidance scale often results in oversaturation and unrealistic artifacts. In this paper, we introduce a new perspective based on low-frequency signals, identifying the accumulation of redundant information in these signals as the key factor behind oversaturation and unrealistic artifacts. Building on this insight, we propose low-frequency improved classifier-free guidance (LF-CFG) to mitigate these issues. Specifically, we introduce an adaptive threshold-based measurement to pinpoint the locations of redundant information. We determine a reasonable threshold by analyzing the change rate of low-frequency information between prior and current steps. We then apply a down-weight strategy to reduce the impact of redundant information in the low-frequency signals. Experimental results demonstrate that LF-CFG effectively alleviates oversaturation and unrealistic artifacts across various diffusion models, including Stable Diffusion-XL, Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL.

[136] Evaluation of Traffic Signals for Daily Traffic Pattern

Mohammad Shokrolah Shirazi,Hung-Fu Chang

Main category: cs.CV

TL;DR: This study proposes dynamic, static, and hybrid signal timing methods for traffic management, showing that hybrid methods are best suited for uneven traffic distributions across intersection zones.

Details Motivation: To improve traffic flow management at intersections by adapting signal timing configurations to match varying traffic patterns throughout the day. Method: A vision-based tracking system and simulation using Simulation of Urban MObility were employed to evaluate dynamic, static, and hybrid signal timing methods based on vehicle count distributions across different zones. Result: Initial results showed that cycle times of 90 and 120 seconds were most effective. The dynamic configuration performed better at four intersections, while the hybrid method improved performance during peak and off-peak hours. Conclusion: The hybrid method is more effective for managing traffic in zones with highly weighted traffic distribution, while the static method suits even traffic distribution. Abstract: The turning movement count data is crucial for traffic signal design, intersection geometry planning, traffic flow, and congestion analysis. This work proposes three methods called dynamic, static, and hybrid configuration for TMC-based traffic signals. A vision-based tracking system is developed to estimate the TMC of six intersections in Las Vegas using traffic cameras. The intersection design, route (e.g. vehicle movement directions), and signal configuration files with compatible formats are synthesized and imported into Simulation of Urban MObility for signal evaluation with realistic data. The initial experimental results based on estimated waiting times indicate that the cycle time of 90 and 120 seconds works best for all intersections. In addition, four intersections show better performance for dynamic signal timing configuration, and the other two with lower performance have a lower ratio of total vehicle count to total lanes of the intersection leg. Since daily traffic flow often exhibits a bimodal pattern, we propose a hybrid signal method that switches between dynamic and static methods, adapting to peak and off-peak traffic conditions for improved flow management. So, a built-in traffic generator module creates vehicle routes for 4 hours, including peak hours, and a signal design module produces signal schedule cycles according to static, dynamic, and hybrid methods. Vehicle count distributions are weighted differently for each zone (i.e., West, North, East, South) to generate diverse traffic patterns. The extended experimental results for 6 intersections with 4 hours of simulation time imply that zone-based traffic pattern distributions affect signal design selection. Although the static method works great for evenly zone-based traffic distribution, the hybrid method works well for highly weighted traffic at intersection pairs of the West-East and North-South zones.

[137] Global and Local Entailment Learning for Natural World Imagery

Srikumar Sastry,Aayush Dhakal,Eric Xing,Subash Khanal,Nathan Jacobs

Main category: cs.CV

TL;DR: 本文提出了Radial Cross-Modal Embeddings (RCME)框架,用于解决视觉语言模型中数据层次结构学习的问题,并在实际任务中展示了改进的性能。

Details Motivation: 为了解决视觉语言模型中数据层次结构学习的重大挑战,并明确建模蕴含的传递性。 Method: 引入了Radial Cross-Modal Embeddings (RCME)框架,优化了视觉语言模型中的概念部分顺序。 Result: 开发了一个能够表示生命树中层次结构的视觉语言基础模型,并在分层物种分类和分层检索任务上展示了增强的性能。 Conclusion: RCME成功地解决了视觉语言模型中层次结构学习的问题,提高了现有最先进模型的性能。 Abstract: Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.

[138] TITAN: Query-Token based Domain Adaptive Adversarial Learning

Tajamul Ashraf,Janibul Bashir

Main category: cs.CV

TL;DR: TITAN improves domain adaptive object detection by generating reliable pseudo-labels and reducing domain gaps, achieving state-of-the-art performance.

Details Motivation: The motivation is to address the degradation of student models due to noisy pseudo-labels caused by domain bias and shifts when adapting object detection models without access to source data. Method: TITAN uses a Target-based Iterative Query-Token Adversarial Network to separate target images into easy and hard subsets and incorporates adversarial modules into a student-teacher framework for domain adaptation. Result: Experiments showed TITAN outperformed state-of-the-art methods with mAP improvements of +22.7%, +22.2%, +21.1%, and +3.7% on specific benchmarks. Conclusion: TITAN demonstrates superior performance in SF-DAOD by effectively addressing domain shifts and reducing domain gaps, achieving significant mAP improvements over existing methodologies. Abstract: We focus on the source-free domain adaptive object detection (SF-DAOD) problem when source data is unavailable during adaptation and the model must adapt to an unlabeled target domain. The majority of approaches for the problem employ a self-supervised approach using a student-teacher (ST) framework where pseudo-labels are generated via a source-pretrained model for further fine-tuning. We observe that the performance of a student model often degrades drastically, due to the collapse of the teacher model, primarily caused by high noise in pseudo-labels, resulting from domain bias, discrepancies, and a significant domain shift across domains. To obtain reliable pseudo-labels, we propose a Target-based Iterative Query-Token Adversarial Network (TITAN), which separates the target images into two subsets: those similar to the source (easy) and those dissimilar (hard). We propose a strategy to estimate variance to partition the target domain. This approach leverages the insight that higher detection variances correspond to higher recall and greater similarity to the source domain. Also, we incorporate query-token-based adversarial modules into a student-teacher baseline framework to reduce the domain gaps between two feature representations. Experiments conducted on four natural imaging datasets and two challenging medical datasets have substantiated the superior performance of TITAN compared to existing state-of-the-art (SOTA) methodologies. We report an mAP improvement of +22.7, +22.2, +21.1, and +3.7 percent over the current SOTA on C2F, C2B, S2C, and K2C benchmarks, respectively.

[139] Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection

Tobias J. Riedlinger,Kira Maag,Hanno Gottschalk

Main category: cs.CV

TL;DR: 本文提出了一种基于空间统计学的目标检测模型,以解决深度神经网络在不确定性估计方面存在的问题。

Details Motivation: 现有的目标检测器和分割模型的置信度估计经常校准不当,并且无法量化未检测到物体区域的不确定性,这在自动驾驶等安全相关应用中存在风险。 Method: 利用标记点过程的概率特性,将边界框数据与空间统计模型相结合,提供对可驾驶区域进行似然训练和良好定义的置信度估计的方法。 Result: 该方法通过校准评估和性能测试验证了其有效性。 Conclusion: 该研究为对象检测提供了新的统计框架,有效改善了模型的不确定性估计能力,适用于对安全性要求较高的场景。 Abstract: Deep neural networks have set the state-of-the-art in computer vision tasks such as bounding box detection and semantic segmentation. Object detectors and segmentation models assign confidence scores to predictions, reflecting the model's uncertainty in object detection or pixel-wise classification. However, these confidence estimates are often miscalibrated, as their architectures and loss functions are tailored to task performance rather than probabilistic foundation. Even with well calibrated predictions, object detectors fail to quantify uncertainty outside detected bounding boxes, i.e., the model does not make a probability assessment of whether an area without detected objects is truly free of obstacles. This poses a safety risk in applications such as automated driving, where uncertainty in empty areas remains unexplored. In this work, we propose an object detection model grounded in spatial statistics. Bounding box data matches realizations of a marked point process, commonly used to describe the probabilistic occurrence of spatial point events identified as bounding box centers, where marks are used to describe the spatial extension of bounding boxes and classes. Our statistical framework enables a likelihood-based training and provides well-defined confidence estimates for whether a region is drivable, i.e., free of objects. We demonstrate the effectiveness of our method through calibration assessments and evaluation of performance.

[140] Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration

Jiahe Chen,Jiaying He,Qian Shao,Qiyuan Chen,Jiahe Ying,Hongxia Xu,Jintai Chen,Jianwei Zheng,Jian Wu

Main category: cs.CV

TL;DR: This paper proposes Dynamic Logits Calibration (DLC), a training-free decoding method that reduces hallucinations in vision-language models by dynamically aligning text generation with visual evidence while maintaining efficiency and output quality.

Details Motivation: The motivation stems from the issue of hallucinations in LVLMs, where generated text contradicts visual input. Existing training-free decoding strategies face limitations such as static constraints, inefficiency due to multiple forward passes, and loss of detail from rigid intervention rules. The study aims to overcome these challenges by proposing a dynamic and efficient solution. Method: The paper introduces a training-free decoding framework called Dynamic Logits Calibration (DLC). It dynamically aligns text generation with visual evidence during inference. DLC uses CLIP to evaluate semantic alignment between input images and generated text, assesses the Relative Visual Advantage (RVA) of candidate tokens, and adaptively adjusts output logits. An adaptive weighting mechanism balances visual guidance and textual output quality based on a real-time context alignment score. Result: Extensive experiments across various LVLM architectures (e.g., LLaVA, InstructBLIP, MiniGPT-4) show that DLC significantly reduces hallucinations, outperforms current methods, and maintains high inference efficiency by avoiding multiple forward passes. Conclusion: This paper concludes that the proposed Dynamic Logits Calibration (DLC) framework effectively and efficiently mitigates hallucinations in Large Vision-Language Models (LVLMs), enhancing their reliability for practical applications. Abstract: Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.

[141] GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation

Wentao Hu,Shunkai Li,Ziqiao Peng,Haoxian Zhang,Fan Shi,Xiaoqiang Liu,Pengfei Wan,Di Zhang,Hui Tian

Main category: cs.CV

TL;DR: GGTalker improves speech-driven 3D talking heads by leveraging general and identity-specific models, delivering high-quality, realistic results efficiently.

Details Motivation: Previous methods struggle with large head rotations, out-of-distribution audio, and require time-consuming, identity-specific training due to insufficient 3D priors. This work aims to improve generalizability and synthesis quality. Method: The paper proposes GGTalker using a two-stage Prior-Adaptation strategy to learn Gaussian head priors and adapt to individual traits. It includes training Audio-Expression and Expression-Visual priors, along with a color MLP for textures and a Body Inpainter for blending. Result: Comprehensive experiments show that GGTalker achieves superior performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency compared to existing approaches. Conclusion: GGTalker addresses the challenge of creating high-quality, generalizable speech-driven 3D talking heads by combining generalizable priors with identity-specific adaptation, achieving state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency. Abstract: Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.

[142] G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation

Mohammed Rakib,Arunkumar Bagavathi

Main category: cs.CV

TL;DR: 本文提出了一种解决多模态学习中模态不平衡问题的新方法G$^{2}$D,通过动态优化模态优先级显著提升了模型表现。

Details Motivation: 传统的多模态模型存在模态不平衡问题,导致弱模态未能充分发挥作用,因此需要一种新的方法来平衡模态贡献。 Method: 提出了一种基于梯度引导的知识蒸馏框架G$^{2}$D,结合动态顺序模态优先化(SMP)技术优化多模态模型的学习过程。 Result: 在多个真实数据集上验证了G$^{2}$D的有效性,结果表明其在分类和回归任务中优于现有的最先进方法。 Conclusion: G$^{2}$D有效地解决了多模态学习中的模态不平衡问题,并通过放大弱模态的重要性提高了模型的性能。 Abstract: Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D on multiple real-world datasets and show that G$^{2}$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.

[143] MADrive: Memory-Augmented Driving Scene Modeling

Polina Karpikova,Daniil Selikhanovych,Kirill Struminsky,Ruslan Musaev,Maria Golitsyna,Dmitry Baranchuk

Main category: cs.CV

TL;DR: 本文提出了一种新的自主驾驶环境重建框架MADrive,通过使用大规模外部记忆库中的3D资产替换观察到的车辆,实现大幅修改配置的逼真相邻合成。

Details Motivation: 现有的场景重建方法在自主驾驶环境中仍与原始观测紧密相关,难以支持显著改变或新驾驶场景的逼真合成。 Method: 发布了一个名为MAD-Cars的数据集,并提出了一个检索模块,用于从内存库中找到最相似的汽车实例,从视频中重建对应的3D资产,并通过方向对齐和重新照明将它们集成到目标场景中。 Result: 实现了车辆场景的完整多视角表示,使得大幅修改配置的逼真相邻合成成为可能。 Conclusion: MADrive通过替换观察到的车辆,使用大规模外部记忆库中检索到的3D资产,扩展了现有场景重建方法的能力。 Abstract: Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods by replacing observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank. Specifically, we release MAD-Cars, a curated dataset of ${\sim}70$K 360{\deg} car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting. The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations, as demonstrated in our experiments. Project page: https://yandex-research.github.io/madrive/

[144] WAFT: Warping-Alone Field Transforms for Optical Flow

Yihan Wang,Jia Deng

Main category: cs.CV

TL;DR: WAFT is a new optical flow method that replaces cost volume with high-resolution warping, achieving state-of-the-art results while being more efficient.

Details Motivation: To develop a more efficient and effective approach for optical flow estimation by rethinking traditional methods that rely on constructing cost volumes. Method: Introducing Warping-Alone Field Transforms (WAFT), which replaces the cost volume used in RAFT with high-resolution warping to improve accuracy and reduce memory usage. Result: WAFT achieves first place on Spring and KITTI benchmarks, demonstrates the best zero-shot generalization on KITTI, and is up to 4.1x faster than similar-performance methods. Conclusion: WAFT is a simple and effective method for optical flow that challenges conventional approaches by replacing cost volume with high-resolution waring, achieving better performance while being more efficient. Abstract: We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at https://github.com/princeton-vl/WAFT.

[145] Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval

Hani Alomari,Anushka Sivakumar,Andrew Zhang,Chris Thomas

Main category: cs.CV

TL;DR: This paper proposes a new method for cross-modal image-text retrieval that better captures complex relationships between different types of data by improving set-based representation learning.

Details Motivation: Traditional methods struggle to capture nuanced and diverse relationships across modalities, and set-based approaches face issues like sparse supervision and set collapse. Method: The paper introduces Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets, along with two loss functions: Global Discriminative Loss and Intra-Set Divergence Loss. Result: The method addresses the limitations of current set-based representations by preserving semantic diversity and enhancing distinction among embeddings. Conclusion: The proposed method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data. Abstract: Cross-modal image-text retrieval is challenging because of the diverse possible associations between content from different modalities. Traditional methods learn a single-vector embedding to represent semantics of each sample, but struggle to capture nuanced and diverse relationships that can exist across modalities. Set-based approaches, which represent each sample with multiple embeddings, offer a promising alternative, as they can capture richer and more diverse relationships. In this paper, we show that, despite their promise, these set-based representations continue to face issues including sparse supervision and set collapse, which limits their effectiveness. To address these challenges, we propose Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets which preserve semantic diversity within the set. We also introduce two loss functions to further enhance the representations: Global Discriminative Loss to enhance distinction among embeddings, and Intra-Set Divergence Loss to prevent collapse within each set. Our method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data.

[146] StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning

Chuxin Wang,Yixin Zha,Wenfei Yang,Tianzhu Zhang

Main category: cs.CV

TL;DR: This paper proposes StruMamba3D, a novel self-supervised point cloud representation learning approach that improves upon Mamba-based methods by preserving spatial dependencies and reducing sensitivity to input length variations, leading to state-of-the-art results.

Details Motivation: To overcome key issues with existing Mamba-based methods in point cloud representation learning, specifically the destruction of 3D point adjacency during SSM processing and failure to retain long-sequence memory as input lengths increase. Method: The method introduces spatial states to preserve spatial dependencies among points, enhances SSM with a state-wise update strategy, incorporates a lightweight convolution for efficient structure modeling, and employs a sequence length-adaptive strategy to reduce sensitivity to varying input lengths. Result: Experimental results demonstrate superior performance across four downstream tasks, achieving 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without using a voting strategy. Conclusion: StruMamba3D effectively addresses the limitations of Mamba-based methods in point cloud representation learning, achieving state-of-the-art performance on ModelNet40 and ScanObjectNN without relying on a voting strategy. Abstract: Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.

[147] DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion

Yansong Qu,Shaohui Dai,Xinyang Li,Yuze Wang,You Shen,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出了一种端到端的框架,用于从单个部分遮挡的图像中生成六种结构一致的新视图,从而实现无需先验修补或手动注释的3D重建。此外,作者还构建了一个自监督训练管道,并发布了首个面向遮挡感知重建的基准数据集。

Details Motivation: 从单个图像重建3D物体在现实世界中面临遮挡问题的挑战。现有的基于扩散的视角合成模型通常假设输入完全可见,在存在遮挡时会生成不一致的视图并降低3D重建质量。 Method: 提出了一种端到端的遮挡感知多视角生成框架,直接从部分遮挡的图像生成六个结构一致的新视图。通过使用Pix2Gestalt数据集构建自监督训练管道,利用遮挡-未遮挡图像对和伪真值视图进行训练。在不修改原始架构的情况下,对视角合成模型进行全微调以联合学习补全和多视角生成。 Result: 成功生成结构一致的新视图,并实现了高质量的3D重建。该方法无需先验修补或手动注释即可处理部分遮挡情况。同时,提出了首个面向遮挡感知重建的基准数据集,涵盖多种遮挡程度、物体类别和遮罩模式。 Conclusion: 本文解决了从部分遮挡图像进行3D重建的问题,提出了一种有效的端到端框架,并为未来研究提供了标准化评估协议和基准数据集。 Abstract: Reconstructing 3D objects from a single image is a long-standing challenge, especially under real-world occlusions. While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they generally assume fully visible inputs and fail when parts of the object are occluded. This leads to inconsistent views and degraded 3D reconstruction quality. To overcome this limitation, we propose an end-to-end framework for occlusion-aware multi-view generation. Our method directly synthesizes six structurally consistent novel views from a single partially occluded image, enabling downstream 3D reconstruction without requiring prior inpainting or manual annotations. We construct a self-supervised training pipeline using the Pix2Gestalt dataset, leveraging occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency. Without modifying the original architecture, we fully fine-tune the view synthesis model to jointly learn completion and multi-view generation. Additionally, we introduce the first benchmark for occlusion-aware reconstruction, encompassing diverse occlusion levels, object categories, and mask patterns. This benchmark provides a standardized protocol for evaluating future methods under partial occlusions. Our code is available at https://github.com/Quyans/DeOcc123.

[148] SAM4D: Segment Anything in Camera and LiDAR Streams

Jianyun Xu,Song Wang,Ziqian Ni,Chunyong Hu,Sheng Yang,Jianke Zhu,Qiang Li

Main category: cs.CV

TL;DR: SAM4D is a multi-modal and temporal foundation model for promptable segmentation using camera and LiDAR streams. It introduces UMPE for feature alignment and MCMA for motion-aware attention, supported by an automated data engine to accelerate pseudo-label generation. The framework demonstrates robust performance in autonomous driving scene understanding.

Details Motivation: The motivation is to overcome annotation bottlenecks and achieve robust, promptable segmentation across multi-modal (camera and LiDAR) and dynamic scenes in autonomous driving contexts. Method: SAM4D uses Unified Multi-modal Positional Encoding (UMPE) to align camera and LiDAR features in a shared 3D space and Motion-aware Cross-modal Memory Attention (MCMA), which improves temporal consistency by compensating for ego-motion. It also employs an automated data engine combining video foundation model (VFM)-driven masklets, spatiotemporal 4D reconstruction, and cross-modal fusion. Result: Extensive experiments on the Waymo-4DSeg dataset demonstrated the effectiveness of SAM4D in cross-modal segmentation and fast, accurate pseudo-label generation, significantly outperforming human annotation in speed while maintaining semantic fidelity. Conclusion: The paper concludes that SAM4D, through its innovative mechanisms and automated data engine, shows strong performance in cross-modal segmentation and significant potential in efficient data annotation for autonomous driving environments. Abstract: We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

[149] SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark

Alex Costanzino,Pierluigi Zama Ramirez,Luigi Lella,Matteo Ragaglia,Alessandro Oliva,Giuseppe Lisanti,Luigi Di Stefano

Main category: cs.CV

TL;DR: This paper proposes SiM3D, the first benchmark for 3D anomaly detection and segmentation integrating multiview and multimodal data, particularly addressing single-instance anomaly detection in manufacturing.

Details Motivation: There was a need for a benchmark that can integrate multiview and multimodal information for comprehensive 3D anomaly detection and segmentation, particularly in scenarios where only one object instance is available for training. Method: The paper introduces SiM3D, which uses multiview and multimodal data for 3D anomaly detection and segmentation. They adapt prominent singleview methods to create baselines and evaluate performance using new metrics based on Anomaly Volumes. Result: SiM3D provides a new dataset with multiview high-resolution images and point clouds for 333 instances of eight object types, including CAD models and manually annotated 3D segmentation ground truths. The authors establish reference baselines using adapted singleview methods. Conclusion: SiM3D is the first benchmark for 3D anomaly detection and segmentation that integrates multiview and multimodal information, especially focusing on single-instance anomaly detection in manufacturing. It includes a novel dataset with high-resolution images, point clouds, and CAD models. Abstract: We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.

[150] Whole-Body Conditioned Egocentric Video Prediction

Yutong Bai,Danny Tran,Amir Bar,Yann LeCun,Trevor Darrell,Jitendra Malik

Main category: cs.CV

TL;DR: 研究人员训练了一个模型,可以根据人体动作预测第一人称视角视频,从而模拟人类行为对环境的影响。

Details Motivation: 为了理解并模拟物理人类行为如何从前向视角影响环境,需要开发能够解决建模复杂现实世界环境和具身代理行为挑战的方法。 Method: 在Nymeria大规模真实世界自我中心视频和身体姿态数据集上训练了一个自回归条件扩散变压器模型,并设计了一种分层评估协议来分析模型的预测和控制能力。 Result: 提出了一种能够根据过去视频和动作信息预测自我中心视频的模型,并通过分层任务评估其具身预测和控制能力。 Conclusion: 这项研究通过基于人体关节层次结构的运动轨迹,训练模型从前向视角模拟人类行为对环境的影响,代表了从人类视角出发建模复杂现实世界环境和具身代理行为的初步尝试。 Abstract: We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.