Skip to content

Table of Contents

cs.CL [Back]

[1] Towards Probabilistic Question Answering Over Tabular Data

Chen Shen,Sajjadur Rahman,Estevam Hruschka

Main category: cs.CL

TL;DR: This paper introduces a framework for probabilistic question answering over tabular data that uses Bayesian Networks and LLMs, showing improved performance over existing methods.

Details Motivation: The motivation is to address the shortcomings of current QA approaches like NL2SQL systems on probabilistic questions requiring reasoning under uncertainty. Method: The method involves inducing Bayesian Networks from tables, translating natural language queries into probabilistic queries, and using large language models (LLMs) to generate final answers. Result: Empirical results show significant improvements over baselines in handling probabilistic questions. Conclusion: The paper concludes that their introduced framework for probabilistic QA over large tabular data outperforms baselines, showcasing the advantages of hybrid symbolic-neural reasoning. Abstract: Current approaches for question answering (QA) over tabular data, such as NL2SQL systems, perform well for factual questions where answers are directly retrieved from tables. However, they fall short on probabilistic questions requiring reasoning under uncertainty. In this paper, we introduce a new benchmark LUCARIO and a framework for probabilistic QA over large tabular data. Our method induces Bayesian Networks from tables, translates natural language queries into probabilistic queries, and uses large language models (LLMs) to generate final answers. Empirical results demonstrate significant improvements over baselines, highlighting the benefits of hybrid symbolic-neural reasoning.

[2] Multi-lingual Functional Evaluation for Large Language Models

Victor Ojewale,Inioluwa Deborah Raji,Suresh Venkatasubramanian

Main category: cs.CL

TL;DR: The paper introduces new multi-lingual functional benchmarks to better evaluate the practical performance and robustness of large language models across various languages.

Details Motivation: Existing multi-lingual competence evaluations often fail to provide an adequate understanding of practical performance and model robustness across different languages. Method: The authors created two new multi-lingual functional benchmarks (CL-GSM Symbolic and CL-IFEval) by translating existing English templates into five additional languages: French, Spanish, Hindi, Arabic, and Yoruba. They then compared the performance of models on these new benchmarks against existing static benchmarks. Result: There is a significant decrease in performance between some static and functional benchmarks across languages. For example, there is a 24%, 17%, and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French, and Spanish respectively. There is also a 15-24% performance drop across languages between Belebele and CL-IFEval, while only a 0.5% to 3% drop between M-MMLU and CL-IFEval. Model robustness also varies, with Arabic and English showing the most consistent performance. Conclusion: Static multi-lingual benchmarks vary in their ability to capture functional performance, with some being much closer than others. Model robustness also varies significantly across languages. Abstract: Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks -- Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)-- by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results reveal that some static multi-lingual benchmarks capture functional performance much more closely than others (i.e. across models, there is a 24%, 17% and 18% decrease in performance between M-GSM and CL-GSM Symbolic in English, French and Spanish respectively; similarly there's a 15 - 24% performance drop across languages between Belebele and CL-IFEval, and only a 0.5% to 3% performance drop between M-MMLU and CL-IFEval). Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.

[3] The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

Chenglei Si,Tatsunori Hashimoto,Diyi Yang

Main category: cs.CL

TL;DR: This study shows that although Large Language Models can generate novel research ideas, they often do not translate into better research outcomes when compared to human-generated ideas.

Details Motivation: The motivation behind this study is to assess whether AI-generated research ideas can result in better research outcomes after execution, beyond just appearing novel. Method: The researchers conducted an execution study involving 43 expert researchers who executed randomly assigned research ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page paper documenting their work. The papers were reviewed blindly by expert NLP researchers. Result: The results showed that the scores for LLM-generated ideas decreased significantly more than those for expert-written ideas across all evaluation metrics (novelty, excitement, effectiveness, and overall). There was also a shift in rankings where human ideas scored higher than LLM ideas after execution. Conclusion: The study concludes that while LLMs can generate novel research ideas, these ideas often do not lead to better research outcomes when compared to those generated by human experts. This highlights the limitations of current LLMs in producing truly effective research ideas. Abstract: Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.

[4] MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering

Chinmay Gondhalekar,Urjitkumar Patel,Fang-Chun Yeh

Main category: cs.CL

TL;DR: The paper proposes MultiFinRAG, a retrieval-augmented generation framework designed specifically for financial question answering that efficiently handles multimodal content like text, tables, and figures, significantly outperforming existing models.

Details Motivation: Financial documents combine diverse modalities such as narrative text, tables, and figures, requiring joint reasoning across these modalities. Traditional LLMs and RAG pipelines struggle due to token limitations, layout loss, and fragmented cross-modal context. Method: The paper introduces MultiFinRAG, which utilizes multimodal extraction by grouping table and figure images, generating structured JSON outputs and textual summaries using a lightweight, quantized open-source multimodal LLM. These outputs are then embedded and indexed with modality-aware similarity thresholds, while a tiered fallback strategy dynamically escalates context when necessary. Result: MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning, despite running on commodity hardware. Conclusion: MultiFinRAG is a more efficient framework for financial QA compared to traditional LLMs and RAG pipelines, achieving higher accuracy on complex tasks involving multiple modalities. Abstract: Financial documents--such as 10-Ks, 10-Qs, and investor presentations--span hundreds of pages and combine diverse modalities, including dense narrative text, structured tables, and complex figures. Answering questions over such content often requires joint reasoning across modalities, which strains traditional large language models (LLMs) and retrieval-augmented generation (RAG) pipelines due to token limitations, layout loss, and fragmented cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation framework purpose-built for financial QA. MultiFinRAG first performs multimodal extraction by grouping table and figure images into batches and sending them to a lightweight, quantized open-source multimodal LLM, which produces both structured JSON outputs and concise textual summaries. These outputs, along with narrative text, are embedded and indexed with modality-aware similarity thresholds for precise retrieval. A tiered fallback strategy then dynamically escalates from text-only to text+table+image contexts when necessary, enabling cross-modal reasoning while reducing irrelevant context. Despite running on commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy than ChatGPT-4o (free-tier) on complex financial QA tasks involving text, tables, images, and combined multimodal reasoning.

[5] Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes

Quintin Myers,Yanjun Gao

Main category: cs.CL

TL;DR: This study evaluates how large language models respond to violent scenarios using a psychological tool and finds inconsistencies and demographic biases.

Details Motivation: As LLMs are increasingly used for detecting and responding to online violence, their ability to handle morally ambiguous real-world situations remains unclear. This study aims to examine their reasoning in such contexts and potential biases. Method: The study used the Violent Behavior Vignette Questionnaire (VBVQ), a validated social science tool, to evaluate six LLMs under a unified zero-shot setting. Persona-based prompting was introduced to assess bias by varying race, age, and geographic identity within the U.S. Result: Two key findings emerged: (1) LLMs' text generation often diverges from their internal preference for violent responses; (2) their responses vary across demographic factors, contradicting established findings in criminology, psychology, and social sciences. Conclusion: Large language models (LLMs) show discrepancies in handling morally ambiguous, violent scenarios, with variations across demographics and contexts. Abstract: Large language models (LLMs) are increasingly proposed for detecting and responding to violent content online, yet their ability to reason about morally ambiguous, real-world scenarios remains underexamined. We present the first study to evaluate LLMs using a validated social science instrument designed to measure human response to everyday conflict, namely the Violent Behavior Vignette Questionnaire (VBVQ). To assess potential bias, we introduce persona-based prompting that varies race, age, and geographic identity within the United States. Six LLMs developed across different geopolitical and organizational contexts are evaluated under a unified zero-shot setting. Our study reveals two key findings: (1) LLMs surface-level text generation often diverges from their internal preference for violent responses; (2) their violent tendencies vary across demographics, frequently contradicting established findings in criminology, social science, and psychology.

[6] Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine

Sebastian Joseph,Lily Chen,Barry Wei,Michael Mackert,Iain J. Marshall,Paul Pu Liang,Ramez Kouzy,Byron C. Wallace,Junyi Jessy Li

Main category: cs.CL

TL;DR: This paper explores the challenges of applying automated fact-checking systems to medicine and argues that such processes should focus on interactive communication between users and experts.

Details Motivation: The motivation is driven by the high-stakes nature of medical decisions and challenges in navigating complex medical literature, which calls for end-to-end fact-checking systems. Method: A study was conducted to examine how clinical experts verify real-world claims from social media by synthesizing medical evidence. Result: The study revealed key difficulties in connecting informal claims to scientific evidence, including ambiguous or underspecified claims, mismatched intentions, and subjective veracity labels. Conclusion: Fact-checking in medicine should be approached as an interactive communication problem, not just an end-to-end process. Abstract: Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripens the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. To understand this, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached and evaluated as an interactive communication problem, rather than an end-to-end process.

[7] Optimising Language Models for Downstream Tasks: A Post-Training Perspective

Zhengyan Shi

Main category: cs.CL

TL;DR: 这篇论文旨在解决语言模型适应特定任务时的低效与高成本问题,提出了包括持续预训练、参数高效微调、改进监督微调和新评估基准在内的系列方法,最终证明这些方法显著增强了语言模型的实用性与适应性。

Details Motivation: 尽管语言模型在自然语言处理方面展现了卓越的能力,但在特定任务上的高效且稳健的适应仍然具有挑战性。传统的微调方法往往未能充分利用未标记数据,容易在小规模任务数据上过拟合并带来巨大的计算成本。 Method: 论文首先探索了从未标记数据中提取任务相关知识的方法,并引入了一种新的持续预训练技术;接着提出了一种参数高效的微调方法以降低内存和计算成本;随后介绍了改进的监督微调方法以提升指令遵循能力;最后开发了新的评估方法和基准来全面评估LM的能力。 Result: 通过广泛的实证研究,结果表明所提出的这些方法显著提升了语言模型的鲁棒性、效率和泛化能力,使它们能够更好地适应各种实际应用场景。 Conclusion: 该论文提出了一系列方法来更好地将语言模型(LMs)适应到下游应用中,这些方法显著提高了LM的鲁棒性、效率和泛化能力,使其更适用于广泛的应用,向实现通用人工智能的目标迈进了一大步。 Abstract: Language models (LMs) have demonstrated remarkable capabilities in NLP, yet adapting them efficiently and robustly to specific tasks remains challenging. As their scale and complexity grow, fine-tuning LMs on labelled data often underutilizes available unlabelled data, leads to overfitting on small task-specific sets, and imposes significant computational costs. These limitations hamper their application to the open-ended landscape of real-world language tasks. This thesis proposes a series of methods to better adapt LMs to downstream applications. First, we explore strategies for extracting task-relevant knowledge from unlabelled data, introducing a novel continued pre-training technique that outperforms state-of-the-art semi-supervised approaches. Next, we present a parameter-efficient fine-tuning method that substantially reduces memory and compute costs while maintaining competitive performance. We also introduce improved supervised fine-tuning methods that enable LMs to better follow instructions, especially when labelled data is scarce, enhancing their performance across a range of NLP tasks, including open-ended generation. Finally, we develop new evaluation methods and benchmarks, such as multi-hop spatial reasoning tasks, to assess LM capabilities and adaptation more comprehensively. Through extensive empirical studies across diverse NLP tasks, our results demonstrate that these approaches substantially improve LM robustness, efficiency, and generalization, making them more adaptable to a broad range of applications. These advances mark a significant step towards more robust and efficient LMs, bringing us closer to the goal of artificial general intelligence.

[8] FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo,Hynek Kydlíček,Vinko Sabolčec,Bettina Messmer,Negar Foroutan,Amir Hossein Kargaran,Colin Raffel,Martin Jaggi,Leandro Von Werra,Thomas Wolf

Main category: cs.CL

TL;DR: 这项研究介绍了一个新的、可自动适应任何语言的预训练数据集整理流程,并利用此流程创建了包含超过1000种语言的20TB多语言数据集FineWeb2,从而提高了多语言大型语言模型的性能。

Details Motivation: 由于为大量语言定制过滤和去重流程固有的难度,开发高性能的多语言LLM面临挑战。本研究旨在解决这一问题,推动多语言大型语言模型的发展。 Method: 基于FineWeb的新预训练数据集整理流程,包括一套根据可衡量标准选择的有意义且有信息量的评估任务指导的管道设计选择消融实验,以及一种考虑重复计数和质量的再平衡数据集的原则性方法。 Result: 该研究展示了其管道可以用来创建非英语语料库,这些语料库产生的模型性能优于之前的数据库;此外,还介绍了考虑重复计数和质量的再平衡数据集方法,提供了额外的性能提升。 Conclusion: FineWeb2通过自动适应任何语言的预训练数据集整理流程,能够创建出比以前的数据集产生更高效模型的非英语语料库,并引入了一种考虑重复计数和质量的再平衡数据集的方法,最终扩展到使用近100个Common Crawl快照产生超过1000种语言的20TB多语言数据集。 Abstract: Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.

[9] KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Xinping Zhao,Xinshuo Hu,Zifei Shan,Shouzheng Huang,Yao Zhou,Zetian Sun,Zhenyu Liu,Dongfang Li,Xinyuan Wei,Qian Chen,Youcheng Pan,Yang Xiang,Meishan Zhang,Haofen Wang,Jun Yu,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: 提出了一种多功能且紧凑的嵌入模型KaLM-Embedding-V2,在通用文本嵌入任务中表现出色。

Details Motivation: 通过利用卓越的训练技术和数据,开发一种多功能且紧凑的嵌入模型,以在通用文本嵌入任务中实现出色的性能。 Method: 1)为了更好地将架构与表示学习对齐,我们删除了因果注意掩码,并采用具有简单而有效的平均池化的全双向transformer来生成固定长度的嵌入;(ii)我们使用多阶段训练管道:(i)在大规模弱监督开源语料库上进行预训练;(ii)在高质量检索和非检索数据集上进行微调;以及(iii)模型参数平均化以增强泛化能力;(iii)我们引入了一种焦点式重加权机制,并采用在线难负样本混合策略;(iv)我们收集了超过20类数据用于预训练,100类数据用于微调。 Result: 我们的模型在MTEB中文和英文评估中显著优于其他相当大小的模型,并能与3x、14x、18x和26x更大的嵌入模型竞争。 Conclusion: 这项研究为具有不到10亿参数的多功能且紧凑的嵌入模型树立了新标准。 Abstract: In this paper, we propose KaLM-Embedding-V2, a versatile and compact embedding model, which achieves impressive performance in general-purpose text embedding tasks by leveraging superior training techniques and data. Our key innovations include: (1) To better align the architecture with representation learning, we remove the causal attention mask and adopt a fully bidirectional transformer with simple yet effective mean-pooling to produce fixed-length embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on large-scale weakly supervised open-source corpora; (ii) fine-tuning on high-quality retrieval and non-retrieval datasets; and (iii) model-soup parameter averaging for robust generalization. Besides, we introduce a focal-style reweighting mechanism that concentrates learning on difficult samples and an online hard-negative mixing strategy to continuously enrich hard negatives without expensive offline mining; (3) We collect over 20 categories of data for pre-training and 100 categories of data for fine-tuning, to boost both the performance and generalization of the embedding model. Extensive evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English show that our model significantly outperforms others of comparable size, and competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new standard for a versatile and compact embedding model with less than 1B parameters.

[10] Can Gradient Descent Simulate Prompting?

Eric Zhang,Leshem Choshen,Jacob Andreas

Main category: cs.CL

TL;DR: 这篇论文探讨了如何通过梯度更新使微调模型模拟提示的效果,提出了一个基于梯度的元学习方法,无需真实标签即可使用模型自身的提示预测作为目标。

Details Motivation: 尽管参数更新在存储成本上更具优势,但提示(prompting)在某些任务中比传统的微调更有效。本文旨在研究是否可以通过修改模型使得微调达到提示的效果。 Method: 该论文使用了基于梯度的元学习工具,并利用语言模型自身的提示预测作为目标,从而让梯度更新能够模拟提示的效果。 Result: 实验表明,后续的梯度下降训练能够恢复部分甚至全部的提示模型性能,尤其在“反转诅咒”任务和单次梯度更新后回答文本问题方面表现出色。 Conclusion: 适当初始化的情况下,梯度下降可以非常有效地表达模型,为长上下文建模提供了新方向,并加深了对基于梯度学习泛化能力的理解。 Abstract: There are two primary ways of incorporating new information into a language model (LM): changing its prompt or changing its parameters, e.g. via fine-tuning. Parameter updates incur no long-term storage cost for model changes. However, for many model updates, prompting is significantly more effective: prompted models can generalize robustly from single examples and draw logical inferences that do not occur under standard fine-tuning. Can models be modified so that fine-tuning does emulate prompting? This paper describes a method for meta-training LMs such that gradient updates emulate the effects of conditioning on new information. Our approach uses tools from gradient-based meta-learning but uses an LM's own prompted predictions as targets, eliminating the need for ground-truth labels. Subsequent gradient descent training recovers some (and occasionally all) of prompted model performance -- showing improvement on the ``reversal curse'' tasks, and answering questions about text passages after a single gradient update. These results suggest that, with appropriate initialization, gradient descent can be surprisingly expressive. Our results suggest new avenues for long-context modeling and offer insight into the generalization capabilities of gradient-based learning.

[11] SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control

Adithya Chittem,Aishna Shrivastava,Sai Tarun Pendela,Jagat Sesh Challa,Dhruv Kumar

Main category: cs.CL

TL;DR: This paper presents an improved method for modelling personality expression in large language models by extending the Machine Personality Inventory to incorporate the 16 Personality Factor model and developing a framework for evaluating and inducing trait intensity, resulting in more controlled and nuanced human-machine interactions.

Details Motivation: The motivation behind this study is the growing expectation for large language models to display human-like personalities during interactions and the limitations of existing models that rely on the Big Five framework and lack mechanisms for controlling trait intensity. Method: The researchers developed an extended MPI framework called Specific Attribute Control (SAC), which uses adjective-based semantic anchoring and behavioural questions across five intensity factors to evaluate and dynamically induce trait intensity in LLMs. Result: The research found that modelling intensity as a continuous spectrum led to more consistent and controllable personality expression compared to binary trait toggling. Changes in target trait intensity were shown to systematically influence closely related traits in psychologically coherent directions. Conclusion: This paper concludes that extending the Machine Personality Inventory (MPI) to incorporate the 16 Personality Factor model and introducing adjective-based semantic anchoring significantly improves personality expression in large language models, allowing for controlled and nuanced human-machine interactions. Abstract: Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and \textit{Willingness}. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.

[12] Large Language Models Acing Chartered Accountancy

Jatin Gupta,Akhil Sharma,Saransh Singhania,Mohammad Adnan,Sakshi Deo,Ali Imam Abidi,Keshav Gupta

Main category: cs.CL

TL;DR: 本文开发了CA-Ben基准测试以评估大型语言模型(LLMs)在金融、法律和定量推理中的表现,发现尽管某些模型在推理任务中表现出色,但它们在数值计算和法律解释上仍存在明显局限性。

Details Motivation: 填补印度金融背景下LLMs在财务、法律和定量推理能力方面的研究空白,并评估其对领域特定知识的应用效果。 Method: 引入CA-Ben基准测试,通过标准化协议评估六种主流LLMs的表现。 Result: 不同LLMs表现存在差异,Claude 3.5 Sonnet和GPT-4o表现最佳,尤其在概念和法律推理方面;但在数值计算和法律解释中出现显著挑战。 Conclusion: 当前的LLMs在概念和法律推理方面表现出色,但在数值计算和法律解释方面面临挑战。未来可以通过混合推理和检索增强生成方法来改进,特别是在定量分析和准确法律解释方面。 Abstract: Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.

[13] A Semi-supervised Scalable Unified Framework for E-commerce Query Classification

Chunyuan Yuan,Chong Zhang,Zheng Fang,Ming Pang,Xue Jiang,Changping Peng,Zhangang Lin,Ching Law

Main category: cs.CL

TL;DR: This paper introduces SSUF, a unified semi-supervised framework with modular enhancements for query classification in e-commerce, achieving significant performance improvements over existing methods.

Details Motivation: E-commerce queries are typically short and lack context, leading to insufficient prior information for modeling. Existing methods rely heavily on posterior click behavior, creating a Matthew vicious cycle, while the lack of a unified framework hampers algorithm optimization efficiency. Method: Development of a semi-supervised scalable unified framework (SSUF) with three enhanced modules: knowledge-enhanced for query representation, label-enhanced for reducing reliance on posterior labels, and structure-enhanced for modeling complex label relations. Extensive offline and online A/B experiments were conducted to evaluate performance. Result: The SSUF framework demonstrates superior performance over existing models in both offline and online experimental settings. Conclusion: The proposed SSUF framework significantly outperforms state-of-the-art models in query classification by unifying multiple subtasks and enhancing query and label representations through knowledge, labels, and structure modules. Abstract: Query classification, including multiple subtasks such as intent and category prediction, is vital to e-commerce applications. E-commerce queries are usually short and lack context, and the information between labels cannot be used, resulting in insufficient prior information for modeling. Most existing industrial query classification methods rely on users' posterior click behavior to construct training samples, resulting in a Matthew vicious cycle. Furthermore, the subtasks of query classification lack a unified framework, leading to low efficiency for algorithm optimization. In this paper, we propose a novel Semi-supervised Scalable Unified Framework (SSUF), containing multiple enhanced modules to unify the query classification tasks. The knowledge-enhanced module uses world knowledge to enhance query representations and solve the problem of insufficient query information. The label-enhanced module uses label semantics and semi-supervised signals to reduce the dependence on posterior labels. The structure-enhanced module enhances the label representation based on the complex label relations. Each module is highly pluggable, and input features can be added or removed as needed according to each subtask. We conduct extensive offline and online A/B experiments, and the results show that SSUF significantly outperforms the state-of-the-art models.

[14] MT2-CSD: A New Dataset and Multi-Semantic Knowledge Fusion Method for Conversational Stance Detection

Fuqiang Niu,Genan Dai,Yisha Lu,Jiayu Liao,Xiang Li,Hu Huang,Bowen Zhang

Main category: cs.CL

TL;DR: 本文介绍了一种用于多目标、多轮对话立场检测的大型数据集MT2-CSD及一种名为LLM-CRAN的模型,该模型在该数据集上表现优于其他基线模型。

Details Motivation: 传统立场检测研究受限于个体实例,难以模拟社交媒体场景中的多方讨论,且缺乏真实捕捉社交媒体互动动态的数据集。 Method: 提出了一个名为LLM-CRAN的大型语言模型增强对话关系注意力网络,并进行了广泛的实验以评估其在MT2-CSD数据集上的有效性。 Result: 构建了目前最大的多目标、多轮对话立场检测数据集MT2-CSD,包含24,457个注释实例。 Conclusion: LLM-CRAN在MT2-CSD数据集上显著优于强基线模型。 Abstract: In the realm of contemporary social media, automatic stance detection is pivotal for opinion mining, as it synthesizes and examines user perspectives on contentious topics to uncover prevailing trends and sentiments. Traditional stance detection research often targets individual instances, thereby limiting its capacity to model multi-party discussions typical in real social media scenarios. This shortcoming largely stems from the scarcity of datasets that authentically capture the dynamics of social media interactions, hindering advancements in conversational stance detection. In this paper, we introduce MT2-CSD, a comprehensive dataset for multi-target, multi-turn conversational stance detection. To the best of our knowledge, MT2-CSD is the largest dataset available for this purpose, comprising 24,457 annotated instances and exhibiting the greatest conversational depth, thereby presenting new challenges for stance detection. To address these challenges, we propose the Large Language model enhanced Conversational Relational Attention Network (LLM-CRAN), which exploits the reasoning capabilities of LLMs to improve conversational understanding. We conduct extensive experiments to evaluate the efficacy of LLM-CRAN on the MT2-CSD dataset. The experimental results indicate that LLM-CRAN significantly outperforms strong baseline models in the task of conversational stance detection.

[15] DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

Kang He,Yuzhe Ding. Haining Wang,Fei Li,Chong Teng,Donghong Ji

Main category: cs.CL

TL;DR: This paper proposes DALR, a dual-level alignment method for multimodal sentence representation that improves cross-modal alignment and captures complex sentence relationships, outperforming current state-of-the-art approaches.

Details Motivation: Most existing multimodal methods align images and text at a coarse level, leading to cross-modal misalignment bias and intra-modal semantic divergence that degrade sentence representation quality. This work aims to address these limitations through more refined alignment strategies. Method: DALR (Dual-level Alignment Learning) incorporates a consistency learning module for fine-grained cross-modal alignment and integrates ranking distillation with global intra-modal alignment learning to better capture intricate sentence relationships. Result: Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks confirm the effectiveness of DALR, consistently showing its superiority over advanced baseline methods. Conclusion: The proposed DALR approach effectively addresses the challenges of cross-modal misalignment bias and intra-modal semantic divergence, demonstrating superior performance over existing state-of-the-art methods in multimodal sentence representation learning. Abstract: Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.

[16] ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry

Qinwen Chen,Wenbiao Tao,Zhiwei Zhu,Mingfan Xi,Liangzhong Guo,Yuan Wang,Wei Wang,Yunshi Lan

Main category: cs.CL

TL;DR: 提出ComRAG框架,利用记忆机制整合静态知识和动态历史问答,在工业社区问答任务中表现优异。

Details Motivation: 现有方法未能充分利用外部知识、未有效整合动态历史问答上下文,或缺乏适合工业部署的记忆机制。 Method: 通过基于质心的记忆机制,结合检索、生成和高效存储策略,充分利用外部知识与动态历史上下文。 Result: 在三个工业社区问答数据集上评估显示,ComRAG在向量相似度上提升了最多25.9%,延迟降低了8.7%至23.3%,块增长从20.23%降至2.06%。 Conclusion: ComRAG是一个用于实时工业社区问答的检索增强生成框架,能够有效集成静态知识和动态历史问答对,相较于基线方法表现出优越性。 Abstract: Community Question Answering (CQA) platforms can be deemed as important knowledge bases in community, but effectively leveraging historical interactions and domain knowledge in real-time remains a challenge. Existing methods often underutilize external knowledge, fail to incorporate dynamic historical QA context, or lack memory mechanisms suited for industrial deployment. We propose ComRAG, a retrieval-augmented generation framework for real-time industrial CQA that integrates static knowledge with dynamic historical QA pairs via a centroid-based memory mechanism designed for retrieval, generation, and efficient storage. Evaluated on three industrial CQA datasets, ComRAG consistently outperforms all baselines--achieving up to 25.9% improvement in vector similarity, reducing latency by 8.7% to 23.3%, and lowering chunk growth from 20.23% to 2.06% over iterations.

[17] Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models

Xiaoshuang Ji,Zhendong Zhao,Xiaojun Chen,Xin Zhao,Zeyao Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为Progtuning的新微调框架,通过渐进学习方法在减少更新参数数量的同时保持了竞争力,并展示了良好的适应性。

Details Motivation: 随着模型规模的增长,更新所有模型参数变得越来越昂贵。现有的微调方法忽略了Transformer块之间的不平等贡献,导致计算资源分配效率低下。 Method: 通过基于贡献度逐步减少更新的Transformer块的数量来优化资源分配。 Result: Progtuning优化了资源分配,减少了约25%的更新参数数量,同时保持了竞争力。 Conclusion: Progtuning是一个结合渐进学习的新型微调框架,能够在减少更新参数数量的同时保持竞争力,并展现出对各种适应场景的良好适应性。 Abstract: Fine-tuning is a promising technique for leveraging Transformer-based language models in downstream tasks. As model sizes continue to grow, updating all model parameters becomes increasingly costly. Parameter-efficient fine-tuning methods effectively address this issue by selectively updating a small subset of parameters. However, fine-tuning and most existing parameter-efficient fine-tuning methods require updating the same number of parameters as the initial size, ignoring the unequal contribution across Transformer blocks and leading to extremely inefficient allocation of computing resources. In this paper, we propose Progtuning, the novel fine-tuning framework combined with progressive learning for Transformer-based language models. Specifically, Progtuning progressively reduces the number of updated transformer blocks based on the contribution. Remarkably, Progtuning optimizes resource allocation and reduces the number of updated parameters by approximately 25\%, while still maintaining competitive performance. And it also exhibits high adaptability with parameter-efficient fine-tuning methods, demonstrating excellent performance across various adaptation scenarios.

[18] Compressed and Smooth Latent Space for Text Diffusion Modeling

Viacheslav Meshchaninov,Egor Chimbulatov,Alexander Shabalin,Aleksandr Abramov,Dmitry Vetrov

Main category: cs.CL

TL;DR: Cosmos enables efficient and high-quality text generation using a compressed latent space tailored for diffusion models.

Details Motivation: To overcome the limitations of autoregressive models (slow decoding, coherence issues) and diffusion models (challenges due to high dimensionality in token-level representations). Method: Introducing Cosmos, which uses a compressed, smooth latent space for diffusion-based text generation, learned via an autoencoder aligned with pretrained language encoder activations. Result: Cosmos compresses text representations by 8×, maintains generation quality, surpasses diffusion-based and autoregressive baselines with longer latent sequences, and achieves over 2× faster inference. Conclusion: Cosmos provides a novel and effective approach to text generation, achieving comparable or superior quality while offering faster inference compared to existing methods. Abstract: Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising alternative by enabling parallel generation and flexible control; however, their application to text generation is hindered by the high dimensionality of token-level representations. We introduce Cosmos, a novel approach to text generation that operates entirely in a compressed, smooth latent space tailored specifically for diffusion. This space is learned using an autoencoder trained simultaneously for token-level reconstruction and alignment with frozen activations from a pretrained language encoder, providing robust semantic grounding and enabling effective perturbation-based augmentations. Empirically, we demonstrate that text representations can be compressed by $8\times$ while maintaining generation quality comparable to token-level diffusion models. Furthermore, increasing the latent sequence length allows Cosmos to surpass both diffusion-based and autoregressive baselines. We evaluate Cosmos on four diverse generative tasks including story generation, question generation, summarization, and detoxification and compare it with various generative paradigms. Cosmos achieves comparable or superior generation quality while offering more than $2\times$ faster inference.

[19] Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

Isaac Chung,Imene Kerboua,Marton Kardos,Roman Solomatin,Kenneth Enevoldsen

Main category: cs.CL

TL;DR: 本文探讨了MTEB基准测试平台的工程实践,旨在增强其可重复性、可扩展性及社区适用性。

Details Motivation: 确保MTEB持续具备可重复性和可扩展性,以应对不断增长的需求并保持其在文本嵌入模型评估中的相关性。 Method: 介绍了MTEB在维护数据集完整性、自动化测试执行和评估基准结果通用性方面的持续集成方法,并讨论了处理社区贡献和扩展基准任务的设计选择。 Result: MTEB通过这些工程实践得以扩展,同时保证质量,进一步增强了对机器学习领域评估框架的支持。 Conclusion: MTEB通过持续集成的工程实践,成功提升了可重复性和可用性,为社区提供了有价值的基准测试平台。 Abstract: The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB's continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results' generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb

[20] Prompt-Guided Turn-Taking Prediction

Koji Inoue,Mikey Elmers,Yahui Fu,Zi Haur Pang,Divesh Lala,Keiko Ochi,Tatsuya Kawahara

Main category: cs.CL

TL;DR: This paper introduces a novel turn-taking prediction model for dialogue systems that can be dynamically controlled via textual prompts, improving prediction accuracy and adaptability.

Details Motivation: Turn-taking prediction models are crucial for dialogue systems and conversational robots. This study addresses the need for more intuitive and explicit control over prediction behavior to adapt dynamically to different conversational contexts. Method: The study proposes a novel model that integrates textual prompt embeddings into a transformer-based voice activity projection (VAP) model, using both channel-wise and cross-channel transformers. Synthetic prompt data was generated using a large language model (LLM). Result: Experimental results showed improved prediction accuracy and the ability of the model to adjust turn-taking timing based on textual prompts. Conclusion: The proposed model enhances turn-taking prediction in spoken dialogue systems by incorporating textual prompt embeddings, enabling dynamic control through intuitive instructions. Abstract: Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as "faster" or "calmer" adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.

[21] Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval

Yongchan Chun,Minhyuk Kim,Dongjun Kim,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型在自动术语提取中的应用,通过使用一种基于句法相似性的检索策略来提高术语提取的效果,并强调了句法线索的重要性。

Details Motivation: 尽管大型语言模型在各种NLP任务中取得了重大进展,但其在自动术语提取方面的潜力几乎没有被研究过。 Method: 提出了一种基于检索的提示策略,该策略在少量样本设置下根据句法相似性而非语义相似性选择示例。 Result: 实验结果表明,在三种专门的ATE基准测试中,句法检索提高了F1分数。 Conclusion: Syntactic retrieval方法在术语提取任务中对大型语言模型的适应具有重要意义。 Abstract: Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to \emph{syntactic} rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.

[22] Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents

Tianyi Men,Zhuoran Jin,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao

Main category: cs.CL

TL;DR: This paper proposes Agent-RewardBench, a comprehensive benchmark for evaluating reward modeling in Multimodal Large Language Models, showing that current models require specialized training to improve performance.

Details Motivation: Due to limitations in external feedback, multimodal agents struggle with self-correction and generalization. Selecting suitable reward models as external feedback is crucial, necessitating a targeted benchmark. Method: The authors propose Agent-RewardBench, a benchmark with multiple dimensions and real-world agent scenarios, step-level reward evaluation, and appropriate difficulty and high-quality data sampling from 10 diverse models. Result: Experiments show that even state-of-the-art multimodal models perform poorly on the benchmark, emphasizing the importance of dedicated reward modeling training. Conclusion: Agent-RewardBench is an effective benchmark for evaluating reward modeling ability in MLLMs, highlighting the need for specialized training in agent reward modeling. Abstract: As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.

[23] Cat and Mouse -- Can Fake Text Generation Outpace Detector Systems?

Andrea McGlinchey,Peter J Barclay

Main category: cs.CL

TL;DR: 本研究发现,尽管大型语言模型不断发展,但简单的统计分类器仍能有效检测假文本,其中Gemini的新版本生成更具欺骗性的文本,而GPT未见明显提升。

Details Motivation: 随着大型语言模型(LLM)在参数、训练数据和能耗方面的不断提升,人们担忧假文本检测是否会陷入无止境的‘军备竞赛’,而本文探讨了这一问题是否存在瓶颈。 Method: 研究者通过统计分类器检测经典侦探小说风格的“假文本”,并评估不同版本Gemini和GPT生成欺骗性文本的能力。 Result: Gemini在0.5版本升级后表现出更强的生成欺骗文本能力,而GPT则没有显著提升。 Conclusion: 可靠的假文本检测在未来更大的模型中依然是可行的,尽管新的模型架构可能会提高其欺骗性。 Abstract: Large language models can produce convincing "fake text" in domains such as academic writing, product reviews, and political news. Many approaches have been investigated for the detection of artificially generated text. While this may seem to presage an endless "arms race", we note that newer LLMs use ever more parameters, training data, and energy, while relatively simple classifiers demonstrate a good level of detection accuracy with modest resources. To approach the question of whether the models' ability to beat the detectors may therefore reach a plateau, we examine the ability of statistical classifiers to identify "fake text" in the style of classical detective fiction. Over a 0.5 version increase, we found that Gemini showed an increased ability to generate deceptive text, while GPT did not. This suggests that reliable detection of fake text may remain feasible even for ever-larger models, though new model architectures may improve their deceptiveness

[24] Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning

Xin Xu,Tianhao Chen,Fan Zhang,Wanlong Liu,Pengxiang Li,Ajay Kumar Jaiswal,Yuchen Yan,Jishan Hu,Yang Wang,Hao Chen,Shiwei Liu,Shizhe Diao,Can Yang,Lu Yin

Main category: cs.CL

TL;DR: 本研究提出了一种名为Double-Checker的新框架,该框架通过显式的自我批判和迭代优化来增强慢思考LLMs的推理能力。

Details Motivation: 尽管慢思考的大语言模型(LLMs)表现出类似反思的推理能力,但它们生成有信息量的批评和改进先前解决方案的能力仍然有限。 Method: 引入了Double-Checker框架,并通过1730个自我批判实例进行微调以提升模型性能。 Result: 在AIME基准测试中,pass@1性能从4.4%提高到了18.2%。 Conclusion: Double-Checker是一个有原则的框架,可以增强慢思考LLMs的推理能力,并展示了迭代自我批判在提高模型性能方面的有效性。 Abstract: While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique.

[25] Small Encoders Can Rival Large Decoders in Detecting Groundedness

Istabrak Abbes,Gabriele Prato,Quentin Fournier,Fernando Rodriguez,Alaa Boukhary,Adam Elwood,Sarath Chandar

Main category: cs.CL

TL;DR: This paper proposes using lightweight encoder models to efficiently detect if a query is grounded in given context before using large language models to generate answers, reducing computational cost while maintaining accuracy.

Details Motivation: LLMs often generate ungrounded responses when context lacks information, leading to unreliable answers. Detecting groundedness beforehand reduces costly answer generation and improves efficiency. Method: Fine-tuning lightweight encoder models like RoBERTa and NomicBERT on curated datasets for groundedness detection in LLM responses. Result: Lightweight models achieved accuracy comparable to Llama3 8B and GPT4o in groundedness detection with significantly reduced inference time. Conclusion: Using lightweight, task-specific encoder models can achieve comparable groundedness detection accuracy as state-of-the-art LLMs while significantly reducing inference latency. Abstract: Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : https://github.com/chandarlab/Hallucinate-less

[26] Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models

Bram Willemsen,Gabriel Skantze

Main category: cs.CL

TL;DR: This paper investigates if text-only language models can detect referring expressions in dialogue, showing effectiveness but highlighting the need for multimodal approaches.

Details Motivation: To understand how much linguistic context alone can help in identifying referring expressions without relying on visual information in dialogue systems. Method: A pretrained large language model (LLM) was adapted to detect mention spans in conversations using next-token prediction, focusing on a text-only, autoregressive language modeling approach. Result: The results show that even with moderate-sized LLMs, small datasets, and parameter-efficient fine-tuning, the text-only approach can be effective for this specific task. Conclusion: The study concludes that while text-only approaches can effectively detect referring expressions in visually grounded dialogues, the task fundamentally requires multimodal solutions. Abstract: In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.

[27] Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models

Fangzhou Dong,Yifan Zeng,Yingpeng Sang,Hong Shen

Main category: cs.CL

TL;DR: 本文提出了一种新的结构化分析框架GLASS,用于增强大型语言模型进行深入文学分析的能力,并创建了首个基于GSS的文学批评数据集。

Details Motivation: 大型语言模型(LLMs)在理解和生成文本方面表现出色,但在提供具有深刻思想和复杂叙事作品的专业文学批评方面存在困难。 Method: 我们提出了GLASS(基于Greimas符号方阵的文学分析),这是一种基于Greimas符号方阵(GSS)的结构化分析框架,并提出了第一个基于GSS的文学批评数据集,使用LLM-as-a-judge范式提出了定量指标。 Result: 框架的结果与多个作品和LLM的专家批评相比表现优异,并且将GLASS应用于39部经典作品,产生了原创且高质量的分析,填补了现有研究空白。 Conclusion: 这项研究为文学研究和教育提供了基于人工智能的工具,并对文学参与的潜在认知机制提供了见解。 Abstract: Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs' ability to conduct in-depth literary analysis. GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Our framework's results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.

[28] Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation

Guanting Dong,Xiaoxi Li,Yuyao Zhang,Mengjie Deng

Main category: cs.CL

TL;DR: This paper proposes Omni-RAG, a novel framework designed to improve the ability of RAG systems to handle real-world, noisy, and multi-intent user queries effectively.

Details Motivation: Real-world RAG systems struggle with noisy, ambiguous user queries containing multiple intents, while current systems are often trained on cleaner data, highlighting a gap between existing capabilities and practical demands. Method: The paper introduces Omni-RAG, which uses LLM-assisted query understanding with three modules: Deep Query Understanding and Decomposition, Intent-Aware Knowledge Retrieval, and Reranking and Generation. Result: Omni-RAG improves processing of complex inputs by denoising, decomposing, retrieving, reranking, and generating responses using tailored prompts and tools like OpenSearch and BGE. Conclusion: Omni-RAG aims to enhance the robustness and effectiveness of RAG systems in real-world applications by handling complex and noisy queries through its three key modules. Abstract: Real-world live retrieval-augmented generation (RAG) systems face significant challenges when processing user queries that are often noisy, ambiguous, and contain multiple intents. While RAG enhances large language models (LLMs) with external knowledge, current systems typically struggle with such complex inputs, as they are often trained or evaluated on cleaner data. This paper introduces Omni-RAG, a novel framework designed to improve the robustness and effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs LLM-assisted query understanding to preprocess user inputs through three key modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs with tailored prompts to denoise queries (e.g., correcting spelling errors) and decompose multi-intent queries into structured sub-queries; (2) Intent-Aware Knowledge Retrieval, which performs retrieval for each sub-query from a corpus (i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking and Generation, where a reranker (i.e., BGE) refines document selection before a final response is generated by an LLM (i.e., Falcon-10B) using a chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG capabilities and the demands of real-world applications, such as those highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex and noisy queries.

[29] Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection

Ali Şenol,Garima Agrawal,Huan Liu

Main category: cs.CL

TL;DR: This paper introduces a Domain Knowledge-Enhanced LLM framework to detect deceptive conversations and classify concept drift, achieving high accuracy and improved performance over existing methods.

Details Motivation: The motivation stems from the increasing difficulty in detecting deceptive conversations due to evolving language patterns and concept drift, which can obscure malicious intent or mimic normal dialogue, making accurate classification challenging even for advanced LLMs. Method: The paper proposes a Domain Knowledge (DK)-Enhanced LLM framework consisting of a DK-LLM module for detecting deceptive conversations, an OCDD unit for detecting semantic shifts, and a second DK-LLM module to classify drift as benign or fraudulent. The framework is validated using a fake review dataset and applied to the SEConvo dataset for evaluation. Result: The proposed system achieved high accuracy in detecting fake conversations, with a 98% classification accuracy using a LLaMA-based implementation. It also effectively classified the nature of concept drift, outperforming zero-shot baselines when incorporating domain knowledge and drift awareness. Conclusion: The study concludes that integrating domain knowledge with LLMs enhances the detection of deceptive conversations and concept drift, significantly improving performance, interpretability, and robustness in high-stakes NLP applications. Abstract: Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)\-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk\-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)\-Enhanced LLM framework that integrates pretrained LLMs with structured, task\-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK\-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK\-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA\-based implementation achieves 98\% classification accuracy. Comparative studies against zero\-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high\-stakes NLP applications.

[30] Text2Cypher Across Languages: Evaluating Foundational Models Beyond English

Makbule Gulcin Ozsoy,William Tai

Main category: cs.CL

TL;DR: 本文研究了基础大语言模型在多语言Text2Cypher任务中的表现,发现其性能依语言而异,英语最佳,西班牙语次之,土耳其语最差,且提示翻译对结果影响不大。

Details Motivation: 虽然自然语言接口提升了数据库的可访问性,但目前大多数研究仅关注英语,在其他语言中的评估有限。因此,本文旨在调查基础LLMs在多种语言下的Text2Cypher任务表现。 Method: 通过将英文问题翻译成西班牙语和土耳其语同时保留原始Cypher查询,创建并发布了一个多语言测试集,从而实现公平的跨语言比较。使用标准化提示和度量标准评估多个基础模型。 Result: 结果表明性能模式一致:英语最高,其次是西班牙语,土耳其语最低。这归因于训练数据可用性和语言特征的差异。此外,翻译任务提示对评估指标几乎没有影响。 Conclusion: 该论文强调了在多语言查询生成中需要更具包容性的评估和发展,并提出了未来的研究方向,如模式本地化和跨多种语言的微调。 Abstract: Recent advances in large language models have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses solely on English, with limited evaluation in other languages. This paper investigates the performance of foundational LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. We evaluate multiple foundational models using standardized prompts and metrics. Our results show a consistent performance pattern: highest on English, then Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic characteristics. Additionally, we explore the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Our findings highlight the need for more inclusive evaluation and development in multilingual query generation. Future work includes schema localization and fine-tuning across diverse languages.

[31] Aligning Spoken Dialogue Models from User Interactions

Anne Wu,Laurent Mazaré,Neil Zeghidour,Alexandre Défossez

Main category: cs.CL

TL;DR: This paper introduces a preference alignment framework tailored for real-time spoken dialogue systems, addressing challenges such as interruptions and temporal context variations. Using a large dataset and offline alignment methods, the model improvements were validated through experiments and human evaluations.

Details Motivation: Current preference learning approaches are mainly focused on text-based language models, which do not adequately address the complexities of real-time speech interactions involving interruptions, interjections, and lack of clear speaker turn segmentation. Method: The researchers created a large-scale dataset with over 150,000 preference pairs from multi-turn speech conversations, annotated with AI feedback. They applied offline alignment techniques to fine-tune a full-duplex autoregressive speech-to-speech model and conducted extensive experiments and human evaluations. Result: Feedback derived from generic conversations was found to effectively improve spoken dialogue models. The finetuned model demonstrated better performance in producing factual, safe, and contextually aligned interactions across multi-turn dialogues. Conclusion: The study concludes that applying preference alignment methods to real-time spoken dialogue models can significantly enhance the factualness, safety, and contextual alignment of interactions. It emphasizes the importance of balancing various conversational dynamics for effective speech dialogue systems. Abstract: We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.

[32] TopK Language Models

Ryosuke Takahashi,Tatsuro Inaba,Kentaro Inui,Benjamin Heinzerling

Main category: cs.CL

TL;DR: This paper introduces TopK LMs, a modified transformer architecture with improved interpretability and reliability for analyzing language model concepts.

Details Motivation: Sparse autoencoders (SAEs) have shortcomings in utility and internal validity due to their post-hoc training, making it unclear whether failures are due to SAEs or underlying LMs. Feature instability complicates comparisons across checkpoints. Method: A modification to the transformer architecture was introduced, incorporating a TopK activation function at chosen layers to make hidden states equivalent to the latent features of a TopK SAE. Result: TopK LMs maintain original capabilities while providing robust interpretability benefits, enabling successful steering through targeted neuron interventions and detailed analysis of neuron formation processes. Conclusion: TopK LMs provide a stable and reliable method for understanding how language models learn and represent concepts, advancing research on model interpretability and controllability. Abstract: Sparse autoencoders (SAEs) have become an important tool for analyzing and interpreting the activation space of transformer-based language models (LMs). However, SAEs suffer several shortcomings that diminish their utility and internal validity. Since SAEs are trained post-hoc, it is unclear if the failure to discover a particular concept is a failure on the SAE's side or due to the underlying LM not representing this concept. This problem is exacerbated by training conditions and architecture choices affecting which features an SAE learns. When tracing how LMs learn concepts during training, the lack of feature stability also makes it difficult to compare SAEs features across different checkpoints. To address these limitations, we introduce a modification to the transformer architecture that incorporates a TopK activation function at chosen layers, making the model's hidden states equivalent to the latent features of a TopK SAE. This approach eliminates the need for post-hoc training while providing interpretability comparable to SAEs. The resulting TopK LMs offer a favorable trade-off between model size, computational efficiency, and interpretability. Despite this simple architectural change, TopK LMs maintain their original capabilities while providing robust interpretability benefits. Our experiments demonstrate that the sparse representations learned by TopK LMs enable successful steering through targeted neuron interventions and facilitate detailed analysis of neuron formation processes across checkpoints and layers. These features make TopK LMs stable and reliable tools for understanding how language models learn and represent concepts, which we believe will significantly advance future research on model interpretability and controllability.

[33] Bridging Offline and Online Reinforcement Learning for LLMs

Jack Lanchantin,Angelica Chen,Janice Lan,Xian Li,Swarnadeep Saha,Tianlu Wang,Jing Xu,Ping Yu,Weizhe Yuan,Jason E Weston,Sainbayar Sukhbaatar,Ilia Kulikov

Main category: cs.CL

TL;DR: Reinforcement learning enhances fine-tuning of large language models, with online and semi-online approaches outperforming offline methods, and multi-tasking boosts overall performance.

Details Motivation: To understand the effectiveness of reinforcement learning methods when transitioning from offline to semi-online to fully online regimes in both verifiable and non-verifiable tasks. Method: Experiments were conducted on verifiable math and non-verifiable instruction following tasks using Direct Preference Optimization and Group Reward Policy Optimization objectives across offline, semi-online, and fully online regimes. Result: Online and semi-online variants showed similar performance and convergence, outperforming offline methods. Multi-tasking improved performance across task types. Conclusion: Reinforcement learning methods are effective for fine-tuning large language models, especially in semi-online and fully online regimes. Multi-tasking with verifiable and non-verifiable rewards improves performance. Abstract: We investigate the effectiveness of reinforcement learning methods for finetuning large language models when transitioning from offline to semi-online to fully online regimes for both verifiable and non-verifiable tasks. Our experiments cover training on verifiable math as well as non-verifiable instruction following with a set of benchmark evaluations for both. Across these settings, we extensively compare online and semi-online Direct Preference Optimization and Group Reward Policy Optimization objectives, and surprisingly find similar performance and convergence between these variants, which all strongly outperform offline methods. We provide a detailed analysis of the training dynamics and hyperparameter selection strategies to achieve optimal results. Finally, we show that multi-tasking with verifiable and non-verifiable rewards jointly yields improved performance across both task types.

[34] Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments

Jiashuo Wang,Kaitao Song,Chunpu Xu,Changhe Song,Yang Xiao,Dongsheng Li,Lili Qiu,Wenjie Li

Main category: cs.CL

TL;DR: This paper introduces an approach to enhance user engagement in socially-driven dialogues by training interactive LLMs using a user simulator, i×MCTS, and DPO, demonstrating effectiveness in real-world scenarios.

Details Motivation: Prior works have optimized models based on knowledge reasoning or dialogue act planning, but these approaches do not necessarily guarantee user engagement in socially-driven dialogues. The need to directly optimize for user engagement motivated this research. Method: The authors use a user simulator and i×MCTS (Monte Carlo Tree Search for interaction) to explore interactions between users and LLMs. They collect a dataset of high- and low-quality experiences and apply Direct Preference Optimization (DPO) to align the LLMs toward maximizing user engagement. Result: Experiments show that the proposed method successfully improves user engagement in two socially-driven dialogue scenarios: emotional support conversations and persuasion for good. Conclusion: The proposed method effectively enhances user engagement in interactive LLMs, as demonstrated through experiments on emotional support conversations and persuasion for good. Abstract: Enhancing user engagement through interactions plays an essential role in socially-driven dialogues. While prior works have optimized models to reason over relevant knowledge or plan a dialogue act flow, the relationship between user engagement and knowledge or dialogue acts is subtle and does not guarantee user engagement in socially-driven dialogues. To this end, we enable interactive LLMs to learn user engagement by leveraging signals from the future development of conversations. Specifically, we adopt a more direct and relevant indicator of user engagement, i.e., the user's reaction related to dialogue intention after the interaction, as a reward to align interactive LLMs. To achieve this, we develop a user simulator to interact with target interactive LLMs and explore interactions between the user and the interactive LLM system via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree \textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset containing pairs of higher and lower-quality experiences using \textit{i$\times$MCTS}, and align interactive LLMs for high-level user engagement by direct preference optimization (DPO) accordingly. Experiments conducted on two socially-driven dialogue scenarios (emotional support conversations and persuasion for good) demonstrate that our method effectively enhances user engagement in interactive LLMs.

[35] skLEP: A Slovak General Language Understanding Benchmark

Marek Šuppa,Andrej Ridzik,Daniel Hládek,Tomáš Javůrek,Viktória Ondrejová,Kristína Sásiková,Martin Tamajka,Marián Šimko

Main category: cs.CL

TL;DR: 该论文介绍了skLEP,这是一个为评估斯洛伐克语NLU模型而设计的全新综合基准,包括九个多样化任务,并提供了用于模型评估和微调的开源工具及公共排行榜。

Details Motivation: 为了填补评估斯洛伐克自然语言理解模型领域缺乏综合性基准测试的空白。 Method: 创建了一个包含九个不同任务的数据集,包括token-level、sentence-pair和document-level挑战,并对多种斯洛伐克语特定、多语言和英语预训练语言模型进行了系统评估。 Result: 发布了完整的基准数据、一个便于微调和评估模型的开源工具包以及一个公共排行榜。 Conclusion: skLEP是一个全面的斯洛伐克语NLU基准测试,旨在推动未来在斯洛伐克语自然语言理解方面的研究,并通过提供开放源码工具包和公共排行榜来促进可重复性研究。 Abstract: In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.

[36] Potemkin Understanding in Large Language Models

Marina Mancoridis,Bec Weeks,Keyon Vafa,Sendhil Mullainathan

Main category: cs.CL

TL;DR: This paper explores how large language models (LLMs) should be evaluated, arguing that current benchmarks may only provide a false impression of understanding unless models misunderstand concepts similarly to humans.

Details Motivation: The motivation behind this research is the question of what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions. Method: Two procedures are presented for quantifying potemkins: one using a specially designed benchmark in three domains, and another using a general procedure providing a lower-bound on their prevalence. Result: The research finds that potemkins are ubiquitous across models, tasks, and domains. It also discovers that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations. Conclusion: The study concludes that the benchmarks used to test LLMs can only validly assess their capabilities if the models misunderstand concepts in ways similar to humans. Otherwise, it's just 'potemkin understanding'. Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.

[37] "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets

Akshay Paruchuri,Maryam Aziz,Rohit Vartak,Ayman Ali,Best Uchehara,Xin Liu,Ishan Chatterjee,Monica Agrawal

Main category: cs.CL

TL;DR: 该论文研究了用户通过聊天机器人向大型语言模型寻求医疗信息的行为,发现了一些潜在问题,并提出需要改进AI在医疗领域的应用。

Details Motivation: 随着人们越来越多地通过交互式聊天机器人从大型语言模型中获取医疗信息,但这些对话的本质和潜在风险仍不清楚,因此需要对此进行深入研究。 Method: 论文通过过滤大规模的对话AI数据集,构建了一个名为HealthChat-11K的精选数据集,并结合临床医生驱动的分类方法,系统地研究了用户在21个不同健康专业领域中的互动行为。 Result: 分析揭示了用户如何以及为何寻求健康信息,包括常见的互动模式、背景信息不足的情况、情感行为,以及可能引发趋炎附势的互动方式。 Conclusion: 论文得出结论,用户在使用大型语言模型(LLMs)寻求医疗保健信息时存在一些问题,如不完整的背景信息、诱导性提问等,这些问题凸显了改进作为聊天机器人部署的LLMs在医疗支持能力的必要性。 Abstract: People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: https://github.com/yahskapar/HealthChat

[38] Data Efficacy for Language Model Training

Yalun Dai,Yangyu Huang,Xin Zhang,Wenshan Wu,Chong Li,Wenhui Lu,Shijie Cao,Li Dong,Scarlett Li

Main category: cs.CL

TL;DR: This paper introduces DELT, a new approach to language model training focusing on data efficacy through optimized data organization, showing significant performance improvements.

Details Motivation: While data efficiency has been widely researched to maximize performance with minimal data, data efficacy, which focuses on optimizing data organization, remains underexplored. Method: The study introduces DELT, a general paradigm comprising Data Scoring, Data Selection, and Data Ordering, with new instances like Learnability-Quality Scoring (LQS) and Folding Ordering (FO). Result: Experiments showed that DELT improves LM performance without increasing data scale or model size, with the combination of LQS and Folding yielding the best results. Conclusion: Data efficacy is a promising foundational area in language model training that can be achieved alongside data efficiency by optimizing the organization of training data. Abstract: Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.

cs.CV [Back]

[39] OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport

Qin Ren,Yifan Wang,Ruogu Fang,Haibin Ling,Chenyu You

Main category: cs.CV

TL;DR: OTSurv是一种新的MIL框架,通过最优传输建模病理异质性,在六个基准测试中实现了平均3.6%的C指数提升,并具有统计显著性和高可解释性。

Details Motivation: 现有MIL方法未能充分捕捉WSIs中的全局和局部病理异质性,例如长尾形态分布和切片级预测不确定性。 Method: OTSurv从最优传输的角度构建生存预测模型,引入两个约束:(1) 全局长尾约束,用于调节传输质量分配以避免模式崩溃和过度均匀;(2) 局部不确定性感知约束,通过优先处理高置信度切片并抑制噪声来逐步提高总传输质量。最终问题被转化为可通过高效矩阵缩放算法求解的非平衡最优传输问题。 Result: OTSurv在六个流行基准测试中取得了新的最先进结果,平均C指数提升了3.6%,并在log-rank检验中达到统计显著性,同时具备高可解释性。 Conclusion: OTSurv是一种有效的生存预测工具,尤其适用于数字病理学领域,且代码已公开。 Abstract: Survival prediction using whole slide images (WSIs) can be formulated as a multiple instance learning (MIL) problem. However, existing MIL methods often fail to explicitly capture pathological heterogeneity within WSIs, both globally -- through long-tailed morphological distributions, and locally through -- tile-level prediction uncertainty. Optimal transport (OT) provides a principled way of modeling such heterogeneity by incorporating marginal distribution constraints. Building on this insight, we propose OTSurv, a novel MIL framework from an optimal transport perspective. Specifically, OTSurv formulates survival predictions as a heterogeneity-aware OT problem with two constraints: (1) global long-tail constraint that models prior morphological distributions to avert both mode collapse and excessive uniformity by regulating transport mass allocation, and (2) local uncertainty-aware constraint that prioritizes high-confidence patches while suppressing noise by progressively raising the total transport mass. We then recast the initial OT problem, augmented by these constraints, into an unbalanced OT formulation that can be solved with an efficient, hardware-friendly matrix scaling algorithm. Empirically, OTSurv sets new state-of-the-art results across six popular benchmarks, achieving an absolute 3.6% improvement in average C-index. In addition, OTSurv achieves statistical significance in log-rank tests and offers high interpretability, making it a powerful tool for survival prediction in digital pathology. Our codes are available at https://github.com/Y-Research-SBU/OTSurv.

[40] StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation

Haodong Li,Chen Wang,Jiahui Lei,Kostas Daniilidis,Lingjie Liu

Main category: cs.CV

TL;DR: StereoDiff combines stereo matching and video depth diffusion for improved video depth estimation, achieving SoTA results with better consistency and accuracy than existing methods.

Details Motivation: Video depth estimation has different temporal consistency requirements for dynamic and static regions compared to image depth estimation. Stereo matching provides strong global 3D cues for static regions, while dynamic regions require smooth transitions learned from large-scale data. Method: StereoDiff uses a two-stage approach that synergizes stereo matching for static regions and video depth diffusion for dynamic areas in videos. The synergy is analyzed through frequency domain analysis. Result: StereoDiff achieves state-of-the-art (SoTA) performance on real-world, dynamic video depth benchmarks both indoors and outdoors, demonstrating superior consistency and accuracy. Conclusion: StereoDiff achieves superior performance in video depth estimation by combining stereo matching and video depth diffusion methods, outperforming existing approaches in consistency and accuracy. Abstract: Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff's SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.

[41] ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations

Zhiyuan Wu,Yongqiang Zhao,Shan Luo

Main category: cs.CV

TL;DR: This paper introduces ConViTac, a novel method for integrating visual and tactile sensor data in robotics using contrastive representation learning and cross-modal attention, significantly improving task performance compared to existing techniques.

Details Motivation: The motivation stems from the limitations of existing approaches for combining visual and tactile sensory data, which often rely on simplistic methods like feature addition or concatenation, leading to suboptimal integration. Vision and touch offer complementary information crucial for robotic perception and manipulation, so improved fusion techniques are needed. Method: The authors proposed ConViTac, a network incorporating a Contrastive Embedding Conditioning (CEC) mechanism. This uses a contrastive encoder pretrained via self-supervised learning to project visual and tactile inputs into unified latent embeddings, which are then fused using cross-modal attention for better alignment of representations. Result: ConViTac demonstrated superior performance over state-of-the-art methods in real-world tasks, with an improvement of up to 12.0% in accuracy for material classification and grasping prediction tasks, validating the effectiveness of the proposed CEC mechanism. Conclusion: The paper concludes that ConViTac, with its Contrastive Embedding Conditioning mechanism, significantly enhances visual-tactile feature fusion and outperforms current methods in material classification and grasping prediction tasks. Abstract: Vision and touch are two fundamental sensory modalities for robots, offering complementary information that enhances perception and manipulation tasks. Previous research has attempted to jointly learn visual-tactile representations to extract more meaningful information. However, these approaches often rely on direct combination, such as feature addition and concatenation, for modality fusion, which tend to result in poor feature integration. In this paper, we propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0% in material classification and grasping prediction tasks.

[42] AI-Driven MRI-based Brain Tumour Segmentation Benchmarking

Connor Ludwig,Khashayar Namdar,Farzad Khalvati

Main category: cs.CV

TL;DR: 本文评估了多种提示质量下医学图像分割模型(如 SAM、SAM 2、MedSAM、SAM-Med-3D 和 nnU-Net)的表现,发现虽然 SAM 系列模型在某些情况下表现优异,nnU-Net 仍是更实用的选择。

Details Motivation: 近年来出现了许多通用可提示模型及医学变体,但在共同的医学数据集上缺乏对多种提示质量的评估与比较。 Method: 本研究使用 Segment Anything Model (SAM)、Segment Anything Model 2 (SAM 2)、MedSAM、SAM-Med-3D 和 nnU-Net 在 BraTS 2023 成人胶质瘤和儿科数据集上进行零样本推理,并评估不同提示质量下的表现。此外,还通过微调这些模型进一步进行了比较和扩展评估。 Result: 多个模型表现出有希望的 Dice 分数,特别是 SAM 和 SAM 2 在高度准确的边界框提示下分别达到了 0.894 和 0.893,超过了 nnU-Net 的分割表现。 Conclusion: nnU-Net 仍然是主要的医学图像分割网络,因为向模型提供高度准确提示的不切实际性。对 SAM 系列模型在儿科数据集上的微调提升了点提示的表现,但无法超越边界框或 nnU-Net 的分割效果。 Abstract: Medical image segmentation has greatly aided medical diagnosis, with U-Net based architectures and nnU-Net providing state-of-the-art performance. There have been numerous general promptable models and medical variations introduced in recent years, but there is currently a lack of evaluation and comparison of these models across a variety of prompt qualities on a common medical dataset. This research uses Segment Anything Model (SAM), Segment Anything Model 2 (SAM 2), MedSAM, SAM-Med-3D, and nnU-Net to obtain zero-shot inference on the BraTS 2023 adult glioma and pediatrics dataset across multiple prompt qualities for both points and bounding boxes. Several of these models exhibit promising Dice scores, particularly SAM and SAM 2 achieving scores of up to 0.894 and 0.893, respectively when given extremely accurate bounding box prompts which exceeds nnU-Net's segmentation performance. However, nnU-Net remains the dominant medical image segmentation network due to the impracticality of providing highly accurate prompts to the models. The model and prompt evaluation, as well as the comparison, are extended through fine-tuning SAM, SAM 2, MedSAM, and SAM-Med-3D on the pediatrics dataset. The improvements in point prompt performance after fine-tuning are substantial and show promise for future investigation, but are unable to achieve better segmentation than bounding boxes or nnU-Net.

[43] How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?

Stephanie Käs,Anton Burenko,Louis Markert,Onur Alp Culha,Dennis Mack,Timm Linder,Bastian Leibe

Main category: cs.CV

TL;DR: This paper explores the use of Vision Foundation Models (VFMs) and Vision Language Models (VLMs) for dynamic gesture recognition in human-robot communication, comparing them against traditional methods like HD-GCN using a new dataset called NUGGET.

Details Motivation: Traditional deep learning-based gesture recognition methods rely on task-specific architectures, increasing system complexity. Vision Foundation Models (VFMs) and Vision Language Models (VLMs) offer potential to reduce this complexity by serving as shared multi-task models due to their strong generalization abilities. Method: This study compares three models—V-JEPA (a state-of-the-art Vision Foundation Model), Gemini Flash 2.0 (a multimodal Vision Language Model), and HD-GCN (a top-performing skeleton-based approach)—for dynamic, full-body gesture recognition using the NUGGET dataset. Result: HD-GCN achieved the best performance in dynamic, full-body gesture recognition. However, V-JEPA performed nearly as well with a simple task-specific classification head, suggesting its potential as a shared multi-task model. Gemini Flash 2.0 struggled to differentiate gestures based only on textual descriptions in a zero-shot setting. Conclusion: The study concludes that while HD-GCN currently performs best for dynamic, full-body gesture recognition, V-JEPA shows potential as a simpler, shared multi-task model. Gemini Flash 2.0 struggles with gesture differentiation in zero-shot settings, indicating the need for further research on input representations. Abstract: Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.

[44] Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models

Cansu Korkmaz,Ahmet Murat Tekalp,Zafer Dogan

Main category: cs.CV

TL;DR: This paper proposes an automated framework using vision-language models (VLMs) to identify the most trustworthy super-resolution image from a set generated by diffusion models, validated by a new Trustworthiness Score (TWS).

Details Motivation: The motivation stems from the ambiguity in selecting the most reliable SR solution from diffusion models, which generate diverse SR images but lack a principled way to identify trustworthy outputs for critical applications. Method: The method involves using VLMs like BLIP-2 and GPT-4o to evaluate semantic correctness, visual quality, and artifact presence in SR samples. A Trustworthiness Score (TWS) is proposed to quantify SR reliability based on CLIP embeddings, SSIM on edge maps, and wavelet decomposition. Result: The result shows that VLM-guided selection yields high TWS values, strongly correlating with human preference, outperforming conventional metrics like PSNR and LPIPS in reflecting information fidelity. Conclusion: The paper concludes that leveraging vision-language models (VLMs) to select the most trustworthy super-resolution (SR) sample from a diffusion-generated set significantly improves SR reliability and trustworthiness. Abstract: Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.

[45] FixCLR: Negative-Class Contrastive Learning for Semi-Supervised Domain Generalization

Ha Min Son,Shahbaz Rezaei,Xin Liu

Main category: cs.CV

TL;DR: FixCLR enhances semi-supervised domain generalization by explicitly regularizing domain-invariant representations, offering significant performance improvements.

Details Motivation: Existing SSDG methods do not explicitly regularize to learn domain-invariant representations across all domains, which is crucial for domain generalization. Method: We introduce FixCLR, which adapts contrastive learning for explicit domain invariance regularization by utilizing class information from pseudo-labels and using only a repelling term. Result: FixCLR can improve the performance of most existing SSDG and semi-supervised methods. Extensive experiments show its effectiveness, particularly under conditions with limited label availability. Conclusion: FixCLR proves to be an effective SSDG method, especially when combined with other semi-supervised methods. Abstract: Semi-supervised domain generalization (SSDG) aims to solve the problem of generalizing to out-of-distribution data when only a few labels are available. Due to label scarcity, applying domain generalization methods often underperform. Consequently, existing SSDG methods combine semi-supervised learning methods with various regularization terms. However, these methods do not explicitly regularize to learn domains invariant representations across all domains, which is a key goal for domain generalization. To address this, we introduce FixCLR. Inspired by success in self-supervised learning, we change two crucial components to adapt contrastive learning for explicit domain invariance regularization: utilization of class information from pseudo-labels and using only a repelling term. FixCLR can also be added on top of most existing SSDG and semi-supervised methods for complementary performance improvements. Our research includes extensive experiments that have not been previously explored in SSDG studies. These experiments include benchmarking different improvements to semi-supervised methods, evaluating the performance of pretrained versus non-pretrained models, and testing on datasets with many domains. Overall, FixCLR proves to be an effective SSDG method, especially when combined with other semi-supervised methods.

[46] Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision

Yuting He,Shuo Li

Main category: cs.CV

TL;DR: The COVER framework enhances pixel-wise self-supervised pretraining in medical vision by reformulating contrastive learning as a vector regression problem, improving feature correlation and granularity adaptation.

Details Motivation: Extending contrastive learning to pixel-wise representation is crucial for medical vision but remains an open problem due to the over-dispersion issue in binary CL, which disrupts intra-class distribution and pixel-wise feature correlation. Method: COntrast in VEctor Regression (COVER) framework reformulates contrastive learning as a vector regression problem, modeling feature distances in displacement vectors and leveraging a vector pyramid architecture for granularity adaptation. Result: COVER successfully addresses over-dispersion by enforcing consistent optimization flow and preserving pixel-wise feature correlations, validated through extensive experiments across 8 tasks spanning 2 dimensions and 4 modalities. Conclusion: COVER significantly improves pixel-wise self-supervised pretraining, advancing generalizable medical visual foundation models. Abstract: Contrastive learning (CL) has become a cornerstone of self-supervised pretraining (SSP) in foundation models, however, extending CL to pixel-wise representation, crucial for medical vision, remains an open problem. Standard CL formulates SSP as a binary optimization problem (binary CL) where the excessive pursuit of feature dispersion leads to an over-dispersion problem, breaking pixel-wise feature correlation thus disrupting the intra-class distribution. Our vector CL reformulates CL as a vector regression problem, enabling dispersion quantification in pixel-wise pretraining via modeling feature distances in regressing displacement vectors. To implement this novel paradigm, we propose the COntrast in VEctor Regression (COVER) framework. COVER establishes an extendable vector-based self-learning, enforces a consistent optimization flow from vector regression to distance modeling, and leverages a vector pyramid architecture for granularity adaptation, thus preserving pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks, spanning 2 dimensions and 4 modalities, show that COVER significantly improves pixel-wise SSP, advancing generalizable medical visual foundation models.

[47] Enhancing Ambiguous Dynamic Facial Expression Recognition with Soft Label-based Data Augmentation

Ryosuke Kawamura,Hideaki Hayashi,Shunsuke Otake,Noriko Takemura,Hajime Nagahara

Main category: cs.CV

TL;DR: MIDAS是一种用于动态面部表情识别任务的数据增强方法,它通过结合带有软标签的视频帧提升对模糊表情的识别能力。

Details Motivation: 在实际应用中,准确识别人群中常见的模糊面部表情至关重要,因此需要开发一种有效的数据增强方法来提高动态面部表情识别的性能。 Method: 使用软标签表示多个情感类别的概率,并通过凸组合配对视频帧和对应情感类别标签的方法进行数据增强。 Result: 实验表明,在DFEW数据集和FERV39k-Plus上使用MIDAS增强数据训练的模型相较于原始数据集上的最先进方法表现出更优的性能。 Conclusion: MIDAS是一种增强DFER任务中对模糊面部表情数据识别性能的数据增强方法,它通过凸组合视频帧及其对应的情感类别标签来扩展mixup方法,并已被证明在处理DFER中的模糊性方面非常有效。 Abstract: Dynamic facial expression recognition (DFER) is a task that estimates emotions from facial expression video sequences. For practical applications, accurately recognizing ambiguous facial expressions -- frequently encountered in in-the-wild data -- is essential. In this study, we propose MIDAS, a data augmentation method designed to enhance DFER performance for ambiguous facial expression data using soft labels representing probabilities of multiple emotion classes. MIDAS augments training data by convexly combining pairs of video frames and their corresponding emotion class labels. This approach extends mixup to soft-labeled video data, offering a simple yet highly effective method for handling ambiguity in DFER. To evaluate MIDAS, we conducted experiments on both the DFEW dataset and FERV39k-Plus, a newly constructed dataset that assigns soft labels to an existing DFER dataset. The results demonstrate that models trained with MIDAS-augmented data achieve superior performance compared to the state-of-the-art method trained on the original dataset.

[48] THIRDEYE: Cue-Aware Monocular Depth Estimation via Brain-Inspired Multi-Stage Fusion

Calin Teodor Ioan

Main category: cs.CV

TL;DR: 本文提出了一种名为ThirdEye的单目深度估计方法,通过引入专门的预训练网络来显式利用人类视觉系统依赖的线索,并通过三阶段皮层层次结构进行融合,最终生成高分辨率视差图。

Details Motivation: 传统单目深度估计方法主要依靠深度模型直接从RGB像素中隐式学习深度信息,忽略了人类视觉系统所依赖的显式单目线索(如遮挡边界、阴影和透视)。ThirdEye旨在通过显式提供这些线索以提高深度估计的效果。 Method: ThirdEye使用了专门的预训练且冻结的网络来提取单目线索,并在三阶段皮层层次结构(V1->V2->V3)中融合这些线索,该结构包含一个键值工作记忆模块,用于根据线索可靠性进行加权。随后,一个自适应bins变换头生成高分辨率视差图。 Result: 由于线索专家网络是冻结的,ThirdEye继承了大量的外部监督,仅需要适度微调即可实现高性能的深度估计。然而,具体定量结果将在未来修订中公布。 Conclusion: ThirdEye是一种结合了神经科学启发架构的新颖单目深度估计框架,通过显式利用预训练网络提供的单目线索,实现了有效的深度估计,同时减少了对大规模微调的需求。 Abstract: Monocular depth estimation methods traditionally train deep models to infer depth directly from RGB pixels. This implicit learning often overlooks explicit monocular cues that the human visual system relies on, such as occlusion boundaries, shading, and perspective. Rather than expecting a network to discover these cues unaided, we present ThirdEye, a cue-aware pipeline that deliberately supplies each cue through specialised, pre-trained, and frozen networks. These cues are fused in a three-stage cortical hierarchy (V1->V2->V3) equipped with a key-value working-memory module that weights them by reliability. An adaptive-bins transformer head then produces a high-resolution disparity map. Because the cue experts are frozen, ThirdEye inherits large amounts of external supervision while requiring only modest fine-tuning. This extended version provides additional architectural detail, neuroscientific motivation, and an expanded experimental protocol; quantitative results will appear in a future revision.

[49] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

Shubhankar Borse,Seokeon Choi,Sunghyun Park,Jeongho Kim,Shreya Kadambi,Risheek Garrepalli,Sungrack Yun,Munawar Hayat,Fatih Porikli

Main category: cs.CV

TL;DR: This paper introduces MultiHuman-Testbench, a comprehensive benchmark for evaluating multi-human image generation models, along with novel techniques to improve facial identity preservation.

Details Motivation: Lack of dedicated benchmarks for multi-human image generation makes it challenging to evaluate models effectively. Method: Introduction of MultiHuman-Testbench benchmark with curated text prompts, human face images, pose conditioning images, and a multi-faceted evaluation suite. Novel techniques like image and region isolation using human segmentation and Hungarian matching are incorporated. Result: Benchmark includes 1800 samples, 5550 unique face images, and pose conditioning images. Evaluation showed improvements in ID similarity using proposed techniques. Conclusion: The proposed benchmark and findings offer valuable insights and a standardized tool to advance research in multi-human image generation. Abstract: Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation.

[50] The Role of Cyclopean-Eye in Stereo Vision

Sherlon Almeida da Silva,Davi Geiger,Luiz Velho,Moacir Antonelli Ponti

Main category: cs.CV

TL;DR: 本文研究了立体视觉系统的几何基础,强调3D结构和人类感知的重要性,并提出了新的几何约束,结合理论分析和实验验证,展示了几何先验与学习特征融合的优势。

Details Motivation: 探索3D结构和人类启发的感知如何促进准确的深度重建。 Method: 重新审视Cyclopean Eye模型,提出新的几何约束,并评估深度学习模型中的立体特征匹配质量及注意力机制的作用。 Result: 通过理论洞察和真实数据集的实证研究,发现几何先验与学习特征的结合有助于立体视觉系统理解。 Conclusion: 结合几何先验与学习特征可以提供对立体视觉系统的内部抽象理解。 Abstract: This work investigates the geometric foundations of modern stereo vision systems, with a focus on how 3D structure and human-inspired perception contribute to accurate depth reconstruction. We revisit the Cyclopean Eye model and propose novel geometric constraints that account for occlusions and depth discontinuities. Our analysis includes the evaluation of stereo feature matching quality derived from deep learning models, as well as the role of attention mechanisms in recovering meaningful 3D surfaces. Through both theoretical insights and empirical studies on real datasets, we demonstrate that combining strong geometric priors with learned features provides internal abstractions for understanding stereo vision systems.

[51] FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

Advait Gupta,Rishie Raj,Dang Nguyen,Tianyi Zhou

Main category: cs.CV

TL;DR: FaSTA* is a fast-slow neurosymbolic agent that efficiently performs complex image editing by reusing previously learned subroutines and selectively applying detailed search.

Details Motivation: Multi-turn image editing tasks require a balance between computational cost and accuracy. Existing methods may be inefficient when handling similar subtasks repeatedly. The goal is to create a cost-efficient solution that mimics human-like planning by prioritizing known strategies while selectively applying detailed search. Method: FaSTA* combines large language models (LLMs) for fast subtask planning and A* search for accurate tool-use per subtask. It uses inductive reasoning on successful toolpaths to extract reusable subroutines, which are prioritized over A* search unless they fail. Result: FaSTA* significantly improves computational efficiency compared to recent image editing approaches while remaining competitive in terms of success rate. Conclusion: FaSTA* is a neurosymbolic agent that improves computational efficiency while maintaining competitive success rates in complex, multi-turn image editing tasks by combining fast high-level planning with slow, accurate search. Abstract: We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as "Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.'' It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A$^*$ search per subtask to find a cost-efficient toolpath -- a sequence of calls to AI tools. To save the cost of A$^*$ on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A$^*$ search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent "FaSTA$^*$'': fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A$^*$ search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA$^*$ is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate.

[52] M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization

Ju-Hyeon Nam,Dong-Hyun Moon,Sang-Chul Lee

Main category: cs.CV

TL;DR: This paper proposes M2SFormer, a novel Transformer-based framework that improves pixel-level forgery localization by addressing computational inefficiency and enhancing detail preservation.

Details Motivation: Existing deep learning methods face challenges with computational overhead, limited representation power, and loss of fine detail when detecting complex or subtle image forgeries. Method: Proposed M2SFormer, a Transformer encoder-based framework integrating multi-frequency and multi-scale attentions with a difficulty-guided attention module. Result: M2SFormer outperforms state-of-the-art models and shows strong generalization in detecting and localizing forgeries across unseen domains. Conclusion: M2SFormer provides superior performance in forgery localization, overcoming the limitations of computational overhead and limited representation power in existing methods. Abstract: Image editing techniques have rapidly advanced, facilitating both innovative use cases and malicious manipulation of digital images. Deep learning-based methods have recently achieved high accuracy in pixel-level forgery localization, yet they frequently struggle with computational overhead and limited representation power, particularly for subtle or complex tampering. In this paper, we propose M2SFormer, a novel Transformer encoder-based framework designed to overcome these challenges. Unlike approaches that process spatial and frequency cues separately, M2SFormer unifies multi-frequency and multi-scale attentions in the skip connection, harnessing global context to better capture diverse forgery artifacts. Additionally, our framework addresses the loss of fine detail during upsampling by utilizing a global prior map, a curvature metric indicating the difficulty of forgery localization, which then guides a difficulty-guided attention module to preserve subtle manipulations more effectively. Extensive experiments on multiple benchmark datasets demonstrate that M2SFormer outperforms existing state-of-the-art models, offering superior generalization in detecting and localizing forgeries across unseen domains.

[53] PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling

Hao Zhang,Haolan Xu,Chun Feng,Varun Jampani,Narendra Ahuja

Main category: cs.CV

TL;DR: PhysRig是一种基于物理的可微分蒙皮和绑定框架,通过将骨架嵌入体积表示并模拟软体结构,解决了传统方法的问题,并在多个任务中表现优越。

Details Motivation: 现有的线性混合蒙皮(LBS)方法由于简单且可微分,被广泛使用,但存在体积损失、非自然变形等问题,无法建模弹性材料。 Method: 通过将刚性骨架嵌入体积表示(如四面体网格),并使用欧拉背景网格中的粒子模拟对象,利用连续介质力学进行离散化。 Result: PhysRig在合成数据集上始终优于传统的LBS方法,生成更逼真且物理上合理的结果。 Conclusion: PhysRig是一个基于物理的可微分蒙皮和绑定框架,克服了传统LBS方法的局限性,并在姿态传递任务中展示了其多功能性。 Abstract: Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling.

[54] AIR-VIEW: The Aviation Image Repository for Visibility Estimation of Weather, A Dataset and Benchmark

Chad Mourning,Zhewei Wang,Justin Murray

Main category: cs.CV

TL;DR: 本文介绍了一个用于航空天气监测的新图像数据集,并对其性能进行了基准测试。

Details Motivation: 传统天气传感器成本高昂,且目前缺乏适合监督学习、带有能见度估计标签的公开数据集,这限制了机器学习在航空天气监测中的应用。 Method: 通过为期一年的数据收集活动,从联邦航空管理局(FAA)气象相机网络中获取图像数据,并对三种常用方法和一种通用基线进行了基准测试。 Result: 提出了一种新的适合用于航空能见度估计的大规模数据集,并基于多个公开数据集和新数据集对该方法进行了评估。 Conclusion: 该论文介绍了适用于航空天气监测的新数据集,并建立了基准,填补了大气能见度估计领域公开数据集的空白。 Abstract: Machine Learning for aviation weather is a growing area of research for providing low-cost alternatives for traditional, expensive weather sensors; however, in the area of atmospheric visibility estimation, publicly available datasets, tagged with visibility estimates, of distances relevant for aviation, of diverse locations, of sufficient size for use in supervised learning, are absent. This paper introduces a new dataset which represents the culmination of a year-long data collection campaign of images from the FAA weather camera network suitable for this purpose. We also present a benchmark when applying three commonly used approaches and a general-purpose baseline when trained and tested on three publicly available datasets, in addition to our own, when compared against a recently ratified ASTM standard.

[55] Hierarchical Sub-action Tree for Continuous Sign Language Recognition

Dejie Yang,Zhu Xu,Xinjie Gao,Yang Liu

Main category: cs.CV

TL;DR: This paper proposes HST-CSLR, a method for continuous sign language recognition that effectively uses textual gloss knowledge through a hierarchical tree structure and contrastive alignment.

Details Motivation: CSLR progress is hindered by limited large datasets and precise annotations. Existing cross-modal methods underutilize textual gloss knowledge. Method: A Hierarchical Sub-action Tree (HST) is constructed to align textual and visual modalities step-by-step while reducing computational complexity. Contrastive alignment enhancement is also applied. Result: The HST-CSLR approach successfully integrates textual information using large language models and improves modality alignment. Conclusion: The proposed HST-CSLR effectively combines gloss knowledge with visual representation learning, demonstrating effectiveness on four CSLR datasets. Abstract: Continuous sign language recognition (CSLR) aims to transcribe untrimmed videos into glosses, which are typically textual words. Recent studies indicate that the lack of large datasets and precise annotations has become a bottleneck for CSLR due to insufficient training data. To address this, some works have developed cross-modal solutions to align visual and textual modalities. However, they typically extract textual features from glosses without fully utilizing their knowledge. In this paper, we propose the Hierarchical Sub-action Tree (HST), termed HST-CSLR, to efficiently combine gloss knowledge with visual representation learning. By incorporating gloss-specific knowledge from large language models, our approach leverages textual information more effectively. Specifically, we construct an HST for textual information representation, aligning visual and textual modalities step-by-step and benefiting from the tree structure to reduce computational complexity. Additionally, we impose a contrastive alignment enhancement to bridge the gap between the two modalities. Experiments on four datasets (PHOENIX-2014, PHOENIX-2014T, CSL-Daily, and Sign Language Gesture) demonstrate the effectiveness of our HST-CSLR.

[56] OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs

Yiman Zhang,Ziheng Luo,Qiangyu Yan,Wei He,Borui Jiang,Xinghao Chen,Kai Han

Main category: cs.CV

TL;DR: This paper introduces OmniEval, a benchmark for evaluating multi-modal AI models, featuring full-modal collaboration, diverse video content, and detailed task categorization.

Details Motivation: The motivation behind OmniEval is to address the limitations of existing benchmarks by incorporating full-modal collaboration, diverse video content, and granular task evaluation for omni-modality models. Method: The authors designed evaluation tasks that require full-modal collaboration, collected diverse audio-visual videos, and categorized questions into major and sub-task types, including a more granular video localization task named Grounding. Result: OmniEval includes 810 audio-visual synchronized videos, 2617 question-answer pairs across 3 major task types and 12 sub-task types, with results showing its effectiveness in evaluating omni-modality models. Conclusion: OmniEval is a new benchmark designed to evaluate omni-modality models, and it provides a platform for assessing the ability of models to integrate and understand multiple modalities. Abstract: In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos: OmniEval includes 810 audio-visual synchronized videos, 285 Chinese videos and 525 English videos; (iii) Diversity and granularity of tasks: OmniEval contains 2617 question-answer pairs, comprising 1412 open-ended questions and 1205 multiple-choice questions. These questions are divided into 3 major task types and 12 sub-task types to achieve comprehensive evaluation. Among them, we introduce a more granular video localization task named Grounding. Then we conduct experiments on OmniEval with several omni-modality models. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities. Codes and data could be found at https://omnieval.github.io/.

[57] Evidence-based diagnostic reasoning with multi-agent copilot for human pathology

Chengkuan Chen,Luca L. Weishaupt,Drew F. K. Williamson,Richard J. Chen,Tong Ding,Bowen Chen,Anurag Vaidya,Long Phi Le,Guillaume Jaume,Ming Y. Lu,Faisal Mahmood

Main category: cs.CV

TL;DR: PathChat+ is a new multimodal AI model for pathology that integrates text and image analysis to improve diagnostic reasoning, outperforming previous methods, and works with SlideSeek to analyze complex whole-slide images effectively.

Details Motivation: Current computational pathology models focus mainly on image analysis without integrating natural language or context, while existing MLLMs face challenges such as limited training data, poor multi-image understanding, and lack of autonomous diagnostic reasoning capabilities. Method: The authors introduced PathChat+, a multimodal large language model trained on over 1 million pathology-specific instruction samples and nearly 5.5 million question-answer turns, along with SlideSeek, a reasoning-enabled multi-agent AI system that uses PathChat+ for iterative, hierarchical diagnostic reasoning on gigapixel WSIs. Result: PathChat+ outperformed prior PathChat, state-of-the-art general-purpose models, and other pathology-specific models across diverse benchmarks. SlideSeek demonstrated high accuracy on DDxBench and generated interpretable summary reports through visual grounding. Conclusion: PathChat+ and SlideSeek significantly advance computational pathology by overcoming limitations of existing models, enabling more accurate and autonomous diagnostic reasoning on whole-slide images. Abstract: Pathology is experiencing rapid digital transformation driven by whole-slide imaging and artificial intelligence (AI). While deep learning-based computational pathology has achieved notable success, traditional models primarily focus on image analysis without integrating natural language instruction or rich, text-based context. Current multimodal large language models (MLLMs) in computational pathology face limitations, including insufficient training data, inadequate support and evaluation for multi-image understanding, and a lack of autonomous, diagnostic reasoning capabilities. To address these limitations, we introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology-specific instruction samples and nearly 5.5 million question answer turns. Extensive evaluations across diverse pathology benchmarks demonstrated that PathChat+ substantially outperforms the prior PathChat copilot, as well as both state-of-the-art (SOTA) general-purpose and other pathology-specific models. Furthermore, we present SlideSeek, a reasoning-enabled multi-agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole-slide images (WSIs) through iterative, hierarchical diagnostic reasoning, reaching high accuracy on DDxBench, a challenging open-ended differential diagnosis benchmark, while also capable of generating visually grounded, humanly-interpretable summary reports.

[58] DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing

Lingling Cai,Kang Zhao,Hangjie Yuan,Xiang Wang,Yingya Zhang,Kejie Huang

Main category: cs.CV

TL;DR: DFVEdit is an efficient, zero-shot video editing method for Video DiTs that avoids costly attention modifications or fine-tuning, offering significant gains in speed and memory usage while maintaining high-quality results.

Details Motivation: Existing video editing methods applied to Video DiTs are computationally expensive due to reliance on attention modification or fine-tuning, prompting the need for a more efficient approach. Method: The method introduces Conditional Delta Flow Vector (CDFV) and enhances it with Implicit Cross Attention (ICA) guidance and Embedding Reinforcement (ER), operating directly on clean latents through flow transformation without attention modification or fine-tuning. Result: DFVEdit achieves at least 20x inference speed-up and 85% memory reduction compared to existing methods, while maintaining high structural fidelity, spatial-temporal consistency, and editing quality across popular Video DiTs like CogVideoX and Wan2.1. Conclusion: DFVEdit provides a highly efficient and zero-shot video editing solution for Video DiTs, achieving superior performance in speed, memory efficiency, and editing quality. Abstract: The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) -- a theoretically unbiased estimation of DFV -- and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85\% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.

[59] From Cradle to Cane: A Two-Pass Framework for High-Fidelity Lifespan Face Aging

Tao Liu,Dafeng Zhang,Gengchen Li,Shizhuo Liu,Yongqi Song,Senmao Li,Shiqi Yang,Boqian Li,Kai Wang,Yaxing Wang

Main category: cs.CV

TL;DR: This paper proposes a two-pass face aging framework called Cradle2Cane that successfully balances the trade-off between age accuracy and identity preservation, demonstrating superior performance over existing methods.

Details Motivation: Existing face aging methods struggle with achieving a realistic and seamless transformation across the entire lifespan, especially when handling large age gaps or extreme head poses. This challenge, known as the Age-ID trade-off, involves balancing age accuracy and identity preservation. Method: A two-pass face aging framework based on few-step text-to-image (T2I) diffusion models. The first pass introduces an adaptive noise injection (AdaNI) mechanism guided by textual conditions of age and gender to focus on age accuracy. The second pass enhances identity preservation using identity-aware embeddings (IDEmb): SVR-ArcFace and Rotate-CLIP, while maintaining age-specific features. Result: Extensive experiments on the CelebA-HQ test dataset, evaluated through Face++ and Qwen-VL protocols, show that Cradle2Cane outperforms existing face aging methods in terms of both age accuracy and identity consistency. Conclusion: The proposed Cradle2Cane framework effectively addresses the Age-ID trade-off in face aging tasks, outperforming existing methods in age accuracy and identity consistency. Abstract: Face aging has become a crucial task in computer vision, with applications ranging from entertainment to healthcare. However, existing methods struggle with achieving a realistic and seamless transformation across the entire lifespan, especially when handling large age gaps or extreme head poses. The core challenge lies in balancing age accuracy and identity preservation--what we refer to as the Age-ID trade-off. Most prior methods either prioritize age transformation at the expense of identity consistency or vice versa. In this work, we address this issue by proposing a two-pass face aging framework, named Cradle2Cane, based on few-step text-to-image (T2I) diffusion models. The first pass focuses on solving age accuracy by introducing an adaptive noise injection (AdaNI) mechanism. This mechanism is guided by including prompt descriptions of age and gender for the given person as the textual condition. Also, by adjusting the noise level, we can control the strength of aging while allowing more flexibility in transforming the face. However, identity preservation is weakly ensured here to facilitate stronger age transformations. In the second pass, we enhance identity preservation while maintaining age-specific features by conditioning the model on two identity-aware embeddings (IDEmb): SVR-ArcFace and Rotate-CLIP. This pass allows for denoising the transformed image from the first pass, ensuring stronger identity preservation without compromising the aging accuracy. Both passes are jointly trained in an end-to-end way. Extensive experiments on the CelebA-HQ test dataset, evaluated through Face++ and Qwen-VL protocols, show that our Cradle2Cane outperforms existing face aging methods in age accuracy and identity consistency.

[60] 3D Scene-Camera Representation with Joint Camera Photometric Optimization

Weichen Dai,Kangcheng Ma,Jiaxin Wang,Kecen Pan,Yuhang Ming,Hua Zhang,Wanzeng Kong

Main category: cs.CV

TL;DR: This paper proposes a 3D scene-camera representation method that jointly optimizes camera photometric parameters to remove scene-unrelated information, resulting in high-quality 3D scene representations even in the presence of imaging distortions.

Details Motivation: Photometric distortions in camera imaging can degrade image quality and introduce erroneous information into 3D scene representations. This paper aims to address this issue by developing a method that accounts for these distortions. Method: A novel 3D scene-camera representation with joint camera photometric optimization is proposed. Internal and external photometric models are introduced to develop a full photometric model and corresponding camera representation. Depth regularization is also applied during photometric parameter optimization. Result: Experimental results show that the proposed method can produce high-quality 3D scene representations even under conditions of imaging degradation such as vignetting and dirt. Conclusion: The proposed method can achieve high-quality 3D scene representations by incorporating a camera photometric model and optimizing its parameters, which effectively separates scene-unrelated information from the 3D scene representation. Abstract: Representing scenes from multi-view images is a crucial task in computer vision with extensive applications. However, inherent photometric distortions in the camera imaging can significantly degrade image quality. Without accounting for these distortions, the 3D scene representation may inadvertently incorporate erroneous information unrelated to the scene, diminishing the quality of the representation. In this paper, we propose a novel 3D scene-camera representation with joint camera photometric optimization. By introducing internal and external photometric model, we propose a full photometric model and corresponding camera representation. Based on simultaneously optimizing the parameters of the camera representation, the proposed method effectively separates scene-unrelated information from the 3D scene representation. Additionally, during the optimization of the photometric parameters, we introduce a depth regularization to prevent the 3D scene representation from fitting scene-unrelated information. By incorporating the camera model as part of the mapping process, the proposed method constructs a complete map that includes both the scene radiance field and the camera photometric model. Experimental results demonstrate that the proposed method can achieve high-quality 3D scene representations, even under conditions of imaging degradation, such as vignetting and dirt.

[61] Rethink Sparse Signals for Pose-guided Text-to-image Generation

Wenjie Xuan,Jing Zhang,Juhua Liu,Bo Du,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏信号的新方法(SP-Ctrl),在姿态引导的文本到图像生成中表现出优异的可控性和多样性。

Details Motivation: 密集信号虽然提供了详细的空间指导,但存在编辑困难和与文本提示潜在不一致的问题,因此重新探索稀疏信号用于姿态引导的可能性。 Method: 提出了一种新的Spatial-Pose ControlNet (SP-Ctrl),将OpenPose扩展为可学习的空间表示,并引入关键点概念学习,以提升姿态对齐性能。 Result: 实验表明,该方法在稀疏姿态引导下优于最近的空间可控T2I生成方法,并且能够匹配基于密集信号的方法的性能。 Conclusion: SP-Ctrl通过稀疏信号实现了对姿态引导图像生成的强可控性,并在多样性和跨物种生成方面展现出良好的能力。 Abstract: Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals. Codes will be available at https://github.com/DREAMXFAR/SP-Ctrl.

[62] EVA: Mixture-of-Experts Semantic Variant Alignment for Compositional Zero-Shot Learning

Xiao Zhang,Yongqiang Ma,Haodong Jing,Nanning Zheng

Main category: cs.CV

TL;DR: This paper proposes EVA, a novel framework for Compositional Zero-Shot Learning that improves compositional generalization by leveraging multiple experts and semantic alignment techniques.

Details Motivation: Existing CZSL methods are suboptimal due to simple composition-prototype mapping and neglect of compositional divergence in cross-modal matching. Method: EVA uses a Mixture-of-Experts Semantic Variant Alignment framework, incorporating domain-expert adaptation for token-aware learning and semantic variant alignment for precise image-primitive matching. Result: EVA significantly surpasses other CZSL approaches on three benchmarks in both closed- and open-world settings. Conclusion: The proposed EVA framework outperforms state-of-the-art CZSL methods by achieving better compositional generalization through domain-expert adaptation and semantic variant alignment. Abstract: Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets. Moreover, the all-to-one cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Semantic Variant Alignment framework for CZSL. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching. Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.

[63] Segment Anything in Pathology Images with Natural Language

Zhixuan Chen,Junlin Hou,Liqi Lin,Yihui Wang,Yequan Bie,Xi Wang,Yanning Zhou,Ronald Cheong Kin Chan,Hao Chen

Main category: cs.CV

TL;DR: 提出了一种名为PathSegmentor的文本提示分割基础模型及其配套大规模数据集PathSeg,用于病理图像分析。

Details Motivation: 由于标注数据有限和类别定义受限,当前方法在临床应用中面临重大挑战。 Method: 引入PathSegmentor,一种专为病理图像设计的文本提示分割模型,并创建了包含275k图像-掩码-标签三元组的数据集PathSeg。 Result: 实验显示PathSegmentor在整体Dice得分上超越现有模型,具有更高的准确性和适用性,并提升了诊断模型的可解释性。 Conclusion: 该研究推动了精准肿瘤学中可解释AI的发展。 Abstract: Pathology image segmentation is crucial in computational pathology for analyzing histological features relevant to cancer diagnosis and prognosis. However, current methods face major challenges in clinical applications due to limited annotated data and restricted category definitions. To address these limitations, we propose PathSegmentor, the first text-prompted segmentation foundation model designed specifically for pathology images. We also introduce PathSeg , the largest and most comprehensive dataset for pathology segmentation, built from 17 public sources and containing 275k image-mask-label triples across 160 diverse categories. With PathSegmentor, users can perform semantic segmentation using natural language prompts, eliminating the need for laborious spatial inputs such as points or boxes. Extensive experiments demonstrate that PathSegmentor outperforms specialized models with higher accuracy and broader applicability, while maintaining a compact architecture. It significantly surpasses existing spatial- and text-prompted models by 0.145 and 0.429 in overall Dice scores, respectively, showing strong robustness in segmenting complex structures and generalizing to external datasets. Moreover, PathSegmentor's outputs enhance the interpretability of diagnostic models through feature importance estimation and imaging biomarker discovery, offering pathologists evidence-based support for clinical decision-making. This work advances the development of explainable AI in precision oncology.

[64] TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation

Chade Li,Pengju Zhang,Yihong Wu

Main category: cs.CV

TL;DR: TSDASeg improves interactive point cloud segmentation by directly aligning 3D point clouds with textual and visual data using a novel two-stage framework and memory module.

Details Motivation: Existing methods underperform in point-level tasks like segmentation due to missing direct 3D-text alignment, limiting their ability to link local 3D features with textual context. This motivates the development of a solution that explicitly aligns 3D point clouds with textual and visual data. Method: TSDASeg uses a two-stage framework with a direct cross-modal alignment module and a memory module containing multiple dedicated memory banks. These components work together to enhance the alignment between 3D point clouds, textual data, and 2D images through self-attention and cross-attention mechanisms. Result: Experiments on multiple datasets show that the proposed method achieves superior performance, establishing state-of-the-art results for interactive point cloud segmentation. Conclusion: The proposed TSDASeg model achieves state-of-the-art performance in interactive point cloud segmentation by addressing the issue of 3D-text misalignment and improving cross-modal feature consistency. Abstract: The rapid advancement of 3D vision-language models (VLMs) has spurred significant interest in interactive point cloud processing tasks, particularly for real-world applications. However, existing methods often underperform in point-level tasks, such as segmentation, due to missing direct 3D-text alignment, limiting their ability to link local 3D features with textual context. To solve this problem, we propose TSDASeg, a Two-Stage model coupled with a Direct cross-modal Alignment module and memory module for interactive point cloud Segmentation. We introduce the direct cross-modal alignment module to establish explicit alignment between 3D point clouds and textual/2D image data. Within the memory module, we employ multiple dedicated memory banks to separately store text features, visual features, and their cross-modal correspondence mappings. These memory banks are dynamically leveraged through self-attention and cross-attention mechanisms to update scene-specific features based on prior stored data, effectively addressing inconsistencies in interactive segmentation results across diverse scenarios. Experiments conducted on multiple 3D instruction, reference, and semantic segmentation datasets demonstrate that the proposed method achieves state-of-the-art performance.

[65] Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

Akio Hayakawa,Masato Ishii,Takashi Shibuya,Yuki Mitsufuji

Main category: cs.CV

TL;DR: 本文提出了一种基于引导视频到音频合成任务的新型逐步视频到音频生成方法,能够生成多个语义不同的音频轨道,从而实现更高质量的复合音频合成。

Details Motivation: 旨在全面捕捉给定视频引发的所有声音事件,模仿传统Foley工作流程。 Method: 提出了一种新的逐步视频到音频生成方法,每个步骤都以目标文本提示和先前生成的音频轨道为条件,进行引导的视频到音频合成任务。 Result: 该设计灵感来自先前组合生成框架中的概念否定思想,并引入了一个训练框架,该框架利用预训练的视频到音频模型,并消除了对专用配对数据集的需求。 Conclusion: 实验结果表明,该方法能为单个输入视频生成多个语义不同的音频轨道,合成音频质量优于现有基线。 Abstract: We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.

[66] DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting

Yeon-Ji Song,Jaein Kim,Byung-Ju Kim,Byoung-Tak Zhang

Main category: cs.CV

TL;DR: This paper proposes DBMovi-GS, a novel method for dynamic view synthesis from blurry monocular videos using sparse-controlled Gaussian splatting, achieving state-of-the-art results in realistic novel view synthesis for dynamic and low-quality inputs.

Details Motivation: Synthesizing dynamic scenes from blurry monocular videos is an unresolved challenge due to existing methods' reliance on high-resolution images or rigid scene priors, leading to instability and degraded visual fidelity in real-world environments with dynamic motion. Method: The paper introduces Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS), which generates dense 3D Gaussians to restore sharpness and reconstruct detailed 3D scene geometry affected by dynamic motion variations. Result: The proposed model achieves robust performance in novel view synthesis under dynamic and blurry scenes, enabling realistic view synthesis for challenging monocular video inputs. Conclusion: The proposed DBMovi-GS method effectively addresses the challenge of dynamic novel view synthesis from blurry monocular videos, achieving robust performance and setting a new benchmark in realistic view synthesis for such inputs. Abstract: Novel view synthesis is a task of generating scenes from unseen perspectives; however, synthesizing dynamic scenes from blurry monocular videos remains an unresolved challenge that has yet to be effectively addressed. Existing novel view synthesis methods are often constrained by their reliance on high-resolution images or strong assumptions about static geometry and rigid scene priors. Consequently, their approaches lack robustness in real-world environments with dynamic object and camera motion, leading to instability and degraded visual fidelity. To address this, we propose Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS), a method designed for dynamic view synthesis from blurry monocular videos. Our model generates dense 3D Gaussians, restoring sharpness from blurry videos and reconstructing detailed 3D geometry of the scene affected by dynamic motion variations. Our model achieves robust performance in novel view synthesis under dynamic blurry scenes and sets a new benchmark in realistic novel view synthesis for blurry monocular video inputs.

[67] Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology

Qiuyi Qi,Xin Li,Ming Kong,Zikang Xu,Bingdi Chen,Qiang Zhu,S Kevin Zhou

Main category: cs.CV

TL;DR: 该论文提出了一种名为SAIC的方法,通过合成高质量的病理图像来增强异常细胞检测模型的效果和稳定性,适用于实际临床应用。

Details Motivation: 缺乏高质量标注、长尾数据分布以及不一致的染色风格是训练神经网络用于细胞病理学中异常细胞检测的主要挑战。 Method: SAIC方法包括三个步骤:基于属性引导从异常细胞库中选择合适的候选者,使用高频特征重构来实现风格对齐和高保真合成,最后引入大型视觉-语言模型过滤高质量的合成图像。 Result: 实验结果表明,结合SAIC合成的图像能够有效提升检测模型对于尾部类别和风格的性能和鲁棒性,从而提高整体检测效果。 Conclusion: SAIC可以增强异常细胞检测模型的有效性和鲁棒性,并且在临床应用中具有良好的通用性和实用性。 Abstract: Challenges such as the lack of high-quality annotations, long-tailed data distributions, and inconsistent staining styles pose significant obstacles to training neural networks to detect abnormal cells in cytopathology robustly. This paper proposes a style-aligned image composition (SAIC) method that composes high-fidelity and style-preserved pathological images to enhance the effectiveness and robustness of detection models. Without additional training, SAIC first selects an appropriate candidate from the abnormal cell bank based on attribute guidance. Then, it employs a high-frequency feature reconstruction to achieve a style-aligned and high-fidelity composition of abnormal cells and pathological backgrounds. Finally, it introduces a large vision-language model to filter high-quality synthesis images. Experimental results demonstrate that incorporating SAIC-synthesized images effectively enhances the performance and robustness of abnormal cell detection for tail categories and styles, thereby improving overall detection performance. The comprehensive quality evaluation further confirms the generalizability and practicality of SAIC in clinical application scenarios. Our code will be released at https://github.com/Joey-Qi/SAIC.

[68] Inverse Scene Text Removal

Takumi Yoshimatsu,Shumpei Takezaki,Seiichi Uchida

Main category: cs.CV

TL;DR: This paper investigates Inverse Scene Text Removal (ISTR), which focuses on detecting if images have had text removed and locating where that removal happened, aiming to curb misuse of STR technology.

Details Motivation: The motivation for this research is the increasing misuse risks of Scene Text Removal (STR) technology as it becomes more advanced through neural networks and synthetic data. The authors aim to investigate these risks and find ways to detect potential misuse. Method: The authors used neural networks and synthetic data to develop ISTR, which analyzes STR-processed images with a focus on binary classification and localizing removed text regions. They also trained a text recognizer in an attempt to recover the removed text content. Result: The experiments demonstrated that ISTR tasks, such as detecting whether an image has undergone STR and localizing removed text regions, can be achieved with high accuracies. This enables the detection of potential misuse and enhances STR technology. Conclusion: This paper concludes that ISTR can effectively detect whether an image has undergone STR and localize removed text regions, thus enabling the detection of potential misuse and improving STR. Abstract: Scene text removal (STR) aims to erase textual elements from images. It was originally intended for removing privacy-sensitiveor undesired texts from natural scene images, but is now also appliedto typographic images. STR typically detects text regions and theninpaints them. Although STR has advanced through neural networksand synthetic data, misuse risks have increased. This paper investi-gates Inverse STR (ISTR), which analyzes STR-processed images andfocuses on binary classification (detecting whether an image has un-dergone STR) and localizing removed text regions. We demonstrate inexperiments that these tasks are achievable with high accuracies, en-abling detection of potential misuse and improving STR. We also at-tempt to recover the removed text content by training a text recognizerto understand its difficulty.

[69] VisionGuard: Synergistic Framework for Helmet Violation Detection

Lam-Huy Nguyen,Thinh-Phuc Nguyen,Thanh-Hai Nguyen,Gia-Huy Dinh,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: 本文提出 VisionGuard 框架,通过多阶段方法提升摩托车头盔违规检测的准确性,尤其在复杂环境下表现优越。

Details Motivation: 自动检测摩托车头盔违规行为面临环境变化、摄像头角度和数据不一致等挑战,这限制了可靠检测和稳定分类的效果。 Method: VisionGuard 结合了基于跟踪的自适应标注模块和生成虚拟边界框的上下文扩展模块,以提高分类一致性和少数类召回率。 Result: 实验结果显示,VisionGuard 相比基线检测器整体 mAP 提高了 3.1%,展示了其在交通监控系统中的实用潜力。 Conclusion: VisionGuard 是一种多阶段框架,通过集成自适应标注和上下文扩展模块,有效解决了逐帧检测器在摩托车头盔违规检测中的局限性,尤其是在类别不平衡和注释不一致的情况下。 Abstract: Enforcing helmet regulations among motorcyclists is essential for enhancing road safety and ensuring the effectiveness of traffic management systems. However, automatic detection of helmet violations faces significant challenges due to environmental variability, camera angles, and inconsistencies in the data. These factors hinder reliable detection of motorcycles and riders and disrupt consistent object classification. To address these challenges, we propose VisionGuard, a synergistic multi-stage framework designed to overcome the limitations of frame-wise detectors, especially in scenarios with class imbalance and inconsistent annotations. VisionGuard integrates two key components: Adaptive Labeling and Contextual Expander modules. The Adaptive Labeling module is a tracking-based refinement technique that enhances classification consistency by leveraging a tracking algorithm to assign persistent labels across frames and correct misclassifications. The Contextual Expander module improves recall for underrepresented classes by generating virtual bounding boxes with appropriate confidence scores, effectively addressing the impact of data imbalance. Experimental results show that VisionGuard improves overall mAP by 3.1% compared to baseline detectors, demonstrating its effectiveness and potential for real-world deployment in traffic surveillance systems, ultimately promoting safety and regulatory compliance.

[70] Detection of Breast Cancer Lumpectomy Margin with SAM-incorporated Forward-Forward Contrastive Learning

Tyler Ward,Xiaoqin Wang,Braxton McFarland,Md Atik Ahamed,Sahar Nozad,Talal Arshad,Hafsa Nebbache,Jin Chen,Abdullah Imran

Main category: cs.CV

TL;DR: 通过使用一种新的深度学习框架FFCL-SAM,可以更快速且准确地评估乳腺癌手术中的标本边缘状态,有望降低需要再次手术的比例。

Details Motivation: 为了解决当前2D标本放射成像方法在评估乳腺癌切除手术标本边缘状态时准确性有限的问题,从而避免额外手术的需求。 Method: 提出了一种结合Segment Anything Model(SAM)与Forward-Forward Contrastive Learning(FFCL)的新深度学习框架,用于SR图像的块级分类和肿瘤边缘分割。 Result: 该方法在边缘分类上达到了0.8455的AUC,并在Dice相似性度量上比基线模型提高了27.4%,同时将每张图像的推理时间减少到47毫秒。 Conclusion: FFCL-SAM显著增强了术中边缘评估的速度和准确性,具有减少再切除率并改善乳腺癌治疗手术结果的潜力。 Abstract: Complete removal of cancer tumors with a negative specimen margin during lumpectomy is essential in reducing breast cancer recurrence. However, 2D specimen radiography (SR), the current method used to assess intraoperative specimen margin status, has limited accuracy, resulting in nearly a quarter of patients requiring additional surgery. To address this, we propose a novel deep learning framework combining the Segment Anything Model (SAM) with Forward-Forward Contrastive Learning (FFCL), a pre-training strategy leveraging both local and global contrastive learning for patch-level classification of SR images. After annotating SR images with regions of known maligancy, non-malignant tissue, and pathology-confirmed margins, we pre-train a ResNet-18 backbone with FFCL to classify margin status, then reconstruct coarse binary masks to prompt SAM for refined tumor margin segmentation. Our approach achieved an AUC of 0.8455 for margin classification and segmented margins with a 27.4% improvement in Dice similarity over baseline models, while reducing inference time to 47 milliseconds per image. These results demonstrate that FFCL-SAM significantly enhances both the speed and accuracy of intraoperative margin assessment, with strong potential to reduce re-excision rates and improve surgical outcomes in breast cancer treatment. Our code is available at https://github.com/tbwa233/FFCL-SAM/.

[71] The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion

Bang Gong,Luchao Qi,Jiaye Wu,Zhicheng Fu,Chunbo Song,David W. Jacobs,John Nicholson,Roni Sengupta

Main category: cs.CV

TL;DR: 论文提出了Aging Multiverse框架,可以从单一面部图像生成多种受环境、健康和生活方式影响的老化轨迹,突破了传统单路径老化的限制。

Details Motivation: 传统方法将面部老化视为一个确定性的路径,而本研究旨在提供多条可能的衰老轨迹,从而更全面地展示未来的变化。这种方法具有更高的可控性与可解释性,适用于数字叙事、健康教育和个性化可视化等领域。 Method: 论文采用了基于扩散的方法(diffusion-based method),通过注意力混合(attention mixing)调节编辑强度,并引入了模拟老化正则化策略(Simulated Aging Regularization)来稳定编辑过程。 Result: 实验表明,该方法在身份保持、老化真实感和条件一致性方面均优于现有模型,并且用户研究也验证了其优越性能。 Conclusion: 该论文提出了一种新的面部老化轨迹生成方法,能够基于单一图像生成多个受外部因素影响的老化结果,并在身份保持、年龄准确性和条件控制之间取得了平衡。 Abstract: We introduce the Aging Multiverse, a framework for generating multiple plausible facial aging trajectories from a single image, each conditioned on external factors such as environment, health, and lifestyle. Unlike prior methods that model aging as a single deterministic path, our approach creates an aging tree that visualizes diverse futures. To enable this, we propose a training-free diffusion-based method that balances identity preservation, age accuracy, and condition control. Our key contributions include attention mixing to modulate editing strength and a Simulated Aging Regularization strategy to stabilize edits. Extensive experiments and user studies demonstrate state-of-the-art performance across identity preservation, aging realism, and conditional alignment, outperforming existing editing and age-progression models, which often fail to account for one or more of the editing criteria. By transforming aging into a multi-dimensional, controllable, and interpretable process, our approach opens up new creative and practical avenues in digital storytelling, health education, and personalized visualization.

[72] User-in-the-Loop View Sampling with Error Peaking Visualization

Ayaka Yasunaga,Hideo Saito,Shohei Mori

Main category: cs.CV

TL;DR: 本文提出了一种无需3D注释的移动视图合成方法,利用光场局部重建与误差可视化,有效减少视角样本需求并提升用户体验。

Details Motivation: 现有的基于AR的新视角样本采集任务因需要3D注释且限制在预定义的小区域内,导致心理负担较重,因此需要一种更自由、高效的方法。 Method: 通过局部重建光场并可视化需要插入新视图来消除的误差,从而优化数据采集过程。 Result: 实验结果表明,误差峰值可视化降低了对最终结果的失望感,在较少视角样本的情况下仍能提供满意的效果,并适用于大型场景的辐射场重建(如3D高斯splatting)。 Conclusion: 本文提出了一种基于局部重建光场和误差可视化的移动视图合成方法,减少了对3D注释的依赖,并提高了场景探索的灵活性。 Abstract: Augmented reality (AR) provides ways to visualize missing view samples for novel view synthesis. Existing approaches present 3D annotations for new view samples and task users with taking images by aligning the AR display. This data collection task is known to be mentally demanding and limits capture areas to pre-defined small areas due to the ideal but restrictive underlying sampling theory. To free users from 3D annotations and limited scene exploration, we propose using locally reconstructed light fields and visualizing errors to be removed by inserting new views. Our results show that the error-peaking visualization is less invasive, reduces disappointment in final results, and is satisfactory with fewer view samples in our mobile view synthesis system. We also show that our approach can contribute to recent radiance field reconstruction for larger scenes, such as 3D Gaussian splatting.

[73] Bridging Video Quality Scoring and Justification via Large Multimodal Models

Qizhi Xie,Kun Yuan,Yunpeng Qu,Jiachao Gong,Mingda Wu,Ming Sun,Chao Zhou,Jihong Zhu

Main category: cs.CV

TL;DR: 本文提出了一种新的视频质量评估方法,通过使用基于评分的指令生成(SIG)管道和渐进式调优策略,显著提高了视频大型多模态模型的质量评分和解释能力。

Details Motivation: 传统视频质量评估(VQA)方法仅能生成数值评分,无法描述视频复杂的质量维度。适应视频大型多模态模型(LMM)通过指令调整具有解决此问题的潜力。 Method: 提出了一种基于评分的指令生成(SIG)管道,并设计了一种渐进式调优策略以充分释放S2I的力量。 Result: 所提出的Score-based Instruction Generation (SIG)管道消除了对专家撰写的质量描述和专有系统的依赖,确保了数据的可扩展性和生成效率。此外,构建了一个包含超过320K个多样化指令-响应对的Score2Instruct (S2I)数据集。 Conclusion: 实验结果表明,该方法在多个视频LMM上持续提升了质量评分和解释能力。 Abstract: Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity. Yet, a score fails to describe the video's complex quality dimensions, restricting its applicability. Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems, limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system's reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs' quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs.

[74] HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

Qize Yang,Shimin Yao,Weixuan Chen,Shenghao Fu,Detao Bai,Jiaxing Zhao,Boyuan Sun,Bowen Yin,Xihan Wei,Jingren Zhou

Main category: cs.CV

TL;DR: This paper proposes a method to enhance multimodal reasoning in large language models by addressing global context understanding and shortcut problems using reinforcement learning techniques and introduces a new benchmark called IntentBench.

Details Motivation: To address challenges in adapting Reinforcement Learning (RL) to multimodal data and improve the global context understanding and shortcut problems in existing multimodal reasoning models. Method: Implementing a context reward judged by an LLM, format and accuracy rewards, and logical reward assessment. Introduced a reasoning omni-modal benchmark called IntentBench. Result: Improved complex reasoning capability and accurate interpretation of multimodal context information in multimodal large language models. Conclusion: The proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source models. Abstract: With the rapid evolution of multimodal large language models, the capacity to deeply understand and interpret human intentions has emerged as a critical capability, which demands detailed and thoughtful reasoning. In recent studies, Reinforcement Learning (RL) has demonstrated potential in enhancing the reasoning capabilities of Large Language Models (LLMs). Nonetheless, the challenges associated with adapting RL to multimodal data and formats remain largely unaddressed. In this paper, we identify two issues in existing multimodal reasoning models: insufficient global context understanding and shortcut problems. Insufficient context understanding can happen when a model misinterprets multimodal context, resulting in incorrect answers. The shortcut problem occurs when the model overlooks crucial clues in multimodal inputs, directly addressing the query without considering the multimodal information. To tackle these issues, we emphasize the necessity for the model to reason with a clear understanding of the global context within multimodal inputs. This global context understanding can effectively prevent the model from overlooking key multimodal cues and ensure a thorough reasoning process. To ensure the accurate interpretation of multimodal context information, we implement a context reward judged by a large language model, alongside format and accuracy rewards. Additionally, to improve complex reasoning capability, we employ the LLM to assess the logical reward, determining whether the reasoning process successfully integrates multimodal information with logical methods. We also introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating models in understanding complex human intentions and emotions. Our proposed method demonstrates advanced performance across multiple omni-modal benchmarks compared to other open-source omni-modal models.

[75] FedSC: Federated Learning with Semantic-Aware Collaboration

Huan Wang,Haoran Li,Huaming Chen,Jun Yan,Jiahua Shi,Jun Shen

Main category: cs.CV

TL;DR: 本文提出了一种新的联邦学习框架FedSC,利用语义级原型协作来解决数据异构性问题,提高了模型性能并确保收敛性。

Details Motivation: 现有的联邦学习方法通常忽略每个客户端内部的语义信息,而这些信息对于处理数据异构性可能具有重要意义。 Method: 提出了一种新的联邦学习方法FedSC,构建关系原型和一致性原型以捕捉客户端特定和类别相关的知识。 Result: 实验结果表明,FedSC在各种具有挑战性的场景下均表现出色,验证了其有效性及关键组件的效率。 Conclusion: FedSC通过语义感知的协作学习有效解决了联邦学习中的数据异构问题,并提供了收敛性保证。 Abstract: Federated learning (FL) aims to train models collaboratively across clients without sharing data for privacy-preserving. However, one major challenge is the data heterogeneity issue, which refers to the biased labeling preferences at multiple clients. A number of existing FL methods attempt to tackle data heterogeneity locally (e.g., regularizing local models) or globally (e.g., fine-tuning global model), often neglecting inherent semantic information contained in each client. To explore the possibility of using intra-client semantically meaningful knowledge in handling data heterogeneity, in this paper, we propose Federated Learning with Semantic-Aware Collaboration (FedSC) to capture client-specific and class-relevant knowledge across heterogeneous clients. The core idea of FedSC is to construct relational prototypes and consistent prototypes at semantic-level, aiming to provide fruitful class underlying knowledge and stable convergence signals in a prototype-wise collaborative way. On the one hand, FedSC introduces an inter-contrastive learning strategy to bring instance-level embeddings closer to relational prototypes with the same semantics and away from distinct classes. On the other hand, FedSC devises consistent prototypes via a discrepancy aggregation manner, as a regularization penalty to constrain the optimization region of the local model. Moreover, a theoretical analysis for FedSC is provided to ensure a convergence guarantee. Experimental results on various challenging scenarios demonstrate the effectiveness of FedSC and the efficiency of crucial components.

[76] Logios : An open source Greek Polytonic Optical Character Recognition system

Perifanos Konstantinos,Goutsos Dionisis

Main category: cs.CV

TL;DR: 本文介绍了一种基于深度学习的新OCR系统,专为提高希腊多调文本的识别准确性与效率而设计,并开放源代码供学术使用。

Details Motivation: 传统的OCR方法在处理希腊多调文本时存在局限性,需要一种更准确且高效的解决方案。 Method: 利用卷积层进行特征提取,并使用循环层进行序列学习,从而应对希腊多调脚本带来的独特挑战。 Result: 开发出一个专门用于希腊多调文本的OCR系统,显著提升了识别的准确性与处理效率。 Conclusion: 该论文提出了一种专门针对希腊多调文本的光学字符识别(OCR)系统,通过结合卷积层和循环层的优势,提高了传统OCR方法在准确性与效率方面的表现,并以开源库的形式发布了模型并提供学术用途的OCR平台。 Abstract: In this paper, we present an Optical Character Recognition (OCR) system specifically designed for the accurate recognition and digitization of Greek polytonic texts. By leveraging the combined strengths of convolutional layers for feature extraction and recurrent layers for sequence learning, our system addresses the unique challenges posed by Greek polytonic scripts. This approach aims to overcome the limitations of traditional OCR methods, offering significant improvements in accuracy and efficiency. We release the underlying model as an open-source library and make our OCR platform available for academic use.

[77] HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation

Qingyue Jiao,Kangyu Zheng,Yiyu Shi,Zhiding Liang

Main category: cs.CV

TL;DR: 本研究提出一种结合经典与量子计算的新型生成对抗网络模型,成功实现高质量彩色医学图像生成,在皮肤疾病检测数据增强方面具有重要意义。

Details Motivation: 训练有效的皮肤病检测机器学习模型需要大量高质量数据,而现有数据集面临类别不平衡、隐私问题和对象偏差等问题,因此需要数据增强。传统生成模型计算资源消耗大且耗时长,量子计算提供了一种有前景的替代方案,但已有量子图像生成方法只能产生低质量灰度图像。 Method: 提出了一种结合经典与量子潜在空间融合的新方法,并构建了首个能够生成彩色医学图像的经典-量子生成对抗网络(GAN)模型。 Result: 该模型在图像生成质量和分类性能提升方面优于经典深度卷积GAN和现有的混合经典-量子GAN,同时使用约25倍更少参数和10倍更少训练轮次即可达到与最先进的经典生成模型相当的性能提升。此外,该模型在IBM真实量子计算机上的噪声环境下表现出稳健性能。 Conclusion: 量子图像生成技术在皮肤疾病检测数据增强方面展现出巨大潜力,随着量子硬件的发展前景广阔。 Abstract: Machine learning-assisted diagnosis is gaining traction in skin disease detection, but training effective models requires large amounts of high-quality data. Skin disease datasets often suffer from class imbalance, privacy concerns, and object bias, making data augmentation essential. While classical generative models are widely used, they demand extensive computational resources and lengthy training time. Quantum computing offers a promising alternative, but existing quantum-based image generation methods can only yield grayscale low-quality images. Through a novel classical-quantum latent space fusion technique, our work overcomes this limitation and introduces the first classical-quantum generative adversarial network (GAN) capable of generating color medical images. Our model outperforms classical deep convolutional GANs and existing hybrid classical-quantum GANs in both image generation quality and classification performance boost when used as data augmentation. Moreover, the performance boost is comparable with that achieved using state-of-the-art classical generative models, yet with over 25 times fewer parameters and 10 times fewer training epochs. Such results suggest a promising future for quantum image generation as quantum hardware advances. Finally, we demonstrate the robust performance of our model on real IBM quantum machine with hardware noise.

[78] HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation

Xinzhuo Li,Adheesh Juvekar,Xingyou Liu,Muntasir Wahed,Kiet A. Nguyen,Ismini Lourentzou

Main category: cs.CV

TL;DR: HalluSegBench is introduced as the first benchmark to assess hallucinations in vision-language segmentation by using counterfactual visual reasoning, revealing that current models often misidentify objects due to vision-related errors rather than labeling mistakes.

Details Motivation: Current evaluation protocols for segmentation hallucination do not adequately address critical failures, especially regarding visual context manipulation. This research aims to provide a more effective diagnostic tool for grounded visual understanding. Method: The researchers introduced HalluSegBench, a benchmark with a dataset of counterfactual instance pairs and new metrics to evaluate hallucinations through visually coherent scene edits. Result: Experiments using HalluSegBench showed that state-of-the-art models frequently produce false segmentations, indicating the need for better grounding mechanisms. Conclusion: The study concludes that vision-driven hallucinations are more common than label-driven ones in vision-language segmentation models, emphasizing the importance of counterfactual reasoning for diagnosing grounding fidelity. Abstract: Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.

[79] Multimodal Prompt Alignment for Facial Expression Recognition

Fuyan Ma,Yiran He,Bin Sun,Shutao Li

Main category: cs.CV

TL;DR: This paper proposes MPA-FER, a multimodal prompt alignment framework for facial expression recognition that leverages large language models and improves text-visual feature alignment, achieving state-of-the-art results on benchmark datasets.

Details Motivation: Current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. Method: MPA-FER introduces a multi-granularity hard prompt generation strategy using an LLM, aligns soft prompts with generated hard prompts, employs prototype-guided visual feature alignment, and implements a cross-modal global-local alignment module. Result: Extensive experiments demonstrate that the MPA-FER framework achieves superior performance on three FER benchmark datasets, highlighting its effectiveness in capturing fine-grained textual-visual relationships. Conclusion: The proposed MPA-FER framework outperforms state-of-the-art methods on three FER benchmark datasets while maintaining the benefits of the pretrained model and minimizing computational costs. Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.

[80] LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection

Lei Hao,Lina Xu,Chang Liu,Yanni Dong

Main category: cs.CV

TL;DR: 本文提出了一种高效且轻量级的多模态目标检测方法LASFNet,显著降低了计算成本并提升了检测精度。

Details Motivation: 为了简化复杂的训练过程并降低多模态目标检测中的计算开销,提出了新的解决方案。 Method: 提出了一种基于单特征融合单元的轻量级注意力引导自调制特征融合网络(LASFNet),包含ASFF模块和FATM模块的设计。 Result: 与现有方法相比,LASFNet减少了多达90%的参数数量和85%的计算成本,同时将检测精度(mAP)提高了1%-3%。 Conclusion: LASFNet实现了高效的多模态目标检测,在准确率和计算成本之间取得了良好的平衡,并且通过了三个代表性数据集的实验验证。 Abstract: Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at https://github.com/leileilei2000/LASFNet.

[81] Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation

Ze Wang,Hao Chen,Benran Hu,Jiang Liu,Ximeng Sun,Jialian Wu,Yusheng Su,Xiaodong Yu,Emad Barsoum,Zicheng Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的1D二进制图像潜变量表示方法,以提高图像生成的效率和速度,相较传统方法大大减少了所需的token数量。

Details Motivation: 动机是降低建模高分辨率图像的计算需求,提高图像和多模态理解和生成的效率。 Method: 该论文采用了1D二进制图像潜变量表示方法,将图像表示为二进制向量序列,而非传统的一热编码词汇表标记,从而减少了所需的token数量。 Result: 论文结果显示,与标准VQ-VAE相比,使用128个离散token在最高分辨率为1024x1024的图像上实现了最多32倍的token数量减少,并展示了与现代图像生成模型相当的性能。 Conclusion: 论文得出结论,通过引入1D二进制图像潜在空间,显著提高了图像生成的效率和速度,并且在全球批量大小为4096时,在单个GPU节点上实现了训练和推理的高效性。 Abstract: Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.

[82] DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation

Wenzhou Lyu,Jialing Lin,Wenqi Ren,Ruihao Xia,Feng Qian,Yang Tang

Main category: cs.CV

TL;DR: 本文提出了DidSee,一种基于扩散模型的深度补全框架,专门解决非朗伯物体在深度补全任务中的泛化性和精度问题。

Details Motivation: 传统深度补全方法由于训练数据的多样性和规模有限而难以泛化,且现有的扩散模型在深度补全任务中存在训练-推理不匹配导致的偏差问题,同时非朗伯区域缺乏显著视觉特征也阻碍了精确预测。 Method: DidSee整合了一个重新缩放的噪声调度器以消除信号泄漏偏差;设计了与噪声无关的单步训练公式以缓解误差累积;引入了语义增强模块,实现深度补全和语义分割的联合优化。 Result: DidSee在多个基准测试中达到了最先进的性能,展现了强大的现实世界泛化能力,并有效提升了类别级姿态估计和机器人抓取等下游任务的表现。 Conclusion: DidSee是一个基于扩散的框架,用于非朗伯物体的深度补全,解决了传统方法在泛化性和性能上的不足,并实现了多个基准测试中的最先进性能。 Abstract: Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose \textbf{DidSee}, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic grasping.Project page: https://wenzhoulyu.github.io/DidSee/

[83] Boosting Domain Generalized and Adaptive Detection with Diffusion Models: Fitness, Generalization, and Transferability

Boyong He,Yuxiang Ji,Zhuoyue Tan,Liaoni Wu

Main category: cs.CV

TL;DR: 本文提出一种高效的扩散模型方法,用于跨域检测任务,有效减少了推理时间并提升了域泛化与适应性能。

Details Motivation: 现有方法在将扩散模型应用于域泛化和适应任务时仍面临较大的推理成本,且未充分挖掘扩散模型的能力。 Method: 通过单步扩散过程中提取中间特征,结合对象中心辅助分支和一致性损失,提升性能并减少推理时间。此外,在统一框架内,利用源域和目标域的对齐指导标准检测器。 Result: 该方法在3个域适应基准和5个域泛化基准上取得了具有竞争力的结果,并在COCO泛化基准上展示了其优势,尤其是在大领域偏移和低数据场景下效率显著。 Conclusion: 本文提出了一种基于扩散模型的新方法,以解决训练和测试数据之间领域差距所导致检测器性能下降的问题,并在域泛化和适应任务中表现出色。 Abstract: Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains. The code is available at \href{https://github.com/heboyong/Fitness-Generalization-Transferability}{Fitness-Generalization-Transferability}.

[84] Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling

Hansam Cho,Seoung Bum Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为Faithfulness Guidance and Scheduling (FGS) 的方法,解决了图像编辑过程中编辑能力和保真度之间的权衡问题,提高了图像编辑的质量和精确度。

Details Motivation: 在图像编辑中,两个关键方面是编辑能力和保真度,而实现最佳结果具有挑战性,因为它们之间存在固有的权衡。 Method: 我们提出了保真度引导和调度(FGS),通过引入保真度引导以加强输入图像信息的保存,并引入一种调度策略来解决可编辑性和保真度之间的不一致。 Result: 实验结果表明,FGS在保持编辑能力的同时实现了优越的保真度。 Conclusion: FGS实现了卓越的保真度同时保持了可编辑性。其与各种编辑方法的兼容性能够实现跨不同任务的精确、高质量图像编辑。 Abstract: Text-guided diffusion models have become essential for high-quality image synthesis, enabling dynamic image editing. In image editing, two crucial aspects are editability, which determines the extent of modification, and faithfulness, which reflects how well unaltered elements are preserved. However, achieving optimal results is challenging because of the inherent trade-off between editability and faithfulness. To address this, we propose Faithfulness Guidance and Scheduling (FGS), which enhances faithfulness with minimal impact on editability. FGS incorporates faithfulness guidance to strengthen the preservation of input image information and introduces a scheduling strategy to resolve misalignment between editability and faithfulness. Experimental results demonstrate that FGS achieves superior faithfulness while maintaining editability. Moreover, its compatibility with various editing methods enables precise, high-quality image edits across diverse tasks.

[85] Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features

Shangbo Wu,Yu-an Tan,Ruinan Ma,Wencong Ma,Dehua Zhu,Yuanzhang Li

Main category: cs.CV

TL;DR: This paper introduces dSVA, an adversarial attack method leveraging self-supervised Vision Transformers to improve black-box transferability by combining global and local features.

Details Motivation: The motivation stems from the synergy between self-supervised learning and the Transformer architecture, aiming to enhance adversarial transferability by leveraging both global and local features extracted through self-supervised methods. Method: The paper presents dSVA, a generative dual self-supervised ViT features attack, which incorporates a generator to create black-box adversarial examples. It utilizes joint features and the attention mechanism of self-supervised ViTs for training. Result: The results demonstrate that the combination of contrastive learning and masked image modeling enables Vision Transformers to attend to distinct feature tendencies, leading to enhanced adversarial generalizability and outperforming state-of-the-art models in black-box transferability. Conclusion: The paper concludes that by exploiting self-supervised Vision Transformer (ViT) representations, particularly using both global structural features from contrastive learning and local textural features from masked image modeling, adversarial transferability can be significantly improved. Abstract: The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.

[86] Class-Agnostic Region-of-Interest Matching in Document Images

Demin Zhang,Jiahao Lyu,Zhijie Shen,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的任务“RoI-Matching”,旨在实现灵活、高效、多粒度和开放集方式的文档区域匹配,为用户定制化应用提供了可能。

Details Motivation: 现有的文档分析方案仅适用于固定的类别定义和粒度,无法满足用户定制的灵活应用需求。因此,本文提出了RoI-Matching任务。 Method: 构建了一个名为RoI-Matcher的新框架,该框架采用siamese网络提取参考域和目标域中的多级特征,并利用交叉注意力层整合和对齐不同域中的相似语义信息。 Result: 实验表明,所提方法在RoI-Matching-Bench基准数据集上表现良好,且流程简单,为后续研究提供了基线。 Conclusion: RoI-Matching任务和RoI-Matcher框架实现了灵活、高效的文档区域匹配,为未来的研究奠定了基础。 Abstract: Document understanding and analysis have received a lot of attention due to their widespread application. However, existing document analysis solutions, such as document layout analysis and key information extraction, are only suitable for fixed category definitions and granularities, and cannot achieve flexible applications customized by users. Therefore, this paper defines a new task named ``Class-Agnostic Region-of-Interest Matching'' (``RoI-Matching'' for short), which aims to match the customized regions in a flexible, efficient, multi-granularity, and open-set manner. The visual prompt of the reference document and target document images are fed into our model, while the output is the corresponding bounding boxes in the target document images. To meet the above requirements, we construct a benchmark RoI-Matching-Bench, which sets three levels of difficulties following real-world conditions, and propose the macro and micro metrics to evaluate. Furthermore, we also propose a new framework RoI-Matcher, which employs a siamese network to extract multi-level features both in the reference and target domains, and cross-attention layers to integrate and align similar semantics in different domains. Experiments show that our method with a simple procedure is effective on RoI-Matching-Bench, and serves as the baseline for further research. The code is available at https://github.com/pd162/RoI-Matching.

[87] SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification

Dinh-Khoi Vo,Van-Loc Nguyen,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: This paper proposes SAMURAI, a Shape-Aware Multimodal Retrieval method that combines language understanding and shape priors to accurately identify 3D objects in complex indoor environments using masked 2D images and natural language descriptions.

Details Motivation: Retrieving 3D objects using only a masked 2D image and natural language description is challenging due to distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. These issues are exacerbated in the ROOMELSA challenge due to limited access to full 3D scene context. Method: SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions and uses a majority voting strategy. A preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Result: The hybrid retrieval framework achieves competitive performance on the ROOMELSA private test set, demonstrating the effectiveness of leveraging both language and shape cues for 3D object identification. Conclusion: The proposed SAMURAI framework effectively addresses the challenges of retrieving 3D objects in complex indoor environments by combining shape priors with language understanding, highlighting its robustness for open-world 3D object retrieval. Abstract: Retrieving 3D objects in complex indoor environments using only a masked 2D image and a natural language description presents significant challenges. The ROOMELSA challenge limits access to full 3D scene context, complicating reasoning about object appearance, geometry, and semantics. These challenges are intensified by distorted viewpoints, textureless masked regions, ambiguous language prompts, and noisy segmentation masks. To address this, we propose SAMURAI: Shape-Aware Multimodal Retrieval for 3D Object Identification. SAMURAI integrates CLIP-based semantic matching with shape-guided re-ranking derived from binary silhouettes of masked regions, alongside a robust majority voting strategy. A dedicated preprocessing pipeline enhances mask quality by extracting the largest connected component and removing background noise. Our hybrid retrieval framework leverages both language and shape cues, achieving competitive performance on the ROOMELSA private test set. These results highlight the importance of combining shape priors with language understanding for robust open-world 3D object retrieval.

[88] PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image

Hongyu Yan,Kunming Luo,Weiyu Li,Yixun Liang,Shengming Li,Jingwei Huang,Chunchao Guo,Ping Tan

Main category: cs.CV

TL;DR: PoseMaster is an end-to-end controllable 3D character generation framework that overcomes limitations in existing image-based methods, providing superior quality and precise arbitrary-pose control in 3D character generation.

Details Motivation: To improve the efficiency of 3D character modeling by overcoming issues like distorted images caused by self-occlusion and viewpoint problems in current image-based methods. Method: PoseMaster uses a flow-based 3D native generation framework that unifies pose transformation and 3D character generation. It leverages the 3D body bones as the pose condition and employs random emptying of conditions during training to enhance generalizability. Result: Extensive experiments show that PoseMaster outperforms state-of-the-art techniques in both qualitative and quantitative evaluations for A-pose character generation. Conclusion: PoseMaster demonstrates a superior ability to achieve precise control for arbitrary poses in 3D character generation. Abstract: 3D characters play a crucial role in our daily entertainment. To improve the efficiency of 3D character modeling, recent image-based methods use two separate models to achieve pose standardization and 3D reconstruction of the A-pose character. However, these methods are prone to generating distorted and degraded images in the pose standardization stage due to self-occlusion and viewpoints, which further affects the geometric quality of the subsequent reconstruction process. To tackle these problems, we propose PoseMaster, an end-to-end controllable 3D character generation framework. Specifically, we unify pose transformation and 3D character generation into a flow-based 3D native generation framework. To achieve accurate arbitrary-pose control, we propose to leverage the 3D body bones existing in the skeleton of an animatable character as the pose condition. Furthermore, considering the specificity of multi-condition control, we randomly empty the pose condition and the image condition during training to improve the effectiveness and generalizability of pose control. Finally, we create a high-quality pose-control dataset derived from realistic character animation data to make the model learning the implicit relationships between skeleton and skinning weights. Extensive experiments show that PoseMaster outperforms current state-of-the-art techniques in both qualitative and quantitative evaluations for A-pose character generation while demonstrating its powerful ability to achieve precise control for arbitrary poses.

[89] EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception

Sanjoy Chowdhury,Subrata Biswas,Sayan Nag,Tushar Nagarajan,Calvin Murdock,Ishwarya Ananthabhotla,Yijun Qian,Vamsi Krishna Ithapu,Dinesh Manocha,Ruohan Gao

Main category: cs.CV

TL;DR: EgoAdapt improves efficiency for egocentric perception tasks with adaptive cross-modal distillation and policy learning.

Details Motivation: Modern perception models have high computational costs, which hinders their deployment in resource-constrained environments. Method: EgoAdapt uses cross-modal distillation and policy learning to adaptively optimize inference efficiency across different egocentric perception tasks. Result: EgoAdapt reduced GMACs by up to 89.09%, parameters by up to 82.02%, and energy consumption by up to 9.6x on three egocentric datasets while matching or exceeding state-of-the-art performance. Conclusion: EgoAdapt is an efficient framework for multisensory egocentric tasks that significantly reduces computational costs while maintaining or improving performance. Abstract: Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets EPIC-Kitchens, EasyCom, and Aria Everyday Activities demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6x, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.

[90] ESMStereo: Enhanced ShuffleMixer Disparity Upsampling for Real-Time and Accurate Stereo Matching

Mahmoud Tahmasebi,Saif Huq,Kevin Meehan,Marion McAfee

Main category: cs.CV

TL;DR: This paper proposes the Enhanced Shuffle Mixer (ESM) for stereo matching, combining high accuracy and real-time performance by efficiently restoring critical information in small-scale cost volumes.

Details Motivation: Accurate disparity estimation using large-scale cost volumes is computationally expensive and unsuitable for real-time applications, while small-scale cost volumes lack sufficient information for high accuracy. This work aims to bridge this gap by enhancing information retention and processing efficiency. Method: The proposed method introduces the Enhanced Shuffle Mixer (ESM), which restores critical details by integrating primary features into the disparity upsampling unit. Features from initial disparity estimation are fused with image features, shuffled and split across layers, and further refined through a compact feature-guided hourglass network for detailed scene geometry recovery. Result: The compact version of ESMStereo achieves 116 FPS on high-end GPUs and 91 FPS on the AGX Orin, demonstrating real-time performance while maintaining high accuracy in disparity map generation. Conclusion: The Enhanced Shuffle Mixer (ESM) addresses the challenge of achieving high accuracy and real-time performance in stereo matching by mitigating information loss in small-scale cost volumes, leading to highly accurate disparity maps at real-time speeds. Abstract: Stereo matching has become an increasingly important component of modern autonomous systems. Developing deep learning-based stereo matching models that deliver high accuracy while operating in real-time continues to be a major challenge in computer vision. In the domain of cost-volume-based stereo matching, accurate disparity estimation depends heavily on large-scale cost volumes. However, such large volumes store substantial redundant information and also require computationally intensive aggregation units for processing and regression, making real-time performance unattainable. Conversely, small-scale cost volumes followed by lightweight aggregation units provide a promising route for real-time performance, but lack sufficient information to ensure highly accurate disparity estimation. To address this challenge, we propose the Enhanced Shuffle Mixer (ESM) to mitigate information loss associated with small-scale cost volumes. ESM restores critical details by integrating primary features into the disparity upsampling unit. It quickly extracts features from the initial disparity estimation and fuses them with image features. These features are mixed by shuffling and layer splitting then refined through a compact feature-guided hourglass network to recover more detailed scene geometry. The ESM focuses on local contextual connectivity with a large receptive field and low computational cost, leading to the reconstruction of a highly accurate disparity map at real-time. The compact version of ESMStereo achieves an inference speed of 116 FPS on high-end GPUs and 91 FPS on the AGX Orin.

[91] OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

Caoshuo Li,Zengmao Ding,Xiaobin Hu,Bang Li,Donghao Luo,AndyPian Wu,Chaoyang Wang,Chengjie Wang,Taisong Jin,SevenShu,Yunsheng Wu,Yongge Liu,Rongrong Ji

Main category: cs.CV

TL;DR: 这篇论文介绍了一种名为 OracleFusion 的新方法,用于帮助解读古代甲骨文字符,并展示了其在提高可读性和审美质量方面的优势。

Details Motivation: 甲骨文是最早期的古老语言之一,但仍有大量字符未被解读。这给解释带来了重大挑战。 Method: 提出了一种新颖的两阶段语义排版框架 OracleFusion,并使用 Multimodal Large Language Model (MLLM) 和 Oracle Structural Vector Fusion (OSVF) 进行实验。 Result: OracleFusion 在语义、视觉吸引力和字形维护方面优于最先进的基线模型,显著提高了可读性和审美质量。 Conclusion: OracleFusion 是一种有效的工具,为未解读的甲骨文字符的专家释读提供了专家级的见解。 Abstract: As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.

[92] Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection

Luosheng Xu,Dalin Zhang,Zhaohui Song

Main category: cs.CV

TL;DR: 该研究提出了一种名为FlickCD的高效遥感变化检测方法,在保证高性能的同时大幅降低计算和存储需求。

Details Motivation: 深度学习模型的复杂度和计算需求增加并未显著提高准确性,因此需要探索轻量级且高精度的模型以满足卫星上处理的需求。 Method: 提出了FlickCD模型,包含增强差异模块(EDM)和局部-全局融合块(结合Shifted Window Self-Attention和Enhanced Global Self-Attention)。 Result: 在四个基准数据集上的实验表明,FlickCD减少了超过一个数量级的计算和存储开销,同时保持最先进的性能或仅有轻微的准确度损失(<1% F1)。 Conclusion: FlickCD实现了高效的遥感变化检测,平衡了性能与资源消耗之间的关系。 Abstract: Remote sensing change detection is essential for monitoring urban expansion, disaster assessment, and resource management, offering timely, accurate, and large-scale insights into dynamic landscape transformations. While deep learning has revolutionized change detection, the increasing complexity and computational demands of modern models have not necessarily translated into significant accuracy gains. Instead of following this trend, this study explores a more efficient approach, focusing on lightweight models that maintain high accuracy while minimizing resource consumption, which is an essential requirement for on-satellite processing. To this end, we propose FlickCD, which means quick flick then get great results, pushing the boundaries of the performance-resource trade-off. FlickCD introduces an Enhanced Difference Module (EDM) to amplify critical feature differences between temporal phases while suppressing irrelevant variations such as lighting and weather changes, thereby reducing computational costs in the subsequent change decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion Blocks, leveraging Shifted Window Self-Attention (SWSA) and Enhanced Global Self-Attention (EGSA) to efficiently capture semantic information at multiple scales, preserving both coarse- and fine-grained changes. Extensive experiments on four benchmark datasets demonstrate that FlickCD reduces computational and storage overheads by more than an order of magnitude while achieving state-of-the-art (SOTA) performance or incurring only a minor (<1\% F1) accuracy trade-off. The implementation code is publicly available at https://github.com/xulsh8/FlickCD.

[93] IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Yujia Liang,Jile Jiao,Zhicheng Wang,Xuetao Feng,Zixuan Ye,Yuan Wang,Hao Lu

Main category: cs.CV

TL;DR: This paper introduces MultiClip-Bench, a dataset for multi-shot video understanding, and proposes the IPFormer-VideoLLM model to address instance identity forgetting, both of which significantly enhance video understanding performance.

Details Motivation: Video Large Language Models (VideoLLMs) struggle with multi-shot scenarios due to the lack of multi-shot annotations in existing datasets. This limitation leads to issues like instance identity forgetting and key frame negligence. Method: The authors introduce a new dataset, MultiClip-Bench, with dense descriptions and instruction-based question-answering pairs for multi-shot scenarios. They also propose a new model, IPFormer-VideoLLM, which injects instance-level features through an attention-based connector to aggregate instance-specific information across scenes. Result: The experiments show that the training set boosts multi-shot performance and the testing benchmark reliably measures model capability. The proposed model enhances multi-scene video understanding and provides distinct advantages on different video benchmarks. Conclusion: The paper concludes that the proposed MultiClip-Bench dataset and IPFormer-VideoLLM model significantly improve multi-scene video understanding and offer advantages across various video benchmarks. Abstract: Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or lossy manner, at the risk of missing identity information, we then contribute a new model IPFormer-VideoLLM. Its key idea is the injection of instance-level features as instance prompts through an efficient attention-based connector. This allows for the aggregation of instance-specific information across scenes. Experiments demonstrate that our proposed dataset and model not only enhance the multi-scene video understanding significantly, but also offer distinct advantages across various video benchmarks.

[94] CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization

Jan Ackermann,Jonas Kulhanek,Shengqu Cai,Haofei Xu,Marc Pollefeys,Gordon Wetzstein,Leonidas Guibas,Songyou Peng

Main category: cs.CV

TL;DR: This paper introduces CL-Splats, a method for efficiently updating 3D scene representations over time by incrementally modifying Gaussian splatting-based models, allowing for focused local optimization and improved reconstruction quality.

Details Motivation: In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. Method: CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures by integrating a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization. Result: Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. Conclusion: CL-Splats provides a robust foundation for future real-time adaptation in 3D scene reconstruction tasks by achieving efficient updates with improved reconstruction quality over the state-of-the-art. Abstract: In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene. This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures. CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation. Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications. Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.

[95] GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction

Muleilan Pei,Shaoshuai Shi,Lu Zhang,Peiliang Li,Shaojie Shen

Main category: cs.CV

TL;DR: This paper introduces a novel GoIRL framework for trajectory prediction in autonomous driving, combining inverse reinforcement learning with vectorized context representations to achieve superior performance and generalization compared to traditional supervised methods.

Details Motivation: Trajectory prediction for surrounding agents in autonomous driving is inherently uncertain and multimodal. Prevailing data-driven methods primarily rely on supervised learning, which has limitations in capturing the complexity of this task. Method: The authors propose a Graph-oriented Inverse Reinforcement Learning (GoIRL) framework. This includes a feature adaptor for aggregating lane-graph features, integration with the maximum entropy IRL paradigm to infer reward distribution, and a hierarchical parameterized trajectory generator with refinement and probability fusion modules. Result: Extensive experiments show that the proposed approach achieves state-of-the-art performance on the Argoverse and nuScenes motion forecasting benchmarks and outperforms existing supervised models in terms of generalization ability. Conclusion: The paper concludes that their proposed GoIRL framework achieves state-of-the-art performance on trajectory prediction tasks and demonstrates superior generalization abilities compared to existing supervised models. Abstract: Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.

[96] Learning to See in the Extremely Dark

Hai Jiang,Binhao Guan,Zhen Liu,Xiaohong Liu,Jian Yu,Zheng Liu,Songchen Han,Shuaicheng Liu

Main category: cs.CV

TL;DR: This paper introduces a new dataset (SIED) and a diffusion-based framework for enhancing extremely low-light RAW images, enabling better performance in near-total darkness.

Details Motivation: Current learning-based methods lack exploration in handling extremely dark scenes due to the absence of appropriate datasets. This work aims to bridge that gap by providing both a dataset and a novel enhancement framework. Method: A paired-to-paired data synthesis pipeline was developed to create the SIED dataset with images at varying low-light levels. A diffusion-based framework with an Adaptive Illumination Correction Module (AICM) and a color consistency loss was introduced for image restoration. Result: The experiments demonstrate the effectiveness of the proposed method on the new SIED dataset as well as existing benchmarks, achieving visually pleasing results from extremely low-SNR RAW inputs. Conclusion: The proposed diffusion-based framework and the SIED dataset advance the field of low-light RAW image enhancement, particularly for extremely dark environments. Abstract: Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/JianghaiSCU/SIED.

[97] YOLO-FDA: Integrating Hierarchical Attention and Detail Enhancement for Surface Defect Detection

Jiawei Hu

Main category: cs.CV

TL;DR: YOLO-FDA improves industrial surface defect detection by combining enhanced detail capture and advanced feature fusion techniques.

Details Motivation: Surface defect detection is challenging due to variability in defects, irregular shapes, fine-grained requirements, and complex textures. Existing methods struggle with redundant features, limited detail sensitivity, and weak multiscale robustness. Method: YOLO-FDA integrates a BiFPN-style architecture, a Detail-directional Fusion Module (DDFM), and two attention-based fusion strategies: Attention-weighted Concatenation (AC) and Cross-layer Attention Fusion (CAF). Result: Extensive experiments show that YOLO-FDA outperforms state-of-the-art methods in accuracy and robustness for diverse defect types and scales. Conclusion: YOLO-FDA provides superior performance in surface defect detection by enhancing fine-grained details and leveraging attention-guided feature fusion. Abstract: Surface defect detection in industrial scenarios is both crucial and technically demanding due to the wide variability in defect types, irregular shapes and sizes, fine-grained requirements, and complex material textures. Although recent advances in AI-based detectors have improved performance, existing methods often suffer from redundant features, limited detail sensitivity, and weak robustness under multiscale conditions. To address these challenges, we propose YOLO-FDA, a novel YOLO-based detection framework that integrates fine-grained detail enhancement and attention-guided feature fusion. Specifically, we adopt a BiFPN-style architecture to strengthen bidirectional multilevel feature aggregation within the YOLOv5 backbone. To better capture fine structural changes, we introduce a Detail-directional Fusion Module (DDFM) that introduces a directional asymmetric convolution in the second-lowest layer to enrich spatial details and fuses the second-lowest layer with low-level features to enhance semantic consistency. Furthermore, we propose two novel attention-based fusion strategies, Attention-weighted Concatenation (AC) and Cross-layer Attention Fusion (CAF) to improve contextual representation and reduce feature noise. Extensive experiments on benchmark datasets demonstrate that YOLO-FDA consistently outperforms existing state-of-the-art methods in terms of both accuracy and robustness across diverse types of defects and scales.

[98] Tree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation

Junwen Wang,Oscar Maccormac,William Rochford,Aaron Kujawa,Jonathan Shapey,Tom Vercauteren

Main category: cs.CV

TL;DR: This paper introduces tree-based semantic loss functions to improve biomedical segmentation by exploiting label hierarchies, achieving strong performance on a large hyperspectral imaging dataset while enabling OOD pixel detection.

Details Motivation: Commonly used learning methods for biomedical segmentation tasks treat all errors equally, failing to exploit inter-class semantics in the label space. This work addresses this limitation by leveraging semantic relationships between classes. Method: Introduction of two tree-based semantic loss functions that utilize a hierarchical organization of labels, incorporated into a training approach with sparse, background-free annotations. Result: Extensive experiments demonstrate that the proposed method performs exceptionally well on a hyperspectral imaging dataset with 107 classes organized in a clinically-defined semantic tree structure. Conclusion: The proposed method achieves state-of-the-art performance on a sparsely annotated HSI dataset and enables effective detection of out-of-distribution (OOD) pixels without compromising segmentation performance on in-distribution (ID) pixels. Abstract: Hyperspectral imaging (HSI) shows great promise for surgical applications, offering detailed insights into biological tissue differences beyond what the naked eye can perceive. Refined labelling efforts are underway to train vision systems to distinguish large numbers of subtly varying classes. However, commonly used learning methods for biomedical segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. In this work, we introduce two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations. Extensive experiments demonstrate that our proposed method reaches state-of-the-art performance on a sparsely annotated HSI dataset comprising $107$ classes organised in a clinically-defined semantic tree structure. Furthermore, our method enables effective detection of out-of-distribution (OOD) pixels without compromising segmentation performance on in-distribution (ID) pixels.

[99] Robust Deep Learning for Myocardial Scar Segmentation in Cardiac MRI with Noisy Labels

Aida Moafi,Danial Moafi,Evgeny M. Mirkes,Gerry P. McCann,Abbas S. Alatrany,Jayanth R. Arnold,Mostafa Mehdipour Ghazi

Main category: cs.CV

TL;DR: 本研究开发了一种全自动的心肌瘢痕检测与分割深度学习方法,有效应对了标签噪声、数据异质性和类别不平衡问题,并在多种成像条件下展现了优异的性能和泛化能力。

Details Motivation: 准确分割心脏MRI中的心肌瘢痕对于临床评估和治疗计划至关重要,但存在标签噪声、数据异质性和类别不平衡等挑战。 Method: 通过微调最先进的模型,使用Kullback-Leibler损失函数和大量数据增强来解决标签噪声、数据异质性和类别不平衡问题。 Result: 该方法在急性和慢性病例上均表现出色,能够产生准确且平滑的分割结果,并在分布外测试集中展现出比现有最先进模型(如nnU-Net)更强的泛化能力和鲁棒性。 Conclusion: 该研究提出了一种强大的深度学习流程,用于心肌瘢痕的自动检测和分割,并在准确性和泛化能力方面表现出色,为心脏影像中深度学习的临床应用奠定了基础。 Abstract: The accurate segmentation of myocardial scars from cardiac MRI is essential for clinical assessment and treatment planning. In this study, we propose a robust deep-learning pipeline for fully automated myocardial scar detection and segmentation by fine-tuning state-of-the-art models. The method explicitly addresses challenges of label noise from semi-automatic annotations, data heterogeneity, and class imbalance through the use of Kullback-Leibler loss and extensive data augmentation. We evaluate the model's performance on both acute and chronic cases and demonstrate its ability to produce accurate and smooth segmentations despite noisy labels. In particular, our approach outperforms state-of-the-art models like nnU-Net and shows strong generalizability in an out-of-distribution test set, highlighting its robustness across various imaging conditions and clinical tasks. These results establish a reliable foundation for automated myocardial scar quantification and support the broader clinical adoption of deep learning in cardiac imaging.

[100] Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image

Pufan Li,Bi'an Du,Wei Hu

Main category: cs.CV

TL;DR: 本文提出了一種新的方法,通過整合幾何和感知先驗來生成高保真的3D對象,無需額外的模型訓練。

Details Motivation: 從單視圖圖像生成逼真的3D對象需要自然的外觀、3D一致性和捕捉未見區域的多種可能解釋的能力。現有方法通常依賴於微調預訓練的2D擴散模型或直接生成3D信息,但它們的結果往往在多視圖一致性和幾何細節上表現不佳。 Method: 我們訓練了三個不同的高斯分支,分別初始化為幾何先驗、感知先驗和高斯噪聲。幾何先驗捕獲粗糙的3D形狀,而感知先驗利用2D預訓練的擴散模型來增強多視圖信息。隨後,我們通過幾何和感知先驗之間的相互作用以及基於重投影的策略來精煉3D高斯分支,進一步提高深度一致性。 Result: 實驗表明,我們的方法在新視圖合成和3D重建方面取得了更高精度的重建結果,優於現有的方法。 Conclusion: 我們的方法展示了在不進行額外模型訓練的情況下,生成具有魯棒性和一致性的高保真3D對象的能力。 Abstract: Generating realistic 3D objects from single-view images requires natural appearance, 3D consistency, and the ability to capture multiple plausible interpretations of unseen regions. Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference or 3D Gaussian Splatting, but their results generally suffer from poor multiview consistency and lack geometric detail. To takle these issues, we present a novel method that seamlessly integrates geometry and perception priors without requiring additional model training to reconstruct detailed 3D objects from a single image. Specifically, we train three different Gaussian branches initialized from the geometry prior, perception prior and Gaussian noise, respectively. The geometry prior captures the rough 3D shapes, while the perception prior utilizes the 2D pretrained diffusion model to enhance multiview information. Subsequently, we refine 3D Gaussian branches through mutual interaction between geometry and perception priors, further enhanced by a reprojection-based strategy that enforces depth consistency. Experiments demonstrate the higher-fidelity reconstruction results of our method, outperforming existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation.

[101] Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition

Longkun Zou,Kangjun Liu,Ke Chen,Kailing Guo,Kui Jia,Yaowei Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于Sim2Real无监督领域自适应的新拓扑感知建模(TAM)框架,通过利用全局空间拓扑和局部几何特征的拓扑关系,结合跨域对比学习与自我训练策略,有效缩小了领域差距并提高了点云分类的泛化能力。

Details Motivation: 由于数据获取方法的不同,3D物体形状点集的语义表示学习面临显著的几何变化,训练数据通常使用点模拟器生成,而测试数据则通过不同的3D传感器收集,导致Sim2Real领域差距限制了点分类器的泛化能力。 Method: 引入了一种新的拓扑感知建模(TAM)框架,通过利用全局空间拓扑和建模局部几何特征的拓扑关系,并结合跨域对比学习与自我训练的策略来缩小领域差距。 Result: 在三个公开的Sim2Real基准数据集上的实验结果显示,所提出的方法在所有评估任务中均表现出一致的改进。 Conclusion: 实验结果验证了所提出的TAM框架在Sim2Real UDA任务上的有效性,显示其在所有评估任务中均优于现有方法。 Abstract: Learning semantic representations from point sets of 3D object shapes is often challenged by significant geometric variations, primarily due to differences in data acquisition methods. Typically, training data is generated using point simulators, while testing data is collected with distinct 3D sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits the generalization ability of point classifiers. Current unsupervised domain adaptation (UDA) techniques struggle with this gap, as they often lack robust, domain-insensitive descriptors capable of capturing global topological information, resulting in overfitting to the limited semantic patterns of the source domain. To address this issue, we introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures, and by modeling the topological relations of local geometric features through a novel self-supervised learning task. Additionally, we propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training, effectively reducing the impact of noisy pseudo-labels and enhancing the robustness of the adaptation process. Experimental results on three public Sim2Real benchmarks validate the effectiveness of our TAM framework, showing consistent improvements over state-of-the-art methods across all evaluated tasks. The source code of this work will be available at https://github.com/zou-longkun/TAG.git.

[102] Task-Aware KV Compression For Cost-Effective Long Video Understanding

Minghao Qin,Yan Shu,Peitian Zhang,Kun Lun,Huaying Yuan,Juenjie Zhou,Shitao Xiao,Bo Zhao,Zheng Liu

Main category: cs.CV

TL;DR: Video-X^2L提出了一种新的长视频理解方法,通过双级KV压缩和选择性KV重加载,在保证信息完整的同时降低计算成本。

Details Motivation: 长视频理解对于现有的多模态大语言模型来说是一个挑战,主要由于计算成本高,而现有的KV压缩方法在高压缩比例下往往会有较大的信息损失。 Method: 引入了双级KV压缩和选择性KV重加载两个关键操作,双级KV压缩包括低压缩KV和高压缩KV,分别用于捕获细粒度视频细节和提供紧凑的视频表示;选择性KV重加载则是在解码阶段根据重要性重新加载不同的KV数据。 Result: 实验结果表明,Video-X^2L在各种流行的长视频理解基准测试中明显优于现有的KV压缩方法,并且大大节省了计算成本。 Conclusion: Video-X^2L是一种无需额外训练、兼容现有KV压缩式多模态大语言模型的视频理解方法,并且在多个长视频理解基准测试中表现出色,相较于现有的KV压缩方法有显著优势,同时大幅节省了计算成本。 Abstract: Long-video understanding (LVU) remains a severe challenge for existing multimodal large language models (MLLMs), primarily due to the prohibitive computational cost. Recent approaches have explored KV compression to mitigate this issue, but they often suffer from significant information loss at high compression ratios. In this paper, we introduce Video-X^2L, which flexibly preserves critical video information for each LVU task. Video-X^2L involves two key operations. The first one is called bi-level KV compression. During the MLLM's pre-filling stage, Video-X^2L generates two types of compressed KVs: low-compression KVs (L-KVs) to capture fine-grained video details and high-compression KVs (H-KVs) to offer compact video representations. The second one is called selective KV re-loading. During the MLLM's decoding stage, Video-X^2L selectively re-loads L-KVs for the most critical video chunks while using H-KVs for other less important ones. This allows the MLLM to fully utilize task-specific information while maintaining the overall compactness. Video-X^2L is simple yet effective: it is free from additional training and directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L with a variety of popular LVU benchmarks, including VideoMME, MLVU, LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L outperforms existing KV-compression methods by a huge advantage while substantially saving the computation cost.

[103] Out-of-Distribution Semantic Occupancy Prediction

Yuheng Zhang,Mengfei Duan,Kunyu Peng,Yuhang Wang,Ruiping Liu,Fei Teng,Kai Luo,Zhiyong Li,Kailun Yang

Main category: cs.CV

TL;DR: 本文提出了OccOoD,一种将OoD检测集成到3D语义占用预测的新框架,并构建了两个包含合成异常的数据集VAA-KITTI和VAA-KITTI-360。

Details Motivation: 现有方法关注分布内场景,对异常物体和长尾分布敏感,可能引发安全隐患。 Method: 引入了Voxel-BEV Progressive Fusion (VBPF),结合基于RWKV的分支,通过几何-语义融合增强OoD检测。 Result: 在1.2米区域内,OccOoD达到67.34%的AuROC和29.21%的AuPRCr。 Conclusion: OccOoD实现了最先进的OoD检测,同时保持了竞争性的占用预测性能。 Abstract: 3D Semantic Occupancy Prediction is crucial for autonomous driving, providing a dense, semantically rich environmental representation. However, existing methods focus on in-distribution scenes, making them susceptible to Out-of-Distribution (OoD) objects and long-tail distributions, which increases the risk of undetected anomalies and misinterpretations, posing safety hazards. To address these challenges, we introduce Out-of-Distribution Semantic Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that injects synthetic anomalies while preserving realistic spatial and occlusion patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360. We introduce OccOoD, a novel framework integrating OoD detection into 3D semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF) leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m region, while maintaining competitive occupancy prediction performance. The established datasets and source code will be made publicly available at https://github.com/7uHeng/OccOoD.

[104] GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Zijun Lin,Shuting He,Cheston Tan,Bihan Wen

Main category: cs.CV

TL;DR: GroundFlow is introduced to enhance 3D visual grounding models by incorporating temporal reasoning, allowing them to better understand sequential instructions involving 3D point clouds.

Details Motivation: Current 3D visual grounding (3DVG) methods treat multi-step text instructions as a whole, failing to extract useful temporal information. SG3D requires contextual understanding due to pronouns like "it" and "the same," which existing methods struggle to interpret without historical reference. Method: GroundFlow selectively extracts both short-term and long-term step information based on its relevance to the current instruction. This approach allows for a comprehensive view of historical context while maintaining temporal understanding as step counts increase. Result: Integrating GroundFlow improves the task accuracy of 3DVG baseline methods significantly (+7.5% and +10.2%) in the SG3D benchmark. It also outperforms a 3D large language model pre-trained on multiple datasets. Conclusion: The proposed GroundFlow module enhances the temporal reasoning capabilities of existing 3DVG models, enabling them to better handle sequential grounding tasks by effectively utilizing historical information from previous steps. Abstract: Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as "it", "here" and "the same" to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow -- a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5\% and +10.2\%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.

[105] Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation

Yihong Cao,Jiaming Zhang,Xu Zheng,Hao Shi,Kunyu Peng,Hang Liu,Kailun Yang,Hui Zhang

Main category: cs.CV

TL;DR: This paper proposes UNLOCK, a novel framework for Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), which achieves competitive performance without needing source data or target labels.

Details Motivation: The motivation is to overcome constraints in panoramic image processing such as distortions, perspective occlusions, and limited annotations by introducing a more practical and efficient method for seamless segmentation. Method: The paper introduces two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. These modules enable adaptation without relying on source data or target labels, enhancing segmentation with 360° viewpoint coverage and occlusion-aware reasoning. Result: Experimental results demonstrate that the source-free method achieves state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. Conclusion: The paper concludes that the proposed UNLOCK framework successfully addresses the SFOASS task, achieving performance comparable to source-dependent methods while not requiring source data or target labels. Abstract: Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360{\deg} viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.

[106] MedPrompt: LLM-CNN Fusion with Weight Routing for Medical Image Segmentation and Classification

Shadman Sobhan,Kazi Abrar Mahmud,Abduz Zami

Main category: cs.CV

TL;DR: MedPrompt是一个结合了大语言模型和卷积神经网络的统一医疗图像分析框架,实现了任务规划与低级图像处理的集成,并具备高扩展性与实时应用潜力。

Details Motivation: 当前医疗图像分析系统通常是特定任务的,需要为分类和分割分别建立模型,且不支持用户自定义的工作流程。 Method: 引入MedPrompt,将少量提示的大语言模型(Llama-4-17B)用于高级任务规划,与模块化的卷积神经网络(DeepFusionLab)进行低级图像处理相结合。 Result: MedPrompt在19个公共数据集上进行了评估,覆盖5种成像模态的12个任务,实现了97%的端到端正确率,平均推理延迟为2.5秒。 Conclusion: MedPrompt通过结合LLM的可解释性和模块化CNN的效率,实现了可扩展的、由提示驱动的医疗影像分析。 Abstract: Current medical image analysis systems are typically task-specific, requiring separate models for classification and segmentation, and lack the flexibility to support user-defined workflows. To address these challenges, we introduce MedPrompt, a unified framework that combines a few-shot prompted Large Language Model (Llama-4-17B) for high-level task planning with a modular Convolutional Neural Network (DeepFusionLab) for low-level image processing. The LLM interprets user instructions and generates structured output to dynamically route task-specific pretrained weights. This weight routing approach avoids retraining the entire framework when adding new tasks-only task-specific weights are required, enhancing scalability and deployment. We evaluated MedPrompt across 19 public datasets, covering 12 tasks spanning 5 imaging modalities. The system achieves a 97% end-to-end correctness in interpreting and executing prompt-driven instructions, with an average inference latency of 2.5 seconds, making it suitable for near real-time applications. DeepFusionLab achieves competitive segmentation accuracy (e.g., Dice 0.9856 on lungs) and strong classification performance (F1 0.9744 on tuberculosis). Overall, MedPrompt enables scalable, prompt-driven medical imaging by combining the interpretability of LLMs with the efficiency of modular CNNs.

[107] BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models

Louis Kerner,Michel Meintz,Bihe Zhao,Franziska Boenisch,Adam Dziedzic

Main category: cs.CV

TL;DR: BitMark introduces a new watermarking technique for text-to-image models to combat model collapse by ensuring traceability of generated images.

Details Motivation: To mitigate the risk of model collapse caused by training on generated content, especially as outputs from such models increasingly populate the Internet and are reused as training data. Method: The method involves embedding a watermark at the bit level of the token stream across multiple scales during the image generation process, subtly influencing bits while maintaining visual fidelity and speed. Result: BitMark successfully embeds watermarks that remain detectable even after images are used to train other models, including when only fine-tuning diffusion or autoregressive models on watermarked images. Conclusion: BitMark is a robust bitwise watermarking framework that helps prevent model collapse in image generative models by enabling reliable detection of generated outputs. Abstract: State-of-the-art text-to-image models like Infinity generate photorealistic images at an unprecedented speed. These models operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models' own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework for Infinity. Our method embeds a watermark directly at the bit level of the token stream across multiple scales (also referred to as resolutions) during Infinity's image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model's outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs.

[108] ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation

Xiwei Xuan,Ziquan Deng,Kwan-Liu Ma

Main category: cs.CV

TL;DR: 本文提出一种无需训练的开放词汇语义分割新方法,通过优化数据质量和设计高效的检索过程,在多个数据集上实现了优于现有技术的表现。

Details Motivation: 现有的解决方案受限于依赖模型的能力或参考集的次优质量,因此探索了被忽视的数据质量问题。 Method: 引入了一个面向数据质量的框架,包括构建具有良好配对片段-文本嵌入的参考集的数据管道和基于相似性的检索方法。 Result: 在十个基准数据集上的广泛评估表明,该方法优于所有现有的无训练OVS方法。 Conclusion: 该研究提出了一种基于数据质量的框架,显著提升了无训练开放词汇语义分割的表现,突出了以数据为中心的设计在推进无训练OVS中的重要性。 Abstract: Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available at https://github.com/xiweix/ReME .

[109] Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping

Qifei Cui,Yuang Zhou,Ruichen Deng

Main category: cs.CV

TL;DR: ESFP是一个端到端的流水线,它将单目RGB视频转换为可用于低成本4-DoF桌面机械臂的可执行关节轨迹。

Details Motivation: 为了使低成本的桌面机械臂能够执行复杂的任务,需要一种有效的方法来从视觉输入生成精确的关节轨迹。 Method: ESFP由四个顺序模块组成:(1) 估计:ROMP将每一帧提升为24关节的3-D骨架;(2) 平滑:提出的HPSTM序列到序列Transformer结合了长范围时间上下文和可微分正向运动学解码器;(3) 过滤:根据HPSTM的不确定性估计对根归一化轨迹进行方差加权;(4) 姿态映射:几何重新定位层将肩肘腕三元组转换为uArm的极坐标工作空间。 Result: 该方法能够在保留手腕方向的同时,有效地将单目RGB视频转化为适用于低成本4-DoF桌面机械臂的关节轨迹。 Conclusion: ESFP提供了一种有效且创新的方法,用于从单目视频输入生成低成本桌面机械臂的可执行关节轨迹。 Abstract: This paper presents ESFP, an end-to-end pipeline that converts monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a 24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence Transformer with self-attention-combines long-range temporal context with a differentiable forward-kinematics decoder, enforcing constant bone lengths and anatomical plausibility while jointly predicting joint means and full covariances. (3) Filtering: root-normalized trajectories are variance-weighted according to HPSTM's uncertainty estimates, suppressing residual noise. (4) Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist triples into the uArm's polar workspace, preserving wrist orientation.

[110] DiMPLe -- Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation

Umaima Rahman,Mohammad Yaqub,Dwarikanath Mahapatra

Main category: cs.CV

TL;DR: DiMPLe improves OOD performance in multi-modal learning by disentangling invariant and spurious features across modalities through a combination of objectives.

Details Motivation: Spurious correlations in visual data hinder out-of-distribution performance, necessitating a method that can disentangle such features across vision and language modalities. Method: DiMPLe incorporates mutual information minimization, spurious feature regularization, and contrastive learning on invariant features to achieve disentanglement within and across modalities. Result: DiMPLe outperforms CoOp-OOD on average across 11 datasets, achieving gains of 15.27 in base class accuracy and 44.31 in novel class accuracy. Conclusion: DiMPLe enables better generalization and robustness in multi-modal learning by disentangling invariant and spurious features across modalities. Abstract: We introduce DiMPLe (Disentangled Multi-Modal Prompt Learning), a novel approach to disentangle invariant and spurious features across vision and language modalities in multi-modal learning. Spurious correlations in visual data often hinder out-of-distribution (OOD) performance. Unlike prior methods focusing solely on image features, DiMPLe disentangles features within and across modalities while maintaining consistent alignment, enabling better generalization to novel classes and robustness to distribution shifts. Our method combines three key objectives: (1) mutual information minimization between invariant and spurious features, (2) spurious feature regularization, and (3) contrastive learning on invariant features. Extensive experiments demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy.

[111] Temporal Rate Reduction Clustering for Human Motion Segmentation

Xianghan Meng,Zhengyu Tong,Zhiyuan Huang,Chun-Guang Li

Main category: cs.CV

TL;DR: This paper introduces $ ext{TR}^2 ext{C}$, a new method for Human Motion Segmentation that improves upon existing techniques by learning temporally consistent structured representations aligned with a UoS model, achieving superior results on benchmark datasets.

Details Motivation: Existing subspace clustering methods for Human Motion Segmentation (HMS) rely on the assumption that high-dimensional temporal data follow a Union-of-Subspaces (UoS) distribution, which may not hold for videos of complex human motions with cluttered backgrounds. Method: The paper proposes a new approach called Temporal Rate Reduction Clustering ($ ext{TR}^2 ext{C}$), which jointly learns structured representations and affinity to segment frame sequences in videos. These representations are designed to maintain temporal consistency and align with a Union-of-Subspaces (UoS) structure. Result: The experiments show that the $ ext{TR}^2 ext{C}$ method achieves state-of-the-art performances on five benchmark HMS datasets with different feature extractors. Conclusion: The paper concludes that the proposed $ ext{TR}^2 ext{C}$ method achieves state-of-the-art performance on five benchmark HMS datasets using different feature extractors. Abstract: Human Motion Segmentation (HMS), which aims to partition videos into non-overlapping human motions, has attracted increasing research attention recently. Existing approaches for HMS are mainly dominated by subspace clustering methods, which are grounded on the assumption that high-dimensional temporal data align with a Union-of-Subspaces (UoS) distribution. However, the frames in video capturing complex human motions with cluttered backgrounds may not align well with the UoS distribution. In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering ($\text{TR}^2\text{C}$), which jointly learns structured representations and affinity to segment the frame sequences in video. Specifically, the structured representations learned by $\text{TR}^2\text{C}$ maintain temporally consistent and align well with a UoS structure, which is favorable for the HMS task. We conduct extensive experiments on five benchmark HMS datasets and achieve state-of-the-art performances with different feature extractors.

[112] DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic

Munish Monga,Vishal Chudasama,Pankaj Wasnik,Biplab Banerjee

Main category: cs.CV

TL;DR: This paper introduces DuET, a novel framework for Dual Incremental Object Detection that effectively handles both class and domain changes without requiring exemplars, significantly improving performance over existing methods.

Details Motivation: Real-world object detection systems need to continuously learn new object categories while adapting to changing environmental conditions, but existing approaches like CIOD and DIOD only address one aspect of this challenge, limiting their practicality. Method: The paper introduces DuET, a Task Arithmetic-based model merging framework with a Directional Consistency Loss to handle incremental learning stably and mitigate sign conflicts. It also proposes the Retention-Adaptability Index (RAI) to evaluate both retention and adaptation performance. Result: Extensive experiments show that DuET achieves significant improvements in RAI (+13.12% on the Pascal Series and +11.39% on the Diverse Weather Series) while maintaining high Average Retention Index, demonstrating its effectiveness in handling dual increments. Conclusion: DuET is an effective and detector-agnostic framework for Dual Incremental Object Detection that addresses both class and domain shifts, outperforming existing methods in real-world applicability. Abstract: Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD) only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET's effectiveness, achieving a +13.12% RAI improvement while preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39% RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks), outperforming existing methods.

[113] Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

Cheng Zou,Senlin Cheng,Bolei Xu,Dandan Zheng,Xiaobo Li,Jingdong Chen,Ming Yang

Main category: cs.CV

TL;DR: ViTI improves video virtual try-on by treating it as a conditional video inpainting task, offering better consistency and garment detail preservation than existing approaches.

Details Motivation: To address the challenges of spatial-temporal inconsistency and poor garment detail preservation in existing image-based and diffusion-based video try-on methods. Method: ViTI builds a video inpainting framework using a Diffusion Transformer with full 3D spatial-temporal attention, progressively adapted for video garment inpainting through masking strategies and multi-stage training, incorporating garment conditions for accurate appearance and details. Result: The proposed method achieves superior performance over previous works in both quantitative and qualitative evaluations, demonstrating effective spatial-temporal consistency and realistic garment fitting across consecutive video frames. Conclusion: ViTI provides a new approach for video virtual try-on by formulating it as a conditional video inpainting task, leading to better spatial-temporal consistency and garment detail preservation compared to previous methods. Abstract: Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.

[114] WordCon: Word-level Typography Control in Scene Text Rendering

Wenda Shi,Yiren Song,Zihan Rao,Dengming Zhang,Jiaming Liu,Xingxing Zou

Main category: cs.CV

TL;DR: 本文提出了一种新方法,在生成图像中实现了更精确的词级排版控制,并介绍了相关数据集和代码开源计划。

Details Motivation: 在生成图像中实现精确的词级排版控制是一个持续的挑战,现有的方法难以满足需求。 Method: 构建了一个词级控制的场景文本数据集,引入了Text-Image Alignment (TIA)框架和WordCon混合参数高效微调方法,并应用了掩码损失和联合注意力损失来提升模型性能。 Result: 通过定性和定量实验验证了所提方法在精确控制词级排版方面的优越性,能够无缝集成到多种文本相关任务的流程中。 Conclusion: 本文提出了一种新的文本-图像对齐框架和WordCon方法,用于实现生成图像中的精确词级排版控制,并展示了其在多个任务中的优越性和有效性。 Abstract: Achieving precise word-level typography control within generated images remains a persistent challenge. To address it, we newly construct a word-level controlled scene text dataset and introduce the Text-Image Alignment (TIA) framework. This framework leverages cross-modal correspondence between text and local image regions provided by grounding models to enhance the Text-to-Image (T2I) model training. Furthermore, we propose WordCon, a hybrid parameter-efficient fine-tuning (PEFT) method. WordCon reparameterizes selective key parameters, improving both efficiency and portability. This allows seamless integration into diverse pipelines, including artistic text rendering, text editing, and image-conditioned text rendering. To further enhance controllability, the masked loss at the latent level is applied to guide the model to concentrate on learning the text region in the image, and the joint-attention loss provides feature-level supervision to promote disentanglement between different words. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. The datasets and source code will be available for academic use.

[115] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation

Diego Biagini,Nassir Navab,Azade Farshad

Main category: cs.CV

TL;DR: HieraSurg是一个层次感知的手术视频生成框架,通过两个扩散模型分阶段生成高质量的外科视频,解决了当前方法在一致性与细节指导方面的不足。

Details Motivation: 现有的外科视频生成方法大多无条件且无法保持与手术动作和阶段的一致性,缺乏必要的手术理解和细粒度指导。 Method: HieraSurg框架由两个专门的扩散模型组成,第一阶段预测未来粗粒度的语义变化,第二阶段结合细粒度视觉特征生成最终视频。 Result: 实验结果表明,该模型在定量和定性评估上均显著优于以往方法,并且在提供现有分割图时表现出特别精细的一致性。 Conclusion: HieraSurg通过利用多层级的手术信息,展现了在胆囊切除手术视频生成数据集上的卓越性能,具有较强的泛化能力和高帧率视频生成能力。 Abstract: Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.

[116] Continual Self-Supervised Learning with Masked Autoencoders in Remote Sensing

Lars Möllenbrok,Behnood Rasti,Begüm Demir

Main category: cs.CV

TL;DR: CoSMAE是一种新的持续自我监督学习方法,用于减少遥感中的灾难性遗忘。

Details Motivation: 现有的CL方法需要大量的标记训练样本来增强对灾难性遗忘的鲁棒性,这在遥感中成本高昂且不可行。 Method: CoSMAE包含两个部分:数据混合和模型混合知识蒸馏。数据混合通过插值当前任务和先前任务的图像来保留先前数据分布的信息。模型混合知识蒸馏则通过蒸馏来自过去模型和当前模型的知识,插值它们的模型权重以形成知识蒸馏的教师。 Result: 实验结果表明,CoSMAE相较于最先进的CL方法有最高4.94%的改进。 Conclusion: CoSMAE通过数据和模型级别的正则化,减少了跨任务泛化的风险和灾难性遗忘的可能性。 Abstract: The development of continual learning (CL) methods, which aim to learn new tasks in a sequential manner from the training data acquired continuously, has gained great attention in remote sensing (RS). The existing CL methods in RS, while learning new tasks, enhance robustness towards catastrophic forgetting. This is achieved by using a large number of labeled training samples, which is costly and not always feasible to gather in RS. To address this problem, we propose a novel continual self-supervised learning method in the context of masked autoencoders (denoted as CoSMAE). The proposed CoSMAE consists of two components: i) data mixup; and ii) model mixup knowledge distillation. Data mixup is associated with retaining information on previous data distributions by interpolating images from the current task with those from the previous tasks. Model mixup knowledge distillation is associated with distilling knowledge from past models and the current model simultaneously by interpolating their model weights to form a teacher for the knowledge distillation. The two components complement each other to regularize the MAE at the data and model levels to facilitate better generalization across tasks and reduce the risk of catastrophic forgetting. Experimental results show that CoSMAE achieves significant improvements of up to 4.94% over state-of-the-art CL methods applied to MAE. Our code is publicly available at: https://git.tu-berlin.de/rsim/CoSMAE.

[117] DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images

Badri Vishal Kasuba,Parag Chaudhuri,Ganesh Ramakrishnan

Main category: cs.CV

TL;DR: 本文介绍了一种名为Dhrishtikon的新框架,它通过结合多语言OCR、大语言模型和新区域匹配算法,提高了文档理解和视觉问答系统的解释性和准确性。

Details Motivation: 文本丰富的文档图像中的视觉基础是文档智能和视觉问答系统中一个关键但研究不足的挑战。 Method: 整合了鲁棒的多语言OCR、大语言模型和一种新的区域匹配算法来实现准确的答案定位。 Result: 实验表明该方法在定位精度上达到了最先进的水平,其中线级粒度在精确率和召回率之间取得了最佳平衡。 Conclusion: 本文提出了一种多粒度视觉定位框架Dhrishtikon,用于增强文档智能和视觉问答系统中的可解释性和信任度。实验结果表明该方法在定位精度上达到了最先进的水平,并且代码和数据集已公开。 Abstract: Visual grounding in text-rich document images is a critical yet underexplored challenge for document intelligence and visual question answering (VQA) systems. We present \drishtikon, a multi-granular visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates robust multi-lingual OCR, large language models, and a novel region matching algorithm to accurately localize answer spans at block, line, word, and point levels. We curate a new benchmark from the CircularsVQA test set, providing fine-grained, human-verified annotations across multiple granularities. Extensive experiments demonstrate that our method achieves state-of-the-art grounding accuracy, with line-level granularity offering the best trade-off between precision and recall. Ablation studies further highlight the benefits of multi-block and multi-line reasoning. Comparative evaluations with leading vision-language models reveal the limitations of current VLMs in precise localization, underscoring the effectiveness of our structured, alignment-based approach. Our findings pave the way for more robust and interpretable document understanding systems in real-world, text-centric scenarios. Code and dataset has been made available at https://github.com/kasuba-badri-vishal/DhrishtiKon.

[118] LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning

Dewen Zhang,Tahir Hussain,Wangpeng An,Hayaru Shouno

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉-语言模型训练方法,通过整合人体关键点信息显著提高了模型在理解以人为中心场景中的表现。

Details Motivation: 当前视觉-语言模型在处理涉及人体姿态和动作的复杂视觉任务时表现不足,缺乏专门的视觉-语言指令数据。 Method: 通过结合人体关键点与传统视觉特征(如描述和边界框)生成专门的视觉-语言指令数据,并构建包含200,328个样本的数据集以微调模型。 Result: 在E-HPAUB基准测试中,LLaVA-Pose模型相较原始LLaVA-1.5-7B模型整体提升了33.2%。 Conclusion: 将关键点集成数据整合到视觉-语言模型中,能有效提升对以人为中心的视觉理解任务的性能。 Abstract: Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish an Extended Human Pose and Action Understanding Benchmark (E-HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 33.2% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models for human-centric visual understanding. Code is available at https://github.com/Ody-trek/LLaVA-Pose.

[119] Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models

Haoyang Wu,Tsun-Hsuan Wang,Mathias Lechner,Ramin Hasani,Jennifer A. Eckhoff,Paul Pak,Ozanan R. Meireles,Guy Rosman,Yutong Ban,Daniela Rus

Main category: cs.CV

TL;DR: 本文提出了一个高效的手术视频分析框架,利用状态空间模型解决长视频处理问题,并通过实验验证了其卓越性能。

Details Motivation: 由于手术流程分析在机器人辅助手术中的重要性以及现有基于Transformer模型在处理长视频时因二次注意力机制导致的效率限制,需要一种更高效且能够捕捉局部和全局动态的方法。 Method: 该研究采用了一种分层输入依赖的状态空间模型,结合了时间一致的视觉特征提取器,并设计了局部聚合和全局关系的状态空间模型模块以捕获局部和全局动态。此外,使用了一种混合离散-连续监督策略进行训练。 Result: 实验结果表明,所提出的方法在Cholec80、MICCAI2016和Heichole数据集上分别提升了+2.8%、+4.3%和+12.9%的性能,显著优于当前最先进的方法。 Conclusion: 本文提出了一种新的基于状态空间模型的框架,用于处理机器人辅助手术视频的全面分析,并通过实验验证了其优于现有方法的性能。 Abstract: Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.

[120] PanSt3R: Multi-view Consistent Panoptic Segmentation

Lojze Zust,Yohann Cabon,Juliette Marrie,Leonid Antsfeld,Boris Chidlovskii,Jerome Revaud,Gabriela Csurka

Main category: cs.CV

TL;DR: PanSt3R is a new method for 3D scene panoptic segmentation that efficiently combines geometry prediction and segmentation in a single pass, outperforming existing methods in speed and effectiveness.

Details Motivation: The motivation is to overcome the limitations of existing approaches that rely on 2D panoptic segmentation for a fundamentally 3D and multi-view problem, and require expensive test-time optimization. Method: The method involves using an integrated approach called PanSt3R, which predicts 3D geometry and multi-view panoptic segmentation in one forward pass without test-time optimization. It builds on MUSt3R and includes semantic awareness and improved multi-view segmentation techniques. Result: PanSt3R achieves state-of-the-art performance on several benchmarks and is significantly faster than current methods while maintaining accuracy. Conclusion: The paper concludes that PanSt3R is a more efficient and effective method for 3D scene panoptic segmentation, as it is faster, scalable, and achieves state-of-the-art results compared to existing methods. Abstract: Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.

[121] Generalizable Neural Electromagnetic Inverse Scattering

Yizhe Cheng,Chunxun Tian,Haoru Wang,Wentao Zhu,Xiaoxuan Ma,Yizhou Wang

Main category: cs.CV

TL;DR: This paper introduces a novel physics-informed framework for solving electromagnetic inverse scattering problems, enabling generalizable and robust predictions of relative permittivity with strong performance even under sparse transmitter setups.

Details Motivation: EISPs are inherently ill-posed and highly nonlinear, posing challenges such as poor generalization and failure under sparse transmitter setups. Existing methods like Img-Interiors lack robustness and require case-specific optimization. Method: The paper reformulates EISP as a two-stage inverse transmission-scattering process, introducing induced current as an intermediate representation, and proposes a current estimator and permittivity solver working in an end-to-end manner. Result: The proposed method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness, particularly maintaining performance under transmitter sparsity. Conclusion: The paper proposes a new physics-driven framework for EISP that allows for generalizable and robust predictions of relative permittivity, offering a fundamentally new perspective on electromagnetic inverse scattering. Abstract: Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires case-specific optimization, lacks generalization to unseen data, and fails under sparse transmitter setups (e.g., with only one transmitter). To address these limitations, we revisit EISP from a physics-informed perspective, reformulating it as a two stage inverse transmission-scattering process. This formulation reveals the induced current as a generalizable intermediate representation, effectively decoupling the nonlinear scattering process from the ill-posed inverse problem. Built on this insight, we propose the first generalizable physics-driven framework for EISP, comprising a current estimator and a permittivity solver, working in an end-to-end manner. The current estimator explicitly learns the induced current as a physical bridge between the incident and scattered field, while the permittivity solver computes the relative permittivity directly from the estimated induced current. This design enables data-driven training and generalizable feed-forward prediction of relative permittivity on unseen data while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.

[122] ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Hongbo Liu,Jingwen He,Yi Jin,Dian Zheng,Yuhao Dong,Fan Zhang,Ziqi Huang,Yinan He,Yangguang Li,Weichao Chen,Yu Qiao,Wanli Ouyang,Shengjie Zhao,Ziwei Liu

Main category: cs.CV

TL;DR: This paper introduces ShotBench and ShotVL, advancing cinematic language understanding in AI by addressing gaps in visual-language models' comprehension of film grammar.

Details Motivation: There is a lack of robust evaluation for Vision-Language Models' ability to comprehend nuanced cinematic grammar, which limits visual comprehension and AI-assisted video generation. Method: The authors created ShotBench and ShotQA, evaluated 24 VLMs, and developed ShotVL through supervised fine-tuning and Group Relative Policy Optimization. Result: Even the best-performing VLMs showed substantial limitations on ShotBench, achieving less than 60% average accuracy, particularly struggling with spatial reasoning and visual cues. Conclusion: The paper concludes that ShotVL, developed using ShotQA, significantly outperforms existing models on ShotBench, establishing a new state-of-the-art in cinematic language understanding. Abstract: Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce \textbf{ShotBench}, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60\% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct \textbf{ShotQA}, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop \textbf{ShotVL} through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new \textbf{state-of-the-art} performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.

[123] CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations

Julian Lorenz,Mrunmai Phatak,Robin Schön,Katja Ludwig,Nico Hörmann,Annemarie Friedrich,Rainer Lienhart

Main category: cs.CV

TL;DR: 本文提出了一种新的高精度合成场景图数据集CoPa-SG,并引入了两种新的场景图概念——参数化关系和原型关系,以增强场景图的表达能力和应用场景。

Details Motivation: 当前研究仍受困于缺乏精确的场景图数据,因此需要一个新的、更精细的场景图表示方法来提高场景理解的能力。 Method: 提出了一个名为CoPa-SG的合成场景图数据集,同时引入了两种新的场景图基本概念:参数化关系和原型关系。 Result: 利用CoPa-SG比较了各种场景图生成模型的表现,并展示了新关系类型如何提升下游应用的规划和推理能力。 Conclusion: CoPa-SG解决了现有场景图数据不准确的问题,并通过引入参数化关系和原型关系,增强了场景图的表达能力和应用潜力。 Abstract: 2D scene graphs provide a structural and explainable framework for scene understanding. However, current work still struggles with the lack of accurate scene graph data. To overcome this data bottleneck, we present CoPa-SG, a synthetic scene graph dataset with highly precise ground truth and exhaustive relation annotations between all objects. Moreover, we introduce parametric and proto-relations, two new fundamental concepts for scene graphs. The former provides a much more fine-grained representation than its traditional counterpart by enriching relations with additional parameters such as angles or distances. The latter encodes hypothetical relations in a scene graph and describes how relations would form if new objects are placed in the scene. Using CoPa-SG, we compare the performance of various scene graph generation models. We demonstrate how our new relation types can be integrated in downstream applications to enhance planning and reasoning capabilities.

[124] ToosiCubix: Monocular 3D Cuboid Labeling via Vehicle Part Annotations

Behrooz Nasihatkon,Hossein Resani,Amirreza Mehrzadian

Main category: cs.CV

TL;DR: ToosiCubix 是一种基于单目图像和相机参数的高效、低成本 3D 车辆注释方法,能够在大规模数据集中实现高质量的 cuboid 注释。

Details Motivation: 现有的车辆 3D cuboid 注释方法依赖昂贵且需要仔细校准的相机-LiDAR 或立体设置,限制了其在大规模数据收集中的可用性。 Method: 通过注释特定特征(如轮子、汽车徽章、对称性)并将其表述为优化问题,使用坐标下降策略结合 Perspective-n-Points (PnP) 和最小二乘法子问题求解,并引入概率尺寸先验处理模糊性。 Result: ToosiCubix 方法每辆车仅需大约 10 次用户点击,在KITTI 和 Cityscapes3D 数据集上验证表明,该方法能够提供高质量的 3D cuboid 注释。 Conclusion: ToosiCubix 是一种使用单目图像和相机参数进行高质量 3D cuboid 注释的成本效益高且可扩展的解决方案。 Abstract: Many existing methods for 3D cuboid annotation of vehicles rely on expensive and carefully calibrated camera-LiDAR or stereo setups, limiting their accessibility for large-scale data collection. We introduce ToosiCubix, a simple yet powerful approach for annotating ground-truth cuboids using only monocular images and intrinsic camera parameters. Our method requires only about 10 user clicks per vehicle, making it highly practical for adding 3D annotations to existing datasets originally collected without specialized equipment. By annotating specific features (e.g., wheels, car badge, symmetries) across different vehicle parts, we accurately estimate each vehicle's position, orientation, and dimensions up to a scale ambiguity (8 DoF). The geometric constraints are formulated as an optimization problem, which we solve using a coordinate descent strategy, alternating between Perspective-n-Points (PnP) and least-squares subproblems. To handle common ambiguities such as scale and unobserved dimensions, we incorporate probabilistic size priors, enabling 9 DoF cuboid placements. We validate our annotations against the KITTI and Cityscapes3D datasets, demonstrating that our method offers a cost-effective and scalable solution for high-quality 3D cuboid annotation.

[125] CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection

Zhixin Cheng,Jiacheng Deng,Xinjun Li,Xiaotian Yin,Bohao Liao,Baoqun Yin,Wenfei Yang,Tianzhu Zhang

Main category: cs.CV

TL;DR: This paper proposes a novel approach using CAA and GOS modules to improve image-to-point cloud registration accuracy by addressing feature mismatch and redundancy issues.

Details Motivation: The motivation stems from the challenges faced by detection-free methods in cross-modal feature matching due to differences in feature channel attention and redundant correspondences caused by similar scene structures. Method: The method involves introducing two modules: Channel Adaptive Adjustment Module (CAA) for enhancing intra-modal features and suppressing cross-modal sensitivity, and Global Optimal Selection Module (GOS) for replacing local selection with global optimization. Result: Experiments on RGB-D Scenes V2 and 7-Scenes datasets demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance in registration accuracy. Conclusion: The paper concludes that the proposed CAA and GOS modules significantly enhance the accuracy of image-to-point cloud registration by improving feature matching and reducing redundancy. Abstract: Detection-free methods typically follow a coarse-to-fine pipeline, extracting image and point cloud features for patch-level matching and refining dense pixel-to-point correspondences. However, differences in feature channel attention between images and point clouds may lead to degraded matching results, ultimately impairing registration accuracy. Furthermore, similar structures in the scene could lead to redundant correspondences in cross-modal matching. To address these issues, we propose Channel Adaptive Adjustment Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances intra-modal features and suppresses cross-modal sensitivity, while GOS replaces local selection with global optimization. Experiments on RGB-D Scenes V2 and 7-Scenes demonstrate the superiority of our method, achieving state-of-the-art performance in image-to-point cloud registration.

[126] GenFlow: Interactive Modular System for Image Generation

Duc-Hung Nguyen,Huu-Phuc Huynh,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: 本文介绍了一种名为 GenFlow 的新模块化框架,通过直观的界面和自动化功能,使各种技能水平的用户都能轻松使用生成艺术工具。

Details Motivation: 生成艺术具有巨大的创造潜力,但由于需要专业的建筑概念和计算工作流的技术专长,其全部潜力尚未被开发。 Method: 开发了一个基于节点的编辑器和一个由自然语言处理驱动的智能助手,以降低技术障碍并实现自动化部署过程。 Result: 用户研究表明,GenFlow 能够优化工作流程、减少任务完成时间,并通过其直观界面和自适应功能增强用户理解能力。 Conclusion: GenFlow 是一种创新的模块化框架,旨在简化生成艺术的工作流程,使所有技能水平的用户都能轻松使用高级工具。 Abstract: Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present GenFlow, a novel modular framework that empowers users of all skill levels to generate images with precision and ease. Featuring a node-based editor for seamless customization and an intelligent assistant powered by natural language processing, GenFlow transforms the complexity of workflow creation into an intuitive and accessible experience. By automating deployment processes and minimizing technical barriers, our framework makes cutting-edge generative art tools available to everyone. A user study demonstrated GenFlow's ability to optimize workflows, reduce task completion times, and enhance user understanding through its intuitive interface and adaptive features. These results position GenFlow as a groundbreaking solution that redefines accessibility and efficiency in the realm of generative art.

[127] FastRef:Fast Prototype Refinement for Few-Shot Industrial Anomaly Detection

Long Tian,Yufei Li,Yuyang Dai,Wenchao Chen,Xiyang Liu,Bo Chen

Main category: cs.CV

TL;DR: FastRef是一种用于少样本工业异常检测(FS-IAD)的新颖高效的原型优化框架,其通过迭代的两阶段过程改进原型以增强检测性能。

Details Motivation: 现有的FS-IAD方法主要关注从有限的正常样本中提取原型,但通常忽视了查询图像统计信息的系统性整合,以提升原型的代表性。 Method: FastRef通过一个迭代的两阶段过程进行原型优化:1) 利用可优化的变换矩阵将查询特征的特性转移到原型上;2) 通过原型对齐抑制异常。此外,采用最优传输(OT)来衡量和最小化原型与其优化后版本之间的差距,从而实现异常抑制。 Result: FastRef与三种具有竞争力的基于原型的FS-IAD方法(PatchCore、FastRecon、WinCLIP和AnomalyDINO)集成,在MVTec、ViSA、MPDD和RealIAD四个基准数据集上的大量实验表明了该方法在1/2/4-shot设置下的有效性和计算效率。 Conclusion: FastRef为FS-IAD提供了一种有效的解决方案,特别是在数据稀缺环境下,通过优化原型显著提升了异常检测的性能。 Abstract: Few-shot industrial anomaly detection (FS-IAD) presents a critical challenge for practical automated inspection systems operating in data-scarce environments. While existing approaches predominantly focus on deriving prototypes from limited normal samples, they typically neglect to systematically incorporate query image statistics to enhance prototype representativeness. To address this issue, we propose FastRef, a novel and efficient prototype refinement framework for FS-IAD. Our method operates through an iterative two-stage process: (1) characteristic transfer from query features to prototypes via an optimizable transformation matrix, and (2) anomaly suppression through prototype alignment. The characteristic transfer is achieved through linear reconstruction of query features from prototypes, while the anomaly suppression addresses a key observation in FS-IAD that unlike conventional IAD with abundant normal prototypes, the limited-sample setting makes anomaly reconstruction more probable. Therefore, we employ optimal transport (OT) for non-Gaussian sampled features to measure and minimize the gap between prototypes and their refined counterparts for anomaly suppression. For comprehensive evaluation, we integrate FastRef with three competitive prototype-based FS-IAD methods: PatchCore, FastRecon, WinCLIP, and AnomalyDINO. Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and RealIAD demonstrate both the effectiveness and computational efficiency of our approach under 1/2/4-shots.

[128] Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction

Zhirui Gao. Renjiao Yi,Yaqiao Dai,Xuening Zhu,Wei Chen,Chenyang Zhu,Kai Xu

Main category: cs.CV

TL;DR: 本文提出了一种端到端的框架,用于直接从多视角边缘图中重建3D参数曲线。与现有的两阶段方法不同,该方法通过消除断开阶段之间的优化差距带来的误差累积,实现更清洁、更稳健的重建效果,并显著减少了训练过程中的参数数量,提高了效率和性能。

Details Motivation: 现有两阶段方法在边缘点云重建和参数曲线拟合之间存在固有的优化差距,导致误差累积问题。此外,参数曲线本身不适合基于渲染的多视角优化,因此需要一种既能保留其几何特性又能支持可微渲染的互补表示。 Method: 本文提出了一种新的单阶段方法,直接从2D边缘图优化3D参数曲线。同时引入了一种双向耦合机制,将参数曲线与面向边缘的高斯组件结合,形成了一个曲线感知的高斯表示(CurveGaussian),从而实现了3D曲线的可微渲染。此外,在训练过程中还引入了动态自适应拓扑优化框架,通过线性化、合并、分裂和剪枝操作来优化曲线结构。 Result: 在ABC数据集和真实世界基准测试中进行了全面评估,结果表明该单阶段方法优于两阶段方法,尤其是在生成更清洁和更稳健的重建结果方面。此外,由于直接优化参数曲线,该方法在训练过程中显著减少了参数数量,达到了更高的效率和更优的性能。 Conclusion: 本文提出的单阶段方法有效地解决了现有两阶段方法中存在的误差累积问题,并通过引入CurveGaussian和动态自适应拓扑优化框架,实现了更高效、更优性能的3D参数曲线重建。 Abstract: This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting'' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.

[129] XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

Bowen Chen,Mengyi Zhao,Haomiao Sun,Li Chen,Xu Wang,Kang Du,Xinglong Wu

Main category: cs.CV

TL;DR: XVerse improves multi-subject text-to-image generation by allowing precise and independent control of subject attributes.

Details Motivation: To achieve fine-grained control over subject identity and semantic attributes without undermining the editability and coherence of Diffusion Transformers (DiTs). Method: Transforming reference images into offsets for token-specific text-stream modulation. Result: XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. Conclusion: XVerse allows for precise and independent control of specific subjects in text-to-image generation, improving personalized and complex scene generation capabilities. Abstract: Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.

[130] EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting

Taoyu Wu,Yiyi Miao,Zhuoxiao Li,Haocheng Zhao,Kang Dang,Jionglong Su,Limin Yu,Haoang Li

Main category: cs.CV

TL;DR: This paper proposes an improved 3DGS-based SLAM method for endoscopic scenarios by incorporating optical flow loss and depth regularization, leading to superior performance in 3D reconstruction and camera pose estimation.

Details Motivation: The motivation stems from challenges in endoscopic scenarios, such as photometric inconsistencies due to non-Lambertian surfaces and dynamic motion caused by breathing, which affect the performance of SLAM systems. Method: The method introduces optical flow loss as a geometric constraint and incorporates a depth regularization strategy to address photometric inconsistencies. It also improves the 3DGS refinement process by focusing on frames with suboptimal rendering quality. Result: Experiments on the C3VD static dataset and the StereoMIS dynamic dataset show that the proposed method achieves better results in novel view synthesis and pose estimation compared to state-of-the-art approaches. Conclusion: The paper concludes that their proposed method, which integrates optical flow loss and depth regularization into 3DGS-based SLAM, outperforms existing methods in novel view synthesis and pose estimation for both static and dynamic surgical scenes. Abstract: Efficient three-dimensional reconstruction and real-time visualization are critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in efficient 3D reconstruction and rendering. Most 3DGS-based Simultaneous Localization and Mapping (SLAM) methods only rely on the appearance constraints for optimizing both 3DGS and camera poses. However, in endoscopic scenarios, the challenges include photometric inconsistencies caused by non-Lambertian surfaces and dynamic motion from breathing affects the performance of SLAM systems. To address these issues, we additionally introduce optical flow loss as a geometric constraint, which effectively constrains both the 3D structure of the scene and the camera motion. Furthermore, we propose a depth regularisation strategy to mitigate the problem of photometric inconsistencies and ensure the validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve scene representation in the SLAM system, we improve the 3DGS refinement strategy by focusing on viewpoints corresponding to Keyframes with suboptimal rendering quality frames, achieving better rendering results. Extensive experiments on the C3VD static dataset and the StereoMIS dynamic dataset demonstrate that our method outperforms existing state-of-the-art methods in novel view synthesis and pose estimation, exhibiting high performance in both static and dynamic surgical scenes. The source code will be publicly available upon paper acceptance.

[131] HyperSORT: Self-Organising Robust Training with hyper-networks

Samuel Joutard,Marijn Stollenga,Marc Balle Sanchez,Mohammad Farid Azampour,Raphael Prevost

Main category: cs.CV

TL;DR: 本文提出了一种新的医学图像分割框架HyperSORT,通过超网络学习UNet参数,有效识别和处理数据集中的系统性偏差和错误。

Details Motivation: 医学图像数据集中存在异质性偏差(如错误标签、不一致的标注风格),这些偏差会影响深度分割网络的表现。而识别和表征这些偏差是一项具有挑战性的任务。 Method: 引入HyperSORT框架,利用超网络预测UNet参数,以学习图像和标注差异性的潜在向量表示。 Result: 在两个3D腹部CT公开数据集上验证了方法的有效性,实验表明HyperSORT可以创建数据集的结构化映射,用于识别相关系统性偏差和错误样本。 Conclusion: HyperSORT能够识别和表征医学图像数据集中的系统性偏差,从而提升深度分割网络的性能。 Abstract: Medical imaging datasets often contain heterogeneous biases ranging from erroneous labels to inconsistent labeling styles. Such biases can negatively impact deep segmentation networks performance. Yet, the identification and characterization of such biases is a particularly tedious and challenging task. In this paper, we introduce HyperSORT, a framework using a hyper-network predicting UNets' parameters from latent vectors representing both the image and annotation variability. The hyper-network parameters and the latent vector collection corresponding to each data sample from the training set are jointly learned. Hence, instead of optimizing a single neural network to fit a dataset, HyperSORT learns a complex distribution of UNet parameters where low density areas can capture noise-specific patterns while larger modes robustly segment organs in differentiated but meaningful manners. We validate our method on two 3D abdominal CT public datasets: first a synthetically perturbed version of the AMOS dataset, and TotalSegmentator, a large scale dataset containing real unknown biases and errors. Our experiments show that HyperSORT creates a structured mapping of the dataset allowing the identification of relevant systematic biases and erroneous samples. Latent space clusters yield UNet parameters performing the segmentation task in accordance with the underlying learned systematic bias. The code and our analysis of the TotalSegmentator dataset are made available: https://github.com/ImFusionGmbH/HyperSORT

[132] Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation

Sweta Banerjee,Viktoria Weiss,Taryn A. Donovan,Rutger A. Fick,Thomas Conrad,Jonas Ammeling,Nils Porsche,Robert Klopfleisch,Christopher Kaltenecker,Katharina Breininger,Marc Aubreville,Christof A. Bertram

Main category: cs.CV

TL;DR: 本文提出了一项针对乳腺癌中非典型有丝分裂分类的深度学习基准测试研究,展示了如何利用迁移学习和LoRA微调技术提升模型性能,并发布了相关代码与数据集。

Details Motivation: 非典型有丝分裂是细胞分裂过程中的异常,可作为肿瘤恶性程度的独立预后标志物。然而由于其发病率低、形态学差异细微、病理学家之间评分一致性低以及数据集中类别不平衡等问题,识别它们仍然具有挑战性。 Method: 本研究基于AMi-Br数据集,比较了深度学习方法在自动化非典型有丝分裂(AMF)分类中的性能,包括基线模型、采用线性探测的基础模型以及使用低秩适配(LoRA)微调的基础模型。此外,还引入了两个新的保持AMF数据集AtNorM-Br和AtNorM-MD用于严格评估。 Result: 实验结果显示,在AMi-Br、AtNorM-Br和AtNorM-MD数据集上分别达到了平均平衡准确率0.8135、0.7696和0.7705,其中基于LoRA的Virchow系列基础模型表现尤为出色。 Conclusion: 该研究表明,尽管非典型有丝分裂分类是一个具有挑战性的问题,但通过使用迁移学习和模型微调技术的最新进展可以有效解决。Virchow系列基础模型的LoRA微调方法在分类任务中表现特别良好。 Abstract: Atypical mitoses mark a deviation in the cell division process that can be an independent prognostically relevant marker for tumor malignancy. However, their identification remains challenging due to low prevalence, at times subtle morphological differences from normal mitoses, low inter-rater agreement among pathologists, and class imbalance in datasets. Building on the Atypical Mitosis dataset for Breast Cancer (AMi-Br), this study presents a comprehensive benchmark comparing deep learning approaches for automated atypical mitotic figure (AMF) classification, including baseline models, foundation models with linear probing, and foundation models fine-tuned with low-rank adaptation (LoRA). For rigorous evaluation, we further introduce two new hold-out AMF datasets - AtNorM-Br, a dataset of mitoses from the The TCGA breast cancer cohort, and AtNorM-MD, a multi-domain dataset of mitoses from the MIDOG++ training set. We found average balanced accuracy values of up to 0.8135, 0.7696, and 0.7705 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and AtNorM-MD datasets, respectively, with the results being particularly good for LoRA-based adaptation of the Virchow-line of foundation models. Our work shows that atypical mitosis classification, while being a challenging problem, can be effectively addressed through the use of recent advances in transfer learning and model fine-tuning techniques. We make available all code and data used in this paper in this github repository: https://github.com/DeepMicroscopy/AMi-Br_Benchmark.

[133] Controllable 3D Placement of Objects with Scene-Aware Diffusion Models

Mohamed Omran,Dimitris Kalatzis,Jens Petersen,Amirhossein Habibian,Auke Wiggers

Main category: cs.CV

TL;DR: This paper introduces a novel method for precise object placement in images by combining visual maps and coarse masks, avoiding complex inpainting prompts.

Details Motivation: Precise object placement in images remains challenging with existing methods, which often require detailed inpainting masks or prompts. Method: A conditioning signal was designed using a visual map and coarse object masks, building on an inpainting model to maintain the background while enabling object placement. Result: The proposed method achieves high-quality object placement, allowing for changes in shape and orientation while maintaining background integrity. Conclusion: The paper concludes that a visual map combined with coarse object masks can enable high-quality, precise object placement in image editing without modifying the background. Abstract: Image editing approaches have become more powerful and flexible with the advent of powerful text-conditioned generative models. However, placing objects in an environment with a precise location and orientation still remains a challenge, as this typically requires carefully crafted inpainting masks or prompts. In this work, we show that a carefully designed visual map, combined with coarse object masks, is sufficient for high quality object placement. We design a conditioning signal that resolves ambiguities, while being flexible enough to allow for changing of shapes or object orientations. By building on an inpainting model, we leave the background intact by design, in contrast to methods that model objects and background jointly. We demonstrate the effectiveness of our method in the automotive setting, where we compare different conditioning signals in novel object placement tasks. These tasks are designed to measure edit quality not only in terms of appearance, but also in terms of pose and location accuracy, including cases that require non-trivial shape changes. Lastly, we show that fine location control can be combined with appearance control to place existing objects in precise locations in a scene.

[134] A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario

Cyrus Addy,Ajay Kumar Gurumadaiah,Yixiang Gao,Kwame Awuah-Offei

Main category: cs.CV

TL;DR: This paper introduces a new thermal imaging dataset for miner detection in underground mining emergencies, demonstrating the potential of deep learning algorithms like YOLO and RT-DETR for improving safety.

Details Motivation: Underground mining operations face significant safety challenges, especially during emergency situations where reliable miner detection is crucial. Existing deep learning solutions lack comprehensive training datasets specific to underground mining environments. Method: The authors created a novel thermal imaging dataset by systematically capturing thermal imagery of various mining activities and scenarios. They evaluated state-of-the-art object detection algorithms like YOLOv8, YOLOv10, YOLO11, and RT-DETR to establish baseline performance metrics. Result: A robust thermal imaging dataset was developed specifically for miner detection, enabling the testing and validation of advanced detection algorithms. Baseline performance metrics were established using modern object detection models. Conclusion: The paper concludes that thermal imaging can be a feasible solution for miner detection in underground mining emergencies, laying the foundation for future research and development of reliable detection systems. Abstract: Underground mining operations face significant safety challenges that make emergency response capabilities crucial. While robots have shown promise in assisting with search and rescue operations, their effectiveness depends on reliable miner detection capabilities. Deep learning algorithms offer potential solutions for automated miner detection, but require comprehensive training datasets, which are currently lacking for underground mining environments. This paper presents a novel thermal imaging dataset specifically designed to enable the development and validation of miner detection systems for potential emergency applications. We systematically captured thermal imagery of various mining activities and scenarios to create a robust foundation for detection algorithms. To establish baseline performance metrics, we evaluated several state-of-the-art object detection algorithms including YOLOv8, YOLOv10, YOLO11, and RT-DETR on our dataset. While not exhaustive of all possible emergency situations, this dataset serves as a crucial first step toward developing reliable thermal-based miner detection systems that could eventually be deployed in real emergency scenarios. This work demonstrates the feasibility of using thermal imaging for miner detection and establishes a foundation for future research in this critical safety application.

[135] Rethinking Oversaturation in Classifier-Free Guidance via Low Frequency

Kaiyu Song,Hanjiang Lai

Main category: cs.CV

TL;DR: This paper introduces LF-CFG, a novel approach to address oversaturation and unrealistic artifacts in diffusion models by targeting redundant information in low-frequency signals using an adaptive threshold and down-weighting strategy.

Details Motivation: High guidance scales in classifier-free guidance often lead to oversaturation and unrealistic artifacts; this work addresses these issues by focusing on low-frequency signals. Method: An adaptive threshold-based measurement is introduced to pinpoint redundant information in low-frequency signals. A down-weight strategy is then applied to reduce its impact. Result: The proposed LF-CFG successfully mitigates oversaturation and unrealistic artifacts in various diffusion models, including Stable Diffusion-XL, Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL. Conclusion: LF-CFG effectively alleviates oversaturation and unrealistic artifacts in diffusion models, as demonstrated by experimental results across multiple model versions. Abstract: Classifier-free guidance (CFG) succeeds in condition diffusion models that use a guidance scale to balance the influence of conditional and unconditional terms. A high guidance scale is used to enhance the performance of the conditional term. However, the high guidance scale often results in oversaturation and unrealistic artifacts. In this paper, we introduce a new perspective based on low-frequency signals, identifying the accumulation of redundant information in these signals as the key factor behind oversaturation and unrealistic artifacts. Building on this insight, we propose low-frequency improved classifier-free guidance (LF-CFG) to mitigate these issues. Specifically, we introduce an adaptive threshold-based measurement to pinpoint the locations of redundant information. We determine a reasonable threshold by analyzing the change rate of low-frequency information between prior and current steps. We then apply a down-weight strategy to reduce the impact of redundant information in the low-frequency signals. Experimental results demonstrate that LF-CFG effectively alleviates oversaturation and unrealistic artifacts across various diffusion models, including Stable Diffusion-XL, Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL.

[136] Evaluation of Traffic Signals for Daily Traffic Pattern

Mohammad Shokrolah Shirazi,Hung-Fu Chang

Main category: cs.CV

TL;DR: This paper proposes and evaluates dynamic, static, and hybrid signal timing methods using real-world data to improve traffic flow at intersections, showing that a hybrid approach adapts best to varying traffic patterns.

Details Motivation: Turning movement count data is essential for effective traffic signal design, intersection planning, and congestion analysis, prompting the need for improved signal timing strategies. Method: The study develops a vision-based tracking system to estimate turning movement counts (TMC) and evaluates traffic signal methods using simulations with realistic data from six intersections in Las Vegas. Result: Initial experiments show that cycle times of 90 and 120 seconds perform best. The dynamic method works well for four intersections, while the hybrid method outperforms others under highly weighted traffic conditions in specific zones. Conclusion: This paper concludes that the hybrid method of dynamic and static signal timing configurations improves traffic flow management, especially under varying traffic conditions in different zones. Abstract: The turning movement count data is crucial for traffic signal design, intersection geometry planning, traffic flow, and congestion analysis. This work proposes three methods called dynamic, static, and hybrid configuration for TMC-based traffic signals. A vision-based tracking system is developed to estimate the TMC of six intersections in Las Vegas using traffic cameras. The intersection design, route (e.g. vehicle movement directions), and signal configuration files with compatible formats are synthesized and imported into Simulation of Urban MObility for signal evaluation with realistic data. The initial experimental results based on estimated waiting times indicate that the cycle time of 90 and 120 seconds works best for all intersections. In addition, four intersections show better performance for dynamic signal timing configuration, and the other two with lower performance have a lower ratio of total vehicle count to total lanes of the intersection leg. Since daily traffic flow often exhibits a bimodal pattern, we propose a hybrid signal method that switches between dynamic and static methods, adapting to peak and off-peak traffic conditions for improved flow management. So, a built-in traffic generator module creates vehicle routes for 4 hours, including peak hours, and a signal design module produces signal schedule cycles according to static, dynamic, and hybrid methods. Vehicle count distributions are weighted differently for each zone (i.e., West, North, East, South) to generate diverse traffic patterns. The extended experimental results for 6 intersections with 4 hours of simulation time imply that zone-based traffic pattern distributions affect signal design selection. Although the static method works great for evenly zone-based traffic distribution, the hybrid method works well for highly weighted traffic at intersection pairs of the West-East and North-South zones.

[137] Global and Local Entailment Learning for Natural World Imagery

Srikumar Sastry,Aayush Dhakal,Eric Xing,Subash Khanal,Nathan Jacobs

Main category: cs.CV

TL;DR: 本文提出RCME框架,通过显式建模蕴含的传递性,提升视觉语言模型对层级结构数据的学习能力,并在相关任务上取得更好性能。

Details Motivation: 现有方法未能显式建模蕴含的传递性,难以捕捉表示空间中的层级结构。 Method: 引入Radial Cross-Modal Embeddings (RCME),显式建模蕴含的传递性,并优化概念的部分顺序。 Result: 在层级物种分类和检索任务中,RCME优于现有最先进模型。 Conclusion: RCME成功建模了蕴含的传递性,提高了视觉语言模型在层级物种分类和检索任务中的性能。 Abstract: Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.

[138] TITAN: Query-Token based Domain Adaptive Adversarial Learning

Tajamul Ashraf,Janibul Bashir

Main category: cs.CV

TL;DR: This paper introduces TITAN, a novel method for source-free domain adaptation in object detection that improves performance by generating reliable pseudo-labels through iterative partitioning and adversarial learning.

Details Motivation: The motivation is to address the degradation of the student model caused by unreliable pseudo-labels due to domain bias and shift when source data is unavailable. Method: TITAN employs a Target-based Iterative Query-Token Adversarial Network that separates target images into similar (easy) and dissimilar (hard) subsets based on variance estimation. It integrates query-token-based adversarial modules into a student-teacher framework to reduce domain gaps. Result: Experiments on four natural imaging datasets and two medical datasets showed mAP improvements of +22.7%, +22.2%, +21.1%, and +3.7% over current SOTA on C2F, C2B, S2C, and K2C benchmarks, respectively. Conclusion: TITAN outperforms existing state-of-the-art methodologies in source-free domain adaptive object detection, showing significant mAP improvements across multiple benchmarks. Abstract: We focus on the source-free domain adaptive object detection (SF-DAOD) problem when source data is unavailable during adaptation and the model must adapt to an unlabeled target domain. The majority of approaches for the problem employ a self-supervised approach using a student-teacher (ST) framework where pseudo-labels are generated via a source-pretrained model for further fine-tuning. We observe that the performance of a student model often degrades drastically, due to the collapse of the teacher model, primarily caused by high noise in pseudo-labels, resulting from domain bias, discrepancies, and a significant domain shift across domains. To obtain reliable pseudo-labels, we propose a Target-based Iterative Query-Token Adversarial Network (TITAN), which separates the target images into two subsets: those similar to the source (easy) and those dissimilar (hard). We propose a strategy to estimate variance to partition the target domain. This approach leverages the insight that higher detection variances correspond to higher recall and greater similarity to the source domain. Also, we incorporate query-token-based adversarial modules into a student-teacher baseline framework to reduce the domain gaps between two feature representations. Experiments conducted on four natural imaging datasets and two challenging medical datasets have substantiated the superior performance of TITAN compared to existing state-of-the-art (SOTA) methodologies. We report an mAP improvement of +22.7, +22.2, +21.1, and +3.7 percent over the current SOTA on C2F, C2B, S2C, and K2C benchmarks, respectively.

[139] Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection

Tobias J. Riedlinger,Kira Maag,Hanno Gottschalk

Main category: cs.CV

TL;DR: 本研究提出一种新的对象检测框架,利用空间统计学解决传统模型在不确定性估计上的不足,尤其在未检测到物体的区域内。

Details Motivation: 现有的目标检测模型在边界框外无法量化不确定性,这对自动驾驶等应用存在安全隐患。 Method: 将边界框数据与标记点过程结合,通过似然函数进行训练,并评估区域可驾驶性的不确定性。 Result: 新方法通过校准评估和性能测试验证了其有效性,并改进了对空旷区域不确定性的量化能力。 Conclusion: 该研究提出了一种基于空间统计学的对象检测模型,能够提供明确定义的置信度估计,用于判断区域是否可驾驶。 Abstract: Deep neural networks have set the state-of-the-art in computer vision tasks such as bounding box detection and semantic segmentation. Object detectors and segmentation models assign confidence scores to predictions, reflecting the model's uncertainty in object detection or pixel-wise classification. However, these confidence estimates are often miscalibrated, as their architectures and loss functions are tailored to task performance rather than probabilistic foundation. Even with well calibrated predictions, object detectors fail to quantify uncertainty outside detected bounding boxes, i.e., the model does not make a probability assessment of whether an area without detected objects is truly free of obstacles. This poses a safety risk in applications such as automated driving, where uncertainty in empty areas remains unexplored. In this work, we propose an object detection model grounded in spatial statistics. Bounding box data matches realizations of a marked point process, commonly used to describe the probabilistic occurrence of spatial point events identified as bounding box centers, where marks are used to describe the spatial extension of bounding boxes and classes. Our statistical framework enables a likelihood-based training and provides well-defined confidence estimates for whether a region is drivable, i.e., free of objects. We demonstrate the effectiveness of our method through calibration assessments and evaluation of performance.

[140] Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration

Jiahe Chen,Jiaying He,Qian Shao,Qiyuan Chen,Jiahe Ying,Hongxia Xu,Jintai Chen,Jianwei Zheng,Jian Wu

Main category: cs.CV

TL;DR: This paper proposes Dynamic Logits Calibration (DLC), a training-free decoding framework that effectively reduces hallucinations in Large Vision-Language Models (LVLMs) while maintaining inference efficiency.

Details Motivation: LVLMs often suffer from hallucinations where generated text contradicts visual input. Existing training-free decoding strategies have limitations like static constraints, inefficiency due to multiple forward passes, and loss of detail from rigid rules. Method: Dynamic Logits Calibration (DLC) uses CLIP to assess semantic alignment between input images and generated text. It evaluates the Relative Visual Advantage (RVA) of candidate tokens against a dynamically updated baseline and adjusts logits accordingly. An adaptive weighting mechanism balances visual guidance with output quality. Result: Experiments across various benchmarks and LVLM architectures show that DLC significantly reduces hallucinations, outperforms current methods, and maintains high inference efficiency by avoiding multiple forward passes. Conclusion: DLC is an effective and efficient decoding-time solution to reduce hallucinations in LVLMs, enhancing their reliability for practical applications. Abstract: Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.

[141] GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation

Wentao Hu,Shunkai Li,Ziqiao Peng,Haoxian Zhang,Fan Shi,Xiaoqiang Liu,Pengfei Wan,Di Zhang,Hui Tian

Main category: cs.CV

TL;DR: GGTalker是一种新的语音驱动3D说话头像生成方法,通过结合通用先验和身份特定适应的策略,解决了当前方法在大头部旋转和分布外音频情况下的局限性,并在多个性能指标上实现了最先进的结果。

Details Motivation: 当前方法在固定视角和小规模音频变化下表现良好,但在大头部旋转和分布外音频情况下表现较差,并且需要耗时的身份特定训练。这主要是由于缺乏足够的3D先验知识,限制了合成头像的外推能力。 Method: 引入了两阶段的Prior-Adaptation训练策略,学习高斯头部先验并适应个体特征;训练Audio-Expression和Expression-Visual先验以捕捉嘴唇运动的普遍模式和头部纹理的一般分布;在Customized Adaptation期间精确建模个体说话风格和纹理细节;引入颜色MLP生成精细、运动对齐的纹理,并利用Body Inpainter将渲染结果与背景融合。 Result: 全面实验表明,GGTalker在渲染质量、3D一致性、唇音同步准确性和训练效率方面均达到了最先进水平。 Conclusion: GGTalker通过结合通用先验和身份特定适应的方法,在创建高质量、可泛化的语音驱动3D说话头像方面取得了最先进的性能。 Abstract: Creating high-quality, generalizable speech-driven 3D talking heads remains a persistent challenge. Previous methods achieve satisfactory results for fixed viewpoints and small-scale audio variations, but they struggle with large head rotations and out-of-distribution (OOD) audio. Moreover, they are constrained by the need for time-consuming, identity-specific training. We believe the core issue lies in the lack of sufficient 3D priors, which limits the extrapolation capabilities of synthesized talking heads. To address this, we propose GGTalker, which synthesizes talking heads through a combination of generalizable priors and identity-specific adaptation. We introduce a two-stage Prior-Adaptation training strategy to learn Gaussian head priors and adapt to individual characteristics. We train Audio-Expression and Expression-Visual priors to capture the universal patterns of lip movements and the general distribution of head textures. During the Customized Adaptation, individual speaking styles and texture details are precisely modeled. Additionally, we introduce a color MLP to generate fine-grained, motion-aligned textures and a Body Inpainter to blend rendered results with the background, producing indistinguishable, photorealistic video frames. Comprehensive experiments show that GGTalker achieves state-of-the-art performance in rendering quality, 3D consistency, lip-sync accuracy, and training efficiency.

[142] G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation

Mohammed Rakib,Arunkumar Bagavathi

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态学习框架G$^{2}$D,通过梯度引导蒸馏和动态模态优先排序解决模态不平衡问题,在分类和回归任务中表现优异。

Details Motivation: 传统多模态模型存在模态不平衡问题,导致弱模态利用不足,影响整体性能。 Method: 引入了Gradient-Guided Distillation (G$^{2}$D),结合了一个融合单模态和多模态目标的损失函数,并采用动态顺序模态优先化(SMP)技术优化模型训练。 Result: G$^{2}$D在多模态学习过程中增强了弱模态的重要性,并在实际数据集上验证了其优越性。 Conclusion: G$^{2}$D有效地解决了多模态学习中的模态不平衡问题,并在多个真实世界数据集中表现出优于现有方法的分类和回归任务性能。 Abstract: Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D on multiple real-world datasets and show that G$^{2}$D amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. Our code is available at https://github.com/rAIson-Lab/G2D.

[143] MADrive: Memory-Augmented Driving Scene Modeling

Polina Karpikova,Daniil Selikhanovych,Kirill Struminsky,Ruslan Musaev,Maria Golitsyna,Dmitry Baranchuk

Main category: cs.CV

TL;DR: MADrive利用外部3D资产库提升自动驾驶场景重建,实现逼真度更高的车辆替换与场景合成。

Details Motivation: 现有的场景重建方法难以支持显著改变或新颖驾驶场景的逼真合成,因此提出了MADrive框架以扩展其能力。 Method: 使用内存增强重建框架,将观察到的车辆替换为从大规模外部内存库中检索到的视觉相似3D资产,并进行方向对齐和重新照明。 Result: 实验表明,MADrive能够提供完整的多视角车辆表示,实现高度逼真的配置变更合成。 Conclusion: MADrive通过引入外部大规模3D资产库,增强了现有场景重建方法的能力,实现了车辆的高质量替换和驾驶场景的逼真合成。 Abstract: Recent advances in scene reconstruction have pushed toward highly realistic modeling of autonomous driving (AD) environments using 3D Gaussian splatting. However, the resulting reconstructions remain closely tied to the original observations and struggle to support photorealistic synthesis of significantly altered or novel driving scenarios. This work introduces MADrive, a memory-augmented reconstruction framework designed to extend the capabilities of existing scene reconstruction methods by replacing observed vehicles with visually similar 3D assets retrieved from a large-scale external memory bank. Specifically, we release MAD-Cars, a curated dataset of ${\sim}70$K 360{\deg} car videos captured in the wild and present a retrieval module that finds the most similar car instances in the memory bank, reconstructs the corresponding 3D assets from video, and integrates them into the target scene through orientation alignment and relighting. The resulting replacements provide complete multi-view representations of vehicles in the scene, enabling photorealistic synthesis of substantially altered configurations, as demonstrated in our experiments. Project page: https://yandex-research.github.io/madrive/

[144] WAFT: Warping-Alone Field Transforms for Optical Flow

Yihan Wang,Jia Deng

Main category: cs.CV

TL;DR: 本文介绍了一种新的光流方法Warping-Alone Field Transforms (WAFT),其通过高分辨率扭曲代替成本体积,从而达到更好的准确性与更低的内存消耗。

Details Motivation: 旨在提出一种更准确且内存成本更低的光流估计方法,同时挑战构建成本体积对于实现强大性能的传统观点。 Method: 引入Warping-Alone Field Transforms (WAFT),以高分辨率扭曲代替成本体积,并且是一种具有最小归纳偏差和对定制设计依赖较少的灵活元架构。 Result: 在Spring和KITTI基准测试中排名第一,在KITTI上实现了最佳的零样本泛化能力,并且比其他相似性能的方法快达4.1倍。 Conclusion: WAFT是一个简单而有效的光流方法,挑战了构建成本体积对于实现强大性能的传统观念。 Abstract: We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at https://github.com/princeton-vl/WAFT.

[145] Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval

Hani Alomari,Anushka Sivakumar,Andrew Zhang,Chris Thomas

Main category: cs.CV

TL;DR: This paper proposes a novel method for cross-modal image-text retrieval by optimizing set-based embeddings through Maximal Pair Assignment Similarity and two new loss functions, achieving state-of-the-art results.

Details Motivation: Traditional single-vector embeddings struggle to capture nuanced cross-modal relationships. Set-based approaches offer more diversity but face challenges like sparse supervision and set collapse. Method: The paper introduces Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets, along with two loss functions: Global Discriminative Loss and Intra-Set Divergence Loss. Result: The proposed method effectively addresses the limitations of set-based representations and improves cross-modal image-text retrieval performance. Conclusion: The proposed method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data. Abstract: Cross-modal image-text retrieval is challenging because of the diverse possible associations between content from different modalities. Traditional methods learn a single-vector embedding to represent semantics of each sample, but struggle to capture nuanced and diverse relationships that can exist across modalities. Set-based approaches, which represent each sample with multiple embeddings, offer a promising alternative, as they can capture richer and more diverse relationships. In this paper, we show that, despite their promise, these set-based representations continue to face issues including sparse supervision and set collapse, which limits their effectiveness. To address these challenges, we propose Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets which preserve semantic diversity within the set. We also introduce two loss functions to further enhance the representations: Global Discriminative Loss to enhance distinction among embeddings, and Intra-Set Divergence Loss to prevent collapse within each set. Our method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data.

[146] StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning

Chuxin Wang,Yixin Zha,Wenfei Yang,Tianzhu Zhang

Main category: cs.CV

TL;DR: This paper introduces StruMamba3D, a novel self-supervised point cloud representation learning approach that enhances Mamba-based models by preserving spatial dependencies, improving structure modeling, and adapting to varying input lengths, leading to state-of-the-art performance.

Details Motivation: The motivation is to improve upon Mamba-based methods in point cloud representation learning by addressing two key issues: the destruction of adjacency of 3D points during SSM processing and the inability to retain long-sequence memory as input lengths increase. Method: The paper proposes StruMamba3D which involves designing spatial states to preserve spatial dependencies among points, enhancing SSM with a state-wise update strategy combined with lightweight convolution, and introducing a sequence length-adaptive strategy to reduce model sensitivity to input lengths. Result: Experimental results show superior performance across four downstream tasks, achieving SOTA accuracy on ModelNet40 (95.1%) and ScanObjectNN (92.75%) without using a voting strategy. Conclusion: StruMamba3D is an effective method for point cloud representation learning that overcomes the limitations of existing Mamba-based methods by preserving spatial dependencies, enhancing SSM with a state-wise update strategy, and reducing sensitivity to varying input lengths. Abstract: Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.

[147] DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion

Yansong Qu,Shaohui Dai,Xinyang Li,Yuze Wang,You Shen,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: This paper introduces an innovative framework that generates consistent multi-views from a single occluded image, enabling improved 3D reconstruction without requiring prior inpainting or annotations.

Details Motivation: The motivation stems from the limitations of existing diffusion-based view synthesis models, which assume fully visible inputs and fail under real-world occlusions, leading to inconsistent views and degraded 3D reconstruction quality. Method: The method synthesizes six structurally consistent novel views from a single partially occluded image using a self-supervised training pipeline on the Pix2Gestalt dataset. It leverages occluded-unoccluded image pairs and pseudo-ground-truth views to train the model for structure-aware completion and view consistency. Result: The proposed approach enables downstream 3D reconstruction without additional preprocessing steps, while introducing the first benchmark for occlusion-aware reconstruction across diverse occlusion levels, object categories, and mask patterns. Conclusion: The paper proposes an end-to-end framework for occlusion-aware multi-view generation, successfully addressing the challenge of reconstructing 3D objects from partially occluded single images without prior inpainting or manual annotations. Abstract: Reconstructing 3D objects from a single image is a long-standing challenge, especially under real-world occlusions. While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they generally assume fully visible inputs and fail when parts of the object are occluded. This leads to inconsistent views and degraded 3D reconstruction quality. To overcome this limitation, we propose an end-to-end framework for occlusion-aware multi-view generation. Our method directly synthesizes six structurally consistent novel views from a single partially occluded image, enabling downstream 3D reconstruction without requiring prior inpainting or manual annotations. We construct a self-supervised training pipeline using the Pix2Gestalt dataset, leveraging occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency. Without modifying the original architecture, we fully fine-tune the view synthesis model to jointly learn completion and multi-view generation. Additionally, we introduce the first benchmark for occlusion-aware reconstruction, encompassing diverse occlusion levels, object categories, and mask patterns. This benchmark provides a standardized protocol for evaluating future methods under partial occlusions. Our code is available at https://github.com/Quyans/DeOcc123.

[148] SAM4D: Segment Anything in Camera and LiDAR Streams

Jianyun Xu,Song Wang,Ziqian Ni,Chunyong Hu,Sheng Yang,Jianke Zhu,Qiang Li

Main category: cs.CV

TL;DR: SAM4D is a foundation model for cross-modal segmentation using camera and LiDAR data, combining novel techniques for feature alignment, temporal consistency, and efficient pseudo-label generation, showing promising results in autonomous driving applications.

Details Motivation: The motivation is to overcome annotation bottlenecks and achieve robust, promptable segmentation across camera and LiDAR streams in dynamic autonomous driving environments. Method: The method involves the development of SAM4D, incorporating Unified Multi-modal Positional Encoding (UMPE) for feature alignment and Motion-aware Cross-modal Memory Attention (MCMA) to improve temporal consistency. Additionally, a multi-modal automated data engine is used to generate pseudo-labels efficiently. Result: Extensive experiments on Waymo-4DSeg showcase SAM4D's powerful cross-modal segmentation performance and its ability to generate high-quality pseudo-labels much faster than human annotation. Conclusion: The paper concludes that SAM4D demonstrates strong cross-modal segmentation capabilities and significant potential in efficient data annotation for autonomous driving scenarios. Abstract: We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

[149] SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark

Alex Costanzino,Pierluigi Zama Ramirez,Luigi Lella,Matteo Ragaglia,Alessandro Oliva,Giuseppe Lisanti,Luigi Di Stefano

Main category: cs.CV

TL;DR: This paper introduces SiM3D, the first benchmark for 3D anomaly detection and segmentation that integrates multiview and multimodal information, especially suited for single-instance anomaly detection in manufacturing contexts.

Details Motivation: The motivation behind this work is to address the challenge of performing 3D anomaly detection and segmentation in scenarios where only one object instance is available for training, particularly focusing on generalizing from synthetic training data to real test data. This is of high interest in manufacturing settings. Method: The paper introduces SiM3D, a new benchmark for 3D anomaly detection and segmentation. It utilizes a novel multimodal multiview dataset acquired with industrial sensors and robots, featuring high-resolution images, point clouds, and CAD models. The approach involves adapting existing singleview methods to establish baselines and evaluating them with new metrics based on Anomaly Volumes. Result: The result of the paper is the creation of SiM3D, which is the first benchmark to consider the integration of multiview and multimodal information for 3D anomaly detection and segmentation. It includes a rich dataset with high-resolution images, point clouds, and CAD models, along with manually annotated 3D segmentation ground truths for anomalous test samples. Conclusion: The paper concludes by establishing reference baselines for the proposed multiview 3D ADS task by adapting prominent singleview methods and assessing their performance using novel metrics that operate on Anomaly Volumes. It highlights the significance of integrating multiview and multimodal information for comprehensive 3D anomaly detection and segmentation. Abstract: We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalising from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 Mpx) and point clouds (7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.

[150] Whole-Body Conditioned Egocentric Video Prediction

Yutong Bai,Danny Tran,Amir Bar,Yann LeCun,Trevor Darrell,Jitendra Malik

Main category: cs.CV

TL;DR: 该研究训练了一个基于人体动作和3D姿态的第一视角视频预测模型,展示了对复杂现实环境和具身行为建模的初步成果。

Details Motivation: 探索复杂现实环境中人类行为的第一视角视频预测,为具身智能体的行为建模提供新思路。 Method: 利用自回归条件扩散Transformer模型,结合身体关节层次结构的运动轨迹进行视频预测。 Result: 开发了一个能够模拟人体动作如何从第一视角改变环境的模型,并在不同挑战性任务上验证了其预测与控制能力。 Conclusion: 本文提出了一种基于人体动作和3D姿态预测第一视角视频的方法,通过使用大规模数据集Nymeria进行训练,并设计了分层评估协议以分析模型的预测和控制能力。 Abstract: We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.