Skip to content

Table of Contents

cs.CL [Back]

[1] Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data

Ekaterina Borisova,Fabio Barth,Nils Feldhus,Raia Abu Ahmad,Malte Ostendorff,Pedro Ortiz Suarez,Georg Rehm,Sebastian Möller

Main category: cs.CL

TL;DR: This paper evaluates how well LLMs process tabular data across different domains and modalities, showing strengths and weaknesses, particularly in handling scientific tables.

Details Motivation: To explore the efficiency of LLMs in processing tabular data, which is widely used in various domains. Method: Cross-domain and cross-modality evaluation of text-based and multimodal LLMs on table understanding tasks, including interpretability analysis and introduction of the TableEval benchmark. Result: LLMs demonstrate robustness in handling tables represented as images and text but face significant challenges with scientific tables. Conclusion: LLMs show robustness across table modalities but struggle with scientific tables. Abstract: Tables are among the most widely used tools for representing structured data in research, business, medicine, and education. Although LLMs demonstrate strong performance in downstream tasks, their efficiency in processing tabular data remains underexplored. In this paper, we investigate the effectiveness of both text-based and multimodal LLMs on table understanding tasks through a cross-domain and cross-modality evaluation. Specifically, we compare their performance on tables from scientific vs. non-scientific contexts and examine their robustness on tables represented as images vs. text. Additionally, we conduct an interpretability analysis to measure context usage and input relevance. We also introduce the TableEval benchmark, comprising 3017 tables from scholarly publications, Wikipedia, and financial reports, where each table is provided in five different formats: Image, Dictionary, HTML, XML, and LaTeX. Our findings indicate that while LLMs maintain robustness across table modalities, they face significant challenges when processing scientific tables.

[2] Prompting as Scientific Inquiry

Ari Holtzman,Chenhao Tan

Main category: cs.CL

TL;DR: Prompting is an essential and scientific approach for studying large language models, akin to behavioral science.

Details Motivation: Prompting is a powerful method for studying and controlling large language models, but it is often not considered scientific. The authors aim to reframe its perception. Method: Argumentation based on the analysis of existing prompting applications and interpretability methods. Result: The paper establishes prompting as a form of behavioral science for understanding complex and opaque LLMs. Conclusion: Prompting is a key component in the science of LLMs and should be treated as behavioral science. Abstract: Prompting is the primary method by which we study and control large language models. It is also one of the most powerful: nearly every major capability attributed to LLMs-few-shot learning, chain-of-thought, constitutional AI-was first unlocked through prompting. Yet prompting is rarely treated as science and is frequently frowned upon as alchemy. We argue that this is a category error. If we treat LLMs as a new kind of complex and opaque organism that is trained rather than programmed, then prompting is not a workaround: it is behavioral science. Mechanistic interpretability peers into the neural substrate, prompting probes the model in its native interface: language. We contend that prompting is not inferior, but rather a key component in the science of LLMs.

[3] LineRetriever: Planning-Aware Observation Reduction for Web Agents

Imene Kerboua,Sahar Omidi Shayegan,Megh Thakkar,Xing Han Lù,Massimo Caccia,Véronique Eglin,Alexandre Aussem,Jérémy Espinas,Alexandre Lacoste

Main category: cs.CL

TL;DR: LineRetriever improves web navigation task efficiency by retrieving only the most relevant observation lines for future actions, reducing observation size while maintaining performance.

Details Motivation: Current retrieval approaches like bottom-up truncation or embedding-based retrieval lose crucial information about page state and action history, which is vital for adaptive planning in web agents. This necessitates a more effective retrieval method tailored for adaptive planning. Method: LineRetriever uses a language model to identify and retrieve the most relevant observation lines for future navigation steps, considering the planning horizon and prioritizing elements that aid action prediction. Result: Experiments show that LineRetriever effectively reduces the observation size at each step for web agents without compromising performance, addressing challenges posed by traditional methods. Conclusion: LineRetriever proves to be an effective method for optimizing retrieval in web navigation tasks, reducing observation size while maintaining performance within context limitations. Abstract: While large language models have demonstrated impressive capabilities in web navigation tasks, the extensive context of web pages, often represented as DOM or Accessibility Tree (AxTree) structures, frequently exceeds model context limits. Current approaches like bottom-up truncation or embedding-based retrieval lose critical information about page state and action history. This is particularly problematic for adaptive planning in web agents, where understanding the current state is essential for determining future actions. We hypothesize that embedding models lack sufficient capacity to capture plan-relevant information, especially when retrieving content that supports future action prediction. This raises a fundamental question: how can retrieval methods be optimized for adaptive planning in web navigation tasks? In response, we introduce \textit{LineRetriever}, a novel approach that leverages a language model to identify and retrieve observation lines most relevant to future navigation steps. Unlike traditional retrieval methods that focus solely on semantic similarity, \textit{LineRetriever} explicitly considers the planning horizon, prioritizing elements that contribute to action prediction. Our experiments demonstrate that \textit{LineRetriever} can reduce the size of the observation at each step for the web agent while maintaining consistent performance within the context limitations.

[4] Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning

Mads Henrichsen,Rasmus Krebs

Main category: cs.CL

TL;DR: 本文介绍了一种新的两阶段方法,利用大型语言模型生成推理来增强文本分类,从而显著提高情感分类的准确性和解释能力。

Details Motivation: 标准分类模型通常直接将输入映射到标签而没有显式推理,这可能限制了它们的性能、鲁棒性和可解释性。 Method: 提出了一种两阶段方法来增强文本分类:第一阶段对Llama-3.2-1B-Instruct模型进行微调以生成推理;第二阶段使用该模型离线创建一个增强的训练数据集用于下游任务。 Result: 与仅输出情感的基线生成模型相比,训练生成模型输出推理和情感的方法在情感预测中的准确率提高了8.7个百分点。 Conclusion: 通过利用大型语言模型生成的推理,这种方法显著提高了情感分类的准确性,并展示了生成推理在增强训练数据集和提供明确解释方面的潜力。 Abstract: Standard classification models often map inputs directly to labels without explicit reasoning, potentially limiting their performance, robustness, and interpretability. This paper introduces a novel two-stage approach to enhance text classification by leveraging Large Language Model (LLM)-generated reasonings. In the first stage, we fine-tune a Llama-3.2-1B-Instruct model (henceforth Llama-R-Gen) on a general-purpose reasoning dataset (syvai/reasoning-gen) to generate textual reasoning (R) given a question and its answer. In the second stage, this generally trained Llama-R-Gen is used offline to create an augmented training dataset for a downstream generative model. This downstream model, based on Llama-3.2-1B-Instruct, takes only the input text (Q) and is trained to output the generated reasoning (R) immediately followed by the predicted emotion (A). We demonstrate this methodology on the dair-ai/emotion dataset for emotion classification. Our experiments show that the generative model trained to output reasoning and the emotion (Classifier Q->RA) achieves a significant improvement of 8.7 percentage points in accuracy (for emotion prediction) compared to a baseline generative model trained solely to output the emotion (Classifier Q->A), highlighting the strong generalization capabilities of the reasoning generation and the benefit of explicit reasoning training. This work underscores the potential of LLM-generated reasonings for creating richer training datasets, thereby improving the performance of diverse downstream NLP tasks and providing explicit explanations.

[5] Towards Style Alignment in Cross-Cultural Translation

Shreya Havaldar,Adam Stein,Eric Wong,Lyle Ungar

Main category: cs.CL

TL;DR: RASTA enhances LLM translation by preserving cultural stylistic nuances, ensuring better alignment between intended and perceived communication styles across different cultures.

Details Motivation: Cultural differences often lead to miscommunication because the speaker's intended style doesn't align with the listener's interpretation. Current LLMs struggle to translate stylistic nuances, especially in non-Western languages, losing important cultural context like politeness. Method: The authors introduce RASTA (Retrieval-Augmented STylistic Alignment), which uses learned stylistic concepts to enhance LLM translations, ensuring they align with cultural communication norms. Result: RASTA successfully mitigates the tendency of LLMs to produce neutral translations by incorporating cultural stylistic elements, thereby improving cross-cultural communication accuracy. Conclusion: RASTA improves the ability of LLMs to maintain and convey cultural communication styles during translation, addressing the issue of misalignment between intended and interpreted styles due to cultural differences. Abstract: Successful communication depends on the speaker's intended style (i.e., what the speaker is trying to convey) aligning with the listener's interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style - biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.

[6] Linearly Decoding Refused Knowledge in Aligned Language Models

Aryan Shrivastava,Ari Holtzman

Main category: cs.CL

TL;DR: 研究发现,尽管语言模型经过指令调整以避免生成有害内容,但通过越狱提示获取的有害信息仍可被线性探针解码,并在下游任务中间接影响模型行为。

Details Motivation: 大多数常用语言模型通过指令调整和强化学习进行对齐,导致其拒绝生成有害内容。然而,越狱提示可以绕过这种机制,因此需要研究此类被拒绝信息的可解码性和持久性。 Method: 通过训练线性探针解码模型隐藏状态,研究通过越狱提示访问的信息是否可解码,并分析这些信息是否在指令调整模型中仍然存在和使用。 Result: 许多最初被拒绝的信息可以通过线性探针高度解码;训练于基础模型的探针有时能迁移到指令调整模型并揭示越狱解码的信息;这些信息不仅残留,还在下游任务中起作用。 Conclusion: 指令调整并不能完全消除或重新定位表示空间中的有害信息,它们仅仅抑制了其直接表达,使其仍可通过线性探测访问并在下游行为中间接影响。 Abstract: Most commonly used language models (LMs) are instruction-tuned and aligned using a combination of fine-tuning and reinforcement learning, causing them to refuse users requests deemed harmful by the model. However, jailbreak prompts can often bypass these refusal mechanisms and elicit harmful responses. In this work, we study the extent to which information accessed via jailbreak prompts is decodable using linear probes trained on LM hidden states. We show that a great deal of initially refused information is linearly decodable. For example, across models, the response of a jailbroken LM for the average IQ of a country can be predicted by a linear probe with Pearson correlations exceeding $0.8$. Surprisingly, we find that probes trained on base models (which do not refuse) sometimes transfer to their instruction-tuned versions and are capable of revealing information that jailbreaks decode generatively, suggesting that the internal representations of many refused properties persist from base LMs through instruction-tuning. Importantly, we show that this information is not merely "leftover" in instruction-tuned models, but is actively used by them: we find that probe-predicted values correlate with LM generated pairwise comparisons, indicating that the information decoded by our probes align with suppressed generative behavior that may be expressed more subtly in other downstream tasks. Overall, our results suggest that instruction-tuning does not wholly eliminate or even relocate harmful information in representation space-they merely suppress its direct expression, leaving it both linearly accessible and indirectly influential in downstream behavior.

[7] The Algebraic Structure of Morphosyntax

Isabella Senturia,Matilde Marcolli

Main category: cs.CL

TL;DR: This paper proposes a mathematical model for the morphology-syntax interface using operads and magma, showing how morphological and syntactic structures interact through coproduct decomposition and reinterpreting aspects of Distributed Morphology.

Details Motivation: The motivation stems from the need to better understand the morphology-syntax interface within the framework of the Strong Minimalist Thesis, particularly how morphological and syntactic components interact during word and structure formation. Method: The authors use mathematical structures such as magma and operads to model the morphology-syntax interface. They employ coproduct decomposition and extend morphological trees beyond those generated by the magma to form algebras over operads. Result: A mathematical model was developed that describes morphosyntactic trees using operadic correspondence, enabling a structured interaction between morphological and syntactic data through the coproduct decomposition. Conclusion: The paper concludes that morphosyntactic trees can be effectively modeled through an operadic correspondence, offering a reinterpretation of certain operations in Distributed Morphology and allowing flexibility in defining the boundary between syntax and morphology. Abstract: Within the context of the mathematical formulation of Merge and the Strong Minimalist Thesis, we present a mathematical model of the morphology-syntax interface. In this setting, morphology has compositional properties responsible for word formation, organized into a magma of morphological trees. However, unlike syntax, we do not have movement within morphology. A coproduct decomposition exists, but it requires extending the set of morphological trees beyond those which are generated solely by the magma, to a larger set of possible morphological inputs to syntactic trees. These participate in the formation of morphosyntactic trees as an algebra over an operad, and a correspondence between algebras over an operad. The process of structure formation for morphosyntactic trees can then be described in terms of this operadic correspondence that pairs syntactic and morphological data and the morphology coproduct. We reinterpret in this setting certain operations of Distributed Morphology as transformation that allow for flexibility in moving the boundary between syntax and morphology within the morphosyntactic objects.

[8] EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

Sanchit Ahuja,Praneetha Vaddamanu,Barun Patra

Main category: cs.CL

TL;DR: 该研究发现非英语语言在多语言模型推理中更高效,减少了标记使用而不牺牲准确性,表明多语言推理具有潜力。

Details Motivation: 尽管许多模型是基于多语言数据进行预训练的,但大多数研究仅关注英语,本研究旨在探索英语是否是最高效的推理语言。 Method: 评估三个开源RLMs(DeepSeek R1、Qwen 2.5和Qwen 3)在四种数学数据集和七种类型多样的语言中的表现。 Result: 发现非英语推理可以减少标记使用并保持准确度,且这种改善在将推理过程翻译成英语后依然存在。 Conclusion: 研究得出非英语推理不仅减少了标记使用,同时保持准确性,并突显了多语言推理的潜力及强大多语言基础的重要性。 Abstract: Despite recent advances in Language Reasoning Models (LRMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: Is English the most token-efficient language for reasoning? We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5 and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy. These gains persist even after translating the reasoning traces into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the models multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations. The code for our work can be found: https://github.com/microsoft/EfficientXLang.

[9] Impact of Fine-Tuning Methods on Memorization in Large Language Models

Jie Hou,Chuxiong Wu,Lannan Luo,Qiang Zeng

Main category: cs.CL

TL;DR: This paper investigates the privacy risks associated with memorization during fine-tuning of pre-trained large language models (LLMs). It finds that prompt-based fine-tuning is less susceptible to membership inference attacks (MIAs) than parameter-based fine-tuning, making it a more privacy-preserving approach.

Details Motivation: With the advancement of pre-trained large language models (LLMs) and the popularity of the 'pre-train and fine-tune' paradigm, there is a growing concern about privacy risks arising from memorization during fine-tuning. This study aims to address this gap by examining different fine-tuning methods. Method: The study categorizes popular fine-tuning approaches and evaluates their impact on memorization using membership inference attacks (MIAs). Result: Compared to parameter-based fine-tuning, prompt-based fine-tuning demonstrates competitive performance while being less vulnerable to MIAs. Additionally, prompt-based methods consistently maintain low memorization irrespective of model scale. Conclusion: Prompt-based fine-tuning is a more privacy-preserving option compared to parameter-based fine-tuning. Abstract: As the capabilities of pre-trained large language models (LLMs) continue to advance, the "pre-train and fine-tune" paradigm has become increasingly mainstream, leading to the development of various fine-tuning methods. However, the privacy risks arising from memorization during fine-tuning have received relatively little attention. To address this gap, we categorize popular fine-tuning approaches and assess their impact on memorization through the lens of membership inference attacks (MIAs). Our results show that, compared to parameter-based fine-tuning, prompt-based fine-tuning achieves competitive performance while exhibiting lower vulnerability to MIAs. Furthermore, prompt-based methods maintain low memorization regardless of model scale. These findings suggest that parameter-based fine-tuning is more prone to leaking private information, whereas prompt-based fine-tuning serves as a more privacy-preserving option.

[10] Natural language processing for African languages

David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: This dissertation improves NLP for low-resource African languages by curating high-quality data, analyzing word embeddings and multilingual models, and creating large annotated datasets for under-represented languages.

Details Motivation: Multilingual NLP models face challenges with low-resource languages, particularly in Sub-Saharan Africa, due to limited labeled data, noisy corpora, and lack of evaluation resources. This work addresses these issues by focusing on improving NLP for African languages through better data and model adaptation. Method: The study analyzes noise in existing corpora, curates a high-quality corpus, and evaluates word embeddings and multilingual PLMs in low-resource scenarios. It also involves the development of annotated datasets for 21 African languages and conducts empirical evaluations using supervised, weakly-supervised, and transfer learning methods. Result: The research demonstrates the importance of data quality over quantity in semantic representation learning, shows the effectiveness of adapting PLMs for unseen African languages, and introduces large-scale annotated datasets for 21 African languages in named entity recognition and machine translation tasks. Conclusion: The dissertation concludes that the quality of semantic representations in word embeddings depends not only on data quantity but also on pre-training data quality. It highlights the limitations of word embeddings and the potential of multilingual PLMs for low-resource African languages, emphasizing the need for annotated datasets in under-represented languages. Abstract: Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.

[11] Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

Daking Rai,Samuel Miller,Kevin Moran,Ziyu Yao

Main category: cs.CL

TL;DR: 该论文介绍了一种新方法RASteer,用于提升语言模型在句法任务和算术推理任务上的表现,其通过系统性识别和增强可靠组件的贡献实现这一目标。

Details Motivation: 尽管语言模型在编码能力方面有显著进展,但在简单的句法任务如生成平衡括号上仍存在困难,这促使作者研究这些错误背后的基本机制并寻求缓解方法。 Method: 通过分析语言模型中注意力头和前馈神经元的机制,提出了一种新的方法RASteer来系统地识别和增强可靠组件的贡献。 Result: RASteer在平衡括号任务上使某些模型的准确性从0%提升到接近100%,并在算术推理任务上实现了最高约20%的性能增益。 Conclusion: RASteer有效地提高了模型在平衡括号任务和算术推理任务上的表现,而不影响模型的一般编码能力。 Abstract: Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.

[12] Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios

Mohna Chakraborty,Adithya Kulkarni,Qi Li

Main category: cs.CL

TL;DR: This paper proposes COLDSELECT, a method for improving prompt-based language model performance in cold-start scenarios by jointly optimizing verbalizer and instance selection, leveraging embedding proximity and diversity modeling.

Details Motivation: Prompt-based methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers. Method: COLDSELECT maps PLM vocabulary and $h_{[MASK]}$ embeddings into a shared space, applying dimensionality reduction and clustering for efficient and diverse selection. Result: Experiments on eight benchmarks show COLDSELECT's superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios. Conclusion: COLDSELECT effectively reduces uncertainty and enhances generalization in cold-start settings by jointly optimizing verbalizer and instance selection. Abstract: Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and $h_{[MASK]}$ embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks demonstrate COLDSELECT's superiority in reducing uncertainty and enhancing generalization, outperforming baselines in verbalizer and few-shot instance selection for cold-start scenarios.

[13] Question Decomposition for Retrieval-Augmented Generation

Paul J. L. Ammann,Jonas Golde,Alan Akbik

Main category: cs.CL

TL;DR: This paper enhances RAG for multi-hop questions by combining LLM-based question decomposition and reranking, achieving better retrieval and answer accuracy without extra training.

Details Motivation: Multi-hop questions challenge standard RAG approaches because relevant information is often distributed across multiple documents. This work aims to address this limitation by improving retrieval coverage and reducing noise. Method: The approach involves decomposing the original query into sub-questions using an LLM, retrieving passages for each sub-question, and then reranking the merged candidate pool to enhance coverage and precision. Result: The method achieved significant improvements in retrieval performance (MRR@10: +36.7%) and answer accuracy (F1: +11.6%) over standard RAG baselines on datasets like MultiHop-RAG and HotpotQA. Conclusion: The proposed RAG pipeline with question decomposition and reranking improves retrieval effectiveness and answer accuracy for multi-hop questions without requiring additional training or specialized indexing. Abstract: Grounding large language models (LLMs) in verifiable external sources is a well-established strategy for generating reliable answers. Retrieval-augmented generation (RAG) is one such approach, particularly effective for tasks like question answering: it retrieves passages that are semantically related to the question and then conditions the model on this evidence. However, multi-hop questions, such as "Which company among NVIDIA, Apple, and Google made the biggest profit in 2023?," challenge RAG because relevant facts are often distributed across multiple documents rather than co-occurring in one source, making it difficult for standard RAG to retrieve sufficient information. To address this, we propose a RAG pipeline that incorporates question decomposition: (i) an LLM decomposes the original query into sub-questions, (ii) passages are retrieved for each sub-question, and (iii) the merged candidate pool is reranked to improve the coverage and precision of the retrieved evidence. We show that question decomposition effectively assembles complementary documents, while reranking reduces noise and promotes the most relevant passages before answer generation. Although reranking itself is standard, we show that pairing an off-the-shelf cross-encoder reranker with LLM-driven question decomposition bridges the retrieval gap on multi-hop questions and provides a practical, drop-in enhancement, without any extra training or specialized indexing. We evaluate our approach on the MultiHop-RAG and HotpotQA, showing gains in retrieval (MRR@10: +36.7%) and answer accuracy (F1: +11.6%) over standard RAG baselines.

[14] Gregorian melody, modality, and memory: Segmenting chant with Bayesian nonparametrics

Vojtěch Lanz,Jan Hajič jr

Main category: cs.CL

TL;DR: 本文探讨了格列高利圣咏旋律是否由特定片段构成的“拼贴理论”,通过使用嵌套分层Pitman-Yor语言模型进行无监督分割,研究发现这种最优分割在模式分类中表现优异,并揭示了旋律起始和结尾部分更具公式化特征。尽管如此,这些结果并未支持传统的centonisation概念。

Details Motivation: 作者受到格列高利圣咏依赖记忆传承这一事实的启发,试图寻找一种最优的无监督旋律分割方法,以探索旋律结构与记忆效率之间的关系。 Method: 采用嵌套分层Pitman-Yor语言模型对圣咏旋律进行无监督分割,并评估其在模式分类任务中的性能。 Result: 所找到的旋律分割方案在模式分类任务中达到了最先进的性能,并发现了旋律开头和结尾部分具有更高的公式化特征,表明模态在记忆和表演中的作用。 Conclusion: 虽然最优分割方案提升了模式分类的效果,并与记忆效率相关联,但该分割方式并不等同于传统意义上的centonisation理论。 Abstract: The idea that Gregorian melodies are constructed from some vocabulary of segments has long been a part of chant scholarship. This so-called "centonisation" theory has received much musicological criticism, but frequent re-use of certain melodic segments has been observed in chant melodies, and the intractable number of possible segmentations allowed the option that some undiscovered segmentation exists that will yet prove the value of centonisation, and recent empirical results have shown that segmentations can outperform music-theoretical features in mode classification. Inspired by the fact that Gregorian chant was memorised, we search for an optimal unsupervised segmentation of chant melody using nested hierarchical Pitman-Yor language models. The segmentation we find achieves state-of-the-art performance in mode classification. Modeling a monk memorising the melodies from one liturgical manuscript, we then find empirical evidence for the link between mode classification and memory efficiency, and observe more formulaic areas at the beginnings and ends of melodies corresponding to the practical role of modality in performance. However, the resulting segmentations themselves indicate that even such a memory-optimal segmentation is not what is understood as centonisation.

[15] Causal Prompting for Implicit Sentiment Analysis with Large Language Models

Jing Ren,Wenhao Zhou,Bowen Li,Mujie Liu,Nguyen Linh Dan Le,Jiade Cen,Liping Chen,Ziqi Xu,Xiwei Xu,Xiaodong Li

Main category: cs.CL

TL;DR: This paper introduces CAPITAL, a causal prompting framework for Implicit Sentiment Analysis that enhances Large Language Models' reasoning by incorporating causal inference, significantly improving performance and robustness against biases.

Details Motivation: Current prompting-based methods for ISA rely on majority voting over reasoning paths without assessing causal validity, leading to biases and spurious correlations. There is a need for more robust and causally-aware approaches. Method: CAPITAL incorporates front-door adjustment into chain-of-thought reasoning, decomposing causal effects into two components: input prompt influence on reasoning chains and impact of those chains on output. It uses encoder-based clustering, NWGM approximation, and contrastive learning to align representations. Result: Experiments show that CAPITAL outperforms strong prompting baselines in accuracy and robustness, especially under adversarial conditions, demonstrating the benefits of causal inference in sentiment reasoning with LLMs. Conclusion: The proposed CAPITAL framework effectively improves the accuracy and robustness of implicit sentiment analysis by integrating causal inference into LLM prompting, addressing issues of bias and spurious correlations. Abstract: Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated, requiring models to perform deeper reasoning over subtle contextual cues. While recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA, they often rely on majority voting over chain-of-thought (CoT) reasoning paths without evaluating their causal validity, making them susceptible to internal biases and spurious correlations. To address this challenge, we propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning. CAPITAL decomposes the overall causal effect into two components: the influence of the input prompt on the reasoning chains, and the impact of those chains on the final output. These components are estimated using encoder-based clustering and the NWGM approximation, with a contrastive learning objective used to better align the encoder's representation with the LLM's reasoning space. Experiments on benchmark ISA datasets with three LLMs demonstrate that CAPITAL consistently outperforms strong prompting baselines in both accuracy and robustness, particularly under adversarial conditions. This work offers a principled approach to integrating causal inference into LLM prompting and highlights its benefits for bias-aware sentiment reasoning. The source code and case study are available at: https://github.com/whZ62/CAPITAL.

[16] Beyond Sociodemographic Prompting: Using Supervision to Align LLMs with Human Response Distributions

Gauri Kambhatla,Sanjana Gautam,Angela Zhang,Alex Liu,Ravi Srinivasan,Junyi Jessy Li,Matthew Lease

Main category: cs.CL

TL;DR: This paper explores how simple supervision can improve language model alignment with different population groups, offering a benchmark for future research.

Details Motivation: The motivation of this paper is to address the challenge of accurately predicting responses from diverse population groups to subjective questions, which has significant practical value. Method: The authors use relatively simple supervision techniques to enhance model alignment and evaluate performance across multiple datasets, language models, and prompting strategies. Result: The study finds that simple supervision significantly improves model alignment with various population groups, and the approach's generality supports easy adoption. Conclusion: The paper concludes that their method offers practical guidance on when to use such approaches and encourages future research through the open-sourced benchmark they provide. Abstract: The ability to accurately predict how different population groups would answer subjective questions would have great value. In this work, we show that use of relatively simple supervision can greatly improve language model alignment with diverse population groups, as measured over three datasets spanning various topics. Beyond evaluating average performance, we also report how alignment varies across specific groups. The simplicity and generality of our approach promotes easy adoption, while our broad findings provide useful guidance for when to use or not use our approach in practice. By conducting evaluation over many LLMs and prompting strategies, along with open-sourcing our work, we provide a useful benchmark to stimulate future research.

[17] Pitfalls of Evaluating Language Models with Open Benchmarks

Md. Najib Hasan,Mohammad Fakhruddin Babar,Souvika Sarkar,Monowar Hasan,Santu Karmaker

Main category: cs.CL

TL;DR: This study reveals flaws in open LLM benchmarks by creating models that perform well on these benchmarks but fail in real-world applications, highlighting the need for improved evaluation methods.

Details Motivation: The motivation behind this study is to expose the underexplored pitfalls of open Large Language Model (LLM) benchmarks, such as HELM and BIG-bench, which could compromise fair comparison and reproducibility due to their openness. Method: The researchers systematically constructed 'cheating' models, which are smaller variants of BART, T5, and GPT-2 fine-tuned directly on public test sets, to evaluate their performance on open benchmarks like HELM. Result: The 'cheating' models achieved top rankings on the HELM benchmark despite poor generalization and limited practical utility, demonstrating that high leaderboard performance may not equate to real-world effectiveness. Conclusion: The study concludes that open LLM benchmarks may not accurately reflect real-world effectiveness and highlights the need for reevaluating current benchmarking practices to ensure robust and trustworthy LM assessments. Abstract: Open Large Language Model (LLM) benchmarks, such as HELM and BIG-bench, offer standardized, transparent protocols that facilitate the fair comparison, reproducibility, and iterative advancement of Language Models (LMs). However, their openness also introduces critical and underexplored pitfalls. This study exposes these weaknesses by systematically constructing ``cheating'' models -- smaller variants of BART, T5, and GPT-2 fine-tuned directly on public test sets -- which achieve top rankings on a prominent open, holistic benchmark (HELM) despite poor generalization and limited practical utility. Our findings underscore three key insights: \ca high leaderboard performance on open benchmarks may not always reflect real-world effectiveness; \cb private or dynamic benchmarks must complement open evaluations to safeguard integrity; and \cc a fundamental reevaluation of current benchmarking practices is essential to ensure robust and trustworthy LM assessments.

To Eun Kim,João Coelho,Gbemileke Onilude,Jai Singh

Main category: cs.CL

TL;DR: 该论文提出了一种在基于RAG的对话系统中进行广告管理的模块化管道,解决了广告集成中的透明性和用户体验问题。

Details Motivation: 生成式搜索系统模糊了信息内容和促销材料之间的界限,引发了透明度和信任方面的担忧。 Method: 使用合成数据训练高性能分类器,并通过监督微调和最佳N选1采样方法优化广告重写器。 Result: 实验结果表明,所提出的广告分类器具有强大的检测性能,并且优化方法显著提高了广告的隐蔽性。 Conclusion: 研究提出了一个模块化的广告管理流水线,用于基于RAG的对话系统,并展示了分类器引导优化的有效性。 Abstract: As conversational search engines increasingly adopt generation-based paradigms powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the integration of advertisements into generated responses presents both commercial opportunities and challenges for user experience. Unlike traditional search, where advertisements are clearly delineated, generative systems blur the boundary between informational content and promotional material, raising concerns around transparency and trust. In this work, we propose a modular pipeline for advertisement management in RAG-based conversational systems, consisting of an ad-rewriter for seamless ad integration and a robust ad-classifier for detection. We leverage synthetic data to train high-performing classifiers, which are then used to guide two complementary ad-integration strategies: supervised fine-tuning of the ad-rewriter and a best-of-N sampling approach that selects the least detectable ad-integrated response among multiple candidates. Our evaluation focuses on two core questions: the effectiveness of ad classifiers in detecting diverse ad integration strategies, and the training methods that best support coherent, minimally intrusive ad insertion. Experimental results show that our ad-classifier, trained on synthetic advertisement data inspired by marketing strategies and enhanced through curriculum learning, achieves robust detection performance. Additionally, we demonstrate that classifier-guided optimization, through both fine-tuning and best-of-N sampling, significantly improves ad stealth, enabling more seamless integration. These findings contribute an adversarial co-evolution framework for developing more sophisticated ad-aware generative search systems and robust ad classifiers.

[19] NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data

Tahir Javed,Kaushal Bhogale,Mitesh M. Khapra

Main category: cs.CL

TL;DR: This paper introduces Nirantar, a new framework for evaluating continual learning in multilingual and multi-domain ASR. Unlike previous frameworks, it uses real-world data from 22 languages and 208 districts in India, allowing for more accurate simulation of real-world CL challenges. The paper shows that no single CL method performs consistently well, highlighting the need for more robust strategies.

Details Motivation: Continual learning (CL) in multilingual and multi-domain ASR presents real-world challenges. Prior work relies on simulated episodes, which may not accurately reflect these challenges. The authors aim to introduce a more realistic framework for CL research. Method: The paper introduces Nirantar, a framework for evaluating continual learning in multilingual and multi-domain ASR. It leverages data collected incrementally across 22 languages and 208 districts in India through natural episodes. Result: Nirantar enables systematic benchmarking of CL methods with 3250 hours of human-transcribed speech. It allows evaluation across Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL) scenarios. Evaluation of existing approaches shows no single method performs consistently well. Conclusion: The paper concludes that Nirantar is an effective framework for CL research due to its dynamic, non-uniform language and domain shifts. It also finds that no single method performs consistently well, highlighting the need for more robust CL strategies. Abstract: We introduce Nirantar, a comprehensive framework for evaluating continual learning (CL) in multilingual and multi-domain ASR. Designed to reflect real-world CL challenges, Nirantar leverages data collected incrementally across 22 languages and 208 districts in India through natural episodes. This enables evaluation across Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL) scenarios. Unlike prior work that relies on simulated episodes, Nirantar presents dynamic, non-uniform language and domain shifts, making it an ideal testbed for CL research. With 3250 hours of human-transcribed speech, including 1720 hours newly introduced in this work, our framework enables systematic benchmarking of CL methods. We evaluate existing approaches and demonstrate that no single method performs consistently well, underscoring the need for more robust CL strategies.

[20] Capsule Network-Based Semantic Intent Modeling for Human-Computer Interaction

Shixiao Wang,Yifan Zhuang,Runsheng Zhang,Zhijun Song

Main category: cs.CL

TL;DR: 本研究提出了一种基于胶囊网络的新方法,通过动态路由和边界机制改进意图识别,显著提升了复杂语义条件下的识别准确性和性能。

Details Motivation: 为了解决人机交互中意图识别准确率不足的问题,本文探索了胶囊网络在语义建模中的应用,旨在更有效地捕捉语义实体间的层次关系和部分-整体结构。 Method: 该方法使用卷积特征提取模块作为低层编码器,通过动态路由机制在多层胶囊结构之间传递信息,并引入基于边界的损失函数机制来提升模型区分意图类别的能力。 Result: 实验结果表明,所提出的模型在准确率、F1分数和意图检测率方面优于传统方法和其他深度学习结构,同时分析了动态路由迭代次数对模型性能的影响并提供了训练期间损失函数的收敛曲线。 Conclusion: 本文提出了一种基于胶囊网络的用户语义意图建模算法,有效地提高了人机交互中意图识别的准确性,并通过实验验证了该方法的稳定性和有效性。 Abstract: This paper proposes a user semantic intent modeling algorithm based on Capsule Networks to address the problem of insufficient accuracy in intent recognition for human-computer interaction. The method represents semantic features in input text through a vectorized capsule structure. It uses a dynamic routing mechanism to transfer information across multiple capsule layers. This helps capture hierarchical relationships and part-whole structures between semantic entities more effectively. The model uses a convolutional feature extraction module as the low-level encoder. After generating initial semantic capsules, it forms high-level abstract intent representations through an iterative routing process. To further enhance performance, a margin-based mechanism is introduced into the loss function. This improves the model's ability to distinguish between intent classes. Experiments are conducted using a public natural language understanding dataset. Multiple mainstream models are used for comparison. Results show that the proposed model outperforms traditional methods and other deep learning structures in terms of accuracy, F1-score, and intent detection rate. The study also analyzes the effect of the number of dynamic routing iterations on model performance. A convergence curve of the loss function during training is provided. These results verify the stability and effectiveness of the proposed method in semantic modeling. Overall, this study presents a new structured modeling approach to improve intent recognition under complex semantic conditions.

[21] Methodological Rigour in Algorithm Application: An Illustration of Topic Modelling Algorithm

Malmi Amadoru

Main category: cs.CL

TL;DR: This paper provides guidelines for ensuring methodological rigour in topic modelling studies to address challenges arising from the opacity of computational algorithms, particularly benefiting novice researchers, editors, and reviewers.

Details Motivation: The motivation stems from the challenges posed by the opacity and lack of transparency in advanced computational algorithms, which can undermine trust in research and necessitate methodological guidance. Method: The paper illustrates the application of structural topic modelling algorithms and proposes a set of guidelines to ensure rigour in topic modelling studies. Result: A comprehensive set of guidelines is presented to ensure methodological rigour in topic modelling studies, contributing to the discourse on computationally intensive theory construction research. Conclusion: The paper concludes that methodological rigour is essential when applying topic modelling algorithms, and the proposed guidelines can enhance trust in computationally intensive research, especially for novice researchers, editors, and reviewers. Abstract: The rise of advanced computational algorithms has opened new avenues for computationally intensive research approaches to theory development. However, the opacity of these algorithms and lack of transparency and rigour in their application pose methodological challenges, potentially undermining trust in research. The discourse on methodological rigour in this new genre of research is still emerging. Against this backdrop, I attempt to offer guidance on methodological rigour, particularly in the context of topic modelling algorithms. By illustrating the application of the structural topic modelling algorithm and presenting a set of guidelines, I discuss how to ensure rigour in topic modelling studies. Although the guidelines are for the application of topic modelling algorithms, they can be applied to other algorithms with context-specific adjustments. The guidelines are helpful, especially for novice researchers applying topic modelling, and editors and reviewers handling topic modelling manuscripts. I contribute to the literature on topic modelling and join the emerging dialogue on methodological rigour in computationally intensive theory construction research.

[22] TUM-MiKaNi at SemEval-2025 Task 3: Towards Multilingual and Knowledge-Aware Non-factual Hallucination Identification

Miriam Anschütz,Ekaterina Gikalo,Niklas Herbster,Georg Groh

Main category: cs.CL

TL;DR: 论文介绍了一个多语言幻觉识别系统,结合检索验证与BERT微调,实现跨语言有效性能。

Details Motivation: 幻觉是阻碍大型语言模型(LLM)可信度和广泛应用的主要问题之一,而目前对幻觉的研究主要集中在英语数据上,忽视了LLM的多语言特性。 Method: 结合了基于检索的事实验证和BERT系统微调的方法来识别幻觉模式。 Result: 该系统在所有语言中都取得了具有竞争力的结果,在八种语言中排名前十,并支持共享任务涵盖的十四种语言之外的更多语言。 Conclusion: 本文提出了一种结合基于维基百科的检索事实验证和微调BERT系统的两部分管道,用于多语言幻觉识别,并取得了跨语言的竞争性结果。 Abstract: Hallucinations are one of the major problems of LLMs, hindering their trustworthiness and deployment to wider use cases. However, most of the research on hallucinations focuses on English data, neglecting the multilingual nature of LLMs. This paper describes our submission to the SemEval-2025 Task-3 - Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. We propose a two-part pipeline that combines retrieval-based fact verification against Wikipedia with a BERT-based system fine-tuned to identify common hallucination patterns. Our system achieves competitive results across all languages, reaching top-10 results in eight languages, including English. Moreover, it supports multiple languages beyond the fourteen covered by the shared task. This multilingual hallucination identifier can help to improve LLM outputs and their usefulness in the future.

[23] Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based

Shuangquan Lyu,Yingnan Deng,Guiran Liu,Zhen Qi,Ruotong Wang

Main category: cs.CL

TL;DR: 本研究提出一种提升大语言模型在低资源语言场景下迁移和适应能力的统一框架,通过知识迁移与参数高效微调策略实现更高性能与稳定性。

Details Motivation: 解决大语言模型在低资源语言场景下的迁移和适应能力受限的问题。 Method: 本文提出了一种统一的框架,结合知识迁移模块和参数高效微调策略,引入知识对齐损失和软提示调整,并集成冻结策略和提示注入进行训练。 Result: 实验结果表明,与现有的多语言预训练模型和主流迁移方法相比,该方法在MLQA、XQuAD和PAWS-X等跨语言任务上表现更优,尤其在数据极度稀缺条件下具有显著优势。 Conclusion: 该框架表现出较强的泛化能力和可扩展性,增强了任务特定适应性,同时保留了大语言模型的通用能力,适用于复杂的语义建模和多语言处理任务。 Abstract: This paper addresses the limited transfer and adaptation capabilities of large language models in low-resource language scenarios. It proposes a unified framework that combines a knowledge transfer module with parameter-efficient fine-tuning strategies. The method introduces knowledge alignment loss and soft prompt tuning to guide the model in effectively absorbing the structural features of target languages or tasks under minimal annotation. This enhances both generalization performance and training stability. The framework includes lightweight adaptation modules to reduce computational costs. During training, it integrates freezing strategies and prompt injection to preserve the model's original knowledge while enabling quick adaptation to new tasks. The study also conducts stability analysis experiments and synthetic pseudo-data transfer experiments to systematically evaluate the method's applicability and robustness across different low-resource tasks. Experimental results show that compared with existing multilingual pre-trained models and mainstream transfer methods, the proposed approach achieves higher performance and stability on cross-lingual tasks such as MLQA, XQuAD, and PAWS-X. It demonstrates particularly strong advantages under extremely data-scarce conditions. The proposed method offers strong generality and scalability. It enhances task-specific adaptability while preserving the general capabilities of large language models. This makes it well-suited for complex semantic modeling and multilingual processing tasks.

[24] Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies

Tao Xiong,Xavier Hu,Wenyan Fan,Shengyu Zhang

Main category: cs.CL

TL;DR: 研究提出 Mixture of Reasoning (MoR),这是一种训练框架,能够将多样化的推理策略嵌入到 LLM 中,从而实现自主的任务自适应推理,无需外部提示工程。

Details Motivation: 大型语言模型(LLMs)虽然通过如思维链(CoT)和树状思维(ToT)等高级提示技术表现出色,但其依赖手工制作的任务特定提示限制了适应性和效率,因此提出了 MoR。 Method: Mixture of Reasoning (MoR) 包括两个阶段:Thought Generation 和 SFT Dataset Construction。 Result: 实验表明,MoR 显著提高了性能,在使用 CoT 提示时,MoR150 达到了 0.730(2.2% 的改进),相较于基线达到了 0.734(13.5% 的改进)。 Conclusion: MoR 提供了一种可推广的解决方案,增强了模型在各种任务中的推理能力,而无需任务特定提示。 Abstract: Large language models (LLMs) excel in complex tasks through advanced prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but their reliance on manually crafted, task-specific prompts limits adaptability and efficiency. We introduce Mixture of Reasoning (MoR), a training framework that embeds diverse reasoning strategies into LLMs for autonomous, task-adaptive reasoning without external prompt engineering. MoR has two phases: Thought Generation, creating reasoning chain templates with models like GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets for supervised fine-tuning.Our experiments show that MoR significantly enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates the need for task-specific prompts, offering a generalizable solution for robust reasoning across diverse tasks.

[25] SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Sihang Li,Wei Shi,Ziyuan Xie,Tao Liang,Guojun Ma,Xiang Wang

Main category: cs.CL

TL;DR: 本文提出了SAFER框架,通过稀疏自编码器解释并提升奖励模型在大型语言模型安全对齐方面的能力。

Details Motivation: 尽管RLHF是将大型语言模型与人类价值观对齐的关键范式,但其核心的奖励模型仍然不透明,需要进行机制分析来提高安全性和可解释性。 Method: 利用稀疏自编码器(SAEs)揭示奖励模型激活中的人类可解释特征,并通过激活差异量化特征的重要性,设计针对性的数据投毒和去噪策略。 Result: 实验表明,SAFER可以通过最小的数据修改精确地降低或增强安全性对齐,而不会牺牲通用对话性能。 Conclusion: SAFER是一个用于解释和改进奖励模型的新框架,它有助于高风险LLM对齐任务中的奖励模型的解释、审计和优化。 Abstract: Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at https://github.com/xzy-101/SAFER-code. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}

[26] Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English

Ahmed Sabir,Azinovič Gasper,Mengsay Loem,Rajesh Sharma

Main category: cs.CL

TL;DR: 本研究发现,基于不同语言(日语和英语)训练的视觉-语言模型表现出与人类相似的文化相关注意力模式,说明文化认知可能隐性地影响模型输出。

Details Motivation: 跨文化研究显示,不同文化背景的人处理视觉信息的方式存在差异。例如,东亚人倾向于采用整体视角,而西方人则更常使用分析方法。研究旨在探讨以不同语言训练的视觉-语言模型是否会表现出类似的文化相关注意力模式。 Method: 通过比较图像描述的分析方法,研究视觉-语言模型是否反映整体性与分析性倾向的差异。 Result: 研究结果表明,视觉-语言模型不仅内化了语言的结构性质,还再现了训练数据中嵌入的文化行为。 Conclusion: 研究表明,以不同语言(日语和英语)为主要训练数据的视觉-语言模型不仅内化了语言的结构性质,还再现了训练数据中嵌入的文化行为,表明文化认知可能隐性地塑造模型输出。 Abstract: Cross-cultural research in perception and cognition has shown that individuals from different cultural backgrounds process visual information in distinct ways. East Asians, for example, tend to adopt a holistic perspective, attending to contextual relationships, whereas Westerners often employ an analytical approach, focusing on individual objects and their attributes. In this study, we investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns. Using comparative analysis of image descriptions, we examine whether these models reflect differences in holistic versus analytic tendencies. Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.

[27] AI Analyst: Framework and Comprehensive Evaluation of Large Language Models for Financial Time Series Report Generation

Elizabeth Fons,Elena Kochkina,Rachneet Kaur,Zhen Zeng,Berowne Hlavaty,Charese Smiley,Svitlana Vyetrenko,Manuela Veloso

Main category: cs.CL

TL;DR: 本研究开发了一个使用大型语言模型(LLMs)生成财务报告的框架,通过自动化分类机制评估模型的推理和事实依据能力。

Details Motivation: 探索大型语言模型在从时间序列数据生成财务报告方面的潜力,以提高报告生成的质量和可解释性。 Method: 提出一个框架,包括提示工程、模型选择和评估,并引入自动化高亮系统对报告内容分类。 Result: 实验表明,LLMs 可以利用真实股市指数和合成时间序列数据生成高质量的财务报告,并有效区分不同类型的洞察力来源。 Conclusion: LLMs 具备生成连贯且信息丰富的财务报告的能力,结合时间序列数据与金融推理,能够区分不同来源的洞察力。 Abstract: This paper explores the potential of large language models (LLMs) to generate financial reports from time series data. We propose a framework encompassing prompt engineering, model selection, and evaluation. We introduce an automated highlighting system to categorize information within the generated reports, differentiating between insights derived directly from time series data, stemming from financial reasoning, and those reliant on external knowledge. This approach aids in evaluating the factual grounding and reasoning capabilities of the models. Our experiments, utilizing both data from the real stock market indices and synthetic time series, demonstrate the capability of LLMs to produce coherent and informative financial reports.

[28] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Daniel Fein,Sebastian Russo,Violet Xiang,Kabir Jolly,Rafael Rafailov,Nick Haber

Main category: cs.CL

TL;DR: This paper introduces LitBench, a standardized benchmark for creative writing evaluation, demonstrating that trained reward models outperform zero-shot LLM judges in aligning with human preferences.

Details Motivation: Evaluating creative writing from large language models is challenging due to the lack of ground truths, necessitating robust automated evaluation methods. Method: LitBench was introduced as a benchmark dataset with a test set of 2,480 human-labeled story comparisons and a training corpus of 43,827 pairs. Zero-shot LLM judges were benchmarked, Bradley-Terry and Generative reward models were trained, and an online human study validated model rankings. Result: Claude-3.7-Sonnet achieved 73% agreement with human preferences as an off-the-shelf judge. Trained reward models (Bradley-Terry and Generative) reached 78% accuracy, outperforming all off-the-shelf judges. Human studies confirmed alignment with human preferences in novel stories. Conclusion: The LitBench benchmark and trained reward models provide a reliable, automated evaluation method for creative writing systems. Abstract: Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.

[29] A Diagrammatic Calculus for a Functional Model of Natural Language Semantics

Matthieu Pierre Boyer

Main category: cs.CL

TL;DR: 本文提出了一种基于函数式编程和范畴论的新型自然语言语义分析方法,并构建了用于高效计算句子指称意义的图示演算。

Details Motivation: 提高传统自然语言语义指称风格的表达能力。 Method: 形式化基于范畴的类型与效应系统,并构建用于建模解析和处理效应的图示演算。 Result: 开发了一种新的函数式编程方法,用于自然语言语义学,并成功形式化了相应的类型与效应系统。 Conclusion: 通过函数式编程方法,可以增强传统指称语义的表达能力,并能高效计算句子的指称意义。 Abstract: In this paper, we study a functional programming approach to natural language semantics, allowing us to increase the expressivity of a more traditional denotation style. We will formalize a category based type and effect system, and construct a diagrammatic calculus to model parsing and handling of effects, and use it to efficiently compute the denotations for sentences.

[30] Generative AI and the future of scientometrics: current topics and future questions

Benedetto Lepori,Jens Peter Andersen,Karsten Donnay

Main category: cs.CL

TL;DR: This paper reviews the role of Generative AI (GenAI) in scientometrics, highlighting its strengths in language generation tasks while cautioning about its limitations in reasoning and domain-specific knowledge. It emphasizes the need for ongoing evaluation of GenAI's impact on how science is measured and understood.

Details Motivation: The paper aims to review the application of GenAI in scientometrics and initiate a debate on its broader implications, especially as it becomes more integrated into scientific knowledge production. Method: This paper reviews the use of GenAI in scientometrics, critically engages with its applications such as topic labeling, citation context analysis, predictive tasks, scholars' profiling, and research assessment, and explores its implications on textual characteristics used to measure science. Result: GenAI shows promise in language generation tasks like topic labeling but struggles with stable semantics and structured domain knowledge. It may fundamentally impact scientometrics by altering textual metrics such as authorship, word usage, and references. Conclusion: GenAI has potential in scientometrics, especially in language generation tasks, but its limitations and rapid evolution necessitate systematic comparisons of models' performance. Its growing influence on scientific language calls for empirical work and theoretical reflection. Abstract: The aim of this paper is to review the use of GenAI in scientometrics, and to begin a debate on the broader implications for the field. First, we provide an introduction on GenAI's generative and probabilistic nature as rooted in distributional linguistics. And we relate this to the debate on the extent to which GenAI might be able to mimic human 'reasoning'. Second, we leverage this distinction for a critical engagement with recent experiments using GenAI in scientometrics, including topic labelling, the analysis of citation contexts, predictive applications, scholars' profiling, and research assessment. GenAI shows promise in tasks where language generation dominates, such as labelling, but faces limitations in tasks that require stable semantics, pragmatic reasoning, or structured domain knowledge. However, these results might become quickly outdated. Our recommendation is, therefore, to always strive to systematically compare the performance of different GenAI models for specific tasks. Third, we inquire whether, by generating large amounts of scientific language, GenAI might have a fundamental impact on our field by affecting textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production.

[31] Many LLMs Are More Utilitarian Than One

Anita Keshmirian,Razan Baltaji,Babak Hemmatian,Hadi Asghari,Lav R. Varshney

Main category: cs.CL

TL;DR: 研究表明,大型语言模型(LLM)在群体讨论中比单独作答时更倾向于接受功利主义的道德决策,但其背后的机制不同于人类群体:LLM表现出更低的道德规范敏感性或更高的公正性,而人类则是对决策后果更加敏感。

Details Motivation: 随着多智能体系统的兴起,理解LLM在协作中如何集体运作变得至关重要。人类的道德判断在群体讨论中通常表现出更强的功利主义倾向,本研究旨在探讨LLM是否也会出现类似现象,并分析其与人类判断的异同。 Method: 研究人员测试了六种LLM模型,在两种条件下处理经典的道德困境问题:(1)“独立模式”下模型单独推理;(2)“群体模式”下模型以两人或三人小组的形式进行多轮讨论。 Result: 在涉及直接伤害个体以最大化他人利益的个人道德困境中,所有模型在群体模式下比单独工作时更接受道德违规行为。一些模型倾向于选择最大化整体福祉的行动,即使这意味著偏向陌生人而非熟悉的人;另一些模型则在群体中更愿意违反道德规范。虽然LLM群体行为表面与人类相似,但其背后机制不同。 Conclusion: 该研究发现,当大型语言模型(LLM)以群体形式进行讨论时,它们在道德困境中的判断会更倾向于功利主义,即更容易接受为了多数人利益而违反道德规范的行为。然而,这种群体效应的机制与人类不同:人类对决策结果的敏感性增强,而LLM则表现出较低的规范敏感性或更高的公正性。 Abstract: Moral judgment is integral to large language model (LLM) alignment and social reasoning. As multi-agent systems gain prominence, it becomes crucial to understand how LLMs function collectively during collaboration, compared to individual agents. In human moral judgment, group deliberation leads to a utilitarian boost: a tendency to endorse norm violations that maximize benefits for the greatest number of people despite harms. We study whether a similar dynamic emerges in multi-agent LLM systems. We tested six models on well-established sets of moral dilemmas across two conditions: (1) Solo, where models reasoned independently, and (2) Group, where they engaged in multi-turn discussions in pairs or triads. In personal moral dilemmas, where agents must decide to directly harm one individual to maximize the utility for others, all models found moral violations to be more acceptable when part of a group than individually, similar to human experiments. Some models endorsed actions that maximized overall well-being, even if they benefited strangers over familiar individuals. Others became more willing to violate moral norms in groups. However, while human groups show a similar action bias, the mechanism for their utilitarian boost differs from LLMs. Whereas the human shift comes from heightened sensitivity to decision outcomes, LLM groups show either reduced norm sensitivity or enhanced impartiality. This suggests that while the surface behavior of LLM collectives mimics human group reasoning, the underlying drivers differ. We discuss the implications for AI alignment, multi-agent design, and artificial moral reasoning.

[32] ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

Alexander Hoyle,Lorena Calvo-Bartolomé,Jordan Boyd-Graber,Philip Resnik

Main category: cs.CL

TL;DR: 本文提出了一种新的可扩展的主题模型评估方法,通过比较LLM代理和人工注释器的表现,证明了某些LLM代理可以有效替代人工进行自动化评估。

Details Motivation: 主题模型和文档聚类评估通常使用与人类偏好不一致的自动度量或需要难以扩展的专业标签,因此需要一种反映实际使用的评估方法。 Method: 设计了一种可扩展的人类评估协议和相应的自动化近似方法,通过收集大量众包工人的注释来验证自动化代理的有效性。 Result: 利用该协议,研究人员收集了大量来自不同主题模型输出的注释,并发现最佳LLM代理能够很好地模拟人工注释器的行为。 Conclusion: 研究发现,最好的LLM代理在统计上与人工注释器无法区分,因此可以作为自动化评估的合理替代品。 Abstract: Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at https://github.com/ahoho/proxann

[33] Stylometry recognizes human and LLM-generated texts in short samples

Karol Przystalski,Jan K. Argasiński,Iwona Grabska-Gradzińska,Jeremi K. Ochab

Main category: cs.CL

TL;DR: This paper explores the application of stylometry to differentiate between texts generated by Large Language Models (LLMs) and humans, showing promising results in classification performance and identifying unique features of LLM-generated texts.

Details Motivation: To address the issues of model attribution, intellectual property, and ethical AI use by exploring stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans. Method: The paper uses stylometry on a benchmark dataset based on Wikipedia. Tree-based models like decision trees and LightGBM are used to classify 10-sentence long texts using both human-designed and n-gram-based stylometric features. Result: Cross-validated results reached up to .87 Matthews correlation coefficient in multiclass classification with 7 classes and accuracy between .79 and 1. In binary classification, the example of Wikipedia and GPT-4 achieved up to .98 accuracy. Shapley Additive Explanations identified features distinguishing LLM-generated from human-written texts. Conclusion: The paper concludes that stylometry can effectively distinguish between texts generated by LLMs and humans, which has implications for model attribution, intellectual property, and ethical AI use. Abstract: The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show -- crucially, in the context of the increasingly sophisticated LLMs -- that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.

[34] TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation

Xi Xuan,King-kui Sin,Yufei Zhou,Chunyu Kit

Main category: cs.CL

TL;DR: 本文介绍了TransLaw,一个用于提高香港法律文件翻译质量和效率的多智能体框架。

Details Motivation: 由于复杂的法律术语、文化嵌入的细微差别和严格的语言结构,LLMs在翻译香港法律判决方面的潜力尚不确定。 Method: TransLaw采用三个专门的代理,即Translator,Annotator和Proofreader,共同生成翻译,并支持定制化的LLM配置。 Result: 通过使用13个开源和商业LLMs作为代理进行评估,发现TransLaw在法律语义准确性、结构连贯性和风格保真度方面超过了GPT-4o,但在复杂术语和风格自然性的上下文中落后于人类专家。 Conclusion: TransLaw是一个新的多智能体框架,用于现实世界的香港判例法翻译,能够实现高精度的法律含义、风格的适当性以及结构的连贯性和一致性。 Abstract: Multi-agent systems empowered by large language models (LLMs) have demonstrated remarkable capabilities in a wide range of downstream applications, including machine translation. However, the potential of LLMs in translating Hong Kong legal judgments remains uncertain due to challenges such as intricate legal terminology, culturally embedded nuances, and strict linguistic structures. In this work, we introduce TransLaw, a novel multi-agent framework implemented for real-world Hong Kong case law translation. It employs three specialized agents, namely, Translator, Annotator, and Proofreader, to collaboratively produce translations for high accuracy in legal meaning, appropriateness in style, and adequate coherence and cohesion in structure. This framework supports customizable LLM configurations and achieves tremendous cost reduction compared to professional human translation services. We evaluated its performance using 13 open-source and commercial LLMs as agents and obtained interesting findings, including that it surpasses GPT-4o in legal semantic accuracy, structural coherence, and stylistic fidelity, yet trails human experts in contextualizing complex terminology and stylistic naturalness. Our platform website is available at CityUHK, and our bilingual judgment corpus used for the evaluation is available at Hugging Face.

[35] Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations

Aditya Tomar,Nihar Ranjan Sahoo,Ashish Mittal,Rudra Murthy,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: This paper explores how cultural context in mathematical problem presentation affects the performance of large language models, finding that models with reasoning capabilities are more resilient to these cultural variations.

Details Motivation: Mathematical problems often carry implicit cultural context, and existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. Method: The study created culturally adapted variants of the GSM8K test set for five regions (Africa, India, China, Korea, and Japan) using prompt-based transformations followed by manual verification. Six large language models were evaluated across five prompting strategies. Result: A consistent performance gap was found, with models performing best on the original US-centric dataset and comparatively worse on culturally adapted versions. Conclusion: The models perform best on the original US-centric dataset and worse on culturally adapted versions, but models with reasoning capabilities are more resilient to cultural variations. Abstract: Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks

[36] Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check

Nicholas Lourie,Michael Y. Hu,Kyunghyun Cho

Main category: cs.CL

TL;DR: Downstream scaling laws are unreliable for predicting task performance from pretraining losses, as they show linear trends in only 39% of cases and are sensitive to experimental settings.

Details Motivation: The motivation stems from conflicting findings on whether task performance can be reliably predicted using downstream scaling laws, with some suggesting linear trends and others pointing out challenges like emergence and inverse scaling. Method: A meta-analysis was conducted on existing data regarding downstream scaling laws to assess the frequency and reliability of linear scaling trends. Result: Linear scaling trends were found to occur in only 39% of cases, and experimental changes could significantly alter scaling trends. Conclusion: Downstream scaling laws are not consistently reliable for predicting task performance from pretraining losses, as linear scaling trends only occur in 39% of cases and can be significantly affected by experimental settings. Abstract: Downstream scaling laws aim to predict task performance at larger scales from pretraining losses at smaller scales. Whether this prediction should be possible is unclear: some works demonstrate that task performance follows clear linear scaling trends under transformation, whereas others point out fundamental challenges to downstream scaling laws, such as emergence and inverse scaling. In this work, we conduct a meta-analysis of existing data on downstream scaling laws, finding that close fit to linear scaling laws only occurs in a minority of cases: 39% of the time. Furthermore, seemingly benign changes to the experimental setting can completely change the scaling trend. Our analysis underscores the need to understand the conditions under which scaling laws succeed. To fully model the relationship between pretraining loss and downstream task performance, we must embrace the cases in which scaling behavior deviates from linear trends.

[37] MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes

Yuheng Wang,Xianhe Tang,Pufeng Huang

Main category: cs.CL

TL;DR: 提出了一个自动化的中文多轮对话数据集MemeCMD,结合了大规模表情包库与生成对话技术,能有效提升多模态对话的表现力和上下文相关性。

Details Motivation: 现有的对话数据集主要局限于手动标注或纯文本对话,缺乏多模态交互所提供的表现力和上下文细微差别。 Method: 构建了一个大规模的、MLLM标注的表情包库,并通过双代理自动生成对话,结合检索框架和自适应阈值确保使用表情包的相关性和自然间隔。 Result: 实验表明该方法在生成合适的、多样化的包含模因的对话方面是有效的,为推进多模态会话AI提供了一个可扩展的资源。 Conclusion: MemeCMD提供了生成对话和检索模因的框架,能够产生上下文合适且多样化的包含模因的对话。 Abstract: Memes are widely used in online social interactions, providing vivid, intuitive, and often humorous means to express intentions and emotions. Existing dialogue datasets are predominantly limited to either manually annotated or pure-text conversations, lacking the expressiveness and contextual nuance that multimodal interactions provide.To address these challenges, we introduce MemeCMD, an automatically generated Chinese Multi-turn Dialogue dataset with contextually retrieved memes. Our dataset combines a large-scale, MLLM-annotated meme library with dialogues auto-generated by dual agents across diverse scenarios. We introduce a retrieval framework and adaptive threshold to ensure contextually relevant, naturally spaced meme usage. Experiments demonstrate the effectiveness of our approach in generating contextually appropriate and diverse meme-incorporated dialogues, offering a scalable and privacy-preserving resource for advancing multimodal conversational AI.

[38] The Cognate Data Bottleneck in Language Phylogenetics

Luise Häuser,Alexandros Stamatakis

Main category: cs.CL

TL;DR: 研究探讨了从自动提取大规模同源词数据的挑战,发现目前的方法无法产生适合系统发育分析的数据,限制了计算方法在历史语言学中的应用。

Details Motivation: 为了充分发挥计算系统发育方法在同源词数据上的潜力,需要利用特定(复杂)模型和基于机器学习的技术。然而,现有的手动收集的同源词数据规模不足以支持这些方法。 Method: 通过从BabelNet这个大型多语种百科词典中自动提取数据集,并对相应的字符矩阵进行系统发育推断,评估其与已知标准树的一致性。 Result: 从BabelNet自动提取的字符矩阵生成的系统发育树与已知的标准树基本不一致,且认为从其他多语种资源中提取更合适的字符矩阵的可能性较低。 Conclusion: 目前尚无可行方法来自动生成适用于系统发育分析的大规模同源词数据,因此,如何以及是否可以将这些计算方法应用于历史语言学仍然是一个开放问题。 Abstract: To fully exploit the potential of computational phylogenetic methods for cognate data one needs to leverage specific (complex) models an machine learning-based techniques. However, both approaches require datasets that are substantially larger than the manually collected cognate data currently available. To the best of our knowledge, there exists no feasible approach to automatically generate larger cognate datasets. We substantiate this claim by automatically extracting datasets from BabelNet, a large multilingual encyclopedic dictionary. We demonstrate that phylogenetic inferences on the respective character matrices yield trees that are largely inconsistent with the established gold standard ground truth trees. We also discuss why we consider it as being unlikely to be able to extract more suitable character matrices from other multilingual resources. Phylogenetic data analysis approaches that require larger datasets can therefore not be applied to cognate data. Thus, it remains an open question how, and if these computational approaches can be applied in historical linguistics.

[39] Discourse Heuristics For Paradoxically Moral Self-Correction

Guangliang Liu,Zimo Qi,Xitong Zhang,Kristen Marie Johnson

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型(LLM)在道德自我修正中的两个主要悖论,分析了其在微调语料库中的话语构建,并提出了利用精选数据集的启发式方法来改进道德自我修正的解决方案。

Details Motivation: 为了更好地理解和解决大型语言模型在道德自我修正中出现的两个悖论,即其只能在表层进行自我修正以及难以识别道德不一致的根本原因。 Method: 通过分析旨在增强道德自我修正能力的微调语料库中的话语构建,揭示有效的构建中存在的启发式方法,并基于这些方法提出改进方案。 Result: 发现道德自我修正依赖于反映启发式捷径的话语构建,而这些捷径在同时提升自我修正和自我诊断能力时会导致不一致性;此外,还强调了该能力在情境化学习和模型规模方面的泛化挑战。 Conclusion: 文章提出了一种利用精选数据集的启发式方法来改善大型语言模型的道德自我修正能力,并指出了进一步研究的方向。 Abstract: Moral self-correction has emerged as a promising approach for aligning the output of Large Language Models (LLMs) with human moral values. However, moral self-correction techniques are subject to two primary paradoxes. First, despite empirical and theoretical evidence to support the effectiveness of self-correction, this LLM capability only operates at a superficial level. Second, while LLMs possess the capability of self-diagnosing immoral aspects of their output, they struggle to identify the cause of this moral inconsistency during their self-correction process. To better understand and address these paradoxes, we analyze the discourse constructions in fine-tuning corpora designed to enhance moral self-correction, uncovering the existence of the heuristics underlying effective constructions. We demonstrate that moral self-correction relies on discourse constructions that reflect heuristic shortcuts, and that the presence of these heuristic shortcuts during self-correction leads to inconsistency when attempting to enhance both self-correction and self-diagnosis capabilities jointly. Based on our findings, we propose a solution to improve moral self-correction by leveraging the heuristics of curated datasets. We also highlight the generalization challenges of this capability, particularly in terms of learning from situated context and model scales.

[40] Should We Still Pretrain Encoders with Masked Language Modeling?

Hippolyte Gisserot-Boukhlef,Nicolas Boizard,Manuel Faysse,Duarte M. Alves,Emmanuel Malherbe,André F. T. Martins,Céline Hudelot,Pierre Colombo

Main category: cs.CL

TL;DR: 本文比较了基于掩码语言建模(MLM)和因果语言建模(CLM)的预训练方法对文本表示任务的影响。研究发现,尽管MLM在大多数任务中表现更优,CLM在数据效率和微调稳定性方面更胜一筹。结合两者优点的双相训练策略在固定计算预算下表现最佳。

Details Motivation: 尽管传统的编码器预训练依赖于掩码语言建模(MLM),最近有研究表明使用因果语言建模(CLM)预训练的解码器模型可以有效地作为编码器,并且在文本表示基准测试中常常超越传统编码器。然而,这些优势是否是CLM目标函数的内在优势还不明确。因此,本文旨在解决这一问题。 Method: 本文通过一系列大规模、精心控制的预训练消融实验进行研究,共训练了30个参数范围从2.1亿到10亿的模型,并进行了超过15,000次微调和评估运行。 Result: 研究发现,尽管MLM预训练通常在文本表示任务中表现更好,CLM预训练模型在数据效率和微调稳定性方面表现更佳。此外,一种先应用CLM后应用MLM的双相训练策略在固定计算预算下实现了最佳性能。 Conclusion: 本文得出结论,虽然MLM预训练通常在文本表示任务中表现更优,但CLM预训练模型具有更高的数据效率和更好的微调稳定性。此外,采用先应用CLM后应用MLM的双相训练策略可以在固定的计算训练预算下实现最佳性能,并且当从现有的预训练CLM模型初始化时,这种策略更具吸引力。 Abstract: Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

[41] La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America

María Grandury,Javier Aula-Blasco,Júlia Falcão,Clémentine Fourrier,Miguel González,Gonzalo Martínez,Gonzalo Santamaría,Rodrigo Agerri,Nuria Aldama,Luis Chiruzzo,Javier Conde,Helena Gómez,Marta Guerrero,Guido Ivetta,Natalia López,Flor Miriam Plaza-del-Arco,María Teresa Martín-Valdivia,Helena Montoro,Carmen Muñoz,Pedro Reviriego,Leire Rosado,Alejandro Vaca,María Estrella Vallecillo-Rodríguez,Jorge Vallego,Irune Zubiaga

Main category: cs.CL

TL;DR: The paper introduces La Leaderboard, an open-source leaderboard for evaluating generative Large Language Models (LLMs) in languages of Spain and Latin America, aiming to represent linguistic and cultural diversity.

Details Motivation: To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community. Method: The paper presents La Leaderboard, combining 66 datasets across multiple languages and evaluates 50 models. They also provide guidance on selecting suitable evaluation setups and explain their methodology. Result: This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. Conclusion: La Leaderboard is a community-driven project that aims to establish an evaluation standard for LLMs in the Spanish-speaking community and encourages similar developments in other languages. Abstract: Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Basque, Catalan, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.

[42] SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

Yilun Zhao,Kaiyan Zhang,Tiansheng Hu,Sihong Wu,Ronan Le Bras,Taira Anderson,Jonathan Bragg,Joseph Chee Chang,Jesse Dodge,Matt Latzke,Yixin Liu,Charles McGrady,Xiangru Tang,Zihang Wang,Chen Zhao,Hannaneh Hajishirzi,Doug Downey,Arman Cohan

Main category: cs.CL

TL;DR: SciArena是一个开放协作平台,通过社区投票评估基础模型在科学文献任务上的表现,并发布SciArena-Eval以推动自动化评估系统研究。

Details Motivation: 传统科学文献理解和综合基准测试无法充分反映真实世界的需求,因此需要一个由社区驱动的平台来提供更有效的模型评估方法。 Method: SciArena采用类似Chatbot Arena的评估方式,通过研究人员对模型回答进行投票,收集意见以生成模型排名,并基于这些数据开发了SciArena-Eval元评估基准。 Result: 平台已支持23个开源和专有基础模型,获得13,000多条来自研究人员的投票,问题具有多样性且与现实需求一致,评估者之间具有一致性和高一致性。 Conclusion: SciArena展示了社区驱动评估的有效性,同时SciArena-Eval强调了开发更可靠自动化评估方法的重要性。 Abstract: We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.

cs.CV [Back]

[43] Moment Sampling in Video LLMs for Long-Form Video QA

Mustafa Chasmai,Gauri Jagatap,Gouthaman KV,Grant Van Horn,Subhransu Maji,Andrea Fanelli

Main category: cs.CV

TL;DR: This paper proposes a new frame sampling technique called 'moment sampling' that improves long-form video question answering by selecting the most relevant frames based on the question context.

Details Motivation: Existing frame sub-sampling techniques often miss crucial frames or include redundant information, limiting the performance of Video LLMs on long-form video content. Method: A lightweight text-to-video moment retrieval model is used to prioritize and select the most relevant frames for answering questions in long videos. Result: Experiments on four long-form VideoQA datasets using four state-of-the-art Video LLMs show improved performance with the proposed moment sampling method. Conclusion: The proposed moment sampling approach effectively enhances long-form VideoQA performance by selecting relevant frames based on the question context, outperforming traditional sub-sampling methods. Abstract: Recent advancements in video large language models (Video LLMs) have significantly advanced the field of video question answering (VideoQA). While existing methods perform well on short videos, they often struggle with long-range reasoning in longer videos. To scale Video LLMs for longer video content, frame sub-sampling (selecting frames at regular intervals) is commonly used. However, this approach is suboptimal, often leading to the loss of crucial frames or the inclusion of redundant information from multiple similar frames. Missing key frames impairs the model's ability to answer questions accurately, while redundant frames lead the model to focus on irrelevant video segments and increase computational resource consumption. In this paper, we investigate the use of a general-purpose text-to-video moment retrieval model to guide the frame sampling process. We propose "moment sampling", a novel, model-agnostic approach that enables the model to select the most relevant frames according to the context of the question. Specifically, we employ a lightweight moment retrieval model to prioritize frame selection. By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs. Through extensive experiments on four long-form VideoQA datasets, using four state-of-the-art Video LLMs, we demonstrate the effectiveness of the proposed approach.

[44] Catastrophic Forgetting Mitigation via Discrepancy-Weighted Experience Replay

Xinrun Xu,Jianwen Yang,Qiuhong Zhang,Zhanbiao Lian,Zhiming Ding,Shan Jiang

Main category: cs.CV

TL;DR: The paper proposes ER-EMU, an edge model update algorithm based on adaptive experience replay, to address the problem of catastrophic forgetting when adapting edge models in cloud-edge collaborative object detection for traffic monitoring.

Details Motivation: Continually adapting edge models in cloud-edge collaborative object detection for traffic monitoring suffers from catastrophic forgetting. Existing approaches struggle to effectively prioritize and leverage historical data for optimal knowledge retention and adaptation. Method: This paper proposes ER-EMU, an edge model update algorithm based on adaptive experience replay. It uses FIFO principle and DDM-ES algorithm which employs MK-MMD to quantify dissimilarity between target domains. Result: Experiments on the Bellevue traffic video dataset demonstrate that ER-EMU consistently improves the performance of several state-of-the-art cloud-edge collaborative object detection frameworks. Conclusion: ER-EMU ensures training diversity and facilitates the retention of knowledge from a wider range of past experiences, while also preventing overfitting to the new domain. Abstract: Continually adapting edge models in cloud-edge collaborative object detection for traffic monitoring suffers from catastrophic forgetting, where models lose previously learned knowledge when adapting to new data distributions. This is especially problematic in dynamic traffic environments characterised by periodic variations (e.g., day/night, peak hours), where past knowledge remains valuable. Existing approaches like experience replay and visual prompts offer some mitigation, but struggle to effectively prioritize and leverage historical data for optimal knowledge retention and adaptation. Specifically, simply storing and replaying all historical data can be inefficient, while treating all historical experiences as equally important overlooks their varying relevance to the current domain. This paper proposes ER-EMU, an edge model update algorithm based on adaptive experience replay, to address these limitations. ER-EMU utilizes a limited-size experience buffer managed using a First-In-First-Out (FIFO) principle, and a novel Domain Distance Metric-based Experience Selection (DDM-ES) algorithm. DDM-ES employs the multi-kernel maximum mean discrepancy (MK-MMD) to quantify the dissimilarity between target domains, prioritizing the selection of historical data that is most dissimilar to the current target domain. This ensures training diversity and facilitates the retention of knowledge from a wider range of past experiences, while also preventing overfitting to the new domain. The experience buffer is also updated using a simple random sampling strategy to maintain a balanced representation of previous domains. Experiments on the Bellevue traffic video dataset, involving repeated day/night cycles, demonstrate that ER-EMU consistently improves the performance of several state-of-the-art cloud-edge collaborative object detection frameworks.

[45] MR-CLIP: Efficient Metadata-Guided Learning of MRI Contrast Representations

Mehmet Yigit Avci,Pedro Borges,Paul Wright,Mehmet Yigitsoy,Sebastien Ourselin,Jorge Cardoso

Main category: cs.CV

TL;DR: 本文提出了一种名为MR-CLIP的多模态对比学习框架,通过将MRI图像与其DICOM元数据对齐来学习对比感知表示,从而解决由于缺乏可靠和标准化元数据导致的医学影像解释、检索和整合到临床工作流程中的挑战。

Details Motivation: 由于在许多现实数据集中,用于简化对比识别的广泛标签(如T1加权或T2加权)通常缺失,并且可用的元数据经常不完整、嘈杂或不一致,因此需要一种新的方法来改善图像解释、检索和集成到临床工作流程中的问题。 Method: 提出了MR-CLIP,这是一个多模态对比学习框架,它将磁共振成像图像与它们的DICOM元数据对齐,以学习对比感知表示,而不依赖于手动标签。 Result: MR-CLIP在跨模态检索和对比分类方面展示了其有效性,突出了其可扩展性和进一步临床应用的潜力。 Conclusion: MR-CLIP提供了一种解决方案,可以应对由于缺乏可靠和标准化的元数据而导致的医学影像解释、检索和整合到临床工作流程中的挑战。 Abstract: Accurate interpretation of Magnetic Resonance Imaging scans in clinical systems is based on a precise understanding of image contrast. This contrast is primarily governed by acquisition parameters, such as echo time and repetition time, which are stored in the DICOM metadata. To simplify contrast identification, broad labels such as T1-weighted or T2-weighted are commonly used, but these offer only a coarse approximation of the underlying acquisition settings. In many real-world datasets, such labels are entirely missing, leaving raw acquisition parameters as the only indicators of contrast. Adding to this challenge, the available metadata is often incomplete, noisy, or inconsistent. The lack of reliable and standardized metadata complicates tasks such as image interpretation, retrieval, and integration into clinical workflows. Furthermore, robust contrast-aware representations are essential to enable more advanced clinical applications, such as achieving modality-invariant representations and data harmonization. To address these challenges, we propose MR-CLIP, a multimodal contrastive learning framework that aligns MR images with their DICOM metadata to learn contrast-aware representations, without relying on manual labels. Trained on a diverse clinical dataset that spans various scanners and protocols, MR-CLIP captures contrast variations across acquisitions and within scans, enabling anatomy-invariant representations. We demonstrate its effectiveness in cross-modal retrieval and contrast classification, highlighting its scalability and potential for further clinical applications. The code and weights are publicly available at https://github.com/myigitavci/MR-CLIP.

[46] HistoART: Histopathology Artifact Detection and Reporting Tool

Seyed Kahaki,Alexander R. Webber,Ghada Zamzmi,Adarsh Subbaswamy,Rucha Deshpande,Aldo Badano

Main category: cs.CV

TL;DR: 本研究提出并比较了三种检测WSI伪影的方法,其中FMA表现最好,能够提供高质量的图像分析结果。

Details Motivation: WSI在数字化组织样本中广泛应用,但易受制片和扫描过程中引入的伪影影响,这可能会影响下游的图像分析。 Method: 比较了三种检测WSI伪影的方法:基于基础模型的方法(FMA),基于ResNet50骨干的深度学习方法(DLA)和基于手工特征的知识基础方法(KBA)。 Result: FMA实现了最高的补丁AUROC为0.995,超过了ResNet50-based方法(AUROC: 0.977)和KBA(AUROC: 0.940)。 Conclusion: FMA在检测WSI中的六种常见伪影类型方面表现最佳,其次是DLA和KBA。 Abstract: In modern cancer diagnostics, Whole Slide Imaging (WSI) is widely used to digitize tissue specimens for detailed, high-resolution examination; however, other diagnostic approaches, such as liquid biopsy and molecular testing, are also utilized based on the cancer type and clinical context. While WSI has revolutionized digital histopathology by enabling automated, precise analysis, it remains vulnerable to artifacts introduced during slide preparation and scanning. These artifacts can compromise downstream image analysis. To address this challenge, we propose and compare three robust artifact detection approaches for WSIs: (1) a foundation model-based approach (FMA) using a fine-tuned Unified Neural Image (UNI) architecture, (2) a deep learning approach (DLA) built on a ResNet50 backbone, and (3) a knowledge-based approach (KBA) leveraging handcrafted features from texture, color, and frequency-based metrics. The methods target six common artifact types: tissue folds, out-of-focus regions, air bubbles, tissue damage, marker traces, and blood contamination. Evaluations were conducted on 50,000+ image patches from diverse scanners (Hamamatsu, Philips, Leica Aperio AT2) across multiple sites. The FMA achieved the highest patch-wise AUROC of 0.995 (95% CI [0.994, 0.995]), outperforming the ResNet50-based method (AUROC: 0.977, 95% CI [0.977, 0.978]) and the KBA (AUROC: 0.940, 95% CI [0.933, 0.946]). To translate detection into actionable insights, we developed a quality report scorecard that quantifies high-quality patches and visualizes artifact distributions.

[47] CaughtCheating: Is Your MLLM a Good Cheating Detective? Exploring the Boundary of Visual Perception and Reasoning

Ming Li,Chenguang Wang,Yijun Liang,Xiyao Wang,Yuhang Zhou,Xiyang Wu,Yuqing Zhang,Ruiyi Zhang,Tianyi Zhou

Main category: cs.CV

TL;DR: This paper introduces the CaughtCheating benchmark, where MLLMs like GPT-o3 struggle to detect subtle visual clues, revealing gaps in their detective-level perception and reasoning abilities.

Details Motivation: As MLLMs like GPT-o3 achieve near-human performance on existing benchmarks, there is a need for more challenging tasks to further push their capabilities, especially in expert-level reasoning like human detectives. Method: The authors conducted extensive experiments and analysis to evaluate the performance of GPT-o3 on the newly introduced CaughtCheating task, which involves detecting suspicious clues in images. Result: GPT-o3's performance drops to nearly zero on the CaughtCheating task, highlighting limitations in current MLLMs' ability to detect subtle visual cues and perform complex reasoning required for such detective-like tasks. Conclusion: CaughtCheating tasks provide challenging visual perception and reasoning challenges that can help advance MLLMs towards human-level detective capabilities. Abstract: Recent agentic Multi-Modal Large Language Models (MLLMs) such as GPT-o3 have achieved near-ceiling scores on various existing benchmarks, motivating a demand for more challenging test tasks. These MLLMs have been reported to excel in a few expert-level tasks for humans, e.g., GeoGuesser, reflecting their potential as a detective who can notice minuscule cues in an image and weave them into coherent, situational explanations, leading to a reliable answer. But can they match the performance of excellent human detectives? To answer this question, we investigate some hard scenarios where GPT-o3 can still handle, and find a common scenario where o3's performance drops to nearly zero, which we name CaughtCheating. It is inspired by the social media requests that ask others to detect suspicious clues from photos shared by the poster's partner. We conduct extensive experiments and analysis to understand why existing MLLMs lack sufficient capability to solve this kind of task. CaughtCheating provides a class of challenging visual perception and reasoning tasks with great value and practical usage. Success in these tasks paves the way for MLLMs to acquire human-level detective perception and reasoning capabilities.

[48] Evolutionary computing-based image segmentation method to detect defects and features in Additive Friction Stir Deposition Process

Akshansh Mishra,Eyob Mesele Sefene,Shivraman Thapliyal

Main category: cs.CV

TL;DR: This paper uses PSO and advanced visualization to detect defects and assess quality in AFSD manufacturing processes.

Details Motivation: To improve the detection of defects and material transitions in Additive Friction Stir Deposition (AFSD) processes for better quality assessment. Method: Particle Swarm Optimization (PSO) combined with gradient magnitude analysis and distance transforms for image segmentation, creating attention-weighted visualizations. Result: PSO identified optimal thresholds (156-173), enabling precise segmentation. Multi-channel visualization highlighted material interfaces and defects not visible through conventional imaging. Conclusion: The proposed method successfully identifies critical interface regions and defects in AFSD processes using evolutionary computing and multi-channel visualization. Abstract: This work proposes an evolutionary computing-based image segmentation approach for analyzing soundness in Additive Friction Stir Deposition (AFSD) processes. Particle Swarm Optimization (PSO) was employed to determine optimal segmentation thresholds for detecting defects and features in multilayer AFSD builds. The methodology integrates gradient magnitude analysis with distance transforms to create novel attention-weighted visualizations that highlight critical interface regions. Five AFSD samples processed under different conditions were analyzed using multiple visualization techniques i.e. self-attention maps, and multi-channel visualization. These complementary approaches reveal subtle material transition zones and potential defect regions which were not readily observable through conventional imaging. The PSO algorithm automatically identified optimal threshold values (ranging from 156-173) for each sample, enabling precise segmentation of material interfaces. The multi-channel visualization technique effectively combines boundary information (red channel), spatial relationships (green channel), and material density data (blue channel) into cohesive representations that quantify interface quality. The results demonstrate that attention-based analysis successfully identifies regions of incomplete bonding and inhomogeneities in AFSD joints, providing quantitative metrics for process optimization and quality assessment of additively manufactured components.

[49] AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training

Feiyang Kang,Nadine Chang,Maying Shen,Marc T. Law,Rafid Mahmood,Ruoxi Jia,Jose M. Alvarez

Main category: cs.CV

TL;DR: 提出了一种名为AdaDeDup的新型混合框架,用于大规模机器学习模型训练的数据集压缩。

Details Motivation: 大规模数据集的计算负担和固有冗余对现代机器学习模型的训练提出了挑战,而现有的数据修剪方法存在任务无关或引入冗余的问题。 Method: AdaDeDup以集群自适应的方式将基于密度的修剪与模型反馈相结合:首先对数据进行分区并应用初始的基于密度的修剪;然后使用代理模型评估每个集群内修剪的影响,并根据保留与修剪样本的损失差异调整集群特定的修剪阈值。 Result: 在Waymo、COCO和nuScenes等大规模目标检测基准上的实验表明,AdaDeDup显著优于现有基线方法,在剪枝20%数据的情况下实现了接近原始模型的性能。 Conclusion: AdaDeDup有效提高了大规模模型训练中的数据效率,是一种具有广泛应用前景的数据压缩方法。 Abstract: The computational burden and inherent redundancy of large-scale datasets challenge the training of contemporary machine learning models. Data pruning offers a solution by selecting smaller, informative subsets, yet existing methods struggle: density-based approaches can be task-agnostic, while model-based techniques may introduce redundancy or prove computationally prohibitive. We introduce Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-informed feedback in a cluster-adaptive manner. AdaDeDup first partitions data and applies an initial density-based pruning. It then employs a proxy model to evaluate the impact of this initial pruning within each cluster by comparing losses on kept versus pruned samples. This task-aware signal adaptively adjusts cluster-specific pruning thresholds, enabling more aggressive pruning in redundant clusters while preserving critical data in informative ones. Extensive experiments on large-scale object detection benchmarks (Waymo, COCO, nuScenes) using standard models (BEVFormer, Faster R-CNN) demonstrate AdaDeDup's advantages. It significantly outperforms prominent baselines, substantially reduces performance degradation (e.g., over 54% versus random sampling on Waymo), and achieves near-original model performance while pruning 20% of data, highlighting its efficacy in enhancing data efficiency for large-scale model training. Code is open-sourced.

[50] VSF-Med:A Vulnerability Scoring Framework for Medical Vision-Language Models

Binesh Sadanandan,Vahid Behzadan

Main category: cs.CV

TL;DR: This paper proposes VSF--Med, a new framework for assessing security risks in medical VLMs, demonstrating that leading models are vulnerable to various types of attacks.

Details Motivation: There is a lack of systematic security evaluations for VLMs in clinical settings despite their potential to improve medical imaging workflows. Method: VSF--Med combines text-prompt attack templates, imperceptible visual perturbations, and an eight-dimensional rubric evaluated by LLMs to produce a composite risk score. Result: The analysis reveals mean z-score shifts across state-of-the-art VLMs, with Llama-3.2-11B-Vision-Instruct and GPT-4o showing notable vulnerability increases in different attack vectors. Conclusion: The paper introduces VSF--Med, a framework for evaluating vulnerabilities in medical Vision Language Models (VLMs), highlighting significant variability in model susceptibility to attacks. Abstract: Vision Language Models (VLMs) hold great promise for streamlining labour-intensive medical imaging workflows, yet systematic security evaluations in clinical settings remain scarce. We introduce VSF--Med, an end-to-end vulnerability-scoring framework for medical VLMs that unites three novel components: (i) a rich library of sophisticated text-prompt attack templates targeting emerging threat vectors; (ii) imperceptible visual perturbations calibrated by structural similarity (SSIM) thresholds to preserve clinical realism; and (iii) an eight-dimensional rubric evaluated by two independent judge LLMs, whose raw scores are consolidated via z-score normalization to yield a 0--32 composite risk metric. Built entirely on publicly available datasets and accompanied by open-source code, VSF--Med synthesizes over 30,000 adversarial variants from 5,000 radiology images and enables reproducible benchmarking of any medical VLM with a single command. Our consolidated analysis reports mean z-score shifts of $0.90\sigma$ for persistence-of-attack-effects, $0.74\sigma$ for prompt-injection effectiveness, and $0.63\sigma$ for safety-bypass success across state-of-the-art VLMs. Notably, Llama-3.2-11B-Vision-Instruct exhibits a peak vulnerability increase of $1.29\sigma$ for persistence-of-attack-effects, while GPT-4o shows increases of $0.69\sigma$ for that same vector and $0.28\sigma$ for prompt-injection attacks.

[51] MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding

Ziqi Zhong,Daniel Tang

Main category: cs.CV

TL;DR: MANTA是一个统一视觉和听觉输入的多模态学习框架,在长视频问答任务中显著优于现有技术。

Details Motivation: 当前多模态学习方法通常独立处理不同模态,导致表示和推理的不一致性。MANTA旨在解决这一问题,实现无缝的多模态融合处理。 Method: MANTA引入了基于信息理论优化的语义对齐、自适应时间同步、分层内容表示和上下文感知的信息检索方法,并构建了严格的数学框架证明其最优性。 Result: 实验表明,MANTA在整体准确性上提高了22.6%,在30分钟以上的视频中提升达27.3%;同时在时间推理和跨模态理解任务中分别提升了23.8%和25.1%。 Conclusion: MANTA框架通过将视觉和听觉输入统一到结构化文本空间,有效解决了多模态学习中的表示与推理不一致问题,并在长视频问答任务中表现出色。 Abstract: While multi-modal learning has advanced significantly, current approaches often treat modalities separately, creating inconsistencies in representation and reasoning. We introduce MANTA (Multi-modal Abstraction and Normalization via Textual Alignment), a theoretically-grounded framework that unifies visual and auditory inputs into a structured textual space for seamless processing with large language models. MANTA addresses four key challenges: (1) semantic alignment across modalities with information-theoretic optimization, (2) adaptive temporal synchronization for varying information densities, (3) hierarchical content representation for multi-scale understanding, and (4) context-aware retrieval of sparse information from long sequences. We formalize our approach within a rigorous mathematical framework, proving its optimality for context selection under token constraints. Extensive experiments on the challenging task of Long Video Question Answering show that MANTA improves state-of-the-art models by up to 22.6% in overall accuracy, with particularly significant gains (27.3%) on videos exceeding 30 minutes. Additionally, we demonstrate MANTA's superiority on temporal reasoning tasks (23.8% improvement) and cross-modal understanding (25.1% improvement). Our framework introduces novel density estimation techniques for redundancy minimization while preserving rare signals, establishing new foundations for unifying multimodal representations through structured text.

[52] An efficient plant disease detection using transfer learning approach

Bosubabu Sambana,Hillary Sunday Nnadi,Mohd Anas Wajid,Nwosu Ogochukwu Fidelia,Claudia Camacho-Zuñiga,Henry Dozie Ajuzie,Edeh Michael Onyema

Main category: cs.CV

TL;DR: This study proposes an automated plant disease detection system using YOLOv8, achieving high accuracy in identifying bacterial, fungal, and viral diseases.

Details Motivation: Early detection of plant diseases is crucial to mitigate their impact on crop productivity and quality, and automation through technology can enhance monitoring efficiency. Method: The study employs transfer learning with state-of-the-art object detection models YOLOv7 and YOLOv8, fine-tuned on a dataset of plant leaf images to detect diseases. Result: YOLOv8 demonstrated superior performance with metrics including 91.05 mAP, 89.40 F1-score, 91.22 Precision, and 87.66 Recall. Conclusion: The study concludes that the proposed system using YOLOv8 for plant disease detection offers a scalable and automated solution, enhancing crop yield and supporting sustainable agriculture. Abstract: Plant diseases pose significant challenges to farmers and the agricultural sector at large. However, early detection of plant diseases is crucial to mitigating their effects and preventing widespread damage, as outbreaks can severely impact the productivity and quality of crops. With advancements in technology, there are increasing opportunities for automating the monitoring and detection of disease outbreaks in plants. This study proposed a system designed to identify and monitor plant diseases using a transfer learning approach. Specifically, the study utilizes YOLOv7 and YOLOv8, two state-ofthe-art models in the field of object detection. By fine-tuning these models on a dataset of plant leaf images, the system is able to accurately detect the presence of Bacteria, Fungi and Viral diseases such as Powdery Mildew, Angular Leaf Spot, Early blight and Tomato mosaic virus. The model's performance was evaluated using several metrics, including mean Average Precision (mAP), F1-score, Precision, and Recall, yielding values of 91.05, 89.40, 91.22, and 87.66, respectively. The result demonstrates the superior effectiveness and efficiency of YOLOv8 compared to other object detection methods, highlighting its potential for use in modern agricultural practices. The approach provides a scalable, automated solution for early any plant disease detection, contributing to enhanced crop yield, reduced reliance on manual monitoring, and supporting sustainable agricultural practices.

[53] Diffusion-Based Image Augmentation for Semantic Segmentation in Outdoor Robotics

Peter Mortimer,Mirko Maehlisch

Main category: cs.CV

TL;DR: This paper proposes a new method for preparing autonomous vehicles for deployment in challenging environments using diffusion-based image augmentation and semantic segmentation models to improve performance in underrepresented visual scenes.

Details Motivation: The motivation stems from the fact that leaning-based perception algorithms perform poorly in out-of-distribution and underrepresented environments. Outdoor robots, especially autonomous vehicles, face challenges due to rapid changes in visual scenes caused by factors like dynamic lighting, seasonality, and weather effects. Method: The method involves using diffusion-based image augmentation with the help of vision foundation models trained on large datasets. Open vocabulary semantic segmentation models are employed to filter out augmentation candidates containing hallucinations. Result: The result is a novel approach for diffusion-based image augmentation that allows control over the semantic distribution of ground surfaces in training data, enabling fine-tuning of models for specific deployment environments. Conclusion: The paper concludes that diffusion-based image augmentations can be used effectively to prepare autonomous vehicles for deployment in underrepresented environments, with potential applications extending beyond snow-filled environments to include sandy and volcanic terrains. Abstract: The performance of leaning-based perception algorithms suffer when deployed in out-of-distribution and underrepresented environments. Outdoor robots are particularly susceptible to rapid changes in visual scene appearance due to dynamic lighting, seasonality and weather effects that lead to scenes underrepresented in the training data of the learning-based perception system. In this conceptual paper, we focus on preparing our autonomous vehicle for deployment in snow-filled environments. We propose a novel method for diffusion-based image augmentation to more closely represent the deployment environment in our training data. Diffusion-based image augmentations rely on the public availability of vision foundation models learned on internet-scale datasets. The diffusion-based image augmentations allow us to take control over the semantic distribution of the ground surfaces in the training data and to fine-tune our model for its deployment environment. We employ open vocabulary semantic segmentation models to filter out augmentation candidates that contain hallucinations. We believe that diffusion-based image augmentations can be extended to many other environments apart from snow surfaces, like sandy environments and volcanic terrains.

[54] FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion

Yu Lu,Yi Yang

Main category: cs.CV

TL;DR: FreeLong++是一种无需训练的框架,能够通过多频段频率融合改善现有视频生成模型在长视频生成中的表现。

Details Motivation: 现有的短视频生成模型在生成长视频时面临时间一致性下降和视觉质量退化的问题,尤其是高频成分的失真。 Method: FreeLong通过融合全局低频特征和局部高频特征,而FreeLong++扩展为多分支架构,实现从低到高的多频段频率融合。 Result: FreeLong++ 在多种任务上优于之前的方法,包括4倍和8倍原生长度的长视频生成,并支持连贯的多提示视频生成及可控的视频生成。 Conclusion: FreeLong++ 能够在无需额外训练的情况下,显著提升视频生成模型在长视频生成上的时间一致性和视觉质量。 Abstract: Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.

[55] SelvaBox: A high-resolution dataset for tropical tree crown detection

Hugo Baudchon,Arthur Ouaknine,Martin Weiss,Mélisande Teng,Thomas R. Walla,Antoine Caron-Guay,Christopher Pal,Etienne Laliberté

Main category: cs.CV

TL;DR: 本文介绍了SelvaBox,一个大规模的热带树冠检测数据集,它显著提升了检测精度并促进了零样本检测能力。

Details Motivation: 由于热带森林生态系统的重要性及其受到的影响,需要高级遥感方法来检测个体树冠。然而标注的数据集稀缺阻碍了模型的发展。 Method: 引入了SelvaBox数据集,并使用多分辨率管道进行模型训练和评估。 Result: SelvaBox包含超过83,000个人工标记的树冠,分布在三个国家,其规模是之前所有热带森林数据集总和的十倍。 Conclusion: SelvaBox是最大的开放获取数据集,用于热带树冠检测,并通过与其他数据集的联合训练提高了检测性能。 Abstract: Detecting individual tree crowns in tropical forests is essential to study these complex and crucial ecosystems impacted by human interventions and climate change. However, tropical crowns vary widely in size, structure, and pattern and are largely overlapping and intertwined, requiring advanced remote sensing methods applied to high-resolution imagery. Despite growing interest in tropical tree crown detection, annotated datasets remain scarce, hindering robust model development. We introduce SelvaBox, the largest open-access dataset for tropical tree crown detection in high-resolution drone imagery. It spans three countries and contains more than 83,000 manually labeled crowns - an order of magnitude larger than all previous tropical forest datasets combined. Extensive benchmarks on SelvaBox reveal two key findings: (1) higher-resolution inputs consistently boost detection accuracy; and (2) models trained exclusively on SelvaBox achieve competitive zero-shot detection performance on unseen tropical tree crown datasets, matching or exceeding competing methods. Furthermore, jointly training on SelvaBox and three other datasets at resolutions from 3 to 10 cm per pixel within a unified multi-resolution pipeline yields a detector ranking first or second across all evaluated datasets. Our dataset, code, and pre-trained weights are made public.

[56] Graph-Based Deep Learning for Component Segmentation of Maize Plants

J. I. Ruíz,A. Méndez,E. Rodríguez

Main category: cs.CV

TL;DR: This paper introduces a new deep learning architecture using GNNs and PCA to accurately identify individual plant components from 3D LiDAR data, outperforming current approaches.

Details Motivation: Traditional methods like 2D imaging, 3D reconstructions, and CNNs have drawbacks when processing 3D data and identifying individual plant components. This work aims to overcome these limitations using a new deep learning approach tailored for 3D Point Cloud data. Method: A novel Deep Learning architecture based on Graph Neural Networks (GNN) with feature enhancement using Principal Component Analysis (PCA), K-Nearest Neighbors (KNN) layer, Edge-Conv layers, and Graph Attention Networks (GAT). Result: The model achieves segmentation accuracy above 80% in the IoU average, surpassing other existing point cloud-based models. Conclusion: The proposed graph-based deep learning architecture effectively enhances segmentation accuracy for identifying individual plant components in LiDAR 3D Point Cloud datasets, outperforming existing models. Abstract: In precision agriculture, one of the most important tasks when exploring crop production is identifying individual plant components. There are several attempts to accomplish this task by the use of traditional 2D imaging, 3D reconstructions, and Convolutional Neural Networks (CNN). However, they have several drawbacks when processing 3D data and identifying individual plant components. Therefore, in this work, we propose a novel Deep Learning architecture to detect components of individual plants on Light Detection and Ranging (LiDAR) 3D Point Cloud (PC) data sets. This architecture is based on the concept of Graph Neural Networks (GNN), and feature enhancing with Principal Component Analysis (PCA). For this, each point is taken as a vertex and by the use of a K-Nearest Neighbors (KNN) layer, the edges are established, thus representing the 3D PC data set. Subsequently, Edge-Conv layers are used to further increase the features of each point. Finally, Graph Attention Networks (GAT) are applied to classify visible phenotypic components of the plant, such as the leaf, stem, and soil. This study demonstrates that our graph-based deep learning approach enhances segmentation accuracy for identifying individual plant components, achieving percentages above 80% in the IoU average, thus outperforming other existing models based on point clouds.

[57] Computer Vision for Objects used in Group Work: Challenges and Opportunities

Changsoo Jung,Sheikh Mannan,Jack Fitzgerald,Nathaniel Blanchard

Main category: cs.CV

TL;DR: 本论文提出了一种新的6D姿态视频数据集FiboSB,并展示了在协作学习环境中使用6D姿态估计的挑战和潜力,通过对YOLO11-x的微调显著提升了检测精度。

Details Motivation: 现有的协作任务系统难以准确捕捉学生与物理对象之间的现实互动,而6D姿态估计可能解决这一问题。 Method: 提出了FiboSB数据集并评估了四种最先进的6D姿态估计方法,同时对YOLO11-x进行了微调以改进物体检测效果。 Result: 在FiboSB数据集上测试显示现有算法存在局限性,特别是物体检测模块表现不佳;经过微调后,YOLO11-x实现了0.898的mAP_50。 Conclusion: 研究得出,通过微调YOLO11-x可以有效提升在协作场景中6D姿态估计的物体检测性能,并为在复杂协作环境中利用6D姿态估计奠定了基础。 Abstract: Interactive and spatially aware technologies are transforming educational frameworks, particularly in K-12 settings where hands-on exploration fosters deeper conceptual understanding. However, during collaborative tasks, existing systems often lack the ability to accurately capture real-world interactions between students and physical objects. This issue could be addressed with automatic 6D pose estimation, i.e., estimation of an object's position and orientation in 3D space from RGB images or videos. For collaborative groups that interact with physical objects, 6D pose estimates allow AI systems to relate objects and entities. As part of this work, we introduce FiboSB, a novel and challenging 6D pose video dataset featuring groups of three participants solving an interactive task featuring small hand-held cubes and a weight scale. This setup poses unique challenges for 6D pose because groups are holistically recorded from a distance in order to capture all participants -- this, coupled with the small size of the cubes, makes 6D pose estimation inherently non-trivial. We evaluated four state-of-the-art 6D pose estimation methods on FiboSB, exposing the limitations of current algorithms on collaborative group work. An error analysis of these methods reveals that the 6D pose methods' object detection modules fail. We address this by fine-tuning YOLO11-x for FiboSB, achieving an overall mAP_50 of 0.898. The dataset, benchmark results, and analysis of YOLO11-x errors presented here lay the groundwork for leveraging the estimation of 6D poses in difficult collaborative contexts.

[58] VOCAL: Visual Odometry via ContrAstive Learning

Chi-Yao Huang,Zeel Bhatt,Yezhou Yang

Main category: cs.CV

TL;DR: VOCAL introduces a novel label-ranking approach for visual odometry, combining Bayesian inference and representation learning to improve feature interpretability and compatibility with multimodal data.

Details Motivation: Many learning-based visual odometry (VO) techniques rely on rigid geometric assumptions that limit interpretability and theoretical grounding in data-driven frameworks. This work aims to overcome these constraints. Method: VOCAL integrates Bayesian inference with a representation learning framework, organizing visual features to mirror camera states and compelling similar states to converge into coherent representations through a ranking mechanism. Result: Extensive evaluations on the KITTI dataset demonstrate improved interpretability and flexibility of VOCAL compared to traditional VO methods. Conclusion: VOCAL successfully addresses the limitations of existing VO techniques by enhancing interpretability and flexibility, pushing VO towards more general and explainable spatial intelligence. Abstract: Breakthroughs in visual odometry (VO) have fundamentally reshaped the landscape of robotics, enabling ultra-precise camera state estimation that is crucial for modern autonomous systems. Despite these advances, many learning-based VO techniques rely on rigid geometric assumptions, which often fall short in interpretability and lack a solid theoretical basis within fully data-driven frameworks. To overcome these limitations, we introduce VOCAL (Visual Odometry via ContrAstive Learning), a novel framework that reimagines VO as a label ranking challenge. By integrating Bayesian inference with a representation learning framework, VOCAL organizes visual features to mirror camera states. The ranking mechanism compels similar camera states to converge into consistent and spatially coherent representations within the latent space. This strategic alignment not only bolsters the interpretability of the learned features but also ensures compatibility with multimodal data sources. Extensive evaluations on the KITTI dataset highlight VOCAL's enhanced interpretability and flexibility, pushing VO toward more general and explainable spatial intelligence.

[59] Developing Lightweight DNN Models With Limited Data For Real-Time Sign Language Recognition

Nikita Nikitin,Eugene Fomin

Main category: cs.CV

TL;DR: 提出了一个基于轻量级DNN的实时手语识别框架,能够以低延迟和高精度在边缘设备上分类343种手势。

Details Motivation: 解决手语识别中的数据稀缺、高计算成本以及训练与推理环境帧率差异等问题。 Method: 通过将手语特定参数(如手形、手掌方向、移动和位置)编码为向量化输入,并利用MediaPipe进行地标提取,构建高度可分离的数据表示形式;优化了适用于10MB以下部署的DNN架构。 Result: 模型在孤立手势识别中达到了92%的准确率,并成功集成到'slait ai'网络应用中,表现出稳定的推理性能。 Conclusion: 提出的框架能够在资源受限的情况下实现高效的手势识别,具有实际应用价值。 Abstract: We present a novel framework for real-time sign language recognition using lightweight DNNs trained on limited data. Our system addresses key challenges in sign language recognition, including data scarcity, high computational costs, and discrepancies in frame rates between training and inference environments. By encoding sign language specific parameters, such as handshape, palm orientation, movement, and location into vectorized inputs, and leveraging MediaPipe for landmark extraction, we achieve highly separable input data representations. Our DNN architecture, optimized for sub 10MB deployment, enables accurate classification of 343 signs with less than 10ms latency on edge devices. The data annotation platform 'slait data' facilitates structured labeling and vector extraction. Our model achieved 92% accuracy in isolated sign recognition and has been integrated into the 'slait ai' web application, where it demonstrates stable inference.

[60] GazeTarget360: Towards Gaze Target Estimation in 360-Degree for Robot Perception

Zhuangzhuang Dai,Vincent Gbouna Zakka,Luis J. Manso,Chen Li

Main category: cs.CV

TL;DR: 本文提出了一种名为 GazeTarget360 的新系统,用于从真实场景的图像中进行全向注视目标估计,解决了现有方法在复杂环境中的局限性。

Details Motivation: 为了实现下游任务(如注意力估计和运动预测)的能力,机器人需要理解人类的注视目标。现有的方法在处理背景信息和面对远离摄像头的情况时效果不佳。 Method: GazeTarget360 结合了眼接触检测器、预训练视觉编码器和多尺度融合解码器来解决全向注视目标估计问题。 Result: 交叉验证结果显示 GazeTarget360 能够在未知场景中产生准确且可靠的注视目标预测。 Conclusion: GazeTarget360 是第一个能够从现实场景的图像中预测注视目标的系统,具有高效性和可部署性。 Abstract: Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: https://github.com/zdai257/DisengageNet.

[61] VirtualFencer: Generating Fencing Bouts based on Strategies Extracted from In-the-Wild Videos

Zhiyin Lin,Purvi Goel,Joy Yun,C. Karen Liu,Joao Pedro Araujo

Main category: cs.CV

TL;DR: 本研究开发了一个名为 VirtualFencer 的系统,用于通过数据驱动建模来提取并生成逼真的击剑运动动作和策略。

Details Motivation: 由于运动员的动作多样且具有战略逻辑,同时受到对手行为的影响,因此需要应用数据驱动建模来更好地理解和模拟击剑运动。 Method: 数据驱动建模 Result: 展示了 VirtualFencer 系统的多种能力,包括自我对战、与来自在线视频的真实击剑手动作进行对抗以及与职业击剑手互动对战。 Conclusion: VirtualFencer 是一个能够从野外视频中无监督提取3D击剑动作和策略的系统,并利用这些提取的知识生成逼真的击剑行为。 Abstract: Fencing is a sport where athletes engage in diverse yet strategically logical motions. While most motions fall into a few high-level actions (e.g. step, lunge, parry), the execution can vary widely-fast vs. slow, large vs. small, offensive vs. defensive. Moreover, a fencer's actions are informed by a strategy that often comes in response to the opponent's behavior. This combination of motion diversity with underlying two-player strategy motivates the application of data-driven modeling to fencing. We present VirtualFencer, a system capable of extracting 3D fencing motion and strategy from in-the-wild video without supervision, and then using that extracted knowledge to generate realistic fencing behavior. We demonstrate the versatile capabilities of our system by having it (i) fence against itself (self-play), (ii) fence against a real fencer's motion from online video, and (iii) fence interactively against a professional fencer.

[62] Room Scene Discovery and Grouping in Unstructured Vacation Rental Image Collections

Vignesh Ram Nithin Kappagantula,Shayan Hassantabar

Main category: cs.CV

TL;DR: 本文提出了一种高效的机器学习管道,用于解决度假租赁平台中房间场景发现与分组问题,并识别卧室中的床型。

Details Motivation: 由于度假租赁平台的快速增长,上传的房产图片缺乏结构化分类,导致游客难以理解房屋的空间布局,尤其是在存在多个相同类型的房间时。因此需要一种有效的方法来组织这些图片。 Method: 作者提出了一种计算效率高的机器学习管道,该管道结合了一个监督的房间类型检测模型、一个监督的重叠检测模型以及一个聚类算法。此外,还使用多模态大语言模型(MLLM)将卧室图像组映射到对应的床型信息。 Result: 评估结果显示,所提出的模型和整个管道表现良好,显著优于对比学习和使用预训练嵌入的聚类等现有方法。 Conclusion: 这种高效的机器学习管道在实时和数据稀缺环境中具有良好的性能,能够有效解决度假租赁平台中的图像组织问题。 Abstract: The rapid growth of vacation rental (VR) platforms has led to an increasing volume of property images, often uploaded without structured categorization. This lack of organization poses significant challenges for travelers attempting to understand the spatial layout of a property, particularly when multiple rooms of the same type are present. To address this issue, we introduce an effective approach for solving the room scene discovery and grouping problem, as well as identifying bed types within each bedroom group. This grouping is valuable for travelers to comprehend the spatial organization, layout, and the sleeping configuration of the property. We propose a computationally efficient machine learning pipeline characterized by low latency and the ability to perform effectively with sample-efficient learning, making it well-suited for real-time and data-scarce environments. The pipeline integrates a supervised room-type detection model, a supervised overlap detection model to identify the overlap similarity between two images, and a clustering algorithm to group the images of the same space together using the similarity scores. Additionally, the pipeline maps each bedroom group to the corresponding bed types specified in the property's metadata, based on the visual content present in the group's images using a Multi-modal Large Language Model (MLLM) model. We evaluate the aforementioned models individually and also assess the pipeline in its entirety, observing strong performance that significantly outperforms established approaches such as contrastive learning and clustering with pretrained embeddings.

[63] Self-Supervised Multiview Xray Matching

Mohamad Dabboussi,Malo Huard,Yann Gousseau,Pietro Gori

Main category: cs.CV

TL;DR: 本研究提出一种新的自监督方法,通过自动生成X光视图的对应矩阵来提升多视角骨折检测的准确性。

Details Motivation: 当前的方法在建立不同X光视图间的稳健对应关系方面存在困难,这对于精确的临床评估至关重要。因此,提出了一种无需手动注释的新方法。 Method: 该研究采用了一种基于变换器的自监督学习方法,利用数字重建放射图像(DRR)自动创建合成X光视图之间的多对多对应矩阵。 Result: 实验结果表明,在合成和真实X光数据集上,引入对应关系能够提升多视角骨折分类的性能。 Conclusion: 论文的结论是,通过合成X光视图之间的对应关系作为预训练策略,可以提高真实数据上的多视角骨折检测性能。 Abstract: Accurate interpretation of multi-view radiographs is crucial for diagnosing fractures, muscular injuries, and other anomalies. While significant advances have been made in AI-based analysis of single images, current methods often struggle to establish robust correspondences between different X-ray views, an essential capability for precise clinical evaluations. In this work, we present a novel self-supervised pipeline that eliminates the need for manual annotation by automatically generating a many-to-many correspondence matrix between synthetic X-ray views. This is achieved using digitally reconstructed radiographs (DRR), which are automatically derived from unannotated CT volumes. Our approach incorporates a transformer-based training phase to accurately predict correspondences across two or more X-ray views. Furthermore, we demonstrate that learning correspondences among synthetic X-ray views can be leveraged as a pretraining strategy to enhance automatic multi-view fracture detection on real data. Extensive evaluations on both synthetic and real X-ray datasets show that incorporating correspondences improves performance in multi-view fracture classification.

[64] Reducing Variability of Multiple Instance Learning Methods for Digital Pathology

Ali Mammadov,Loïc Le Folgoc,Guillaume Hocquet,Pietro Gori

Main category: cs.CV

TL;DR: This paper proposes a Multi-Fidelity, Model Fusion strategy to reduce performance variability in Multiple Instance Learning (MIL) methods for Whole Slide Image (WSI) classification, enhancing reproducibility without sacrificing efficiency.

Details Motivation: MIL methods exhibit high variability in performance across different runs (up to 10-15 AUC points), primarily due to weight initialization, batch ordering, and learning rate differences. This variability hampers reliable comparison between methods and affects reproducibility. Method: The authors introduce a Multi-Fidelity, Model Fusion strategy that trains multiple models for a few epochs and averages the most stable and promising ones based on validation scores. The method is evaluated across two datasets, three initialization strategies, and five MIL methods through more than 2000 experiments. Result: The approach significantly reduces performance variability in MIL methods, simplifies hyperparameter tuning, and maintains computational efficiency. It is validated across diverse settings involving multiple datasets, initialization strategies, and MIL methods. Conclusion: The proposed Multi-Fidelity, Model Fusion strategy effectively reduces performance variability in MIL methods for WSI classification and enhances reproducibility while maintaining computational efficiency. Abstract: Digital pathology has revolutionized the field by enabling the digitization of tissue samples into whole slide images (WSIs). However, the high resolution and large size of WSIs present significant challenges when it comes to applying Deep Learning models. As a solution, WSIs are often divided into smaller patches with a global label (\textit{i.e., diagnostic}) per slide, instead of a (too) costly pixel-wise annotation. By treating each slide as a bag of patches, Multiple Instance Learning (MIL) methods have emerged as a suitable solution for WSI classification. A major drawback of MIL methods is their high variability in performance across different runs, which can reach up to 10-15 AUC points on the test set, making it difficult to compare different MIL methods reliably. This variability mainly comes from three factors: i) weight initialization, ii) batch (shuffling) ordering, iii) and learning rate. To address that, we introduce a Multi-Fidelity, Model Fusion strategy for MIL methods. We first train multiple models for a few epochs and average the most stable and promising ones based on validation scores. This approach can be applied to any existing MIL model to reduce performance variability. It also simplifies hyperparameter tuning and improves reproducibility while maintaining computational efficiency. We extensively validate our approach on WSI classification tasks using 2 different datasets, 3 initialization strategies and 5 MIL methods, for a total of more than 2000 experiments.

[65] Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes

Chuyan Zhang,Kefan Wang,Yun Gu

Main category: cs.CV

TL;DR: This paper introduces SR-LoRA, a new Low-Rank Adaptation method that uses stable rank as a prior for efficient and effective rank allocation across layers, especially useful for domains with large gaps.

Details Motivation: Existing LoRA methods have limited adaptability in cases with large domain gaps, requiring higher ranks and often relying on computationally expensive techniques like iterative pruning or rank searches. Method: The paper proposes SR-LoRA, which uses the stable rank of pre-trained weight matrices as a prior for layer-wise rank allocation, aiming to improve adaptability while maintaining efficiency. Result: Empirical evaluations show that SR-LoRA consistently outperforms recent adaptive LoRA variants on few-shot tasks with significant domain gaps, achieving better performance-efficiency trade-offs. Conclusion: SR-LoRA provides a more efficient and principled way to allocate ranks across layers, outperforming existing adaptive LoRA methods in handling domain gaps without added computational cost. Abstract: Low-Rank Adaptation (LoRA) has proven effective in reducing computational costs while maintaining performance comparable to fully fine-tuned foundation models across various tasks. However, its fixed low-rank structure restricts its adaptability in scenarios with substantial domain gaps, where higher ranks are often required to capture domain-specific complexities. Current adaptive LoRA methods attempt to overcome this limitation by dynamically expanding or selectively allocating ranks, but these approaches frequently depend on computationally intensive techniques such as iterative pruning, rank searches, or additional regularization. To address these challenges, we introduce Stable Rank-Guided Low-Rank Adaptation (SR-LoRA), a novel framework that utilizes the stable rank of pre-trained weight matrices as a natural prior for layer-wise rank allocation. By leveraging the stable rank, which reflects the intrinsic dimensionality of the weights, SR-LoRA enables a principled and efficient redistribution of ranks across layers, enhancing adaptability without incurring additional search costs. Empirical evaluations on few-shot tasks with significant domain gaps show that SR-LoRA consistently outperforms recent adaptive LoRA variants, achieving a superior trade-off between performance and efficiency. Our code is available at https://github.com/EndoluminalSurgicalVision-IMR/SR-LoRA.

[66] MammoTracker: Mask-Guided Lesion Tracking in Temporal Mammograms

Xuan Liu,Yinhao Ren,Marc D. Ryser,Lars J. Grimm,Joseph Y. Lo

Main category: cs.CV

TL;DR: 本研究提出了MammoTracker,一个用于时间序列乳腺X线照片中病灶追踪的自动化框架,并发布了一个包含超过20000个病灶对的新数据集,显著提高了病灶对应和进展分析的效果。

Details Motivation: 在时间序列乳腺X线照片中准确追踪病灶对于监测乳腺癌进展和促进早期诊断至关重要,但目前的计算机辅助诊断(CAD)系统在自动化病灶对应方面仍存在挑战,限制了其效果。 Method: 该研究提出了一种mask-guided的病变追踪框架,采用coarse-to-fine策略,包括全局搜索、局部搜索和评分优化三个关键模块,并引入了一个新的大规模数据集用于训练和评估。 Result: 实验结果表明,MammoTracker取得了0.455的平均重叠度和0.509的准确率,比基线模型高出8%。 Conclusion: MammoTracker是一个有前景的病变追踪框架,能够增强基于计算机辅助诊断(CAD)的病变进展分析。 Abstract: Accurate lesion tracking in temporal mammograms is essential for monitoring breast cancer progression and facilitating early diagnosis. However, automated lesion correspondence across exams remains a challenges in computer-aided diagnosis (CAD) systems, limiting their effectiveness. We propose MammoTracker, a mask-guided lesion tracking framework that automates lesion localization across consecutively exams. Our approach follows a coarse-to-fine strategy incorporating three key modules: global search, local search, and score refinement. To support large-scale training and evaluation, we introduce a new dataset with curated prior-exam annotations for 730 mass and calcification cases from the public EMBED mammogram dataset, yielding over 20000 lesion pairs, making it the largest known resource for temporal lesion tracking in mammograms. Experimental results demonstrate that MammoTracker achieves 0.455 average overlap and 0.509 accuracy, surpassing baseline models by 8%, highlighting its potential to enhance CAD-based lesion progression analysis. Our dataset will be available at https://gitlab.oit.duke.edu/railabs/LoGroup/mammotracker.

[67] Populate-A-Scene: Affordance-Aware Human Video Generation

Mengyi Shan,Zecheng He,Haoyu Ma,Felix Juefei-Xu,Peizhao Zhang,Tingbo Hou,Ching-Yao Chuang

Main category: cs.CV

TL;DR: 本研究探讨了如何利用文本到视频模型进行交互式世界模拟,通过微调模型实现人物插入与行为预测,展现了预训练模型在无标签数据下的适应性感知能力。

Details Motivation: 探索文本到视频模型在预测人与环境互动方面的潜力,将视频生成模型用作交互式世界模拟器。 Method: 通过使用文本到视频模型,根据场景图像和描述人类动作的提示,微调模型以插入人物,并深入研究交叉注意力热图来揭示预训练视频模型中的固有适应性感知。 Result: 模型能够从单个场景图像中推断出适合插入人物的位置及行为方式,而无需显式的边界框或身体姿态条件。 Conclusion: 研究得出,视频生成模型可以被重新利用作为交互式世界模拟器,通过微调模型来插入人物并确保行为、外观和场景适应的一致性。 Abstract: Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.

[68] Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video

Alexander Moore,Amar Saini,Kylie Cancilla,Doug Poland,Carmen Carrano

Main category: cs.CV

TL;DR: MOVi-MC-AC是一个支持多视角及无遮挡内容预测任务的大规模数据集,适用于对象检测、跟踪和分割研究。

Details Motivation: 当前缺少能够支持多相机视角以及无遮挡内容补全任务的大规模数据集。 Method: 通过模拟多相机视频环境,并引入一致的对象ID以实现跨帧与跨相机检测和分割。 Result: MOVi-MC-AC是目前最大的无遮挡分割和首个无遮挡内容数据集,包含约580万个对象实例的标签。 Conclusion: MOVi-MC-AC提供了一个大规模的数据集,支持多视角和无遮挡内容的预测任务,为计算机视觉领域提供了重要资源。 Abstract: Amodal segmentation and amodal content completion require using object priors to estimate occluded masks and features of objects in complex scenes. Until now, no data has provided an additional dimension for object context: the possibility of multiple cameras sharing a view of a scene. We introduce MOVi-MC-AC: Multiple Object Video with Multi-Cameras and Amodal Content, the largest amodal segmentation and first amodal content dataset to date. Cluttered scenes of generic household objects are simulated in multi-camera video. MOVi-MC-AC contributes to the growing literature of object detection, tracking, and segmentation by including two new contributions to the deep learning for computer vision world. Multiple Camera (MC) settings where objects can be identified and tracked between various unique camera perspectives are rare in both synthetic and real-world video. We introduce a new complexity to synthetic video by providing consistent object ids for detections and segmentations between both frames and multiple cameras each with unique features and motion patterns on a single scene. Amodal Content (AC) is a reconstructive task in which models predict the appearance of target objects through occlusions. In the amodal segmentation literature, some datasets have been released with amodal detection, tracking, and segmentation labels. While other methods rely on slow cut-and-paste schemes to generate amodal content pseudo-labels, they do not account for natural occlusions present in the modal masks. MOVi-MC-AC provides labels for ~5.8 million object instances, setting a new maximum in the amodal dataset literature, along with being the first to provide ground-truth amodal content. The full dataset is available at https://huggingface.co/datasets/Amar-S/MOVi-MC-AC ,

[69] CGEarthEye:A High-Resolution Remote Sensing Vision Foundation Model Based on the Jilin-1 Satellite Constellation

Zhiwei Yi,Xin Cheng,Jingyu Ma,Ruifei Zhu,Junwei Tian,Yuanxiu Zhou,Xinge Zhao,Hongzhe Li

Main category: cs.CV

TL;DR: 本研究针对Jilin-1卫星设计了名为CGEarthEye的遥感视觉基础模型框架,并开发了一个大规模自监督学习数据集JLSSD,通过多种对比学习策略实现遥感任务的最优性能。

Details Motivation: 由于超高分辨率光学遥感图像获取渠道有限,高分辨率遥感视觉基础模型的发展受到制约,而Jilin-1星座具备丰富的亚米级图像资源,为解决这一问题提供了可能。 Method: 该研究提出了CGEarthEye框架,包含五个不同参数规模的骨干网络,总参数量达21亿,并开发了基于多级表示聚类与采样策略构建的JLSSD数据集,结合季节对比、增强对比和掩码补丁标记对比策略进行预训练。 Result: 在涵盖四种典型遥感任务的10个基准数据集上进行全面评估后,CGEarthEye持续实现了最先进的(SOTA)性能。 Conclusion: CGEarthEye展现出在特征可视化、模型收敛、参数效率和实际制图应用中的卓越特性,预计将促进Jilin-1数据在传统地球观测应用中的更广泛和高效使用。 Abstract: Deep learning methods have significantly advanced the development of intelligent rinterpretation in remote sensing (RS), with foundational model research based on large-scale pre-training paradigms rapidly reshaping various domains of Earth Observation (EO). However, compared to the open accessibility and high spatiotemporal coverage of medium-resolution data, the limited acquisition channels for ultra-high-resolution optical RS imagery have constrained the progress of high-resolution remote sensing vision foundation models (RSVFM). As the world's largest sub-meter-level commercial RS satellite constellation, the Jilin-1 constellation possesses abundant sub-meter-level image resources. This study proposes CGEarthEye, a RSVFM framework specifically designed for Jilin-1 satellite characteristics, comprising five backbones with different parameter scales with totaling 2.1 billion parameters. To enhance the representational capacity of the foundation model, we developed JLSSD, the first 15-million-scale multi-temporal self-supervised learning (SSL) dataset featuring global coverage with quarterly temporal sampling within a single year, constructed through multi-level representation clustering and sampling strategies. The framework integrates seasonal contrast, augmentation-based contrast, and masked patch token contrastive strategies for pre-training. Comprehensive evaluations across 10 benchmark datasets covering four typical RS tasks demonstrate that the CGEarthEye consistently achieves state-of-the-art (SOTA) performance. Further analysis reveals CGEarthEye's superior characteristics in feature visualization, model convergence, parameter efficiency, and practical mapping applications. This study anticipates that the exceptional representation capabilities of CGEarthEye will facilitate broader and more efficient applications of Jilin-1 data in traditional EO application.

[70] GDGS: 3D Gaussian Splatting Via Geometry-Guided Initialization And Dynamic Density Control

Xingjun Wang,Lianlei Shan

Main category: cs.CV

TL;DR: 本文提出了一种改进的3D高斯点绘方法,通过引入几何引导初始化、表面对齐优化和动态自适应密度控制,提升了实时渲染的质量和效率。

Details Motivation: 为了解决3D高斯点绘在初始化、优化和密度控制方面的问题,提升其在实时渲染中的表现和适用性。 Method: 论文的方法包括三个关键贡献:1) 几何引导初始化预测高斯参数;2) 表面对齐优化策略;3) 动态自适应密度控制机制。 Result: 论文结果显示,该方法能够在复杂场景中实现高保真的实时渲染,并在视觉质量上有显著改进,同时保持了实时性能。 Conclusion: 该论文提出了一种改进的3D高斯点绘方法,通过几何引导初始化、表面对齐优化策略和动态自适应密度控制机制,实现了在复杂场景下高质量的实时渲染,并展示了与现有最先进方法相比具有可比性或更优的结果。 Abstract: We propose a method to enhance 3D Gaussian Splatting (3DGS)~\cite{Kerbl2023}, addressing challenges in initialization, optimization, and density control. Gaussian Splatting is an alternative for rendering realistic images while supporting real-time performance, and it has gained popularity due to its explicit 3D Gaussian representation. However, 3DGS heavily depends on accurate initialization and faces difficulties in optimizing unstructured Gaussian distributions into ordered surfaces, with limited adaptive density control mechanism proposed so far. Our first key contribution is a geometry-guided initialization to predict Gaussian parameters, ensuring precise placement and faster convergence. We then introduce a surface-aligned optimization strategy to refine Gaussian placement, improving geometric accuracy and aligning with the surface normals of the scene. Finally, we present a dynamic adaptive density control mechanism that adjusts Gaussian density based on regional complexity, for visual fidelity. These innovations enable our method to achieve high-fidelity real-time rendering and significant improvements in visual quality, even in complex scenes. Our method demonstrates comparable or superior results to state-of-the-art methods, rendering high-fidelity images in real time.

[71] An Improved U-Net Model for Offline handwriting signature denoising

Wanghui Xiao

Main category: cs.CV

TL;DR: This paper introduces a new signature handwriting denoising model based on an improved U-net structure, which effectively suppresses noise and enhances the clarity of signature images, offering better support for forensic signature analysis.

Details Motivation: The motivation of this study was to address the challenges in forensic science appraisals caused by interfering information mixed with handwriting samples, which affects the accuracy of signature identification. Method: The researchers used an improved U-net structure enhanced with discrete wavelet transform and PCA transform to develop a signature handwriting denoising model. Result: The proposed model demonstrated a significant improvement in denoising effect, enhancing the clarity and readability of signed images and providing more reliable technical support for signature analysis and recognition. Conclusion: The study successfully developed a signature handwriting denoising model using an improved U-net structure combined with discrete wavelet transform and PCA transform, offering better noise suppression than traditional methods. Abstract: Handwriting signatures, as an important means of identity recognition, are widely used in multiple fields such as financial transactions, commercial contracts and personal affairs due to their legal effect and uniqueness. In forensic science appraisals, the analysis of offline handwriting signatures requires the appraiser to provide a certain number of signature samples, which are usually derived from various historical contracts or archival materials. However, the provided handwriting samples are often mixed with a large amount of interfering information, which brings severe challenges to handwriting identification work. This study proposes a signature handwriting denoising model based on the improved U-net structure, aiming to enhance the robustness of the signature recognition system. By introducing discrete wavelet transform and PCA transform, the model's ability to suppress noise has been enhanced. The experimental results show that this modelis significantly superior to the traditional methods in denoising effect, can effectively improve the clarity and readability of the signed images, and provide more reliable technical support for signature analysis and recognition.

[72] Out-of-Distribution Detection with Adaptive Top-K Logits Integration

Hikaru Shijo,Yutaka Yoshihama,Kenichi Yadani,Norifumi Murata

Main category: cs.CV

TL;DR: This paper proposes ATLI, an improved method for OOD detection that adaptively integrates multiple top logits, achieving better performance than existing approaches like MaxLogit.

Details Motivation: Neural networks often produce overconfident predictions on out-of-distribution (OOD) samples, making OOD detection essential for enhancing machine learning safety. While MaxLogit is a powerful method for OOD detection, the authors found that incorporating other logits can further improve performance. Method: ATLI (Adaptive Top-k Logits Integration) adaptively identifies and combines the maximum logit with additional top-k logits for OOD detection. Result: Experiments using the ImageNet-1K benchmark showed that ATLI reduced the false positive rate (FPR95) by 6.73% compared to MaxLogit and by an additional 2.67% compared to other state-of-the-art methods. Conclusion: The proposed ATLI method improves OOD detection performance by adaptively combining the maximum logit with other effective top-k logits, outperforming both MaxLogit and state-of-the-art methods. Abstract: Neural networks often make overconfident predictions from out-of-distribution (OOD) samples. Detection of OOD data is therefore crucial to improve the safety of machine learning. The simplest and most powerful method for OOD detection is MaxLogit, which uses the model's maximum logit to provide an OOD score. We have discovered that, in addition to the maximum logit, some other logits are also useful for OOD detection. Based on this finding, we propose a new method called ATLI (Adaptive Top-k Logits Integration), which adaptively determines effective top-k logits that are specific to each model and combines the maximum logit with the other top-k logits. In this study we evaluate our proposed method using ImageNet-1K benchmark. Extensive experiments showed our proposed method to reduce the false positive rate (FPR95) by 6.73% compared to the MaxLogit approach, and decreased FPR95 by an additional 2.67% compared to other state-of-the-art methods.

[73] PlantSegNeRF: A few-shot, cross-dataset method for plant 3D instance point cloud reconstruction via joint-channel NeRF with multi-view image instance matching

Xin Yang,Ruiming Du,Hanyang Huang,Jiayang Xie,Pengyao Xie,Leisen Fang,Ziyue Guo,Nanjun Jiang,Yu Jiang,Haiyan Cen

Main category: cs.CV

TL;DR: The study proposes PlantSegNeRF, a novel approach for generating high-precision instance point clouds from multi-view RGB image sequences across various plant species. It addresses limitations in resolution, accuracy, and generalizability in existing organ segmentation techniques.

Details Motivation: Existing techniques for organ segmentation of plant point clouds face limitations in resolution, segmentation accuracy, and generalizability across various plant species. The study aims to overcome these challenges by proposing PlantSegNeRF for high-precision instance point cloud generation across a wide range of plant species. Method: PlantSegNeRF performs 2D instance segmentation on multi-view images to generate instance masks for each organ with corresponding IDs. These IDs are then matched and refined using an instance matching module. An instance NeRF is developed to render an implicit scene containing color, density, semantic, and instance information, which is finally converted into high-precision plant instance point clouds based on volume density. Result: PlantSegNeRF outperformed commonly used methods in semantic segmentation of point clouds, showing average improvements of 16.1%, 18.3%, 17.8%, and 24.2% in precision, recall, F1-score, and IoU, respectively, compared to second-best results on structurally complex datasets. For instance segmentation tasks, it achieved average improvements of 11.7%, 38.2%, 32.2%, and 25.3% in mPrec, mRec, mCov, and mWCov, respectively. Conclusion: PlantSegNeRF extends organ-level plant phenotyping and provides a high-throughput method for generating high-quality 3D data, which can support the development of large-scale models in plant science. Abstract: Organ segmentation of plant point clouds is a prerequisite for the high-resolution and accurate extraction of organ-level phenotypic traits. Although the fast development of deep learning has boosted much research on segmentation of plant point clouds, the existing techniques for organ segmentation still face limitations in resolution, segmentation accuracy, and generalizability across various plant species. In this study, we proposed a novel approach called plant segmentation neural radiance fields (PlantSegNeRF), aiming to directly generate high-precision instance point clouds from multi-view RGB image sequences for a wide range of plant species. PlantSegNeRF performed 2D instance segmentation on the multi-view images to generate instance masks for each organ with a corresponding ID. The multi-view instance IDs corresponding to the same plant organ were then matched and refined using a specially designed instance matching module. The instance NeRF was developed to render an implicit scene, containing color, density, semantic and instance information. The implicit scene was ultimately converted into high-precision plant instance point clouds based on the volume density. The results proved that in semantic segmentation of point clouds, PlantSegNeRF outperformed the commonly used methods, demonstrating an average improvement of 16.1%, 18.3%, 17.8%, and 24.2% in precision, recall, F1-score, and IoU compared to the second-best results on structurally complex datasets. More importantly, PlantSegNeRF exhibited significant advantages in plant point cloud instance segmentation tasks. Across all plant datasets, it achieved average improvements of 11.7%, 38.2%, 32.2% and 25.3% in mPrec, mRec, mCov, mWCov, respectively. This study extends the organ-level plant phenotyping and provides a high-throughput way to supply high-quality 3D data for the development of large-scale models in plant science.

[74] Efficient Depth- and Spatially-Varying Image Simulation for Defocus Deblur

Xinge Yang,Chuong Nguyen,Wenbin Wang,Kaizhang Kang,Wolfgang Heidrich,Xiaoxing Li

Main category: cs.CV

TL;DR: 本文提出一种无需真实数据微调的合成数据集方法,解决了大光圈相机因景深浅导致的模糊问题,并提升了模型在真实世界中的表现。

Details Motivation: 现代大光圈相机存在景深较浅的问题,这对固定焦点相机(如智能眼镜)尤为困难。此外,现有开源数据集训练的深度学习模型面临领域差距,在实际应用中效果不佳。 Method: 该论文提出了一种高效且可扩展的合成数据集生成方法,同时模拟了深度相关的散焦和空间变化的光学像差。 Result: 实验结果表明,使用低分辨率合成图像训练的网络可以有效地推广到高分辨率(12MP)的真实世界图像,并适用于多种场景。 Conclusion: 论文得出结论,所提出的合成数据集方法在没有真实世界数据微调的情况下,能够有效提升深度学习模型在实际场景中的泛化能力。 Abstract: Modern cameras with large apertures often suffer from a shallow depth of field, resulting in blurry images of objects outside the focal plane. This limitation is particularly problematic for fixed-focus cameras, such as those used in smart glasses, where adding autofocus mechanisms is challenging due to form factor and power constraints. Due to unmatched optical aberrations and defocus properties unique to each camera system, deep learning models trained on existing open-source datasets often face domain gaps and do not perform well in real-world settings. In this paper, we propose an efficient and scalable dataset synthesis approach that does not rely on fine-tuning with real-world data. Our method simultaneously models depth-dependent defocus and spatially varying optical aberrations, addressing both computational complexity and the scarcity of high-quality RGB-D datasets. Experimental results demonstrate that a network trained on our low resolution synthetic images generalizes effectively to high resolution (12MP) real-world images across diverse scenes.

[75] Customizable ROI-Based Deep Image Compression

Ian Jin,Fanxin Xia,Feng Ding,Xinfeng Zhang,Meiqin Liu,Yao Zhao,Weisi Lin,Lili Meng

Main category: cs.CV

TL;DR: 本文提出了一种可定制的基于ROI的深度图像压缩范式,能够根据用户输入的语义文本定义ROI,并灵活管理ROI与非ROI区域的重建质量权衡。

Details Motivation: 随着用户需求多样化,传统预定义ROI的图像压缩方法无法灵活满足不同用户对ROI定义及ROI与非ROI区域质量权衡的需求。 Method: 本文提出了一个可定制的基于ROI的深度图像压缩范式,包括文本控制的掩码获取模块(TMA)、可定制价值分配机制(CVA)和潜在掩码注意力模块(LMA)。 Result: 所提出的方法允许用户通过输入语义文本自定义ROI,并通过可变的掩码程度管理ROI与非ROI的重建质量权衡,提高了压缩系统的灵活性和性能。 Conclusion: 实验结果表明,所提出的可定制ROI图像压缩范式在ROI定义和掩码获取以及ROI与非ROI之间的重建质量权衡管理方面具有良好的灵活性和有效性。 Abstract: Region of Interest (ROI)-based image compression optimizes bit allocation by prioritizing ROI for higher-quality reconstruction. However, as the users (including human clients and downstream machine tasks) become more diverse, ROI-based image compression needs to be customizable to support various preferences. For example, different users may define distinct ROI or require different quality trade-offs between ROI and non-ROI. Existing ROI-based image compression schemes predefine the ROI, making it unchangeable, and lack effective mechanisms to balance reconstruction quality between ROI and non-ROI. This work proposes a paradigm for customizable ROI-based deep image compression. First, we develop a Text-controlled Mask Acquisition (TMA) module, which allows users to easily customize their ROI for compression by just inputting the corresponding semantic \emph{text}. It makes the encoder controlled by text. Second, we design a Customizable Value Assign (CVA) mechanism, which masks the non-ROI with a changeable extent decided by users instead of a constant one to manage the reconstruction quality trade-off between ROI and non-ROI. Finally, we present a Latent Mask Attention (LMA) module, where the latent spatial prior of the mask and the latent Rate-Distortion Optimization (RDO) prior of the image are extracted and fused in the latent space, and further used to optimize the latent representation of the source image. Experimental results demonstrate that our proposed customizable ROI-based deep image compression paradigm effectively addresses the needs of customization for ROI definition and mask acquisition as well as the reconstruction quality trade-off management between the ROI and non-ROI.

[76] MedDiff-FT: Data-Efficient Diffusion Model Fine-tuning with Structural Guidance for Controllable Medical Image Synthesis

Jianhao Xie,Ziang Zhang,Zhenyu Weng,Yuesheng Zhu,Guibo Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为MedDiff-FT的可控医学图像生成方法,通过微调扩散基础模型以实现数据高效生成具有结构依赖性和领域特异性的医学图像。

Details Motivation: 由于高质量训练数据的稀缺,深度学习在医学图像分割方面的进展受到限制,尽管扩散模型可以通过生成合成图像提供潜在解决方案,但其在医学成像中的有效性仍然受限。 Method: MedDiff-FT在推理过程中使用动态自适应引导掩码以确保解剖学上的一致性合成,并通过轻量级随机掩码生成器通过分层随机注入来增强多样性。此外,自动化质量评估协议使用特征空间度量来过滤次优输出,并通过掩码腐蚀来优化保真度。 Result: 在五个医学分割数据集上的评估表明,MedDiff-FT的合成图像-掩码对平均提升了SOTA方法1%的Dice得分。 Conclusion: MedDiff-FT框架有效地平衡了生成质量、多样性和计算效率,为医学数据增强提供了实用的解决方案。 Abstract: Recent advancements in deep learning for medical image segmentation are often limited by the scarcity of high-quality training data.While diffusion models provide a potential solution by generating synthetic images, their effectiveness in medical imaging remains constrained due to their reliance on large-scale medical datasets and the need for higher image quality. To address these challenges, we present MedDiff-FT, a controllable medical image generation method that fine-tunes a diffusion foundation model to produce medical images with structural dependency and domain specificity in a data-efficient manner. During inference, a dynamic adaptive guiding mask enforces spatial constraints to ensure anatomically coherent synthesis, while a lightweight stochastic mask generator enhances diversity through hierarchical randomness injection. Additionally, an automated quality assessment protocol filters suboptimal outputs using feature-space metrics, followed by mask corrosion to refine fidelity. Evaluated on five medical segmentation datasets,MedDiff-FT's synthetic image-mask pairs improve SOTA method's segmentation performance by an average of 1% in Dice score. The framework effectively balances generation quality, diversity, and computational efficiency, offering a practical solution for medical data augmentation. The code is available at https://github.com/JianhaoXie1/MedDiff-FT.

[77] Learning Dense Feature Matching via Lifting Single 2D Image to 3D Space

Yingping Liang,Yutao Hu,Wenqi Shao,Ying Fu

Main category: cs.CV

TL;DR: This paper introduces Lift to Match (L2M), a new framework that enhances feature matching by incorporating 3D geometry knowledge and synthetic data generation, enabling better generalization across diverse scenarios.

Details Motivation: Existing feature matching methods are limited by their reliance on scarce and clean multi-view image collections and single-view 2D-trained encoders, which hampers their performance in diverse and challenging scenarios. Method: A two-stage framework that first learns a 3D-aware feature encoder using multi-view image synthesis and 3D feature Gaussian representation, followed by a feature decoder trained with a novel-view rendering strategy and large-scale synthetic data generation. Result: Extensive experiments show that the L2M framework achieves superior generalization on zero-shot evaluation benchmarks, proving its effectiveness for robust feature matching. Conclusion: The proposed L2M framework significantly improves the generalization of feature matching across diverse domains by leveraging 3D geometry knowledge and synthetic data generation. Abstract: Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as \textbf{Lift to Match (L2M)}, taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.

[78] Few-shot Classification as Multi-instance Verification: Effective Backbone-agnostic Transfer across Domains

Xin Xu,Eibe Frank,Geoffrey Holmes

Main category: cs.CV

TL;DR: This paper introduces MIV-head, an efficient method for cross-domain few-shot learning that works with fixed backbones, achieving competitive performance without fine-tuning.

Details Motivation: To address the practical challenge of cross-domain few-shot learning when fine-tuning of backbones is not feasible. Method: The authors propose the MIV-head, which treats few-shot classification as multiple instance verification tasks, designed to work with frozen backbones. Result: The MIV-head achieves strong accuracy on test data while being computationally efficient, outperforming known classification head approaches and matching adapter methods with lower adaptation cost. Conclusion: The MIV-head method offers a competitive and cost-effective approach for cross-domain few-shot learning without fine-tuning the backbone. Abstract: We investigate cross-domain few-shot learning under the constraint that fine-tuning of backbones (i.e., feature extractors) is impossible or infeasible -- a scenario that is increasingly common in practical use cases. Handling the low-quality and static embeddings produced by frozen, "black-box" backbones leads to a problem representation of few-shot classification as a series of multiple instance verification (MIV) tasks. Inspired by this representation, we introduce a novel approach to few-shot domain adaptation, named the "MIV-head", akin to a classification head that is agnostic to any pretrained backbone and computationally efficient. The core components designed for the MIV-head, when trained on few-shot data from a target domain, collectively yield strong performance on test data from that domain. Importantly, it does so without fine-tuning the backbone, and within the "meta-testing" phase. Experimenting under various settings and on an extension of the Meta-dataset benchmark for cross-domain few-shot image classification, using representative off-the-shelf convolutional neural network and vision transformer backbones pretrained on ImageNet1K, we show that the MIV-head achieves highly competitive accuracy when compared to state-of-the-art "adapter" (or partially fine-tuning) methods applied to the same backbones, while incurring substantially lower adaptation cost. We also find well-known "classification head" approaches lag far behind in terms of accuracy. Ablation study empirically justifies the core components of our approach. We share our code at https://github.com/xxweka/MIV-head.

[79] DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting

Jingyi Pan,Dan Xu,Qiong Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为DiGA3D的新颖且多功能的3D修补流水线,通过扩散模型传播一致的外观和几何信息,解决了当前在统一框架内执行多种3D修补任务所面临的挑战。

Details Motivation: 为了克服单一参考修补方法在处理远离参考视图的视角时缺乏鲁棒性、独立修补多视图图像时出现外观不一致以及在修补区域有显著几何变化时性能受限的问题。 Method: DiGA3D开发了一种选择多个参考视图的稳健策略,并设计了注意力特征传播机制和纹理-几何评分蒸馏采样损失来改进3D场景的几何一致性。 Result: 对多种3D修补任务进行的广泛实验表明该方法是有效的。 Conclusion: DiGA3D是一个有效的3D修补管道,它利用扩散模型以从粗到细的方式传播一致的外观和几何形状。 Abstract: Developing a unified pipeline that enables users to remove, re-texture, or replace objects in a versatile manner is crucial for text-guided 3D inpainting. However, there are still challenges in performing multiple 3D inpainting tasks within a unified framework: 1) Single reference inpainting methods lack robustness when dealing with views that are far from the reference view. 2) Appearance inconsistency arises when independently inpainting multi-view images with 2D diffusion priors; 3) Geometry inconsistency limits performance when there are significant geometric changes in the inpainting regions. To tackle these challenges, we introduce DiGA3D, a novel and versatile 3D inpainting pipeline that leverages diffusion models to propagate consistent appearance and geometry in a coarse-to-fine manner. First, DiGA3D develops a robust strategy for selecting multiple reference views to reduce errors during propagation. Next, DiGA3D designs an Attention Feature Propagation (AFP) mechanism that propagates attention features from the selected reference views to other views via diffusion models to maintain appearance consistency. Furthermore, DiGA3D introduces a Texture-Geometry Score Distillation Sampling (TG-SDS) loss to further improve the geometric consistency of inpainted 3D scenes. Extensive experiments on multiple 3D inpainting tasks demonstrate the effectiveness of our method. The project page is available at https://rorisis.github.io/DiGA3D/.

[80] MFH: Marrying Frequency Domain with Handwritten Mathematical Expression Recognition

Huanxin Yang,Qiwen Wang

Main category: cs.CV

TL;DR: This paper proposes MFH, a method combining frequency domain analysis with handwritten mathematical expression recognition (HMER), which improves recognition accuracy on multiple test sets.

Details Motivation: Handwritten mathematical expression recognition (HMER) faces challenges due to complex formula structures and character layouts in sequence prediction. This study aims to address these issues by incorporating frequency domain analysis into HMER. Method: The study introduces a method that combines frequency domain analysis with HMER (MFH), leveraging the discrete cosine transform (DCT) to improve structural analysis for recognizing mathematical formulas. Result: When applied to baseline models, the MFH-CoMER method achieves accuracy rates of 61.66%, 62.07%, and 63.72% on the CROHME 2014, 2016, and 2019 test sets, respectively, demonstrating its effectiveness. Conclusion: The proposed MFH method enhances the performance of HMER models by incorporating frequency domain analysis using DCT. Abstract: Handwritten mathematical expression recognition (HMER) suffers from complex formula structures and character layouts in sequence prediction. In this paper, we incorporate frequency domain analysis into HMER and propose a method that marries frequency domain with HMER (MFH), leveraging the discrete cosine transform (DCT). We emphasize the structural analysis assistance of frequency information for recognizing mathematical formulas. When implemented on various baseline models, our network exhibits a consistent performance enhancement, demonstrating the efficacy of frequency domain information. Experiments show that our MFH-CoMER achieves noteworthy accuracyrates of 61.66%/62.07%/63.72% on the CROHME 2014/2016/2019 test sets. The source code is available at https://github.com/Hryxyhe/MFH.

[81] Latent Posterior-Mean Rectified Flow for Higher-Fidelity Perceptual Face Restoration

Xin Luo,Menglin Zhang,Yunwei Lan,Tianyu Zhang,Rui Li,Chang Liu,Dong Liu

Main category: cs.CV

TL;DR: Latent-PMRF improves face restoration by operating in the latent space of a custom VAE, achieving better perceptual alignment and faster convergence compared to PMRF.

Details Motivation: The paper aims to improve the Perception-Distortion tradeoff by addressing the limitations of PMRF's pixel-space modeling approach, which does not fully align with human perception. Method: Latent-PMRF reformulates PMRF in the latent space of a variational autoencoder (VAE), defining the source distribution on latent representations to align with human perception during optimization. Result: Extensive experiments demonstrate that Latent-PMRF achieves superior performance in terms of PD-tradeoff and offers a 5.79X speedup over PMRF in FID scores. Conclusion: Latent-PMRF outperforms existing methods in blind face restoration, providing better alignment with human perception and improved convergence efficiency. Abstract: The Perception-Distortion tradeoff (PD-tradeoff) theory suggests that face restoration algorithms must balance perceptual quality and fidelity. To achieve minimal distortion while maintaining perfect perceptual quality, Posterior-Mean Rectified Flow (PMRF) proposes a flow based approach where source distribution is minimum distortion estimations. Although PMRF is shown to be effective, its pixel-space modeling approach limits its ability to align with human perception, where human perception is defined as how humans distinguish between two image distributions. In this work, we propose Latent-PMRF, which reformulates PMRF in the latent space of a variational autoencoder (VAE), facilitating better alignment with human perception during optimization. By defining the source distribution on latent representations of minimum distortion estimation, we bound the minimum distortion by the VAE's reconstruction error. Moreover, we reveal the design of VAE is crucial, and our proposed VAE significantly outperforms existing VAEs in both reconstruction and restoration. Extensive experiments on blind face restoration demonstrate the superiority of Latent-PMRF, offering an improved PD-tradeoff compared to existing methods, along with remarkable convergence efficiency, achieving a 5.79X speedup over PMRF in terms of FID. Our code will be available as open-source.

[82] ATSTrack: Enhancing Visual-Language Tracking by Aligning Temporal and Spatial Scales

Yihao Zhen,Qiang Wang,Yu Qiao,Liangqiong Qu,Huijie Fan

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉-语言跟踪方法ATSTrack,通过更好地对齐不同输入成分的时间和空间尺度显著提升了特征修改的效果。

Details Motivation: 解决视觉输入与语言描述之间因目标移动导致的对齐问题及其固有的时间和空间尺度差异所造成的阻碍。 Method: 将语言描述分解为具有不同属性的短语,并以细粒度方式修改其特征;引入包含前一帧修改后语言信息的视觉-语言标记来指导模型提取更相关的视觉特征。 Result: 实验结果表明,提出的ATSTrack方法性能与其他现有方法相当。 Conclusion: ATSTrack通过协调不同输入组件的时间和空间尺度,提高了视觉-语言跟踪的效果,同时代码已公开。 Abstract: A main challenge of Visual-Language Tracking (VLT) is the misalignment between visual inputs and language descriptions caused by target movement. Previous trackers have explored many effective feature modification methods to preserve more aligned features. However, an important yet unexplored factor ultimately hinders their capability, which is the inherent differences in the temporal and spatial scale of information between visual and language inputs. To address this issue, we propose a novel visual-language tracker that enhances the effect of feature modification by \textbf{A}ligning \textbf{T}emporal and \textbf{S}patial scale of different input components, named as \textbf{ATSTrack}. Specifically, we decompose each language description into phrases with different attributes based on their temporal and spatial correspondence with visual inputs, and modify their features in a fine-grained manner. Moreover, we introduce a Visual-Language token that comprises modified linguistic information from the previous frame to guide the model to extract visual features that are more relevant to language description, thereby reducing the impact caused by the differences in spatial scale. Experimental results show that our proposed ATSTrack achieves performance comparable to existing methods. Our code will be released.

[83] Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation

Jizhou Han,Chenhao Ding,SongLin Dong,Yuhang He,Xinyuan Gao,Yihong Gong

Main category: cs.CV

TL;DR: 本文提出了一种名为MS-TTA的无训练测试时自适应方法,用于增强视觉-语言模型的特征表示,从而提高其在分布外和跨数据集任务中的性能。

Details Motivation: 现有无训练测试时自适应方法仅限于CLIP原始特征空间操作,依赖高置信度样本而忽视低置信度样本的潜力。 Method: 提出了一种无需训练的测试时自适应方法MS-TTA,该方法利用Mean-Shift算法优化所有测试样本的特征表示,并通过缓存改进嵌入来提供更稳定的推理能力。 Result: MS-TTA在多个分布外和跨数据集基准测试中表现优于现有的无训练测试时自适应方法,实现了稳健的自适应性。 Conclusion: MS-TTA能够通过单步k近邻Mean-Shift方法增强特征表示,提升视觉-语言模型在分布外和跨数据集基准测试中的性能。 Abstract: Visual-language models (VLMs) like CLIP exhibit strong generalization but struggle with distribution shifts at test time. Existing training-free test-time adaptation (TTA) methods operate strictly within CLIP's original feature space, relying on high-confidence samples while overlooking the potential of low-confidence ones. We propose MS-TTA, a training-free approach that enhances feature representations beyond CLIP's space using a single-step k-nearest neighbors (kNN) Mean-Shift. By refining all test samples, MS-TTA improves feature compactness and class separability, leading to more stable adaptation. Additionally, a cache of refined embeddings further enhances inference by providing Mean Shift enhanced logits. Extensive evaluations on OOD and cross-dataset benchmarks demonstrate that MS-TTA consistently outperforms state-of-the-art training-free TTA methods, achieving robust adaptation without requiring additional training.

[84] Bisecle: Binding and Separation in Continual Learning for Video Language Understanding

Yue Tan,Xiaoqian Hu,Hao Xue,Celso De Melo,Flora D. Salim

Main category: cs.CV

TL;DR: This paper proposes Bisecle, a novel continual learning framework for vision-language models inspired by hippocampal memory mechanisms, which improves video understanding by reducing catastrophic forgetting and enhancing generalization across tasks.

Details Motivation: Real-world videos are continuous data streams requiring continual adaptation, but fine-tuning large vision-language models (VLMs) on new tasks is computationally expensive. Existing continual learning frameworks face challenges like catastrophic forgetting and update conflicts in this context. Method: Bisecle uses a multi-directional supervision module to capture cross-modal relationships and employs a contrastive prompt learning scheme to isolate task-specific knowledge. This approach is inspired by hippocampal mechanisms like rapid binding and pattern separation. Result: Bisecle demonstrates robust and efficient continual learning performance on several VideoQA benchmarks, showing reduced forgetting and improved cross-task generalization compared to existing methods. Conclusion: The proposed Bisecle method effectively addresses the challenges of continual learning in large multimodal foundation models for video understanding tasks, mitigating forgetting and enhancing cross-task generalization. Abstract: Frontier vision-language models (VLMs) have made remarkable improvements in video understanding tasks. However, real-world videos typically exist as continuously evolving data streams (e.g., dynamic scenes captured by wearable glasses), necessitating models to continually adapt to shifting data distributions and novel scenarios. Considering the prohibitive computational costs of fine-tuning models on new tasks, usually, a small subset of parameters is updated while the bulk of the model remains frozen. This poses new challenges to existing continual learning frameworks in the context of large multimodal foundation models, i.e., catastrophic forgetting and update conflict. While the foundation models struggle with parameter-efficient continual learning, the hippocampus in the human brain has evolved highly efficient mechanisms for memory formation and consolidation. Inspired by the rapid Binding and pattern separation mechanisms in the hippocampus, in this work, we propose Bisecle for video-language continual learning, where a multi-directional supervision module is used to capture more cross-modal relationships and a contrastive prompt learning scheme is designed to isolate task-specific knowledge to facilitate efficient memory storage. Binding and separation processes further strengthen the ability of VLMs to retain complex experiences, enabling robust and efficient continual learning in video understanding tasks. We perform a thorough evaluation of the proposed Bisecle, demonstrating its ability to mitigate forgetting and enhance cross-task generalization on several VideoQA benchmarks.

[85] ARIG: Autoregressive Interactive Head Generation for Real-time Conversations

Ying Guo,Xi Liu,Cheng Zhen,Pengfei Yan,Xiaoming Wei

Main category: cs.CV

TL;DR: This paper introduces ARIG, a real-time, frame-wise framework for generating realistic interactive head movements in virtual agents by leveraging autoregressive modeling, diffusion-based motion prediction, and deep contextual understanding of both behavior and conversation states.

Details Motivation: Previous methods like clip-wise generation or explicit listener/speaker switching have limitations in future signal acquisition, contextual understanding, and smooth transitions, making real-time and realistic interaction difficult. This work aims to overcome these challenges. Method: The method involves an autoregressive (AR) based frame-wise framework called ARIG. It uses a diffusion procedure for motion prediction in continuous space, bidirectional-integrated learning for short-range behavior understanding, and context-aware modeling for long-range understanding. Voice activity signals and context features are used to understand conversational states. Result: Extensive experiments demonstrated the effectiveness of the ARIG model in achieving real-time performance and improved interaction realism compared to previous approaches. Conclusion: The proposed ARIG framework achieves real-time and realistic interactive head generation by employing an autoregressive, frame-wise approach with enhanced interactive behavior and conversational state understanding. Abstract: Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.

[86] ADAptation: Reconstruction-based Unsupervised Active Learning for Breast Ultrasound Diagnosis

Yaofei Duan,Yuhao Huang,Xin Yang,Luyi Han,Xinyu Xie,Zhiyuan Zhu,Ping He,Ka-Hou Chan,Ligang Cui,Sio-Kei Im,Dong Ni,Tao Tan

Main category: cs.CV

TL;DR: This paper proposes ADAptation, an unsupervised active learning framework using diffusion models and novel feature clustering and sample selection techniques, effectively addressing domain adaptation challenges in medical imaging.

Details Motivation: Deep learning models face performance drops due to distribution shifts between training and test domains; collecting target domain data is costly and time-consuming, while current active learning approaches struggle with domain variations. Method: The study introduces ADAptation, which combines diffusion models for domain translation with a hypersphere-constrained contrastive learning network and a dual-scoring mechanism to balance sample uncertainty and representativeness. Result: Extensive experiments on four breast ultrasound datasets across five deep classifiers show that ADAptation surpasses existing active learning methods in handling domain shifts. Conclusion: ADAptation, the proposed unsupervised active learning framework for domain adaptation, outperforms existing AL-based methods and demonstrates strong generalization in clinical domain adaptation. Abstract: Deep learning-based diagnostic models often suffer performance drops due to distribution shifts between training (source) and test (target) domains. Collecting and labeling sufficient target domain data for model retraining represents an optimal solution, yet is limited by time and scarce resources. Active learning (AL) offers an efficient approach to reduce annotation costs while maintaining performance, but struggles to handle the challenge posed by distribution variations across different datasets. In this study, we propose a novel unsupervised Active learning framework for Domain Adaptation, named ADAptation, which efficiently selects informative samples from multi-domain data pools under limited annotation budget. As a fundamental step, our method first utilizes the distribution homogenization capabilities of diffusion models to bridge cross-dataset gaps by translating target images into source-domain style. We then introduce two key innovations: (a) a hypersphere-constrained contrastive learning network for compact feature clustering, and (b) a dual-scoring mechanism that quantifies and balances sample uncertainty and representativeness. Extensive experiments on four breast ultrasound datasets (three public and one in-house/multi-center) across five common deep classifiers demonstrate that our method surpasses existing strong AL-based competitors, validating its effectiveness and generalization for clinical domain adaptation. The code is available at the anonymized link: https://github.com/miccai25-966/ADAptation.

[87] Just Noticeable Difference for Large Multimodal Models

Zijian Chen,Yuan Tian,Yuze Sun,Wei Sun,Zicheng Zhang,Weisi Lin,Guangtao Zhai,Wenjun Zhang

Main category: cs.CV

TL;DR: This paper introduces LMM-JND, a novel approach to understanding visual perception limitations in large multimodal models (LMMs). Through the creation of the VPA-JND dataset, the study exposes significant weaknesses in state-of-the-art models like GPT-4o and InternVL2.5, highlighting the importance of improving visual acuity for both performance and security reasons.

Details Motivation: Despite decades of research on Just Noticeable Difference (JND) in human vision, there has been limited systematic exploration of perceptual boundaries in large multimodal models (LMMs), particularly regarding their visual capabilities and perceptual defects. These gaps raise potential security issues and hinder response efficiency, motivating this study to address such shortcomings. Method: The study proposes a new concept, LMM-JND, along with a determination pipeline. It investigates behavior commonalities in human visual perception tasks across multiple LMM families and constructs a large-scale dataset, VPA-JND, which includes 21.5k reference images and over 489k stimuli across 12 distortion types. The performance of state-of-the-art LMMs like GPT-4o and InternVL2.5 series is evaluated on this dataset. Result: The study demonstrates significant visual blind spots in current LMMs. Using the newly developed VPA-JND dataset, it reveals areas where advanced models like GPT-4o and InternVL2.5 series struggle with basic visual comparison tasks, falling short of human-level performance. Additionally, the research identifies a correlation between vision and language backbone designs and their impact on visual acuity in LMMs. Conclusion: The research concludes that current LMMs have significant visual blind spots, and the proposed LMM-JND concept provides a unique perspective for studying these limitations. This insight is crucial for addressing security concerns and guiding future improvements in LMM visual acuity. Abstract: Just noticeable difference (JND), the minimum change that the human visual system (HVS) can perceive, has been studied for decades. Although recent work has extended this line of research into machine vision, there has been a scarcity of studies systematically exploring its perceptual boundaries across multiple tasks and stimulus types, particularly in the current era of rapidly advancing large multimodal models (LMMs), where studying the multifaceted capabilities of models has become a mainstream focus. Moreover, the perceptual defects of LMMs are not investigated thoroughly, resulting in potential security issues and suboptimal response efficiency. In this paper, we take an initial attempt and demonstrate that there exist significant visual blind spots in current LMMs. To systemically quantify this characteristic, we propose a new concept, {\bf LMM-JND}, together with its determination pipeline. Targeting uncovering the behavior commonalities in HVS-aligned visual perception tasks, we delve into several LMM families and construct a large-scale dataset, named VPA-JND, which contains 21.5k reference images with over 489k stimuli across 12 distortion types, to facilitate LMM-JND studies. VPA-JND exposes areas where state-of-the-art LMMs, including GPT-4o and the InternVL2.5 series, struggle with basic comparison queries and fall significantly short of human-level visual performance. We further explore the effects of vision and language backbones and find a notable correlation between their design philosophy that may instruct the future refinement of LMMs for their visual acuity. Together, our research underscores the significance of LMM-JND as a unique perspective for studying LMMs, and predictable LMM-JND is crucial for security concerns. This work will be available at https://github.com/zijianchen98/LMM-JND.

[88] Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models

Fenil R. Doshi,Thomas Fel,Talia Konkle,George Alvarez

Main category: cs.CV

TL;DR: This paper introduces the Configural Shape Score (CSS) to measure how well vision models use global shape configurations, showing that top-performing transformers effectively integrate both local texture and global shape cues, suggesting that combining both is key to human-like vision systems.

Details Motivation: Contemporary vision models mainly rely on local texture cues, leading to brittle and non-compositional features. Previous work has treated shape and texture as opposing factors, neglecting the possibility of simultaneous reliance on both. This paper aims to develop a more accurate evaluation of shape representation by focusing on absolute configural competence. Method: The authors introduced the Configural Shape Score (CSS) to evaluate absolute configural competence in models by testing their ability to recognize objects with preserved local textures but altered global part arrangements. They tested 86 models, including convolutional networks, transformers, and hybrid models, and used mechanistic probes to study how high-CSS networks process information. Result: CSS revealed a broad spectrum of configural sensitivity across models, with self-supervised and language-aligned transformers (e.g., DINOv2, SigLIP2, EVA-CLIP) performing best. High-CSS models showed dependence on long-range interactions, and representational-similarity analyses indicated a transition from local to global coding at mid-depth. A BagNet control performed at chance, eliminating 'border-hacking' strategies. Additionally, CSS was shown to predict other shape-dependent evaluations. Conclusion: The paper concludes that future vision systems should integrate both shape and texture cues for robustness and generalizability, rather than forcing a choice between them. Abstract: Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs-texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self-supervised and language-aligned transformers -- exemplified by DINOv2, SigLIP2 and EVA-CLIP -- occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius-controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control remains at chance (iv), ruling out "border-hacking" strategies. Finally, (v) we show that configural shape score also predicts other shape-dependent evals. Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape.

[89] Laplace-Mamba: Laplace Frequency Prior-Guided Mamba-CNN Fusion Network for Image Dehazing

Yongzhen Wang,Liangliang Chen,Bingwen Hu,Heng Liu,Xiao-Ping Zhang,Mingqiang Wei

Main category: cs.CV

TL;DR: 提出了一种新的图像去雾框架Laplace-Mamba,结合了拉普拉斯频率先验与Mamba-CNN混合架构,以提高图像恢复的质量和效率。

Details Motivation: 基于SSM的方法在重建局部结构方面存在局限性,且对高维数据的处理效果不佳,难以恢复精细的图像特征。 Method: 利用拉普拉斯分解将图像分为低频和高频成分,并通过SSMs和CNNs分别处理全局上下文和局部结构细节。 Result: 实验表明,该方法在多个基准数据集上均优于现有技术,在恢复质量和效率方面均有提升。 Conclusion: Laplace-Mamba有效地结合了拉普拉斯频率先验与Mamba-CNN混合架构,解决了基于SSM的方法在局部结构重建和高维数据处理方面的不足。 Abstract: Recent progress in image restoration has underscored Spatial State Models (SSMs) as powerful tools for modeling long-range dependencies, owing to their appealing linear complexity and computational efficiency. However, SSM-based approaches exhibit limitations in reconstructing localized structures and tend to be less effective when handling high-dimensional data, frequently resulting in suboptimal recovery of fine image features. To tackle these challenges, we introduce Laplace-Mamba, a novel framework that integrates Laplace frequency prior with a hybrid Mamba-CNN architecture for efficient image dehazing. Leveraging the Laplace decomposition, the image is disentangled into low-frequency components capturing global texture and high-frequency components representing edges and fine details. This decomposition enables specialized processing via dual parallel pathways: the low-frequency branch employs SSMs for global context modeling, while the high-frequency branch utilizes CNNs to refine local structural details, effectively addressing diverse haze scenarios. Notably, the Laplace transformation facilitates information-preserving downsampling of low-frequency components in accordance with the Nyquist theory, thereby significantly improving computational efficiency. Extensive evaluations across multiple benchmarks demonstrate that our method outperforms state-of-the-art approaches in both restoration quality and efficiency. The source code and pretrained models are available at https://github.com/yz-wang/Laplace-Mamba.

[90] ExPaMoE: An Expandable Parallel Mixture of Experts for Continual Test-Time Adaptation

JianChao Zhao,Songlin Dong

Main category: cs.CV

TL;DR: 本文提出ExPaMoE,一种基于可扩展并行专家混合架构的新颖CTTA框架,通过解耦领域知识和实时检测分布变化,显著提升了模型在持续测试时间适应中的性能。

Details Motivation: 现有的CTTA方法依赖于跨所有领域的共享模型参数,在面对大的或非平稳的领域转移时容易出现特征纠缠和灾难性遗忘。因此需要一种新方法来增强模型的适应性和鲁棒性。 Method: 提出了一种基于可扩展并行专家混合架构的ExPaMoE框架,利用双分支专家设计和频谱感知在线领域判别器(SODD)实现领域知识分离与实时分布变化检测。 Result: ExPaMoE在多个标准基准测试中表现出优越性,包括CIFAR-10C、CIFAR-100C、ImageNet-C以及Cityscapes-to-ACDC语义分割任务,同时在新提出的ImageNet++基准测试中显示出强大的鲁棒性、可扩展性和抗遗忘能力。 Conclusion: ExPaMoE通过解耦通用和特定领域知识,并动态扩展专家池,有效解决了现有CTTA方法在面对大范围或非平稳领域转移时的特征纠缠和灾难性遗忘问题。 Abstract: Continual Test-Time Adaptation (CTTA) aims to enable models to adapt on-the-fly to a stream of unlabeled data under evolving distribution shifts. However, existing CTTA methods typically rely on shared model parameters across all domains, making them vulnerable to feature entanglement and catastrophic forgetting in the presence of large or non-stationary domain shifts. To address this limitation, we propose \textbf{ExPaMoE}, a novel framework based on an \emph{Expandable Parallel Mixture-of-Experts} architecture. ExPaMoE decouples domain-general and domain-specific knowledge via a dual-branch expert design with token-guided feature separation, and dynamically expands its expert pool based on a \emph{Spectral-Aware Online Domain Discriminator} (SODD) that detects distribution changes in real-time using frequency-domain cues. Extensive experiments demonstrate the superiority of ExPaMoE across diverse CTTA scenarios. We evaluate our method on standard benchmarks including CIFAR-10C, CIFAR-100C, ImageNet-C, and Cityscapes-to-ACDC for semantic segmentation. Additionally, we introduce \textbf{ImageNet++}, a large-scale and realistic CTTA benchmark built from multiple ImageNet-derived datasets, to better reflect long-term adaptation under complex domain evolution. ExPaMoE consistently outperforms prior arts, showing strong robustness, scalability, and resistance to forgetting.

[91] LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

Haoran Lou,Chunxiao Fan,Ziyan Liu,Yuexin Wu,Xinxiang Wang

Main category: cs.CV

TL;DR: 本文提出LLaVA-SP,改进了多模态大语言模型中的视觉表示,提升了详细理解能力。

Details Motivation: CLIP-ViT在建模相邻图像块之间的局部关系方面存在不足,导致MLLMs对细节的理解能力受限。 Method: 设计了一种新的Projector机制和两种模型变体(LLaVA-SP-Cropping和LLaVA-SP-Pooling),利用卷积核从ViT块特征中提取视觉空间标记。 Result: LLaVA-SP在几乎相同的推理延迟下,在多个多模态基准测试中显著优于LLaVA-1.5模型。 Conclusion: LLaVA-SP通过添加少量视觉空间标记,改进了MLLMs的视觉表现力,并在多模态任务上表现出色。 Abstract: The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which \textbf{ only adds six spatial visual tokens} to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1)We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: ``from central region to global" and ``from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and models are available at \href{https://github.com/CnFaker/LLaVA-SP}{\texttt{https://github.com/CnFaker/LLaVA-SP}}.

[92] SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning

Yunfei Xie,Yuxuan Cheng,Juncheng Wu,Haoyu Zhang,Yuyin Zhou,Shoudong Han

Main category: cs.CV

TL;DR: This paper proposes SCING, a lightweight framework for enhancing cross-modal alignment in vision-language models used for person re-identification, demonstrating strong performance with reduced computational cost.

Details Motivation: Recent methods for adapting vision-language models like CLIP to person re-identification tasks often rely on complex adapter designs or modality-specific tuning while neglecting cross-modal interaction, resulting in high computational costs or suboptimal alignment. This work aims to address these limitations by improving cross-modal alignment and robustness against real-world perturbations. Method: Selective Cross-modal Prompt Tuning (SCING) introduces two innovations: Selective Visual Prompt Fusion (SVIP), which dynamically injects discriminative visual features into text prompts through a cross-modal gating mechanism, and Perturbation-Driven Consistency Alignment (PDCA), a dual-path training strategy that enforces invariant feature alignment under random image perturbations. Result: Extensive experiments on popular benchmarks such as Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC demonstrate the impressive performance of the proposed method. The framework eliminates heavy adapters while maintaining efficient inference. Conclusion: The proposed SCING framework effectively enhances cross-modal alignment and robustness against real-world perturbations in vision-language pre-training models for person re-identification tasks, achieving an optimal trade-off between performance and computational overhead. Abstract: Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image perturbations by regularizing consistency between original and augmented cross-modal embeddings. Extensive experiments are conducted on several popular benchmarks covering Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC, which demonstrate the impressive performance of the proposed method. Notably, our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead. The code will be released upon acceptance.

[93] ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

Zifu Wan,Ce Zhang,Silong Yong,Martin Q. Ma,Simon Stepputtis,Louis-Philippe Morency,Deva Ramanan,Katia Sycara,Yaqi Xie

Main category: cs.CV

TL;DR: This paper introduces ONLY, a training-free and efficient decoding method to reduce hallucinations in LVLMs, enhancing performance with minimal computational overhead.

Details Motivation: LVLMs face the persistent challenge of hallucination, which hinders their reliable deployment in real-world applications. Existing contrastive decoding methods to address this issue are unsuitable for real-time use due to high query requirements. Method: ONLY is a training-free decoding approach that amplifies crucial textual information using a text-to-visual entropy ratio for each token, requiring only a single query and a one-layer intervention during decoding. Result: Experimental results show that ONLY consistently outperforms state-of-the-art methods across various benchmarks while being computationally efficient. Conclusion: The proposed ONLY method effectively mitigates hallucination in Large Vision-Language Models (LVLMs) while enabling efficient real-time deployment with minimal computational cost. Abstract: Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.

[94] Topology-Constrained Learning for Efficient Laparoscopic Liver Landmark Detection

Ruize Cui,Jiaan Zhang,Jialun Pei,Kai Wang,Pheng-Ann Heng,Jing Qin

Main category: cs.CV

TL;DR: This study proposes TopoNet, a novel topology-constrained deep learning framework for laparoscopic liver landmark detection, achieving high accuracy and efficiency by integrating RGB and depth information while preserving global topology.

Details Motivation: Liver landmarks are crucial for minimizing surgical risk during laparoscopic liver surgery, but automatic detection is challenging due to tubular structural properties and dynamic intraoperative deformations. Method: The study introduces TopoNet, a topology-constrained learning framework with a snake-CNN dual-path encoder and a boundary-aware topology fusion (BTF) module. It also incorporates a topological constraint loss function combining center-line constraint loss and topological persistence loss. Result: Extensive experiments on L3D and P2ILF datasets demonstrate that TopoNet achieves outstanding accuracy and computational efficiency. Conclusion: TopoNet has the potential for clinical applications in laparoscopic liver surgery. Abstract: Liver landmarks provide crucial anatomical guidance to the surgeon during laparoscopic liver surgery to minimize surgical risk. However, the tubular structural properties of landmarks and dynamic intraoperative deformations pose significant challenges for automatic landmark detection. In this study, we introduce TopoNet, a novel topology-constrained learning framework for laparoscopic liver landmark detection. Our framework adopts a snake-CNN dual-path encoder to simultaneously capture detailed RGB texture information and depth-informed topological structures. Meanwhile, we propose a boundary-aware topology fusion (BTF) module, which adaptively merges RGB-D features to enhance edge perception while preserving global topology. Additionally, a topological constraint loss function is embedded, which contains a center-line constraint loss and a topological persistence loss to ensure homotopy equivalence between predictions and labels. Extensive experiments on L3D and P2ILF datasets demonstrate that TopoNet achieves outstanding accuracy and computational complexity, highlighting the potential for clinical applications in laparoscopic liver surgery. Our code will be available at https://github.com/cuiruize/TopoNet.

[95] Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving

Djamahl Etchegaray,Yuxia Fu,Zi Huang,Yadan Luo

Main category: cs.CV

TL;DR: This paper introduces Box-QAymo, a new benchmark for evaluating and fine-tuning vision-language models (VLMs) in autonomous driving contexts, highlighting their current limitations in real-world perception tasks.

Details Motivation: Current VLMs often operate under idealized assumptions and struggle to capture user intent in real-world autonomous driving scenarios. Existing datasets are limited in scope, focusing on full-scene descriptions or waypoint predictions, making it difficult to assess localized query understanding. This work aims to bridge this gap by introducing a more realistic and focused evaluation framework. Method: The researchers introduced Box-QAymo, a box-referring dataset and benchmark, which includes a hierarchical evaluation protocol with sanity checks, attribute prediction, motion understanding, and spatiotemporal reasoning tasks. They crowd-sourced object classes, visual attributes, and extracted object trajectories to create temporally grounded QA pairs. Rigorous quality control methods were employed, including negative sampling, temporal consistency checks, and difficulty-aware balancing. Result: The comprehensive evaluation using Box-QAymo revealed notable shortcomings in existing VLMs when queried about perception-based questions relevant to autonomous driving. The dataset demonstrated robustness and diversity due to rigorous quality control measures, and the proposed framework enables targeted finetuning and assessment of VLMs in spatial and temporal reasoning tasks. Conclusion: The study reveals significant limitations in current vision-language models (VLMs) in handling perception questions related to autonomous driving scenarios, emphasizing the gap between current capabilities and real-world requirements. It provides a foundation for improving robustness and interpretability in VLMs for autonomous driving systems. Abstract: Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries. We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes. Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute prediction for box-referred objects, (2) motion understanding of target instances, and (3) spatiotemporal motion reasoning over inter-object dynamics across frames. To support this, we crowd-sourced fine-grained object classes and visual attributes that reflect the complexity drivers encounter, and extract object trajectories to construct temporally grounded QA pairs. Rigorous quality control through negative sampling, temporal consistency checks, and difficulty-aware balancing guarantee dataset robustness and diversity. Our comprehensive evaluation reveals significant limitations in current VLMs when queried about perception questions, highlighting the gap in achieving real-world performance. This work provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions. Project page and dataset are available at https://djamahl99.github.io/qaymo-pages/.

[96] Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation

Feng Lin,Marco Chen,Haokui Zhang,Xiaotian Yu,Guangming Lu,Rong Xiao

Main category: cs.CV

TL;DR: 本文提出了一种名为AAT的方法,通过消融负面影响的注意力头,以提升CLIP-family模型的下游任务表现。

Details Motivation: 我们假设某些注意力头会负面影响最终表示,消融它们可以提高下游任务的表现。 Method: 提出了一种称为Attention Ablation Technique (AAT)的方法,通过操作注意力权重来抑制特定头部的贡献。 Result: 实验表明,AAT在各种领域中都持续提升了下游任务的表现,在跨模态检索任务中回忆率提升了高达11.1%。 Conclusion: AAT可以有效提升CLIP-family模型的下游任务表现,几乎没有增加推理成本。 Abstract: This paper studies the role of attention heads in CLIP's image encoder. While CLIP has exhibited robust performance across diverse applications, we hypothesize that certain attention heads negatively affect final representations and that ablating them can improve performance in downstream tasks. To capitalize on this insight, we propose a simple yet effective method, called Attention Ablation Technique (AAT), to suppress the contribution of specific heads by manipulating attention weights. By integrating two alternative strategies tailored for different application scenarios, AAT systematically identifies and ablates detrimental attention heads to enhance representation quality. Experiments demonstrate that AAT consistently improves downstream task performance across various domains, boosting recall rate by up to 11.1% on CLIP-family models for cross-modal retrieval. The results highlight the potential of AAT to effectively refine large-scale vision-language models with virtually no increase in inference cost.

[97] LOD-GS: Level-of-Detail-Sensitive 3D Gaussian Splatting for Detail Conserved Anti-Aliasing

Zhenya Yang,Bingchen Gong,Kai Chen,Qi Dou

Main category: cs.CV

TL;DR: LOD-GS is a novel filtering framework for 3D Gaussian Splatting that improves anti-aliasing by considering sampling rate and camera distance, achieving superior rendering results.

Details Motivation: Aliasing artifacts remain a challenge in 3D Gaussian Splatting despite its efficiency and quality. Current methods fail to consider the impact of sampling rate and camera distance, leading to under-filtering or over-smoothing. Method: LOD-GS introduces basis functions for each Gaussian to model appearance variations based on sampling rate, which is influenced by focal length and camera distance. The parameters are jointly optimized end-to-end. Result: Extensive experiments show that LOD-GS outperforms existing methods in rendering quality while eliminating aliasing. A new synthetic dataset with varying camera distances was also introduced. Conclusion: The proposed LOD-GS framework effectively addresses aliasing issues in 3D Gaussian Splatting by dynamically predicting optimal filtering strength, and it achieves state-of-the-art rendering quality. Abstract: Despite the advancements in quality and efficiency achieved by 3D Gaussian Splatting (3DGS) in 3D scene rendering, aliasing artifacts remain a persistent challenge. Existing approaches primarily rely on low-pass filtering to mitigate aliasing. However, these methods are not sensitive to the sampling rate, often resulting in under-filtering and over-smoothing renderings. To address this limitation, we propose LOD-GS, a Level-of-Detail-sensitive filtering framework for Gaussian Splatting, which dynamically predicts the optimal filtering strength for each 3D Gaussian primitive. Specifically, we introduce a set of basis functions to each Gaussian, which take the sampling rate as input to model appearance variations, enabling sampling-rate-sensitive filtering. These basis function parameters are jointly optimized with the 3D Gaussian in an end-to-end manner. The sampling rate is influenced by both focal length and camera distance. However, existing methods and datasets rely solely on down-sampling to simulate focal length changes for anti-aliasing evaluation, overlooking the impact of camera distance. To enable a more comprehensive assessment, we introduce a new synthetic dataset featuring objects rendered at varying camera distances. Extensive experiments on both public datasets and our newly collected dataset demonstrate that our method achieves SOTA rendering quality while effectively eliminating aliasing. The code and dataset have been open-sourced.

[98] Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment

Kai Zhou,Shuhai Zhang,Zeng You,Jinwu Hu,Mingkui Tan,Fei Liu

Main category: cs.CV

TL;DR: PGFA introduces a novel approach for zero-shot skeleton-based action recognition, improving skeleton-text alignment and achieving significant performance gains over existing techniques.

Details Motivation: Zero-shot skeleton-based action recognition is challenging due to generalization issues from seen to unseen actions and limitations in previous methods, such as insufficient discrimination of skeleton features and alignment bias. Method: An end-to-end cross-modal contrastive training framework combined with a prototype-guided text feature alignment strategy was developed for better skeleton-text alignment and discrimination. Result: PGFA achieved absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively, compared to the SMIE method. Conclusion: The proposed PGFA method significantly improves zero-shot skeleton-based action recognition performance compared to existing methods like SMIE. Abstract: Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models' generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a prototype-guided feature alignment paradigm for zero-shot skeleton-based action recognition, termed PGFA. Specifically, we develop an end-to-end cross-modal contrastive training framework to improve skeleton-text alignment, ensuring sufficient discrimination for skeleton features. Additionally, we introduce a prototype-guided text feature alignment strategy to mitigate the adverse impact of the distribution discrepancy during testing. We provide a theoretical analysis to support our prototype-guided text feature alignment strategy and empirically evaluate our overall PGFA on three well-known datasets. Compared with the top competitor SMIE method, our PGFA achieves absolute accuracy improvements of 22.96%, 12.53%, and 18.54% on the NTU-60, NTU-120, and PKU-MMD datasets, respectively.

[99] Out-of-distribution detection in 3D applications: a review

Zizhao Li,Xueyang Kang,Joseph West,Kourosh Khoshelham

Main category: cs.CV

TL;DR: This paper explores out-of-distribution (OOD) detection techniques in AI, focusing on their importance in improving reliability and generalization, especially in 3D applications like autonomous driving.

Details Motivation: The paper addresses the critical need for AI systems to detect objects not prevalent in training data, which is vital for real-world applications such as autonomous driving where misclassification or ignoring unseen objects can have serious consequences. Method: The authors provide a comprehensive overview through use cases, benchmark datasets, evaluation metrics, and comparative analysis of methodologies, including model structures, uncertainty indicators, distributional distance taxonomies, and calibration techniques. Result: The study presents a detailed analysis of OOD detection methods and highlights emerging research opportunities, particularly in adversarially robust OOD detection, failure identification, and integration with 3D vision. Conclusion: This paper concludes that OOD detection plays a crucial role in developing reliable, safe, and robust AI systems, especially in real-world applications like 3D vision where generalization beyond training data is essential. Abstract: The ability to detect objects that are not prevalent in the training set is a critical capability in many 3D applications, including autonomous driving. Machine learning methods for object recognition often assume that all object categories encountered during inference belong to a closed set of classes present in the training data. This assumption limits generalization to the real world, as objects not seen during training may be misclassified or entirely ignored. As part of reliable AI, OOD detection identifies inputs that deviate significantly from the training distribution. This paper provides a comprehensive overview of OOD detection within the broader scope of trustworthy and uncertain AI. We begin with key use cases across diverse domains, introduce benchmark datasets spanning multiple modalities, and discuss evaluation metrics. Next, we present a comparative analysis of OOD detection methodologies, exploring model structures, uncertainty indicators, and distributional distance taxonomies, alongside uncertainty calibration techniques. Finally, we highlight promising research directions, including adversarially robust OOD detection and failure identification, particularly relevant to 3D applications. The paper offers both theoretical and practical insights into OOD detection, showcasing emerging research opportunities such as 3D vision integration. These insights help new researchers navigate the field more effectively, contributing to the development of reliable, safe, and robust AI systems.

[100] AI-Generated Video Detection via Perceptual Straightening

Christian Internò,Robert Geirhos,Markus Olhofer,Sunny Liu,Barbara Hammer,David Klindt

Main category: cs.CV

TL;DR: 本文提出ReStraV方法,利用神经表示几何特性有效区分自然视频与AI生成视频,具有高效计算和高检测准确性。

Details Motivation: 现有的AI生成视频检测方法在泛化性和捕捉细微时间不一致方面存在挑战,急需一种高效的检测技术以防止滥用。 Method: 使用预训练的自监督视觉Transformer(DINOv2)量化模型表示域中的时间曲率和步距,并基于这些度量训练分类器进行视频检测。 Result: 提出的轻量级分类器在VidProM基准测试中达到了97.17%的准确率和98.63%的AUROC,显著优于现有图像和视频检测方法。 Conclusion: ReStraV提供了一种有效的AI生成视频的检测方法,通过分析神经表示域中的时间曲率和步距统计特性,实现了对AI生成视频的高效准确识别。 Abstract: The rapid advancement of generative AI enables highly realistic synthetic videos, posing significant challenges for content authentication and raising urgent concerns about misuse. Existing detection methods often struggle with generalization and capturing subtle temporal inconsistencies. We propose ReStraV(Representation Straightening Video), a novel approach to distinguish natural from AI-generated videos. Inspired by the "perceptual straightening" hypothesis -- which suggests real-world video trajectories become more straight in neural representation domain -- we analyze deviations from this expected geometric property. Using a pre-trained self-supervised vision transformer (DINOv2), we quantify the temporal curvature and stepwise distance in the model's representation domain. We aggregate statistics of these measures for each video and train a classifier. Our analysis shows that AI-generated videos exhibit significantly different curvature and distance patterns compared to real videos. A lightweight classifier achieves state-of-the-art detection performance (e.g., 97.17% accuracy and 98.63% AUROC on the VidProM benchmark), substantially outperforming existing image- and video-based methods. ReStraV is computationally efficient, it is offering a low-cost and effective detection solution. This work provides new insights into using neural representation geometry for AI-generated video detection.

[101] Similarity Memory Prior is All You Need for Medical Image Segmentation

Tang Hao,Guo ZhiQing,Wang LieJun,Liu Chao

Main category: cs.CV

TL;DR: This paper proposes Sim-MPNet, a novel network for medical image segmentation that mimics the brain's ability to recognize complex shapes, achieving better results than current state-of-the-art techniques.

Details Motivation: Recent findings on 'grandmother cells' in macaque visual cortex inspired the idea that recognizing complex visual patterns can be applied to improve medical image segmentation by focusing on subtle texture changes. Method: The paper introduces a Similarity Memory Prior Network (Sim-MPNet) with two key components: the Dynamic Memory Weights-Loss Attention (DMW-LA), which uses similarity memory priors to enhance feature learning, and the Double-Similarity Global Internal Enhancement Module (DS-GIM), which leverages cosine similarity and Euclidean distance to explore internal feature differences. Result: Experiments on four public datasets show that Sim-MPNet achieves superior segmentation performance compared to existing advanced methods. Conclusion: Sim-MPNet outperforms state-of-the-art methods in medical image segmentation, demonstrating its effectiveness and potential for future research. Abstract: In recent years, it has been found that "grandmother cells" in the primary visual cortex (V1) of macaques can directly recognize visual input with complex shapes. This inspires us to examine the value of these cells in promoting the research of medical image segmentation. In this paper, we design a Similarity Memory Prior Network (Sim-MPNet) for medical image segmentation. Specifically, we propose a Dynamic Memory Weights-Loss Attention (DMW-LA), which matches and remembers the category features of specific lesions or organs in medical images through the similarity memory prior in the prototype memory bank, thus helping the network to learn subtle texture changes between categories. DMW-LA also dynamically updates the similarity memory prior in reverse through Weight-Loss Dynamic (W-LD) update strategy, effectively assisting the network directly extract category features. In addition, we propose the Double-Similarity Global Internal Enhancement Module (DS-GIM) to deeply explore the internal differences in the feature distribution of input data through cosine similarity and euclidean distance. Extensive experiments on four public datasets show that Sim-MPNet has better segmentation performance than other state-of-the-art methods. Our code is available on https://github.com/vpsg-research/Sim-MPNet.

[102] Context-Aware Academic Emotion Dataset and Benchmark

Luming Zhao,Jingwen Xuan,Jiamin Lou,Yonghui Yu,Wenwu Yang

Main category: cs.CV

TL;DR: 本论文介绍了RAER数据集和CLIP-CAER方法,前者是首个涵盖多种自然学习场景的学术情感数据集,后者则通过整合面部表情和上下文信息提升了学术情感识别的准确性。

Details Motivation: 学术情感分析对于评估学生的学习参与度和认知状态至关重要,但由于缺乏公开的数据集,学术情感识别仍处于探索阶段。 Method: 提出了CLIP-CAER方法,利用可学习文本提示在视觉语言模型CLIP中有效地整合视频中的面部表情和上下文线索。 Result: 实验结果表明,CLIP-CAER明显优于最先进的基于视频的表情识别方法,并强调了情境在线索学术情感识别中的重要作用。 Conclusion: CLIP-CAER通过结合面部表情和上下文线索显著提高了学术情感识别的准确性,RAER数据集为未来的研究提供了宝贵的资源。 Abstract: Academic emotion analysis plays a crucial role in evaluating students' engagement and cognitive states during the learning process. This paper addresses the challenge of automatically recognizing academic emotions through facial expressions in real-world learning environments. While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition remains underexplored, largely due to the scarcity of publicly available datasets. To bridge this gap, we introduce RAER, a novel dataset comprising approximately 2,700 video clips collected from around 140 students in diverse, natural learning contexts such as classrooms, libraries, laboratories, and dormitories, covering both classroom sessions and individual study. Each clip was annotated independently by approximately ten annotators using two distinct sets of academic emotion labels with varying granularity, enhancing annotation consistency and reliability. To our knowledge, RAER is the first dataset capturing diverse natural learning scenarios. Observing that annotators naturally consider context cues-such as whether a student is looking at a phone or reading a book-alongside facial expressions, we propose CLIP-CAER (CLIP-based Context-aware Academic Emotion Recognition). Our method utilizes learnable text prompts within the vision-language model CLIP to effectively integrate facial expression and context cues from videos. Experimental results demonstrate that CLIP-CAER substantially outperforms state-of-the-art video-based facial expression recognition methods, which are primarily designed for basic emotions, emphasizing the crucial role of context in accurately recognizing academic emotions. Project page: https://zgsfer.github.io/CAER

[103] Overtake Detection in Trucks Using CAN Bus Signals: A Comparative Study of Machine Learning Methods

Fernando Alonso-Fernandez,Talha Hanif Butt,Prayag Tiwari

Main category: cs.CV

TL;DR: 本研究利用来自沃尔沃集团的五辆卡车的CAN总线数据,评估了三种分类算法用于超车检测的性能,并提出了一种得分级融合策略,以提高ADAS系统的决策能力。

Details Motivation: 安全的超车操作对于预防事故和确保交通流畅至关重要。高级驾驶辅助系统(ADAS)需要准确预测此类操作以做出及时和明智的决策。 Method: 该研究使用来自沃尔沃集团五辆在用车辆的CAN总线数据进行超车检测,评估了三种常见的分类器:人工神经网络(ANN)、随机森林(RF)和支持向量机(SVM),并分析了不同预处理配置对性能的影响。此外,还采用了得分级融合策略来改善性能。 Result: 研究结果表明,交通状况的多样性强烈影响信号模式,特别是在无超车类别中,如果训练数据缺乏足够的多样性,将影响分类性能。由于数据是在现实世界条件下收集的,因此无法事先保证类别多样性。然而,通过多车辆数据训练可以提高泛化能力。此外,每辆车的分析显示分类准确性特别是超车情况依赖于每辆车的训练数据量。采用得分级融合策略后,在多数情况下获得了最佳的每车性能。 Conclusion: 该研究发现,通过多车辆数据训练可以提高分类的泛化能力并减少特定条件下的偏差,并且应用了得分级融合策略,在大多数情况下实现了最佳的每卡车性能。最终通过融合方法实现了TNR=93%和TPR=86.5%的准确率。 Abstract: Safe overtaking manoeuvres in trucks are vital for preventing accidents and ensuring efficient traffic flow. Accurate prediction of such manoeuvres is essential for Advanced Driver Assistance Systems (ADAS) to make timely and informed decisions. In this study, we focus on overtake detection using Controller Area Network (CAN) bus data collected from five in-service trucks provided by the Volvo Group. We evaluate three common classifiers for vehicle manoeuvre detection, Artificial Neural Networks (ANN), Random Forest (RF), and Support Vector Machines (SVM), and analyse how different preprocessing configurations affect performance. We find that variability in traffic conditions strongly influences the signal patterns, particularly in the no-overtake class, affecting classification performance if training data lacks adequate diversity. Since the data were collected under unconstrained, real-world conditions, class diversity cannot be guaranteed a priori. However, training with data from multiple vehicles improves generalisation and reduces condition-specific bias. Our pertruck analysis also reveals that classification accuracy, especially for overtakes, depends on the amount of training data per vehicle. To address this, we apply a score-level fusion strategy, which yields the best per-truck performance across most cases. Overall, we achieve an accuracy via fusion of TNR=93% (True Negative Rate) and TPR=86.5% (True Positive Rate). This research has been part of the BIG FUN project, which explores how Artificial Intelligence can be applied to logged vehicle data to understand and predict driver behaviour, particularly in relation to Camera Monitor Systems (CMS), being introduced as digital replacements for traditional exterior mirrors.

[104] World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

Yupeng Zheng,Pengxuan Yang,Zebin Xing,Qichao Zhang,Yuhang Zheng,Yinfeng Gao,Pengfei Li,Teng Zhang,Zhongpu Xia,Peng Jia,Dongbin Zhao

Main category: cs.CV

TL;DR: World4Drive 提出了一种基于视觉基础模型的端到端自动驾驶框架,通过自监督学习构建潜在世界模型,实现无需感知注释的多模态规划,在多个基准测试中表现优异。

Details Motivation: 现有端到端自动驾驶方法通常依赖昂贵的感知监督来提取场景信息,因此需要构建一种无需感知注释、能够通过自监督学习进行多模态规划的世界模型。 Method: World4Drive 利用视觉基础模型提取场景特征并生成多模态规划轨迹,同时引入世界模型选择模块评估和选择最佳轨迹,通过自监督对齐实际未来观测与潜在空间重建的预测观测实现无感知标注的端到端规划。 Result: World4Drive 在 nuScenes 和 NavSim 基准测试中实现了最先进的性能,L2 误差相对减少 18.1%,碰撞率降低 46.7%,训练收敛速度提高 3.75 倍。 Conclusion: World4Drive 是一个无需手动感知注释即可实现端到端规划的自动驾驶框架,通过自监督学习构建潜在世界模型,并在多个基准测试中表现出色。 Abstract: End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1\% relative reduction in L2 error, 46.7% lower collision rate, and 3.75 faster training convergence. Codes will be accessed at https://github.com/ucaszyp/World4Drive.

[105] De-Simplifying Pseudo Labels to Enhancing Domain Adaptive Object Detection

Zehua Fu,Chenguang Liu,Yuyu Chen,Jiaqi Zhou,Qingjie Liu,Yunhong Wang

Main category: cs.CV

TL;DR: This paper proposes DeSimPL to address simple-label bias in self-labeling methods for unsupervised domain adaptation, achieving better performance through pseudo label optimization.

Details Motivation: Self-labeling methods in UDA for object detection suffer from performance gaps compared to domain alignment methods due to simple-label bias. Method: DeSimPL uses an instance-level memory bank and adversarial samples to reduce simple-label bias, along with an adaptive weighted loss to minimize false positives. Result: Experiments on four benchmarks show that DeSimPL improves detector performance by reducing simple samples during training. Conclusion: DeSimPL effectively reduces the proportion of simple samples during training, significantly improving the performance of self-labeling detectors. Abstract: Despite its significant success, object detection in traffic and transportation scenarios requires time-consuming and laborious efforts in acquiring high-quality labeled data. Therefore, Unsupervised Domain Adaptation (UDA) for object detection has recently gained increasing research attention. UDA for object detection has been dominated by domain alignment methods, which achieve top performance. Recently, self-labeling methods have gained popularity due to their simplicity and efficiency. In this paper, we investigate the limitations that prevent self-labeling detectors from achieving commensurate performance with domain alignment methods. Specifically, we identify the high proportion of simple samples during training, i.e., the simple-label bias, as the central cause. We propose a novel approach called De-Simplifying Pseudo Labels (DeSimPL) to mitigate the issue. DeSimPL utilizes an instance-level memory bank to implement an innovative pseudo label updating strategy. Then, adversarial samples are introduced during training to enhance the proportion. Furthermore, we propose an adaptive weighted loss to avoid the model suffering from an abundance of false positive pseudo labels in the late training period. Experimental results demonstrate that DeSimPL effectively reduces the proportion of simple samples during training, leading to a significant performance improvement for self-labeling detectors. Extensive experiments conducted on four benchmarks validate our analysis and conclusions.

[106] UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions

Siyuan Yao,Rui Zhu,Ziqi Wang,Wenqi Ren,Yanyang Yan,Xiaochun Cao

Main category: cs.CV

TL;DR: UMDATrack是一种新型视觉对象跟踪方法,能够在各种恶劣天气条件下保持高性能表现,同时无需大量标注数据和冗余模型更新。

Details Motivation: 大多数现有方法专注于学习白天数据中的目标表示,但在恶劣天气条件下(如夜间或雾天环境)由于巨大的域偏移会导致性能显著下降,因此需要一种能够应对各种不利天气条件的解决方案。 Method: 首先使用可控场景生成器在不同文本提示下合成少量未标记视频(少于源白天数据集帧数的2%),然后设计了一个简单而有效的域定制适配器(DCA)以及目标感知置信度对齐模块(TCA)来提升模型在不同天气条件下的适应性和定位一致性。 Result: 实验表明,UMDATrack超越了现有的先进视觉跟踪器,在多个恶劣天气条件下表现出色,并达到了新的最先进的性能水平。 Conclusion: UMDATrack通过统一的域适应框架,在多种恶劣天气条件下实现了高质量的目标状态预测,超越了现有的先进视觉跟踪器,并取得了显著的最新性能。 Abstract: Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects' representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance by a significant margin. Our code is available at https://github.com/Z-Z188/UMDATrack.

[107] LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment

Juelin Zhu,Shuaibang Peng,Long Wang,Hanlin Tan,Yu Liu,Maojun Zhang,Shen Yan

Main category: cs.CV

TL;DR: 本文提出LoD-Loc v2,一种基于低细节层次城市模型的空中视觉定位方法,通过粗到精的轮廓对齐策略实现高效精准定位,且首次支持低LoD1模型,性能优于现有方法。

Details Motivation: 现有的基于高细节层次模型的方法(如LoD-Loc)依赖于高LoD(LoD3或LoD2)模型,而大多数可用模型和未来广泛部署的模型为低LoD1模型,因此需要开发适用于低LoD模型的定位方法以提升无人机在全球城市环境中的定位潜力。 Method: LoD-Loc v2首先利用建筑分割网络生成建筑物轮廓,在粗略姿态选择阶段构建姿态代价体,并通过最大值选取初始姿态;在精细姿态估计阶段,采用多光束跟踪方法的粒子滤波进一步优化姿态估计。 Result: LoD-Loc v2不仅在高LoD模型上提升了定位精度,而且首次实现了基于低LoD模型的空中定位,其性能显著优于现有最先进的基线方法,甚至超越了基于纹理模型的方法。 Conclusion: LoD-Loc v2通过使用显式的轮廓对齐的粗到精策略,实现了基于低细节层次城市模型的空中视觉定位,同时提升了估计精度并扩展了先验误差的收敛范围。 Abstract: We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are low-LoD (LoD1). Consequently, enabling localization on low-LoD city models could unlock drones' potential for global urban localization. To address these issues, we introduce LoD-Loc v2, which employs a coarse-to-fine strategy using explicit silhouette alignment to achieve accurate localization over low-LoD city models in the air. Specifically, given a query image, LoD-Loc v2 first applies a building segmentation network to shape building silhouettes. Then, in the coarse pose selection stage, we construct a pose cost volume by uniformly sampling pose hypotheses around a prior pose to represent the pose probability distribution. Each cost of the volume measures the degree of alignment between the projected and predicted silhouettes. We select the pose with maximum value as the coarse pose. In the fine pose estimation stage, a particle filtering method incorporating a multi-beam tracking approach is used to efficiently explore the hypothesis space and obtain the final pose estimation. To further facilitate research in this field, we release two datasets with LoD1 city models covering 10.7 km , along with real RGB queries and ground-truth pose annotations. Experimental results show that LoD-Loc v2 improves estimation accuracy with high-LoD models and enables localization with low-LoD models for the first time. Moreover, it outperforms state-of-the-art baselines by large margins, even surpassing texture-model-based methods, and broadens the convergence basin to accommodate larger prior errors.

[108] A Unified Transformer-Based Framework with Pretraining For Whole Body Grasping Motion Generation

Edward Effendy,Kuan-Wei Tseng,Rei Kawakami

Main category: cs.CV

TL;DR: 本文提出了一种创新的Transformer框架,通过三个阶段实现了高效的全身抓取任务,解决了数据稀缺问题,并在GRAB数据集上表现出色。

Details Motivation: 由于缺乏足够的手物交互数据,现有方法在实现全身抓取方面存在局限性。因此,论文旨在开发一种更加稳定和现实的抓取框架。 Method: 论文采用了三阶段的流水线方法:抓取姿态生成、时间填补以及LiftUp Transformer优化。此外,还引入了一种高效的数据预处理策略来解决手物交互数据不足的问题。 Result: 实验表明,在GRAB数据集上,该方法相比当前最先进的基线方法在连贯性、稳定性和视觉真实感方面表现更优。 Conclusion: 该论文提出了一种基于Transformer的全新框架,用于实现全身抓取任务中的姿态生成与运动填补,并展现出卓越的性能。 Abstract: Accepted in the ICIP 2025 We present a novel transformer-based framework for whole-body grasping that addresses both pose generation and motion infilling, enabling realistic and stable object interactions. Our pipeline comprises three stages: Grasp Pose Generation for full-body grasp generation, Temporal Infilling for smooth motion continuity, and a LiftUp Transformer that refines downsampled joints back to high-resolution markers. To overcome the scarcity of hand-object interaction data, we introduce a data-efficient Generalized Pretraining stage on large, diverse motion datasets, yielding robust spatio-temporal representations transferable to grasping tasks. Experiments on the GRAB dataset show that our method outperforms state-of-the-art baselines in terms of coherence, stability, and visual realism. The modular design also supports easy adaptation to other human-motion applications.

[109] Cage-Based Deformation for Transferable and Undefendable Point Cloud Attack

Keke Tang,Ziyong Du,Weilong Peng,Xiaofei Wang,Peican Zhu,Ligang Liu,Zhihong Tian

Main category: cs.CV

TL;DR: 本文提出了一种名为CageAttack的基于笼子变形的对抗攻击框架,能够在保持对抗点云自然外观的同时提高攻击效果。

Details Motivation: 现有的对抗攻击方法在保持几何合理性的同时限制了可转移性和不可防御性,而现有的非结构化变形方法可能导致不自然的失真,因此需要一种新的方法来解决这些问题。 Method: CageAttack通过构建围绕目标物体的笼子结构进行变形,对笼子顶点施加扰动,从而实现对点云的自然变形。 Result: 实验表明,CageAttack在七个3D深度神经网络分类器上的表现优异,实现了更好的可转移性、不可防御性和合理性的平衡。 Conclusion: CageAttack能够生成具有高可转移性、不可防御性和合理性的对抗点云,优于现有技术。 Abstract: Adversarial attacks on point clouds often impose strict geometric constraints to preserve plausibility; however, such constraints inherently limit transferability and undefendability. While deformation offers an alternative, existing unstructured approaches may introduce unnatural distortions, making adversarial point clouds conspicuous and undermining their plausibility. In this paper, we propose CageAttack, a cage-based deformation framework that produces natural adversarial point clouds. It first constructs a cage around the target object, providing a structured basis for smooth, natural-looking deformation. Perturbations are then applied to the cage vertices, which seamlessly propagate to the point cloud, ensuring that the resulting deformations remain intrinsic to the object and preserve plausibility. Extensive experiments on seven 3D deep neural network classifiers across three datasets show that CageAttack achieves a superior balance among transferability, undefendability, and plausibility, outperforming state-of-the-art methods. Codes will be made public upon acceptance.

[110] Rectifying Magnitude Neglect in Linear Attention

Qihang Fan,Huaibo Huang,Yuang Ai,ran He

Main category: cs.CV

TL;DR: This paper identifies that Linear Attention neglects Query magnitude, causing suboptimal performance. The proposed solution, MALA, rectifies this issue and performs well across diverse applications.

Details Motivation: Linear Attention suffers from significant performance degradation compared to Softmax Attention due to its disregard for Query magnitude information. Method: An analysis of Linear Attention's limitations was conducted, leading to the development of Magnitude-Aware Linear Attention (MALA), which incorporates Query magnitude into its computation. Result: MALA achieves strong results across various tasks such as image classification, object detection, NLP, and more, closely resembling Softmax Attention's performance while maintaining efficiency. Conclusion: The proposed MALA effectively addresses the issue of Linear Attention disregarding Query magnitude information, achieving performance comparable to Softmax Attention across multiple tasks. Abstract: As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query's magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at https://github.com/qhfan/MALA

[111] BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving

Zeming Chen,Hang Zhao

Main category: cs.CV

TL;DR: 本文提出了BEV-VAE,一种用于自动驾驶中一致且可控视图合成的方法,通过先构建紧凑的BEV潜在空间,再使用潜在扩散变压器生成场景,实验证明了其在多个数据集上的有效性。

Details Motivation: 自动驾驶中的多视角图像生成需要跨摄像头视角的3D场景理解,而现有的大多数方法将此问题视为2D图像集生成任务,缺乏明确的3D建模。因此,构建一种结构化表示对于场景生成至关重要,尤其是在自动驾驶应用中。 Method: 该论文提出了一种名为BEV-VAE的方法,用于一致且可控的视图合成。首先训练一个多视角图像变分自编码器以获得紧凑且统一的BEV潜在空间,然后使用潜在扩散变压器生成场景。 Result: 实验结果表明,BEV-VAE在3D一致性重建和生成方面表现出色。 Conclusion: BEV-VAE支持基于给定相机配置(可选3D布局)的任意视角生成,并在nuScenes和Argoverse 2数据集上展示了其在3D一致性重建和生成方面的强大性能。 Abstract: Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for autonomous driving applications. This paper proposes BEV-VAE for consistent and controllable view synthesis. BEV-VAE first trains a multi-view image variational autoencoder for a compact and unified BEV latent space and then generates the scene with a latent diffusion transformer. BEV-VAE supports arbitrary view generation given camera configurations, and optionally 3D layouts. Experiments on nuScenes and Argoverse 2 (AV2) show strong performance in both 3D consistent reconstruction and generation. The code is available at: https://github.com/Czm369/bev-vae.

[112] TopoStreamer: Temporal Lane Segment Topology Reasoning in Autonomous Driving

Yiming Yang,Yueru Luo,Bingkun He,Hongbin Lin,Suzhong Fu,Chao Yan,Kun Tang,Xinrui Yan,Chao Zheng,Shuguang Cui,Zhen Li

Main category: cs.CV

TL;DR: 本文提出了一种新的端到端时间感知模型TopoStreamer,用于解决车道段拓扑推理中的问题,并在相关数据集上展示了其优越性能。

Details Motivation: 现有的方法在一致的位置嵌入和时间多属性学习方面存在局限性,影响了道路网络的准确重建。 Method: 提出了TopoStreamer,一种端到端的时间感知模型,包含三个关键改进:流属性约束、动态车道边界位置编码和车道段去噪。 Result: 在OpenLane-V2数据集上,TopoStreamer在车道段感知任务中取得了+3.4% mAP和中心线感知任务中+2.1% OLS的显著性能提升。 Conclusion: TopoStreamer通过引入流属性约束、动态车道边界位置编码和车道段去噪,有效解决了现有方法在道路网络重建中的不足。 Abstract: Lane segment topology reasoning constructs a comprehensive road network by capturing the topological relationships between lane segments and their semantic types. This enables end-to-end autonomous driving systems to perform road-dependent maneuvers such as turning and lane changing. However, the limitations in consistent positional embedding and temporal multiple attribute learning in existing methods hinder accurate roadnet reconstruction. To address these issues, we propose TopoStreamer, an end-to-end temporal perception model for lane segment topology reasoning. Specifically, TopoStreamer introduces three key improvements: streaming attribute constraints, dynamic lane boundary positional encoding, and lane segment denoising. The streaming attribute constraints enforce temporal consistency in both centerline and boundary coordinates, along with their classifications. Meanwhile, dynamic lane boundary positional encoding enhances the learning of up-to-date positional information within queries, while lane segment denoising helps capture diverse lane segment patterns, ultimately improving model performance. Additionally, we assess the accuracy of existing models using a lane boundary classification metric, which serves as a crucial measure for lane-changing scenarios in autonomous driving. On the OpenLane-V2 dataset, TopoStreamer demonstrates significant improvements over state-of-the-art methods, achieving substantial performance gains of +3.4% mAP in lane segment perception and +2.1% OLS in centerline perception tasks.

[113] UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

Xiao Zhang,Fei Wei,Yong Wang,Wenda Zhao,Feiyi Li,Xiangxiang Chu

Main category: cs.CV

TL;DR: This paper proposes the UPRE framework for Zero-shot domain adaptation, addressing challenges through joint optimization of textual prompts and visual representations, demonstrating superior performance on benchmark datasets.

Details Motivation: The motivation stems from the challenges faced in Zero-shot domain adaptation due to the lack of images in the target domain, and limitations in previous approaches that primarily address domain distribution shifts while overlooking misalignment between the detection task and Vision-Language Models. Method: The paper proposes the unified prompt and representation enhancement (UPRE) framework, which includes a multi-view domain prompt combining linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module producing domain style variations. Multi-level enhancement strategies are also introduced to align multi-modal representations and capture diverse visual representations. Result: Extensive experiments on nine benchmark datasets demonstrate the superior performance of the proposed framework in ZSDA detection scenarios. Conclusion: The paper concludes that the proposed UPRE framework effectively addresses the challenges in ZSDA detection scenarios by jointly optimizing textual prompts and visual representations. Abstract: Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios. Code is available at https://github.com/AMAP-ML/UPRE.

[114] Holmes: Towards Effective and Harmless Model Ownership Verification to Personalized Large Vision Models via Decoupling Common Features

Linghui Zhu,Yiming Li,Haiqin Weng,Yan Liu,Tianwei Zhang,Shu-Tao Xia,Zhi Wang

Main category: cs.CV

TL;DR: 本论文研究了个性化大型视觉模型面临的模型窃取攻击风险,并提出了一种基于解耦特征的无害模型所有权验证方法,以有效识别被盗模型。

Details Motivation: 个性化预训练模型通过使用私有本地数据进行微调取得了优异性能,但这也使其面临模型窃取攻击的风险,而现有防御方法对微调模型效果有限。 Method: 该方法包括三个阶段:1)构建保留通用特征、破坏数据集特定特征的影子模型;2)利用元分类器判断可疑模型是否包含受害者模型的数据集特定特征;3)通过假设检验进行模型所有权验证,提高鲁棒性。 Result: 在基准数据集上的实验表明,所提方法能够同时有效检测多种类型的模型窃取攻击。 Conclusion: 本文揭示了现有防御方法在微调模型中的局限性,并提出一种新的所有权验证方法,在保护模型知识产权方面表现出色。 Abstract: Large vision models achieve remarkable performance in various downstream tasks, primarily by personalizing pre-trained models through fine-tuning with private and valuable local data, which makes the personalized model a valuable intellectual property for its owner. Similar to the era of traditional DNNs, model stealing attacks also pose significant risks to these personalized models. However, in this paper, we reveal that most existing defense methods (developed for traditional DNNs), typically designed for models trained from scratch, either introduce additional security risks, are prone to misjudgment, or are even ineffective for fine-tuned models. To alleviate these problems, this paper proposes a harmless model ownership verification method for personalized models by decoupling similar common features. In general, our method consists of three main stages. In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features. We represent the dataset-specific features of the victim model by the output differences between the shadow and victim models. After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim. In the third stage, we conduct model ownership verification by hypothesis test to mitigate randomness and enhance robustness. Extensive experiments on benchmark datasets verify the effectiveness of the proposed method in detecting different types of model stealing simultaneously.

[115] Biorthogonal Tunable Wavelet Unit with Lifting Scheme in Convolutional Neural Network

An Le,Hung Nguyen,Sungbal Seo,You-Suk Bae,Truong Nguyen

Main category: cs.CV

TL;DR: This paper proposes a flexible biorthogonal tunable wavelet unit that improves CNN performance in image classification and anomaly detection.

Details Motivation: To provide greater flexibility in filter design while improving convolution, pooling, and downsampling operations in CNNs. Method: A novel biorthogonal tunable wavelet unit was constructed using a lifting scheme to relax orthogonality and equal filter length constraints. Result: Improved classification accuracy by 2.12% on CIFAR-10 and 9.73% on DTD with ResNet-18; demonstrated competitive performance in anomaly detection on MVTec dataset. Conclusion: The proposed biorthogonal tunable wavelet unit effectively enhances CNN operations, showing significant improvements in image classification and anomaly detection. Abstract: This work introduces a novel biorthogonal tunable wavelet unit constructed using a lifting scheme that relaxes both the orthogonality and equal filter length constraints, providing greater flexibility in filter design. The proposed unit enhances convolution, pooling, and downsampling operations, leading to improved image classification and anomaly detection in convolutional neural networks (CNN). When integrated into an 18-layer residual neural network (ResNet-18), the approach improved classification accuracy on CIFAR-10 by 2.12% and on the Describable Textures Dataset (DTD) by 9.73%, demonstrating its effectiveness in capturing fine-grained details. Similar improvements were observed in ResNet-34. For anomaly detection in the hazelnut category of the MVTec Anomaly Detection dataset, the proposed method achieved competitive and wellbalanced performance in both segmentation and detection tasks, outperforming existing approaches in terms of accuracy and robustness.

[116] Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning

Bob Zhang,Haoran Li,Tao Zhang,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Yanbin Hao

Main category: cs.CV

TL;DR: This paper proposes an RL-based post-training strategy to enhance the reasoning capabilities of MLLMs in multi-image grounding tasks, achieving notable improvements and generalization across multiple benchmarks.

Details Motivation: The motivation stems from the limitations of current MLLMs in handling real-world applications involving complex multi-image compositions and multimodal instructions, particularly in cross-image reasoning and generalization. Method: The method involves synthesizing high-quality CoT data for cold-start initialization, followed by SFT using LoRA. After identifying correct solutions during the cold-start phase, rejection sampling is used to curate high-quality RL data, and rule-based RL guides the model toward optimal reasoning paths. Result: The approach achieved significant improvements, including a +9.04% enhancement on MIG-Bench and +4.98% improvements on out-of-domain reasoning benchmarks over the SFT baseline. It also showed strong generalization with gains of +3.1% and +2.4% on BLINK and MMIU benchmarks, respectively. Conclusion: The study concludes that adopting an RL-based post-training strategy significantly improves the reasoning performance of MLLMs in multi-image grounding tasks, demonstrating strong generalization and effectiveness across various benchmarks. Abstract: Recently, Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references. However, their performance degrades when handling real-world applications involving complex multi-image compositions and multimodal instructions, which reveals limitations in cross-image reasoning and generalization. To address these challenges, we adopt a Reinforcement Learning (RL) based post-training strategy to improve the reasoning performance of MLLMs in multi-image grounding tasks. Our approach begins with synthesizing high-quality chain-of-thought (CoT) data for cold-start initialization, followed by supervised fine-tuning (SFT) using low-rank adaptation (LoRA). The cold-start training stage enables the model to identify correct solutions. Subsequently, we perform rejection sampling using the merged SFT model to curate high-quality RL data and leverage rule-based RL to guide the model toward optimal reasoning paths. Extensive experimental results demonstrate the effectiveness of our approach, achieving +9.04\% improvements on MIG-Bench and +4.98\% improvements on several out-of-domain reasoning grounding benchmarks over the SFT baseline. Furthermore, our approach exhibits strong generalization in multi-image perception, with gains of +3.1\% and +2.4\% over the base model on subsets of the BLINK and MMIU benchmarks, respectively.

[117] Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Hao Xing,Kai Zhe Boey,Yuankai Wu,Darius Burschka,Gordon Cheng

Main category: cs.CV

TL;DR: This paper proposes MMGCN, a novel framework for accurate temporal segmentation of human actions by combining multi-modal data and advanced techniques, showing significant performance improvements on benchmark datasets.

Details Motivation: Accurate temporal segmentation of human actions is crucial for robots in collaborative settings, but existing methods face over-segmentation issues due to noise in pose estimation and object detection. Method: MMGCN combines low-frame-rate visual data with high-frame-rate motion data using a sinusoidal encoding strategy, a temporal graph fusion module, and SmoothLabelMix for data augmentation to enhance segmentation accuracy. Result: Experiments on the Bimanual Actions Dataset show that the MMGCN outperforms state-of-the-art methods, achieving F1@10: 94.5% and F1@25: 92.8% in action segmentation accuracy. Conclusion: The proposed MMGCN framework effectively improves temporal segmentation of human actions by integrating multi-modal data and introducing innovative techniques like sinusoidal encoding, temporal graph fusion, and SmoothLabelMix. Abstract: Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

[118] Language-Unlocked ViT (LUViT): Empowering Self-Supervised Vision Transformers with LLMs

Selim Kuzucu,Muhammad Ferjad Naeem,Anna Kukleva,Federico Tombari,Bernt Schiele

Main category: cs.CV

TL;DR: 本文提出了 LUViT 框架,通过协同预训练方法有效结合 LLM 和 ViT,在视觉任务中取得了显著性能提升。

Details Motivation: 解决 LLM 和 ViT 之间由于模态不匹配导致的融合困难以及微调不稳定的问题。 Method: LUViT 结合了 Vision Transformer (ViT) 和 Large Language Model (LLM),通过 Masked Auto-Encoding (MAE) 预训练 ViT,并使用 Low-Rank Adaptation (LoRA) 在 LLM 中进行联合优化。 Result: 实验表明,LUViT 在多种视觉任务中表现更优,为利用 LLM 的知识提供了更有效且高效的方法。 Conclusion: LUViT 提出了一种新的方法,通过协同预训练策略弥合视觉和语言模型之间的模态差异,并在多个下游视觉任务中显著提升了性能。 Abstract: The integration of Large Language Model (LLMs) blocks with Vision Transformers (ViTs) holds immense promise for vision-only tasks by leveraging the rich semantic knowledge and reasoning capabilities of LLMs. However, a fundamental challenge lies in the inherent modality mismatch between text-centric pretraining of LLMs and vision-centric training of ViTs. Direct fusion often fails to fully exploit the LLM's potential and suffers from unstable finetuning. As a result, LLM blocks are kept frozen while only the vision components are learned. As a remedy to these challenges, we introduce Language-Unlocked Vision Transformers (LUViT), a novel approach that bridges this modality mismatch through a synergistic pre-training strategy. LUViT co-adapts a ViT backbone and an LLM fusion block by (1) employing Masked Auto-Encoding (MAE) to pre-train the ViT for richer visual representations, and (2) concurrently training Low-Rank Adaptation (LoRA) layers within the LLM block using the MAE objective. This joint optimization guides the ViT to produce LLM-aligned features and the LLM to effectively interpret visual information. We demonstrate through extensive experiments that LUViT significantly improves performance on various downstream vision tasks, showcasing a more effective and efficient pathway to harness LLM knowledge for visual understanding.

[119] Towards Open-World Human Action Segmentation Using Graph Convolutional Networks

Hao Xing,Kai Zhe Boey,Gordon Cheng

Main category: cs.CV

TL;DR: 本文提出了一种用于开放世界中人类-物体交互分割的新框架,通过EPGCN、Mixup训练和时间聚类损失三种方法提升了对未见动作的检测与分割效果。

Details Motivation: 现有的基于学习的方法在封闭世界动作分割上表现良好,但在新动作出现的开放世界场景中泛化能力差,需要一种无需手动标注即可检测和分割分布外动作的模型。 Method: 1) 提出了一种增强的金字塔图卷积网络(EPGCN)及其解码模块;2) 使用Mixup训练来合成分布外数据;3) 引入了一种新的时间聚类损失函数。 Result: 实验结果显示,在开放集分割指标F1@50和分布外检测指标AUROC上分别取得了16.9%和34.6%的相对提升。 Conclusion: 该论文提出了一种结构化框架,用于开放世界的动作分割和检测未见过的动作,通过三种创新方法实现了显著的性能提升,并在两个具有挑战性的数据集中验证了其有效性。 Abstract: Human-object interaction segmentation is a fundamental task of daily activity understanding, which plays a crucial role in applications such as assistive robotics, healthcare, and autonomous systems. Most existing learning-based methods excel in closed-world action segmentation, they struggle to generalize to open-world scenarios where novel actions emerge. Collecting exhaustive action categories for training is impractical due to the dynamic diversity of human activities, necessitating models that detect and segment out-of-distribution actions without manual annotation. To address this issue, we formally define the open-world action segmentation problem and propose a structured framework for detecting and segmenting unseen actions. Our framework introduces three key innovations: 1) an Enhanced Pyramid Graph Convolutional Network (EPGCN) with a novel decoder module for robust spatiotemporal feature upsampling. 2) Mixup-based training to synthesize out-of-distribution data, eliminating reliance on manual annotations. 3) A novel Temporal Clustering loss that groups in-distribution actions while distancing out-of-distribution samples. We evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and 2 Hands and Object (H2O) datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation.

[120] OptiPrune: Boosting Prompt-Image Consistency with Attention-Guided Noise and Dynamic Token Selection

Ziji Lu

Main category: cs.CV

TL;DR: 本文提出了一种统一的框架OptiPrune,结合了分布感知的初始噪声优化和基于相似性的标记剪枝,以解决文本到图像扩散模型在语义对齐和计算效率方面的挑战。

Details Motivation: 文本到图像扩散模型在生成准确语义对齐的图像时面临两个主要挑战:一是通过噪声优化带来的大量计算开销,二是通过激进地剪枝标记导致的语义保真度下降。因此需要一种方法能在资源受限硬件上同时解决这两个问题。 Method: 1) 引入一个由注意力得分引导的分布感知噪声优化模块,将初始潜在噪声引导向语义上有意义的区域;2) 设计一种硬件高效的标记剪枝策略,通过逐块相似性选择代表性的基础标记,在剪枝前注入随机性以增强泛化能力,并在注意力操作前使用最大相似性复制恢复被剪枝的标记。 Result: 实验表明,OptiPrune在显著降低计算成本的同时实现了最先进的文本-图像一致性。 Conclusion: OptiPrune是一种有效的方法,能够在保持高语义对齐质量的同时提高文本到图像扩散模型的推理效率。 Abstract: Text-to-image diffusion models often struggle to achieve accurate semantic alignment between generated images and text prompts while maintaining efficiency for deployment on resource-constrained hardware. Existing approaches either incur substantial computational overhead through noise optimization or compromise semantic fidelity by aggressively pruning tokens. In this work, we propose OptiPrune, a unified framework that combines distribution-aware initial noise optimization with similarity-based token pruning to address both challenges simultaneously. Specifically, (1) we introduce a distribution-aware noise optimization module guided by attention scores to steer the initial latent noise toward semantically meaningful regions, mitigating issues such as subject neglect and feature entanglement; (2) we design a hardware-efficient token pruning strategy that selects representative base tokens via patch-wise similarity, injects randomness to enhance generalization, and recovers pruned tokens using maximum similarity copying before attention operations. Our method preserves the Gaussian prior during noise optimization and enables efficient inference without sacrificing alignment quality. Experiments on benchmark datasets, including Animal-Animal, demonstrate that OptiPrune achieves state-of-the-art prompt-image consistency with significantly reduced computational cost.

[121] LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling

Huaqiu Li,Yong Wang,Tongwen Huang,Hailang Huang,Haoqian Wang,Xiangxiang Chu

Main category: cs.CV

TL;DR: 本文提出了一种不依赖数据集、统一的图像恢复方法,通过引入多模态理解模型和循环后验采样策略,在无需任务特定设计或配对数据的情况下取得了优越性能。

Details Motivation: 现有的图像恢复方法要么针对特定任务设计,限制了其在不同类型退化中的泛化能力,要么依赖配对数据集进行训练,导致闭集约束问题。因此需要一种更通用且不依赖数据集的方法。 Method: 结合多模态理解模型提供语义先验,并使用轻量级模块对齐退化输入与扩散模型生成偏好,在任务盲条件下进行循环优化以实现后验采样。 Result: 大量实验表明,该方法优于当前最先进的方法,验证了其有效性和鲁棒性。 Conclusion: 该论文提出了一种新的、无需数据集的统一图像恢复方法,通过循环后验采样利用预训练的潜在扩散模型,解决了现有方法在泛化性和闭集约束方面的局限性。 Abstract: Unified image restoration is a significantly challenging task in low-level vision. Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. Our method incorporates the multimodal understanding model to provide sematic priors for the generative model under a task-blind condition. Furthermore, it utilizes a lightweight module to align the degraded input with the generated preference of the diffusion model, and employs recurrent refinement for posterior sampling. Extensive experiments demonstrate that our method outperforms state-of-the-art methods, validating its effectiveness and robustness. Our code and data will be available at https://github.com/AMAP-ML/LD-RPS.

[122] Real-Time Inverse Kinematics for Generating Multi-Constrained Movements of Virtual Human Characters

Hendric Voss,Stefan Kopp

Main category: cs.CV

TL;DR: This paper presents a real-time inverse kinematics solver using TensorFlow's differentiation and compilation features, effectively generating realistic human movement with high performance and success rates compared to existing algorithms.

Details Motivation: The motivation is the importance of generating accurate and realistic virtual human movements in real-time for various applications in computer graphics, interactive virtual environments, robotics, and biomechanics. Method: The method involves an inverse kinematics solver using automatic differentiation and just-in-time compilation of TensorFlow, treating forward and inverse kinematics as differentiable operations. Result: Results show that the IK solver achieves real-time performance on the SMPLX human skeleton model and outperforms iterative-based IK algorithms like Cyclic Coordinate Descent, FABRIK, and IPOPT in terms of convergence speed, computational efficiency, and success rates. Conclusion: The paper concludes that their novel real-time inverse kinematics solver is effective for generating realistic human-like movements, demonstrating rapid convergence, minimal computational overhead, and improved success rates compared to existing methods. Abstract: Generating accurate and realistic virtual human movements in real-time is of high importance for a variety of applications in computer graphics, interactive virtual environments, robotics, and biomechanics. This paper introduces a novel real-time inverse kinematics (IK) solver specifically designed for realistic human-like movement generation. Leveraging the automatic differentiation and just-in-time compilation of TensorFlow, the proposed solver efficiently handles complex articulated human skeletons with high degrees of freedom. By treating forward and inverse kinematics as differentiable operations, our method effectively addresses common challenges such as error accumulation and complicated joint limits in multi-constrained problems, which are critical for realistic human motion modeling. We demonstrate the solver's effectiveness on the SMPLX human skeleton model, evaluating its performance against widely used iterative-based IK algorithms, like Cyclic Coordinate Descent (CCD), FABRIK, and the nonlinear optimization algorithm IPOPT. Our experiments cover both simple end-effector tasks and sophisticated, multi-constrained problems with realistic joint limits. Results indicate that our IK solver achieves real-time performance, exhibiting rapid convergence, minimal computational overhead per iteration, and improved success rates compared to existing methods. The project code is available at https://github.com/hvoss-techfak/TF-JAX-IK

[123] TRACE: Temporally Reliable Anatomically-Conditioned 3D CT Generation with Enhanced Efficiency

Minye Shao,Xingyu Miao,Haoran Duan,Zeyu Wang,Jingkun Chen,Yawen Huang,Xian Wu,Jingjing Deng,Yang Long,Yefeng Zheng

Main category: cs.CV

TL;DR: TRACE is a framework for generating 3D medical images by using a 2D multimodal-conditioned diffusion method, effectively balancing anatomical accuracy, temporal coherence, and computational efficiency.

Details Motivation: Current methods for 3D medical image generation face challenges including limited anatomical fidelity, restricted axial length, and high computational costs, making them impractical for regions with limited resources. Method: TRACE uses a 2D multimodal-conditioned diffusion approach with segmentation priors and radiology reports for anatomical alignment, and optical flow for temporal coherence. An overlapping-frame strategy is employed during inference to reconstruct 3D volumes. Result: TRACE successfully generates spatiotemporally and anatomically aligned 3D medical images while maintaining computational efficiency. Conclusion: TRACE provides an effective solution for 3D medical image generation, balancing anatomical fidelity, spatiotemporal consistency, and computational efficiency. Abstract: 3D medical image generation is essential for data augmentation and patient privacy, calling for reliable and efficient models suited for clinical practice. However, current methods suffer from limited anatomical fidelity, restricted axial length, and substantial computational cost, placing them beyond reach for regions with limited resources and infrastructure. We introduce TRACE, a framework that generates 3D medical images with spatiotemporal alignment using a 2D multimodal-conditioned diffusion approach. TRACE models sequential 2D slices as video frame pairs, combining segmentation priors and radiology reports for anatomical alignment, incorporating optical flow to sustain temporal coherence. During inference, an overlapping-frame strategy links frame pairs into a flexible length sequence, reconstructed into a spatiotemporally and anatomically aligned 3D volume. Experimental results demonstrate that TRACE effectively balances computational efficiency with preserving anatomical fidelity and spatiotemporal consistency. Code is available at: https://github.com/VinyehShaw/TRACE.

[124] CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs

Jiaming Zhang,Rui Hu,Qing Guo,Wei Yang Bryan Lim

Main category: cs.CV

TL;DR: This paper proposes CAVALRY-V, an innovative framework for adversarial attacks on V-MLLMs, achieving superior performance by targeting cross-modal integration with a dual-objective loss function and efficient two-stage training.

Details Motivation: Despite the impressive capabilities of Video Multimodal Large Language Models (V-MLLMs), their vulnerability to adversarial attacks remains underexplored due to complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. Method: The paper introduces CAVALRY-V, which uses a dual-objective semantic-visual loss function and a computationally efficient two-stage generator framework to target the interface between visual perception and language generation in V-MLLMs. Result: Empirical evaluations show that CAVALRY-V achieves a 22.8% average improvement over baseline attacks on video understanding benchmarks and provides a 34.4% average gain in image understanding, demonstrating flexibility through implicit temporal coherence modeling. Conclusion: CAVALRY-V demonstrates significant improvements over existing attack methods on both commercial and open-source models, showing its potential as a foundational approach for adversarial research across multimodal systems. Abstract: Video Multimodal Large Language Models (V-MLLMs) have shown impressive capabilities in temporal reasoning and cross-modal understanding, yet their vulnerability to adversarial attacks remains underexplored due to unique challenges: complex cross-modal reasoning mechanisms, temporal dependencies, and computational constraints. We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that directly targets the critical interface between visual perception and language generation in V-MLLMs. Our approach introduces two key innovations: (1) a dual-objective semantic-visual loss function that simultaneously disrupts the model's text generation logits and visual representations to undermine cross-modal integration, and (2) a computationally efficient two-stage generator framework that combines large-scale pre-training for cross-model transferability with specialized fine-tuning for spatiotemporal coherence. Empirical evaluation on comprehensive video understanding benchmarks demonstrates that CAVALRY-V significantly outperforms existing attack methods, achieving 22.8% average improvement over the best baseline attacks on both commercial systems (GPT-4.1, Gemini 2.0) and open-source models (QwenVL-2.5, InternVL-2.5, Llava-Video, Aria, MiniCPM-o-2.6). Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding (34.4% average gain). This capability demonstrates CAVALRY-V's potential as a foundational approach for adversarial research across multimodal systems.

[125] Instant Particle Size Distribution Measurement Using CNNs Trained on Synthetic Data

Yasser El Jarida,Youssef Iraqi,Loubna Mekouar

Main category: cs.CV

TL;DR: 本论文提出了一種基於卷積神經網絡(CNN)的方法,利用Blender生成的合成粒子圖像進行自動、實時的粒徑分佈(PSD)估計。

Details Motivation: 準確的粒徑分佈(PSD)測量對礦業、製藥和肥料製造等行業至關重要,傳統方法如篩析法和激光衍射法存在人工操作繁瑣、耗時且受粒子重疊限制等問題。因此需要一種更高效、自動化的方法來改善現有技術的不足。 Method: 使用Blender先進的渲染功能生成逼真的合成粒子圖像,並以此訓練三種不同的卷積神經網絡模型(ResNet-50、InceptionV3 和 EfficientNet-B0),用以預測關鍵的PSD參數(d10、d50、d90)。 Result: 三個模型在預測PSD參數上表現出相似的準確性,但EfficientNet-B0具有最佳的計算效率,適合實時工業應用。 Conclusion: 該研究證明了基於逼真合成數據的CNN訓練在工業自動化PSD監測中具有顯著潛力,為未來實際應用提供了有效解決方案。 Abstract: Accurate particle size distribution (PSD) measurement is important in industries such as mining, pharmaceuticals, and fertilizer manufacturing, significantly influencing product quality and operational efficiency. Traditional PSD methods like sieve analysis and laser diffraction are manual, time-consuming, and limited by particle overlap. Recent developments in convolutional neural networks (CNNs) enable automated, real-time PSD estimation directly from particle images. In this work, we present a CNN-based methodology trained on realistic synthetic particle imagery generated using Blender's advanced rendering capabilities. Synthetic data sets using this method can replicate various industrial scenarios by systematically varying particle shapes, textures, lighting, and spatial arrangements that closely resemble the actual configurations. We evaluated three CNN-based architectures, ResNet-50, InceptionV3, and EfficientNet-B0, for predicting critical PSD parameters (d10, d50, d90). Results demonstrated comparable accuracy across models, with EfficientNet-B0 achieving the best computational efficiency suitable for real-time industrial deployment. This approach shows the effectiveness of realistic synthetic data for robust CNN training, which offers significant potential for automated industrial PSD monitoring. The code is released at : https://github.com/YasserElj/Synthetic-Granular-Gen

[126] High-Frequency Semantics and Geometric Priors for End-to-End Detection Transformers in Challenging UAV Imagery

Hongxing Peng,Lide Chen,Hui Zhu,Yan Chen

Main category: cs.CV

TL;DR: 本论文提出了一种全新的基于Transformer的无人机目标检测框架HEGS-DETR,通过创新性的网络结构和模块设计,在解决小目标检测、密集分布和复杂背景等问题上取得了显著的性能提升,并且具有实时性和较低的参数需求。

Details Motivation: 当前的UAV-OD算法依赖手工设计组件,存在泛化能力差和密集物体误分类的问题,需要针对航空图像特性进行改进。 Method: 提出了HEGS-DETR框架,包括HFESNet主干网络、ESOP策略、SQR模块和GAPE模块。 Result: 实验表明,HEGS-DETR在VisDrone数据集上比基线模型AP$_{50}$提高了5.1%,AP提高了3.8%,同时保持了实时速度并减少了4M参数量。 Conclusion: HEGS-DETR框架在无人机目标检测中表现出色,解决了现有方法的局限性,并实现了更高的性能和效率。 Abstract: Unmanned Aerial Vehicle-based Object Detection (UAV-OD) faces substantial challenges, including small target sizes, high-density distributions, and cluttered backgrounds in UAV imagery. Current algorithms often depend on hand-crafted components like anchor boxes, which demand fine-tuning and exhibit limited generalization, and Non-Maximum Suppression (NMS), which is threshold-sensitive and prone to misclassifying dense objects. These generic architectures thus struggle to adapt to aerial imaging characteristics, resulting in performance limitations. Moreover, emerging end-to-end frameworks have yet to effectively mitigate these aerial-specific challenges.To address these issues, we propose HEGS-DETR, a comprehensively enhanced, real-time Detection Transformer framework tailored for UAVs. First, we introduce the High-Frequency Enhanced Semantics Network (HFESNet) as a novel backbone. HFESNet preserves critical high-frequency spatial details to extract robust semantic features, thereby improving discriminative capability for small and occluded targets in complex backgrounds. Second, our Efficient Small Object Pyramid (ESOP) strategy strategically fuses high-resolution feature maps with minimal computational overhead, significantly boosting small object detection. Finally, the proposed Selective Query Recollection (SQR) and Geometry-Aware Positional Encoding (GAPE) modules enhance the detector's decoder stability and localization accuracy, effectively optimizing bounding boxes and providing explicit spatial priors for dense scenes. Experiments on the VisDrone dataset demonstrate that HEGS-DETR achieves a 5.1\% AP$_{50}$ and 3.8\% AP increase over the baseline, while maintaining real-time speed and reducing parameter count by 4M.

[127] Do Echo Top Heights Improve Deep Learning Nowcasts?

Peter Pavlík,Marc Schleiss,Anna Bou Ezzeddine,Viera Rozinajová

Main category: cs.CV

TL;DR: This paper investigates using Echo Top Height (ETH) with deep learning for improved short-term rainfall prediction, finding benefits at low intensities but challenges at higher rain rates.

Details Motivation: Precipitation nowcasting is crucial for weather-sensitive sectors, and while deep learning models have shown promise, most ignore vertical information in 3D radar volumes. This study explores the use of ETH as an auxiliary variable to address this gap. Method: A single-pass 3D U-Net was implemented to process radar reflectivity and Echo Top Height (ETH) as separate input channels for deep learning-based precipitation nowcasting. Result: Models using ETH showed improvement in skill at low rain-rate thresholds but produced inconsistent results at higher intensities, systematically underestimating precipitation intensity. Conclusion: The study concludes that while ETH can improve nowcasting skill at low rain-rate thresholds, it leads to underestimation of precipitation intensity at higher levels and increases error variance. The work sets a foundation for assessing additional variables' impact on nowcasting. Abstract: Precipitation nowcasting -- the short-term prediction of rainfall using recent radar observations -- is critical for weather-sensitive sectors such as transportation, agriculture, and disaster mitigation. While recent deep learning models have shown promise in improving nowcasting skill, most approaches rely solely on 2D radar reflectivity fields, discarding valuable vertical information available in the full 3D radar volume. In this work, we explore the use of Echo Top Height (ETH), a 2D projection indicating the maximum altitude of radar reflectivity above a given threshold, as an auxiliary input variable for deep learning-based nowcasting. We examine the relationship between ETH and radar reflectivity, confirming its relevance for predicting rainfall intensity. We implement a single-pass 3D U-Net that processes both the radar reflectivity and ETH as separate input channels. While our models are able to leverage ETH to improve skill at low rain-rate thresholds, results are inconsistent at higher intensities and the models with ETH systematically underestimate precipitation intensity. Three case studies are used to illustrate how ETH can help in some cases, but also confuse the models and increase the error variance. Nonetheless, the study serves as a foundation for critically assessing the potential contribution of additional variables to nowcasting performance.

[128] UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection

Wei Li,Jiaman Tang,Yang Li,Beihao Xia,Ligang Tan,Hongmao Qin

Main category: cs.CV

TL;DR: This paper proposes UAVD-Mamba, a new multimodal UAV object detection framework based on Mamba architectures, which improves geometric adaptability, feature complementarity, and multiscale object detection performance.

Details Motivation: UAV object detection faces significant challenges such as occlusions, small object sizes, and irregular shapes, necessitating a robust and efficient multimodal detection method. Method: The study introduces UAVD-Mamba, a novel framework that incorporates Deformable Token Mamba Blocks (DTMBs) for improved geometric adaptability and multimodal feature complementarity. It also includes multiscale feature representations and a Detection Neck for Mamba (DNM) inspired by the YOLO series. Result: Experimental results on the DroneVehicle dataset show that the proposed method outperforms the baseline OAFA method by 3.6% in the mAP metric. Conclusion: The proposed UAVD-Mamba framework demonstrates improved performance in multimodal UAV object detection, particularly in handling occlusions, small object sizes, and irregular shapes. Abstract: Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at https://github.com/GreatPlum-hnu/UAVD-Mamba.git.

[129] Robust Component Detection for Flexible Manufacturing: A Deep Learning Approach to Tray-Free Object Recognition under Variable Lighting

Fatemeh Sadat Daneshmand

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于Mask R-CNN的计算机视觉系统,使工业机器人能够在任意方向上检测和抓取笔组件,无需结构化放置,并在不同的照明条件下保持稳健性能,从而提高了制造灵活性并减少了设置时间。

Details Motivation: 随着工业4.0中柔性制造系统的发展,需要能够在非结构化环境中操作对象而不受刚性定位限制的机器人。这项研究旨在开发一种计算机视觉系统,以提高制造业的灵活性和效率。 Method: 该论文采用了一种基于Mask R-CNN的方法,并在一个完整的笔制造线上进行实现和评估,以应对三个关键挑战:无位置约束的目标检测、对极端光照变化的鲁棒性以及使用成本效益高的相机的可靠性能。 Result: 该系统在不同照明条件下实现了95%的检测准确率,同时消除了对结构化组件放置的需求,展示了30%的设置时间减少和显著提升的制造灵活性。此外,通过在四种不同的照明场景下进行广泛测试,验证了其在实际工业部署中的适用性。 Conclusion: 该论文的结论是,所提出的基于Mask R-CNN的计算机视觉系统能够使工业机器人在任意方向上检测和抓取笔组件,同时消除了对结构化放置的需求,并且在不同的照明条件下表现出稳健性能。 Abstract: Flexible manufacturing systems in Industry 4.0 require robots capable of handling objects in unstructured environments without rigid positioning constraints. This paper presents a computer vision system that enables industrial robots to detect and grasp pen components in arbitrary orientations without requiring structured trays, while maintaining robust performance under varying lighting conditions. We implement and evaluate a Mask R-CNN-based approach on a complete pen manufacturing line at ZHAW, addressing three critical challenges: object detection without positional constraints, robustness to extreme lighting variations, and reliable performance with cost-effective cameras. Our system achieves 95% detection accuracy across diverse lighting conditions while eliminating the need for structured component placement, demonstrating a 30% reduction in setup time and significant improvement in manufacturing flexibility. The approach is validated through extensive testing under four distinct lighting scenarios, showing practical applicability for real-world industrial deployment.

[130] SafeMap: Robust HD Map Construction from Incomplete Observations

Xiaoshuai Hao,Lingdong Kong,Rong Yin,Pengwei Wang,Jing Zhang,Yunfeng Diao,Shu Zhao

Main category: cs.CV

TL;DR: This paper introduces SafeMap, a novel framework for robust HD map construction that effectively handles incomplete multi-view camera data by integrating two modules: G-PVR for dynamic prioritization of informative regions and D-BEVC for correcting BEV representations, leading to enhanced performance and reliability.

Details Motivation: Existing methods struggle with incomplete multi-view camera data in constructing robust high-definition maps for autonomous driving, necessitating a more secure and accurate approach. Method: SafeMap integrates two key components: the Gaussian-based Perspective View Reconstruction (G-PVR) module and the Distillation-based Bird's-Eye-View (BEV) Correction (D-BEVC) module. G-PVR dynamically prioritizes informative regions based on available camera views, while D-BEVC corrects BEV representations using panoramic BEV features. Result: Experimental results show that SafeMap significantly outperforms previous methods in both complete and incomplete scenarios, highlighting its superior performance and reliability. Conclusion: SafeMap is a robust and reliable framework for HD map construction that significantly outperforms previous methods in handling incomplete multi-view camera data, offering a plug-and-play solution for enhanced robustness. Abstract: Robust high-definition (HD) map construction is vital for autonomous driving, yet existing methods often struggle with incomplete multi-view camera data. This paper presents SafeMap, a novel framework specifically designed to secure accuracy even when certain camera views are missing. SafeMap integrates two key components: the Gaussian-based Perspective View Reconstruction (G-PVR) module and the Distillation-based Bird's-Eye-View (BEV) Correction (D-BEVC) module. G-PVR leverages prior knowledge of view importance to dynamically prioritize the most informative regions based on the relationships among available camera views. Furthermore, D-BEVC utilizes panoramic BEV features to correct the BEV representations derived from incomplete observations. Together, these components facilitate the end-to-end map reconstruction and robust HD map generation. SafeMap is easy to implement and integrates seamlessly into existing systems, offering a plug-and-play solution for enhanced robustness. Experimental results demonstrate that SafeMap significantly outperforms previous methods in both complete and incomplete scenarios, highlighting its superior performance and reliability.

[131] Is Visual in-Context Learning for Compositional Medical Tasks within Reach?

Simon Reiß,Zdravko Marinov,Alexander Jaus,Constantin Seibold,M. Saquib Sarfraz,Erik Rodner,Rainer Stiefelhagen

Main category: cs.CV

TL;DR: This paper proposes a method for improving visual in-context learning by training models to handle complex, multi-step tasks using synthetic task sequences and optimized training strategies.

Details Motivation: The motivation is to enable a single model to adapt to multiple tasks during test time without re-training, particularly focusing on sequences of tasks rather than isolated ones, allowing flexible vision pipelines. Method: The authors used a synthetic compositional task generation engine built from segmentation datasets, explored masking-based training objectives, and analyzed the properties and limitations of visual in-context learning architectures. Result: The study provides insights into training visual in-context learners for compositional tasks, especially relevant for multi-modal medical applications, and identifies key challenges that need addressing. Conclusion: The paper concludes that visual in-context learning can be enhanced to handle multi-step tasks using a synthetic task generation engine and appropriate training objectives, offering flexibility at test time while highlighting challenges for future research. Abstract: In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks. We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine. This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks. Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks. Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.

[132] GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

Anna-Maria Halacheva,Jan-Nico Zaech,Xi Wang,Danda Pani Paudel,Luc Van Gool

Main category: cs.CV

TL;DR: This paper introduces a novel scene-centric 3D Vision-Language Model that reduces reliance on object detectors by embedding language into 3D Gaussian splats, achieving better performance and generalization.

Details Motivation: Current 3D Vision-Language Models rely heavily on object detectors, which introduce bottlenecks and limit flexibility. This work aims to overcome these issues with a more efficient and adaptable approach. Method: The method embeds linguistic features into Gaussian primitives for early modality alignment and uses a dual sparsifier to create compact, task-relevant tokens. Result: The model achieves five-fold improvement in performance over prior 3D VLMs in out-of-the-domain scenarios, demonstrating strong generalization. Conclusion: The proposed scene-centric 3D VLM effectively improves performance in out-of-the-domain settings by leveraging photorealistic 3D representations and sparse task-aware tokens. Abstract: As multimodal language models advance, their application to 3D scene understanding is a fast-growing frontier, driving the development of 3D Vision-Language Models (VLMs). Current methods show strong dependence on object detectors, introducing processing bottlenecks and limitations in taxonomic flexibility. To address these limitations, we propose a scene-centric 3D VLM for 3D Gaussian splat scenes that employs language- and task-aware scene representations. Our approach directly embeds rich linguistic features into the 3D scene representation by associating language with each Gaussian primitive, achieving early modality alignment. To process the resulting dense representations, we introduce a dual sparsifier that distills them into compact, task-relevant tokens via task-guided and location-guided pathways, producing sparse, task-aware global and local scene tokens. Notably, we present the first Gaussian splatting-based VLM, leveraging photorealistic 3D representations derived from standard RGB images, demonstrating strong generalization: it improves performance of prior 3D VLM five folds, in out-of-the-domain settings.

[133] Masks make discriminative models great again!

Tianshi Cao,Marie-Julie Rakotosaona,Ben Poole,Federico Tombari,Michael Niemeyer

Main category: cs.CV

TL;DR: Image2GS focuses on reconstructing visible parts of a 3D scene from a single image using masked training with visibility-aware Gaussian splats, achieving strong results without relying on hallucinating unseen content.

Details Motivation: The authors aim to address the challenge of reconstructing photorealistic 3D scenes from a single image by focusing on the image-to-3D lifting component and separating it from the hallucination of unseen content. Method: Image2GS decouples the image-to-3D lifting task from content completion and uses visibility masks derived from optimized 3D Gaussian splats to train discriminative models only on visible areas. Result: The masked training strategy significantly improves reconstruction quality in visible regions compared to baselines, and Image2GS remains competitive with state-of-the-art models trained on full target images when evaluated on complete scenes. Conclusion: Image2GS demonstrates that focusing on visible regions during training can lead to competitive performance compared to models trained on full images, highlighting the importance of specialized techniques for the image-to-3D lifting problem. Abstract: We present Image2GS, a novel approach that addresses the challenging problem of reconstructing photorealistic 3D scenes from a single image by focusing specifically on the image-to-3D lifting component of the reconstruction process. By decoupling the lifting problem (converting an image to a 3D model representing what is visible) from the completion problem (hallucinating content not present in the input), we create a more deterministic task suitable for discriminative models. Our method employs visibility masks derived from optimized 3D Gaussian splats to exclude areas not visible from the source view during training. This masked training strategy significantly improves reconstruction quality in visible regions compared to strong baselines. Notably, despite being trained only on masked regions, Image2GS remains competitive with state-of-the-art discriminative models trained on full target images when evaluated on complete scenes. Our findings highlight the fundamental struggle discriminative models face when fitting unseen regions and demonstrate the advantages of addressing image-to-3D lifting as a distinct problem with specialized techniques.

[134] MVP: Winning Solution to SMP Challenge 2025 Video Track

Liliang Ye,Yunyao Zhang,Yafeng Wu,Yi-Ping Phoebe Chen,Junqing Yu,Wei Yang,Zikai Song

Main category: cs.CV

TL;DR: 本文提出了一种多模态视频预测器(MVP),用于预测社交媒体视频的流行度,通过整合预训练模型的深度视频特征、用户元数据和上下文信息,并使用梯度提升回归模型捕获跨模态复杂模式,最终在SMP挑战赛2025中获得第一名。

Details Motivation: 社交媒体视频流行度的准确预测有助于内容推荐、趋势检测和用户参与等应用,但由于视频内容、用户行为和平台环境的复杂性,这一任务仍然具有挑战性。 Method: 构建了一个多模态视频预测框架MVP,包括以下步骤:1)系统化的预处理技术(如对数变换和异常值去除);2)从预训练模型中提取深度视频特征并与用户元数据及上下文信息融合;3)使用梯度提升回归模型学习跨模态特征的复杂模式。 Result: 该方法在SMP挑战赛2025视频赛道的官方评估中排名第一,证明了其在社交媒体平台上多模态视频流行度预测方面的有效性与可靠性。 Conclusion: 本文提出的MVP框架能够有效整合多模态特征,为社交媒体视频流行度预测提供了一个强大且可靠的解决方案。 Abstract: Social media platforms serve as central hubs for content dissemination, opinion expression, and public engagement across diverse modalities. Accurately predicting the popularity of social media videos enables valuable applications in content recommendation, trend detection, and audience engagement. In this paper, we present Multimodal Video Predictor (MVP), our winning solution to the Video Track of the SMP Challenge 2025. MVP constructs expressive post representations by integrating deep video features extracted from pretrained models with user metadata and contextual information. The framework applies systematic preprocessing techniques, including log-transformations and outlier removal, to improve model robustness. A gradient-boosted regression model is trained to capture complex patterns across modalities. Our approach ranked first in the official evaluation of the Video Track, demonstrating its effectiveness and reliability for multimodal video popularity prediction on social platforms. The source code is available at https://anonymous.4open.science/r/SMPDVideo.

[135] Surgical Neural Radiance Fields from One Image

Alberto Neri,Maximilan Fehrentz,Veronica Penza,Leonardo S. Mattos,Nazim Haouchine

Main category: cs.CV

TL;DR: This study introduces a method to train Neural Radiance Fields (NeRF) using just one intraoperative image and preoperative MRI data, making 3D reconstruction feasible in time-sensitive surgical scenarios.

Details Motivation: Neural Radiance Fields (NeRF) require extensive multi-view data, which is impractical in surgical settings due to time constraints. This work aims to enable efficient NeRF training using only a single intraoperative image and preoperative data. Method: Preoperative MRI data was used to determine camera viewpoints and images for training. Neural style transfer (WTC2 and STROTSS) was applied to align the appearance of intraoperative images with pre-constructed training sets, enabling rapid single-image NeRF training. Result: Evaluated on four clinical neurosurgical cases, the method showed strong agreement with NeRF models trained on real surgical images. High similarity metrics confirmed reconstruction fidelity, texture preservation, and structural accuracy. Conclusion: The approach makes single-image NeRF training feasible in surgical settings, effectively overcoming the limitations of traditional multi-view methods. Abstract: Purpose: Neural Radiance Fields (NeRF) offer exceptional capabilities for 3D reconstruction and view synthesis, yet their reliance on extensive multi-view data limits their application in surgical intraoperative settings where only limited data is available. In particular, collecting such extensive data intraoperatively is impractical due to time constraints. This work addresses this challenge by leveraging a single intraoperative image and preoperative data to train NeRF efficiently for surgical scenarios. Methods: We leverage preoperative MRI data to define the set of camera viewpoints and images needed for robust and unobstructed training. Intraoperatively, the appearance of the surgical image is transferred to the pre-constructed training set through neural style transfer, specifically combining WTC2 and STROTSS to prevent over-stylization. This process enables the creation of a dataset for instant and fast single-image NeRF training. Results: The method is evaluated with four clinical neurosurgical cases. Quantitative comparisons to NeRF models trained on real surgical microscope images demonstrate strong synthesis agreement, with similarity metrics indicating high reconstruction fidelity and stylistic alignment. When compared with ground truth, our method demonstrates high structural similarity, confirming good reconstruction quality and texture preservation. Conclusion: Our approach demonstrates the feasibility of single-image NeRF training in surgical settings, overcoming the limitations of traditional multi-view methods.

[136] RTMap: Real-Time Recursive Mapping with Change Detection and Localization

Yuheng Du,Sheng Yang,Lingxuan Wang,Zhenghua Hou,Chengying Cai,Zhitao Tan,Mingxia Chen,Shi-Sheng Huang,Qiang Li

Main category: cs.CV

TL;DR: RTMap 是一种新的在线高精地图方法,通过多遍历众包地图实现自进化记忆,解决了感知不准确、遮挡和多智能体融合的问题,在地图质量和定位精度方面表现出色。

Details Motivation: 现有的在线高精地图方法仍受限于感知不准确、密集交通中的遮挡以及无法融合多智能体观测等问题。因此,提出 RTMap 来改进这些方法。 Method: RTMap 在车载端以端到端的方式解决三个核心问题:(1)不确定性感知的高精地图元素建模;(2)概率感知的相对于众包先验地图的定位;(3)实时检测道路结构变化的可能性。 Result: 实验表明,RTMap 在多个公开自动驾驶数据集上展现了出色的先验辅助地图质量和定位准确性,能够稳定服务于下游预测和规划模块,并逐步异步提升众包先验地图的精度和新鲜度。 Conclusion: RTMap 提出了一种在线高精地图方法,通过多遍历众包地图实现自进化记忆,有效提升了现有单遍历方法的性能。 Abstract: While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end-to-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) real-time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously. Our source-code will be made publicly available at https://github.com/CN-ADLab/RTMap (Camera ready version incorporating reviewer suggestions will be updated soon).

[137] Evaluating Robustness of Monocular Depth Estimation with Procedural Scene Perturbations

Jack Nugent,Siyang Wu,Zeyu Ma,Beining Han,Meenal Parakh,Abhishek Joshi,Lingjie Mei,Alexander Raistrick,Xinyuan Li,Jia Deng

Main category: cs.CV

TL;DR: 本文介绍了一种名为PDE的新基准,通过过程生成创建3D场景,以系统评估单目深度估计模型在多种受控扰动下的鲁棒性。

Details Motivation: 标准基准测试不能全面评估单目深度估计的鲁棒性,因此需要一个新的基准来系统地测试不同扰动下的性能。 Method: 使用过程生成创建3D场景,以测试对各种受控扰动的鲁棒性。 Result: 分析结果揭示了哪些扰动对于最先进的深度模型来说具有挑战性。 Conclusion: PDE是一个新的单目深度估计基准,用于系统评估模型的鲁棒性,并希望为未来的研究提供指导。 Abstract: Recent years have witnessed substantial progress on monocular depth estimation, particularly as measured by the success of large models on standard benchmarks. However, performance on standard benchmarks does not offer a complete assessment, because most evaluate accuracy but not robustness. In this work, we introduce PDE (Procedural Depth Evaluation), a new benchmark which enables systematic robustness evaluation. PDE uses procedural generation to create 3D scenes that test robustness to various controlled perturbations, including object, camera, material and lighting changes. Our analysis yields interesting findings on what perturbations are challenging for state-of-the-art depth models, which we hope will inform further research. Code and data are available at https://github.com/princeton-vl/proc-depth-eval.

[138] UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis

Yuanrui Wang,Cong Han,YafeiLi,Zhipeng Jin,Xiawei Li,SiNan Du,Wen Tao,Yi Yang,shuanglong li,Chun Yuan,Liu Lin

Main category: cs.CV

TL;DR: 本文提出了一种基于分割引导的新框架,有效解决了文本到图像生成中视觉文本渲染的关键问题,实现了更高质量的字形和风格保留,并在多个基准测试中取得领先表现。

Details Motivation: 当前文本到图像生成技术在准确渲染视觉文本方面仍存在挑战,包括字形模糊、语义漂移以及风格控制受限等问题。现有方法依赖预渲染字形图像作为条件输入,但这种方式难以保留原始字体样式和颜色提示,且需要复杂的多分支设计,导致模型开销增加并降低灵活性。因此,研究旨在解决这些问题,提供更高效、灵活的文本渲染解决方案。 Method: 该论文的核心方法包括两个关键组件:(1) 一个经过微调的双语文本分割模型,用于精确提取文本掩码;(2) 一种改进的扩散模型,增加了自适应字形条件化模块和区域特定损失函数,以提高生成文本的内容和风格保真度。此外,作者还提出了两个新基准(GlyphMM-benchmark 和 MiniText-benchmark)用于评估复杂排版和小规模文本区域的生成质量。 Result: 实验结果表明,该方法在AnyText基准测试中显著超越了现有技术,尤其是在中文和英文环境下。此外,在新提出的GlyphMM-benchmark和MiniText-benchmark测试中也表现出色,证明其在复杂排版和小规模文本生成方面具有出色的泛化能力和部署潜力。 Conclusion: 该研究提出了一种新的分割引导框架,用于文本到图像生成中的视觉文本渲染。通过引入像素级的视觉文本掩码作为统一条件输入,并结合优化的扩散模型,实现了在多个基准测试中显著优于现有方法的表现,特别是在小规模文本和复杂布局的保持上。 Abstract: Text-to-image generation has greatly advanced content creation, yet accurately rendering visual text remains a key challenge due to blurred glyphs, semantic drift, and limited style control. Existing methods often rely on pre-rendered glyph images as conditions, but these struggle to retain original font styles and color cues, necessitating complex multi-branch designs that increase model overhead and reduce flexibility. To address these issues, we propose a segmentation-guided framework that uses pixel-level visual text masks -- rich in glyph shape, color, and spatial detail -- as unified conditional inputs. Our method introduces two core components: (1) a fine-tuned bilingual segmentation model for precise text mask extraction, and (2) a streamlined diffusion model augmented with adaptive glyph conditioning and a region-specific loss to preserve textual fidelity in both content and style. Our approach achieves state-of-the-art performance on the AnyText benchmark, significantly surpassing prior methods in both Chinese and English settings. To enable more rigorous evaluation, we also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex typesetting, and MiniText-benchmark for assessing generation quality in small-scale text regions. Experimental results show that our model outperforms existing methods by a large margin in both scenarios, particularly excelling at small text rendering and complex layout preservation, validating its strong generalization and deployment readiness.

[139] GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong,Wenmeng Yu,Xiaotao Gu,Guo Wang,Guobing Gan,Haomiao Tang,Jiale Cheng,Ji Qi,Junhui Ji,Lihang Pan,Shuaiqi Duan,Weihan Wang,Yan Wang,Yean Cheng,Zehai He,Zhe Su,Zhen Yang,Ziyang Pan,Aohan Zeng,Baoxu Wang,Boyan Shi,Changyu Pang,Chenhui Zhang,Da Yin,Fan Yang,Guoqing Chen,Jiazheng Xu,Jiali Chen,Jing Chen,Jinhao Chen,Jinghao Lin,Jinjiang Wang,Junjie Chen,Leqi Lei,Leyi Pan,Mingzhi Zhang,Qinkai Zheng,Sheng Yang,Shi Zhong,Shiyu Huang,Shuyuan Zhao,Siyan Xue,Shangqin Tu,Shengbiao Meng,Tianshu Zhang,Tianwei Luo,Tianxiang Hao,Tianle Gong,Wenkai Li,Wei Jia,Xin Lyu,Xuancheng Huang,Yanling Wang,Yadong Xue,Yanfeng Wang,Yifan An,Yifan Du,Yiming Shi,Yiheng Huang,Yilin Niu,Yuan Wang,Yuanchang Yue,Yuchen Li,Yutao Zhang,Yuxuan Zhang,Zhanxiao Du,Zhenyu Hou,Zhao Xue,Zhengxiao Du,Zihan Wang,Peng Zhang,Debing Liu,Bin Xu,Juanzi Li,Minlie Huang,Yuxiao Dong,Jie Tang

Main category: cs.CV

TL;DR: GLM-4.1V-Thinking 是一个强大的视觉语言模型,通过创新训练框架实现多领域卓越性能,并已开源。

Details Motivation: 开发通用的多模态推理模型以解决包括 STEM 问题、视频理解、文档分析等复杂任务。 Method: 通过大规模预训练构建基础视觉模型,并采用课程采样强化学习(RLCS)提升模型性能。 Result: 在28个公开基准测试中表现优异,超越 Qwen2.5-VL-7B,并在18个基准上与更大的 Qwen2.5-VL-72B 相当或更优。 Conclusion: GLM-4.1V-Thinking 是一个先进的视觉语言模型,在多种任务中表现出色,甚至优于更大的模型和闭源模型如 GPT-4o。 Abstract: We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. Reinforcement Learning with Curriculum Sampling (RLCS) then unlocks the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding, among others. To facilitate research in this field, we open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.

[140] ShapeEmbed: a self-supervised learning framework for 2D contour quantification

Anna Foix Romero,Craig Russell,Alexander Krull,Virginie Uhlmann

Main category: cs.CV

TL;DR: 本文提出了一种名为ShapeEmbed的自监督表示学习框架,用于从二维图像中的物体轮廓中提取对平移、缩放、旋转、反射和点索引不变的形状描述符,并在自然和生物图像的形状分类任务中优于现有方法。

Details Motivation: 形状量化的核心挑战之一是确保提取的测量值在保持物体内在几何结构的变换下保持不变,例如改变大小、方向和图像中的位置。 Method: 引入了一个名为ShapeEmbed的自监督表示学习框架,该框架将2D图像中物体轮廓表示为欧氏距离矩阵,并编码成对上述变换不变的形状描述符。 Result: ShapeEmbed在自然和生物图像的形状分类任务中表现优于现有的自动编码器方法。 Conclusion: ShapeEmbed在生物成像应用中具有重要潜力,并克服了传统形状描述符和现有最优自动编码器方法的局限性。 Abstract: The shape of objects is an important source of visual information in a wide range of applications. One of the core challenges of shape quantification is to ensure that the extracted measurements remain invariant to transformations that preserve an object's intrinsic geometry, such as changing its size, orientation, and position in the image. In this work, we introduce ShapeEmbed, a self-supervised representation learning framework designed to encode the contour of objects in 2D images, represented as a Euclidean distance matrix, into a shape descriptor that is invariant to translation, scaling, rotation, reflection, and point indexing. Our approach overcomes the limitations of traditional shape descriptors while improving upon existing state-of-the-art autoencoder-based approaches. We demonstrate that the descriptors learned by our framework outperform their competitors in shape classification tasks on natural and biological images. We envision our approach to be of particular relevance to biological imaging applications.

[141] DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution

Zhe Kong,Le Li,Yong Zhang,Feng Gao,Shaoshu Yang,Tao Wang,Kaihao Zhang,Zhuoliang Kang,Xiaoming Wei,Guanying Chen,Wenhan Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为 DAM-VSR 的新框架,用于解决视频超分辨率问题,该框架能够有效分解外观增强与运动控制,并在多个数据集上展示了卓越的性能。

Details Motivation: 由于复杂且不可预测的降级因素,真实世界的视频超分辨率 (VSR) 构成了重大挑战。尽管最近一些方法利用图像扩散模型进行 VSR 并显示出了改进的细节生成功能,但它们仍然难以产生时间一致的帧。 Method: DAM-VSR 方法将视频超分辨率 (VSR) 分解为外观增强和运动控制问题。外观增强通过参考图像超分辨率实现,而运动控制则通过视频 ControlNet 实现。此外,还采用了运动对齐的双向采样策略。 Result: DAM-VSR 在处理较长输入视频时表现出色,并且在真实世界数据和 AIGC 数据上都达到了最先进水平。 Conclusion: DAM-VSR 框架在真实世界和 AIGC 数据上实现了最先进的性能,证明了其强大的细节生成能力。 Abstract: Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.