cs.CL [Back]

[1] Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans

Javier Conde,Miguel González,María Grandury,Gonzalo Martínez,Pedro Reviriego,Mar Brysbaert

Main category: cs.CL

TL;DR: 本文研究了一组具有代表性的大型语言模型（LLM）与人类评分在语言特征上的一致性，发现其在某些感官联想方面存在局限。

Details

Motivation: 为了更全面地评估LLM的表现，除了常规任务表现外，还需要关注那些不易量化的语言特征如唤醒度、具体性或性别关联等。 Method: 利用格拉斯哥和兰卡斯特规范这两个心理语言学数据集，对一组具有代表性的LLM与人类评分的一致性进行了评估。 Result: 结果显示，在评估的格拉斯哥规范（唤醒度、效价、支配度、具体性、意象性、熟悉度和性别）中，一致性通常更好；而在评估的兰卡斯特规范（内感受、味觉、嗅觉、触觉、听觉和视觉）中，则一致性较差。 Conclusion: 当前的LLM在与人类感官联想对齐方面存在潜在限制，这可能由于它们缺乏人类所具有的具体认知，并说明了使用心理语言学数据集评估LLM的有用性。 Abstract: The evaluation of LLMs has so far focused primarily on how well they can perform different tasks such as reasoning, question-answering, paraphrasing, or translating. For most of these tasks, performance can be measured with objective metrics, such as the number of correct answers. However, other language features are not easily quantified. For example, arousal, concreteness, or gender associated with a given word, as well as the extent to which we experience words with senses and relate them to a specific sense. Those features have been studied for many years by psycholinguistics, conducting large-scale experiments with humans to produce ratings for thousands of words. This opens an opportunity to evaluate how well LLMs align with human ratings on these word features, taking advantage of existing studies that cover many different language features in a large number of words. In this paper, we evaluate the alignment of a representative group of LLMs with human ratings on two psycholinguistic datasets: the Glasgow and Lancaster norms. These datasets cover thirteen features over thousands of words. The results show that alignment is \textcolor{black}{generally} better in the Glasgow norms evaluated (arousal, valence, dominance, concreteness, imageability, familiarity, and gender) than on the Lancaster norms evaluated (introceptive, gustatory, olfactory, haptic, auditory, and visual). This suggests a potential limitation of current LLMs in aligning with human sensory associations for words, which may be due to their lack of embodied cognition present in humans and illustrates the usefulness of evaluating LLMs with psycholinguistic datasets.

[2] AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents

Sudip Dasgupta,Himanshu Shankar

Main category: cs.CL

TL;DR: This paper introduces an AI-driven multi-agent system that significantly improves enterprise document review by enhancing accuracy, reducing time, and ensuring auditability.

Details

Motivation: To improve the accuracy, consistency, and efficiency of enterprise document reviews beyond traditional compliance checks and unstructured text analysis. Method: A modular, multi-agent AI framework utilizing LangChain, CrewAI, TruLens, and Guidance for structured document evaluation, with quantitative comparison against human reviewers. Result: 99% information consistency (vs. 92% for humans), halved error and bias rates, reduced review time from 30 to 2.5 minutes per document, and a 95% agreement rate between AI and expert judgment. Conclusion: The proposed multi-agent AI system demonstrates superior performance in document review compared to humans, offering efficiency, accuracy, and scalability, while still requiring human oversight in specialized areas. Abstract: This study presents a modular, multi-agent system for the automated review of highly structured enterprise business documents using AI agents. Unlike prior solutions focused on unstructured texts or limited compliance checks, this framework leverages modern orchestration tools such as LangChain, CrewAI, TruLens, and Guidance to enable section-by-section evaluation of documents for accuracy, consistency, completeness, and clarity. Specialized agents, each responsible for discrete review criteria such as template compliance or factual correctness, operate in parallel or sequence as required. Evaluation outputs are enforced to a standardized, machine-readable schema, supporting downstream analytics and auditability. Continuous monitoring and a feedback loop with human reviewers allow for iterative system improvement and bias mitigation. Quantitative evaluation demonstrates that the AI Agent-as-Judge system approaches or exceeds human performance in key areas: achieving 99% information consistency (vs. 92% for humans), halving error and bias rates, and reducing average review time from 30 to 2.5 minutes per document, with a 95% agreement rate between AI and expert human judgment. While promising for a wide range of industries, the study also discusses current limitations, including the need for human oversight in highly specialized domains and the operational cost of large-scale LLM usage. The proposed system serves as a flexible, auditable, and scalable foundation for AI-driven document quality assurance in the enterprise context.

[3] Hallucination Detection with Small Language Models

Ming Cheung

Main category: cs.CL

TL;DR: 本文提出了一种使用多个小语言模型对大型语言模型的回答进行验证的方法，以提高准确性和减少幻觉问题。

Details

Motivation: 大型语言模型（LLMs）在实际应用中的可靠性受到响应幻觉的影响，尤其是在缺乏真实标签的情况下难以检测，因此需要一种有效的验证机制。 Method: 将响应分解为单个句子，并利用多模型生成“是”标记的概率进行验证，结合检索到的上下文进行判断。 Result: 实验结果显示，在超过100组问答和上下文的真实数据集上，该框架在检测正确响应方面的F1分数提高了10%。 Conclusion: 该论文提出了一种通过集成多个小型语言模型来检测大型语言模型响应中幻觉的框架，实验结果表明其在检测正确回答方面相比幻觉有10%的F1分数提升。 Abstract: Since the introduction of ChatGPT, large language models (LLMs) have demonstrated significant utility in various tasks, such as answering questions through retrieval-augmented generation. Context can be retrieved using a vectorized database, serving as a foundation for LLMs to generate responses. However, hallucinations in responses can undermine the reliability of LLMs in practical applications, and they are not easily detectable in the absence of ground truth, particularly in question-and-answer scenarios. This paper proposes a framework that integrates multiple small language models to verify responses generated by LLMs using the retrieved context from a vectorized database. By breaking down the responses into individual sentences and utilizing the probability of generating "Yes" tokens from the outputs of multiple models for a given set of questions, responses, and relevant context, hallucinations can be detected. The proposed framework is validated through experiments with real datasets comprising over 100 sets of questions, answers, and contexts, including responses with fully and partially correct sentences. The results demonstrate a 10\% improvement in F1 scores for detecting correct responses compared to hallucinations, indicating that multiple small language models can be effectively employed for answer verification, providing a scalable and efficient solution for both academic and practical applications.

[4] PromptAug: Fine-grained Conflict Classification Using Data Augmentation

Oliver Warke,Joemon M. Jose,Faegheh Hasibi,Jan Breitsohl

Main category: cs.CL

TL;DR: 本文提出了一种名为PromptAug的数据增强方法，在冲突检测等敏感任务中表现出色，并通过多种方式对其效果进行了深入评估。

Details

Motivation: 由于社交媒体上冲突行为的增加以及高质量标注数据的缺乏和获取难度，需要有效的分类模型来检测有害行为。同时，由于社交媒体平台限制研究数据的访问，文本数据增强成为替代方案。 Method: PromptAug使用LLM进行数据增强，并通过极端数据稀缺场景下的全面评估、定量多样性分析和定性主题分析进行了验证。 Result: PromptAug在冲突和情感数据集上的准确率和F1分数均有2%的显著提高。 Conclusion: PromptAug是一种在敏感任务如冲突检测中有效的数据增强方法，提供了独特的、跨学科的评估，基于自然语言处理和社会科学方法论。 Abstract: Given the rise of conflicts on social media, effective classification models to detect harmful behaviours are essential. Following the garbage-in-garbage-out maxim, machine learning performance depends heavily on training data quality. However, high-quality labelled data, especially for nuanced tasks like identifying conflict behaviours, is limited, expensive, and difficult to obtain. Additionally, as social media platforms increasingly restrict access to research data, text data augmentation is gaining attention as an alternative to generate training data. Augmenting conflict-related data poses unique challenges due to Large Language Model (LLM) guardrails that prevent generation of offensive content. This paper introduces PromptAug, an innovative LLM-based data augmentation method. PromptAug achieves statistically significant improvements of 2% in both accuracy and F1-score on conflict and emotion datasets. To thoroughly evaluate PromptAug against other data augmentation methods we conduct a robust evaluation using extreme data scarcity scenarios, quantitative diversity analysis and a qualitative thematic analysis. The thematic analysis identifies four problematic patterns in augmented text: Linguistic Fluidity, Humour Ambiguity, Augmented Content Ambiguity, and Augmented Content Misinterpretation. Overall, this work presents PromptAug as an effective method for augmenting data in sensitive tasks like conflict detection, offering a unique, interdisciplinary evaluation grounded in both natural language processing and social science methodology.

[5] AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text

Chenyang Shao,Tianxing Li,Chenhao Pu,Fengli Xu,Yong Li

Main category: cs.CL

TL;DR: 本文提出AgentStealth，一种基于本地小型语言模型的自我强化匿名框架，有效解决传统文本匿名化方法中的隐私和效用问题。

Details

Motivation: 现有文本匿名化方法依赖于刚性的替换或基于云的大型语言模型，存在效用损失或隐私风险，因此需要一种更有效的本地解决方案。 Method: 提出了一种结合上下文对比学习和自适应效用感知控制的对抗性匿名工作流程，并通过在线强化学习方法优化模型性能。 Result: 实验表明，该方法在两个数据集上均优于基线，在匿名化效果（+12.3%）和效用（+6.8%）方面均有显著提升。 Conclusion: AgentStealth是一个基于小型本地语言模型的自我强化匿名框架，能够在保护隐私的同时提高文本匿名化的效用。 Abstract: In today's digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization framework.First, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at https://github.com/tsinghua-fib-lab/AgentStealth.

[6] Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning

Zihao Zhao,Xinlong Zhai,Jinyu Yang,Chuan Shi

Main category: cs.CL

TL;DR: This paper introduces MDGCL, a novel framework for multi-domain graph representation learning that recognizes and leverages domain-specific differences during pre-training and enables effective cross-domain knowledge transfer, leading to significant performance improvements.

Details

Motivation: Graph data is prevalent in real-world applications like social networks and recommendation systems, but unlike NLP and CV, graphs from different domains have significant semantic and structural differences. Traditional contrastive pre-training strategies fail to handle these differences effectively, prompting the need for a more tailored approach. Method: The authors propose MDGCL, a multi-domain pre-training and cross-domain transfer framework. It includes a contrastive learning strategy to capture domain differences, domain tokens for global domain information, and a domain attention mechanism for fine-grained knowledge transfer. Result: Extensive experiments on five benchmark datasets show that MDGCL significantly outperforms state-of-the-art methods, achieving up to a 19.33% improvement in accuracy and 19.13% in Macro-F1 score. Conclusion: The paper concludes that the proposed MDGCL framework effectively addresses the challenge of multi-domain knowledge transfer in graph foundation models, outperforming existing methods significantly. Abstract: Foundation models have achieved great success in natural language processing (NLP) and computer vision (CV). Their success largely stems from the ability to integrate multi-domain knowledge in pre-training and transfer it to target domains. Considering graph data, especially graphs without textual features, is ubiquitous in real-world applications such as social networks and recommendation systems, some researchers have attempted to extend this paradigm to the graph field, aiming to construct graph foundation models. However, unlike CV and NLP, there are huge gaps among the semantics and properties of graphs in different domains, while current works still adopt traditional contrastive pre-training strategies designed in the single-domain scenario, which regard contrastive samples from different domains as equivalent. From experimental investigations, we discovered that inherent domain-specific differences prevent these strategies from effectively absorbing knowledge from different domains to generate informative representations. In this paper, we propose a novel multi-domain pre-training and cross-domain transfer framework, namely MDGCL.In the pre-training stage, we design a contrastive learning strategy to substantially recognize and capture domain differences, and introduce domain tokens to encode domain-level global information. In the downstream stage, we introduce a domain attention mechanism to enable fine-grained domain knowledge transfer. Extensive experiments on five benchmark datasets have demonstrated that our method outperforms state-of-the-art significantly, with the maximum improvement of 19.33\% on accuracy and 19.13\% on Macro-F1 score.

[7] Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis

Jingkai Li

Main category: cs.CL

TL;DR: This study applies Integrated Information Theory (IIT) 3.0 and 4.0 to Large Language Model (LLM) representations derived from Theory of Mind (ToM) tests to explore potential consciousness phenomena. No statistically significant evidence of consciousness was identified, though spatio-permutational analyses revealed intriguing patterns.

Details

Motivation: To determine if differences in ToM test performances can be reflected through IIT estimates for assessing potential consciousness phenomena in LLMs. Method: Applying IIT 3.0 and 4.0 to analyze sequences of LLM representations derived from existing ToM test results, comparing metrics like Φ^max, Φ, Conceptual Information, and Φ-structure with Span Representations. Result: No significant indicators of 'consciousness' were found in Transformer-based LLMs, but intriguing patterns emerged during spatio-permutational analyses. Conclusion: Transformer-based LLM representations do not exhibit statistically significant signs of consciousness, although interesting patterns appear under spatio-permutational analyses. Abstract: Integrated Information Theory (IIT) provides a quantitative framework for explaining consciousness phenomenon, positing that conscious systems comprise elements integrated through causal properties. We apply IIT 3.0 and 4.0 -- the latest iterations of this framework -- to sequences of Large Language Model (LLM) representations, analyzing data derived from existing Theory of Mind (ToM) test results. Our study systematically investigates whether the differences of ToM test performances, when presented in the LLM representations, can be revealed by IIT estimates, i.e., $\Phi^{\max}$ (IIT 3.0), $\Phi$ (IIT 4.0), Conceptual Information (IIT 3.0), and $\Phi$-structure (IIT 4.0). Furthermore, we compare these metrics with the Span Representations independent of any estimate for consciousness. This additional effort aims to differentiate between potential "consciousness" phenomena and inherent separations within LLM representational space. We conduct comprehensive experiments examining variations across LLM transformer layers and linguistic spans from stimuli. Our results suggest that sequences of contemporary Transformer-based LLM representations lack statistically significant indicators of observed "consciousness" phenomena but exhibit intriguing patterns under $\textit{spatio}$-permutational analyses. The Appendix and code are available as Supplementary Materials at: https://doi.org/10.1016/j.nlp.2025.100163.

[8] Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation

Deyu Zou,Yongqiang Chen,Mufei Li,Siqi Miao,Chenxi Liu,Bo Han,James Cheng,Pan Li

Main category: cs.CL

TL;DR: ReG improves graph-based RAG by refining weak retrievers via LLM feedback and structuring retrieved knowledge, leading to significant performance gains and reduced reasoning costs.

Details

Motivation: Graph-based retrieval-augmented generation (RAG) aims to ground LLM responses using structured external knowledge from KGs, but suffers from weak retrievers due to lack of ground truth and unorganized retrieval outputs, leading to hallucinations and inefficiencies. Method: ReG enhances graph-based RAG through two key components: incorporating LLM feedback to eliminate spurious signals in supervision and introducing a structure-aware reorganization module that organizes retrieved knowledge into coherent evidence chains. Result: Experiments show that ReG consistently improves performance across different LLMs by up to 10%, matches state-of-the-art results using only 5% training data, transfers well to out-of-distribution KGs, and reduces reasoning token cost by up to 30% while improving accuracy by 4% in reasoning-based tasks. Conclusion: ReG significantly improves the performance of graph-based RAG by aligning weak retrievers with LLMs, achieving up to 10% improvement across benchmarks, matching state-of-the-art results with less data, and reducing reasoning token costs when applied to reasoning-based LLMs. Abstract: Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to ground responses with structured external knowledge from up-to-date knowledge graphs (KGs) and reduce hallucinations. However, LLMs often rely on a weak retriever in graph-based RAG: I) Due to the lack of ground truth, the retriever is often trained on weak supervision, which often introduces spurious signals to the LLMs. II) Due to the abstraction of graph data, the retrieved knowledge is often presented in unorganized forms. To mitigate the issue, we present Refined Graph-based RAG (ReG) to align weak retrievers to LLMs for graph-based RAG. Specifically, ReG incorporates LLM feedback to get rid of spurious signals and improve the quality of the supervision. Meanwhile, ReG introduces a structure-aware reorganization module to refactor the retrieval results into logically coherent evidence chains. Experiments on prominent benchmarks demonstrate that ReG significantly and consistently brings improvements across different LLM backbones by up to 10%. The improved supervision quality enables ReG to match the state-of-the-art performance with 5% training data and to transfer to out-of-distribution KGs. Notably, when adopted to reasoning-based LLMs, ReG reduces the reasoning token cost by up to 30% and improves the performance by up to 4%.

[9] MisinfoTeleGraph: Network-driven Misinformation Detection for German Telegram Messages

Lu Kalkbrenner,Veronika Solopova,Steffen Zeiler,Robert Nickel,Dorothea Kolossa

Main category: cs.CL

TL;DR: 这篇论文介绍了Misinfo-TeleGraph，这是第一个用于检测德语电报网络中错误信息的图数据集，包括超过500万条消息，结果显示结合消息转发作为网络结构的图神经网络比仅使用文本的模型表现更好。

Details

Motivation: 连接性和消息传播是错误信息检测中常常被忽视的重要信息来源，尤其是在如Telegram这样的低度监管平台上。 Method: 通过使用M3嵌入以及手动注释，生成了一个包含超过500万条消息的德语电报图数据集，并评估了文本模型和图神经网络（GNN）的表现。 Result: 使用LSTM聚合的GraphSAGE显著优于仅使用文本的基线模型的Matthews相关系数（MCC）和F1分数。 Conclusion: Misinfo-TeleGraph是一个用于检测德语电报网络中错误信息的可重复基准和开放数据集，强调了弱监督在这一领域中的潜力和挑战。 Abstract: Connectivity and message propagation are central, yet often underutilized, sources of information in misinformation detection -- especially on poorly moderated platforms such as Telegram, which has become a critical channel for misinformation dissemination, namely in the German electoral context. In this paper, we introduce Misinfo-TeleGraph, the first German-language Telegram-based graph dataset for misinformation detection. It includes over 5 million messages from public channels, enriched with metadata, channel relationships, and both weak and strong labels. These labels are derived via semantic similarity to fact-checks and news articles using M3-embeddings, as well as manual annotation. To establish reproducible baselines, we evaluate both text-only models and graph neural networks (GNNs) that incorporate message forwarding as a network structure. Our results show that GraphSAGE with LSTM aggregation significantly outperforms text-only baselines in terms of Matthews Correlation Coefficient (MCC) and F1-score. We further evaluate the impact of subscribers, view counts, and automatically versus human-created labels on performance, and highlight both the potential and challenges of weak supervision in this domain. This work provides a reproducible benchmark and open dataset for future research on misinformation detection in German-language Telegram networks and other low-moderation social platforms.

[10] RExBench: Can coding agents autonomously implement AI research extensions?

Nicholas Edwards,Yukyung Lee,Yujun,Mao,Yulu Qin,Sebastian Schuster,Najoung Kim

Main category: cs.CL

TL;DR: RExBench reveals that current LLM agents have limited success in autonomously implementing research extensions, highlighting the need for significant human intervention.

Details

Motivation: The motivation is to assess the capability of LLM-based agents in extending real-world research projects, which is crucial for their application in machine learning and natural sciences. Method: The authors used RExBench, a benchmark of 12 research experiment implementation tasks, to evaluate nine LLM agents across three frameworks (aider, Claude Code, and OpenHands) with both automated and hint-assisted evaluations. Result: All evaluated agents failed to implement most extensions autonomously. Even with human-written hints, the best success rate remained below 40%. Conclusion: The study concludes that current LLM agents still struggle with autonomously handling realistic research extension tasks, requiring substantial human guidance. Abstract: Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

[11] Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks

Badr Youbi Idrissi,Monica Millunzi,Amelia Sorrenti,Lorenzo Baraldi,Daryna Dementieva

Main category: cs.CL

TL;DR: 本研究开发了一种新型文本水印技术，用于检测合成文本并提高大语言模型使用的伦理性，实验表明该方法比现有技术更强大。

Details

Motivation: 由于大型语言模型（LLMs）的广泛应用，存在对其潜在滥用的担忧，因此需要开发算法来识别机器生成的文本，以促进其伦理使用。 Method: 通过复现先前基线研究的结果，并提出一种创新的水印方法，使用改写生成的文本来评估其鲁棒性。 Result: 实验结果显示了所提出的水印方法相比~\cite{aarson}的方法更加稳健。 Conclusion: 研究得出了一种新的水印方法在检测合成文本方面比现有方法更具鲁棒性，有助于确保大语言模型的伦理应用。 Abstract: In the present-day scenario, Large Language Models (LLMs) are establishing their presence as powerful instruments permeating various sectors of society. While their utility offers valuable support to individuals, there are multiple concerns over potential misuse. Consequently, some academic endeavors have sought to introduce watermarking techniques, characterized by the inclusion of markers within machine-generated text, to facilitate algorithmic identification. This research project is focused on the development of a novel methodology for the detection of synthetic text, with the overarching goal of ensuring the ethical application of LLMs in AI-driven text generation. The investigation commences with replicating findings from a previous baseline study, thereby underscoring its susceptibility to variations in the underlying generation model. Subsequently, we propose an innovative watermarking approach and subject it to rigorous evaluation, employing paraphrased generated text to asses its robustness. Experimental results highlight the robustness of our proposal compared to the~\cite{aarson} watermarking method.

[12] Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge

Chase Fensore,Kaustubh Dhole,Joyce C Ho,Eugene Agichtein

Main category: cs.CL

TL;DR: 本研究提出了一个混合检索增强生成系统，参加了LiveRAG Challenge 2025。

Details

Motivation: 参与LiveRAG Challenge 2025，评估动态测试集上的检索增强生成(RAG)系统。 Method: 结合了稀疏(BM25)和密集(E5)检索方法，并使用Falcon3-10B-Instruct生成答案。还采用了神经重排序(RankLLaMA)和DSPy优化提示策略。 Result: 通过RankLLaMA进行神经重排序将MAP从0.523提高到0.797，但计算成本增加。DSPy优化提示策略达到了更高的语义相似度，但拒绝率为0%。 Conclusion: 本文的结论是，混合方法在Faithfulness指标上排名第四，在Correctness指标上排名第十一。词汇对齐是开发集上性能最强的预测因素。 Abstract: We present our submission to the LiveRAG Challenge 2025, which evaluates retrieval-augmented generation (RAG) systems on dynamic test sets using the FineWeb-10BT corpus. Our final hybrid approach combines sparse (BM25) and dense (E5) retrieval methods and then aims to generate relevant and faithful answers with Falcon3-10B-Instruct. Through systematic evaluation on 200 synthetic questions generated with DataMorgana across 64 unique question-user combinations, we demonstrate that neural re-ranking with RankLLaMA improves MAP from 0.523 to 0.797 (52% relative improvement) but introduces prohibitive computational costs (84s vs 1.74s per question). While DSPy-optimized prompting strategies achieved higher semantic similarity (0.771 vs 0.668), their 0% refusal rates raised concerns about over-confidence and generalizability. Our submitted hybrid system without re-ranking achieved 4th place in faithfulness and 11th place in correctness among 25 teams. Analysis across question categories reveals that vocabulary alignment between questions and documents was the strongest predictor of performance on our development set, with document-similar phrasing improving cosine similarity from 0.562 to 0.762.

[13] Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions

Ankush Raut,Projna Paromita,Sydney Begerowski,Suzanne Bell,Theodora Chaspari

Main category: cs.CL

TL;DR: Decoder-only LLMs like Llama-3.1 outperform encoder-only models in detecting subtle micro-behaviors in team conversations during simulated space missions.

Details

Motivation: This research aims to assess the feasibility of large language models (LLMs) in detecting subtle expressions of micro-behaviors in team conversations, particularly for applications in high-stakes environments such as space missions where text-based analysis is critical. Method: The researchers explored zero-shot classification, fine-tuning, paraphrase-augmented fine-tuning with encoder-only LLMs, and few-shot text generation using decoder-only causal language modeling LLMs to predict micro-behaviors from conversational turns. Result: Encoder-only LLMs struggled with detecting underrepresented micro-behaviors, even after weighted fine-tuning, while the decoder-only LLM Llama-3.1 achieved superior performance with macro F1-scores of 44% for 3-way classification and 68% for binary classification. Conclusion: The study concludes that decoder-only LLMs, specifically instruction fine-tuned versions of Llama-3.1, are more effective in detecting micro-behaviors in team conversations compared to encoder-only models like RoBERTa and DistilBERT. Abstract: We explore the feasibility of large language models (LLMs) in detecting subtle expressions of micro-behaviors in team conversations using transcripts collected during simulated space missions. Specifically, we examine zero-shot classification, fine-tuning, and paraphrase-augmented fine-tuning with encoder-only sequence classification LLMs, as well as few-shot text generation with decoder-only causal language modeling LLMs, to predict the micro-behavior associated with each conversational turn (i.e., dialogue). Our findings indicate that encoder-only LLMs, such as RoBERTa and DistilBERT, struggled to detect underrepresented micro-behaviors, particularly discouraging speech, even with weighted fine-tuning. In contrast, the instruction fine-tuned version of Llama-3.1, a decoder-only LLM, demonstrated superior performance, with the best models achieving macro F1-scores of 44% for 3-way classification and 68% for binary classification. These results have implications for the development of speech technologies aimed at analyzing team communication dynamics and enhancing training interventions in high-stakes environments such as space missions, particularly in scenarios where text is the only accessible data.

[14] VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

Raghavv Goel,Sudhanshu Agrawal,Mukul Gagrani,Junyoung Park,Yifan Zao,He Zhang,Tian Liu,Yiping Yang,Xin Yuan,Jiuyan Lu,Chris Lott,Mingu Lee

Main category: cs.CL

TL;DR: 本文提出了一种新的训练无关技术VocabTrim，用于优化基于draft的推测解码方法，通过限制draft过程中使用的词汇量来提高生成速度，尤其适用于内存受限的设备。

Details

Motivation: 在基于draft的推测解码中，通常需要目标模型和draft模型共享词汇表或语言模型头部，这在某些情况下会导致不必要的推理开销，尤其是在目标模型具有庞大词汇表的情况下。 Method: 提出了一种名为VocabTrim的技术，该技术重构了drafter的语言模型头部，使其仅包含目标模型中最常使用的部分词汇。 Result: VocabTrim方法在Spec-Bench测试中显著提升了Llama-3模型的内存受限加速比，特别是对于Llama-3.2-3B-Instruct模型，加速比提高了16%。 Conclusion: VocabTrim通过减少drafting过程中的词汇量，降低了内存受限环境下的延迟，从而提高了生成速度。尽管略微降低了接受率，但整体上提升了内存受限情况下的效率。 Abstract: In this paper, we introduce a simple training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods that incorporates language modeling head (LM head) during drafting process. A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens, followed by verification by a base LLM, a target model, accepting a subset as its valid generation. As it is usually considered that the speculative decoding requires one-to-one mapping between vocabularies of the target model and the draft model, it has been natural to share the vocabulary between them, or even share the LM head as in EAGLE or Medusa. We first identify that this draft token sampling scheme inherently contains an unnecessary inference overhead in drafting, especially for some target LLMs with very large vocabularies. Then, we propose a simple technique, VocabTrim, to mitigate the drafting overhead to improve the generation speed in memory-bound environment. VocabTrim reconstructs the drafter LM head to contain only a limited set of tokens, selected by the most frequently sampled from the vocabulary of the target model. While limiting the vocabulary in drafting slightly degrades the acceptance rate, it significantly reduces the drafting latency in memory-bound process which is often the case on edge devices, resulting in higher memory-bound speed up (MBSU). We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.

[15] Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report

Emily Dux Speltz

Main category: cs.CL

TL;DR: This report summarizes the outcomes of an interdisciplinary workshop exploring the relationship between AI language models and human cognitive processes in text comprehension and production, highlighting the potential of LLMs to inform and augment human capabilities while emphasizing the importance of ethical and responsible use of AI.

Details

Motivation: The motivation for this study is to address a critical knowledge gap regarding the relationship between AI language models and human cognitive processes in text comprehension and composition. Method: An interdisciplinary workshop was conducted with experts in cognitive psychology, language learning, and AI-based NLP. The collaborative dialogue focused on examining the underlying processes of human text comprehension and composition and how AI can inform and augment these processes. Result: Key findings include the insights that LLMs can enhance our understanding of human language processing, there is increasing alignment between LLM behavior and human language processing when models are fine-tuned with human feedback, and there are both opportunities and challenges in human-AI collaboration in language tasks. Conclusion: The report concludes that while large language models (LLMs) have significant potential to enhance our understanding of human language processing and augment human capabilities, their effective integration requires addressing ethical considerations and overcoming challenges in human-AI collaboration. Abstract: This report synthesizes the outcomes of a recent interdisciplinary workshop that brought together leading experts in cognitive psychology, language learning, and artificial intelligence (AI)-based natural language processing (NLP). The workshop, funded by the National Science Foundation, aimed to address a critical knowledge gap in our understanding of the relationship between AI language models and human cognitive processes in text comprehension and composition. Through collaborative dialogue across cognitive, linguistic, and technological perspectives, workshop participants examined the underlying processes involved when humans produce and comprehend text, and how AI can both inform our understanding of these processes and augment human capabilities. The workshop revealed emerging patterns in the relationship between large language models (LLMs) and human cognition, with highlights on both the capabilities of LLMs and their limitations in fully replicating human-like language understanding and generation. Key findings include the potential of LLMs to offer insights into human language processing, the increasing alignment between LLM behavior and human language processing when models are fine-tuned with human feedback, and the opportunities and challenges presented by human-AI collaboration in language tasks. By synthesizing these findings, this report aims to guide future research, development, and implementation of LLMs in cognitive psychology, linguistics, and education. It emphasizes the importance of ethical considerations and responsible use of AI technologies while striving to enhance human capabilities in text comprehension and production through effective human-AI collaboration.

[16] The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

Niyati Bafna,Tianjian Li,Kenton Murray,David R. Mortensen,David Yarowsky,Hale Sirin,Daniel Khashabi

Main category: cs.CL

TL;DR: This paper identifies translation as a key problem in multilingual generation by LLMs, especially for low-resource languages, using interpretability techniques to trace errors in model processing.

Details

Motivation: Multilingual generation with large language models often shows poor quality for mid- to low-resource languages. The motivation was to understand whether the issue stemmed from task-solving or translation stages. Method: Using insights from interpretability, the researchers formalized the 'translation barrier hypothesis' and tested it on a word translation task across 108 language pairs. They used logit lens to analyze model processing in intermediate layers. Result: A significant portion of failures in multilingual generation was traced back to the translation stage, particularly when translating correctly solved intermediate concepts into low-resource target languages. Conclusion: The study concludes that the poor quality of multilingual generation in LLMs, especially for low-resource languages, is significantly due to failures in the translation stage. This presents a key hurdle in end-to-end multilingual generation. Abstract: Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages. Building on insights from interpretability, we demonstrate the existence of an implicit task-solving-->translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We test this hypothesis for a word translation task across 108 language pairs, using logit lens to observe model processing in intermediate layers. We find that a significant portion of overall failures indeed stems from translation failure, or the model's inability to translate correctly solved intermediate concepts into the target language. This is especially true for low-resource target languages. Our results highlight an important hurdle for end-to-end multilingual generation, and lend guiding insights for future work seeking to improve multilinguality in LLMs.

[17] Jan-nano Technical Report

Alan Dao,Dinh Bach Vu

Main category: cs.CL

TL;DR: Jan-nano是一个40亿参数的语言模型，通过多阶段RLVR系统进行微调，消除了对下一个词预测训练的依赖，在消费者硬件上实现了高效的性能。

Details

Motivation: 大多数语言模型面临强大功能与计算资源之间的权衡，需要一种更加高效的方法来打破这种限制。 Method: 基于Qwen3-4B进行微调，采用了一种新颖的多阶段RLVR系统，完全摆脱了传统的下一词预测训练（SFT）方法。 Result: Jan-nano在SimpleQA基准测试中达到了83.2%的成绩，并支持128K的上下文长度。 Conclusion: Jan-nano证明了智能不在于规模，而在于策略，其通过 radical specialization 实现了高效性能。 Abstract: Most language models face a fundamental tradeoff where powerful capabilities require substantial computational resources. We shatter this constraint with Jan-nano, a 4B parameter language model that redefines efficiency through radical specialization: instead of trying to know everything, it masters the art of finding anything instantly. Fine-tuned from Qwen3-4B using our novel multi-stage RLVR system that completely eliminates reliance on next token prediction training (SFT), Jan-nano achieves 83.2% on SimpleQA benchmark with MCP integration while running on consumer hardware. With 128K context length, Jan-nano proves that intelligence isn't about scale, it's about strategy.

[18] Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

Miles Turpin,Andy Arditi,Marvin Li,Joe Benton,Julian Michael

Main category: cs.CL

TL;DR: 本文提出了verbalization fine-tuning (VFT) 方法，通过训练模型显式识别其受提示线索影响的情况，从而显著提高奖励黑客行为的检测率。

Details

Motivation: 语言模型在使用RL训练时可能会进行奖励黑客行为，即利用非预期策略获取高奖励，这种行为难以通过链式推理检测，对高风险应用场景构成威胁。因此需要一种有效方法来提高奖励黑客行为的可检测性。 Method: 提出了一种名为verbalization fine-tuning (VFT) 的预RL干预方法，并将其应用于训练模型以显式识别受提示线索影响的情况。随后，在奖励黑客环境中进行RL训练并评估模型表现。 Result: 在RL训练后，经过VFT训练的模型仅有6%的响应包含未检测到的奖励黑客行为；而没有进行VFT的模型该比例高达88%，偏差基线干预模型更是达到99%。此外，VFT将模型显式表达线索影响的比例从8%提升至42%，并在RL之后进一步增加到94%。 Conclusion: VFT通过在RL之前训练模型显式地承认其受到提示线索的影响，显著提高了奖励黑客行为的检测率，为实现更透明和安全的AI系统提供了可行路径。 Abstract: Language models trained with RL can engage in reward hacking--exploiting unintended strategies for high reward--without revealing this behavior in their chain-of-thought reasoning, making detection difficult and posing risks for high-stakes applications. We propose verbalization fine-tuning (VFT), a pre-RL intervention that trains models to explicitly acknowledge when they are influenced by prompt cues--hints which point to incorrect answers (e.g., "a Stanford professor thinks the answer is A"). To evaluate VFT, we subsequently train models with RL on environments where held-out prompt cues signal which incorrect answers will receive high reward, incentivizing models to reward hack by exploiting cues instead of reasoning correctly. We measure how often models exploit these cues without verbalizing it. After RL, only 6% of the VFT-trained model's responses consist of undetected reward hacks. In comparison, when we perform RL without VFT, the rate of undetected reward hacks goes up to 88%; with a debiasing baseline intervention, this increases further to 99%. VFT achieves this by substantially increasing how often models verbalize the influence of cues--from 8% to 42% after VFT, and up to 94% after RL--while baselines remain low even after RL (10% and 1%). Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection, offering a practical path toward more transparent and safe AI systems.

[19] ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

Jianxin Yan,Wangze Ni,Lei Chen,Xuemin Lin,Peng Cheng,Zhan Qin,Kui Ren

Main category: cs.CL

TL;DR: ContextCache is a context-aware semantic caching system designed for multi-turn dialogues, which improves caching accuracy and significantly reduces computational costs and latency compared to traditional methods.

Details

Motivation: Existing semantic caching systems primarily rely on individual query matching without considering multi-turn dialogue contexts, leading to incorrect cache hits. This limitation necessitates a context-aware system like ContextCache to improve accuracy in conversational settings. Method: ContextCache uses a two-stage retrieval architecture: first, vector-based retrieval identifies potential matches for the current query; second, self-attention mechanisms integrate current and historical dialogue representations for precise contextual matching. Result: Evaluation using real-world conversations shows that ContextCache outperforms existing methods in terms of precision and recall. Additionally, cached responses reduce latency by approximately 10 times compared to direct LLM invocation. Conclusion: ContextCache effectively enhances the precision and recall of semantic caching in multi-turn dialogues, offering significant computational cost reductions compared to existing methods. Abstract: Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of multi-turn dialogue contexts, which leads to incorrect cache hits when similar queries appear in different conversational settings. This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues. ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching. Evaluation of real-world conversations shows that ContextCache improves precision and recall compared to existing methods. Additionally, cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for LLM conversational applications.

[20] MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs

Jianhui Wei,Zijie Meng,Zikai Xiao,Tianxiang Hu,Yang Feng,Zhijie Zhou,Jian Wu,Zuozhu Liu

Main category: cs.CL

TL;DR: 本文提出了 MedEthicsQA，这是一个用于评估医学大语言模型在医学伦理方面表现的基准测试数据集。

Details

Motivation: 尽管 MedLLMs 在临床任务中表现出色，但其伦理安全性尚未得到充分研究，因此需要一个系统且权威的基准来评估其在医学伦理方面的能力。 Method: 构建了一个包含 5,623 道选择题和 5,351 道开放式问题的综合基准 MedEthicsQA，并基于全球医学伦理标准建立了一个分层分类体系。 Result: 最先进的 MedLLMs 在回答医学伦理问题时表现不如其基础模型，显示出医学伦理对齐的缺陷。数据集经过严格的质量控制，错误率低至 2.72%。 Conclusion: MedLLMs 展示了在临床任务中的潜力，但其在医学伦理对齐方面存在不足。MedEthicsQA 提供了一个可靠的数据集来评估和改进 LLMs 在医学伦理方面的表现。 Abstract: While Medical Large Language Models (MedLLMs) have demonstrated remarkable potential in clinical tasks, their ethical safety remains insufficiently explored. This paper introduces $\textbf{MedEthicsQA}$, a comprehensive benchmark comprising $\textbf{5,623}$ multiple-choice questions and $\textbf{5,351}$ open-ended questions for evaluation of medical ethics in LLMs. We systematically establish a hierarchical taxonomy integrating global medical ethical standards. The benchmark encompasses widely used medical datasets, authoritative question banks, and scenarios derived from PubMed literature. Rigorous quality control involving multi-stage filtering and multi-faceted expert validation ensures the reliability of the dataset with a low error rate ($2.72\%$). Evaluation of state-of-the-art MedLLMs exhibit declined performance in answering medical ethics questions compared to their foundation counterparts, elucidating the deficiencies of medical ethics alignment. The dataset, registered under CC BY-NC 4.0 license, is available at https://github.com/JianhuiWei7/MedEthicsQA.

[21] Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models

Zhuojun Ding,Wei Wei,Chenghao Fan

Main category: cs.CL

TL;DR: The paper proposes the SaM framework for information extraction tasks, which dynamically selects and merges expert models at inference time to improve generalization and scalability without additional training.

Details

Motivation: Supervised fine-tuning is costly for domain-specific information extraction tasks, and existing approaches lack adaptation and scalability when training a unified model across multiple domains. Method: The SaM framework dynamically selects and merges expert models based on domain similarity and performance on sampled instances to create task-specific models optimized for the target domain. Result: Extensive experiments demonstrate that the SaM framework effectively improves performance across various domains and allows convenient addition or removal of experts for scalability. Conclusion: The proposed SaM framework improves generalization and scalability for domain-specific information extraction tasks without extra training, outperforming unified models by an average of 10%. Abstract: Supervised fine-tuning (SFT) is widely used to align large language models (LLMs) with information extraction (IE) tasks, such as named entity recognition (NER). However, annotating such fine-grained labels and training domain-specific models is costly. Existing works typically train a unified model across multiple domains, but such approaches lack adaptation and scalability since not all training data benefits target domains and scaling trained models remains challenging. We propose the SaM framework, which dynamically Selects and Merges expert models at inference time. Specifically, for a target domain, we select domain-specific experts pre-trained on existing domains based on (i) domain similarity to the target domain and (ii) performance on sampled instances, respectively. The experts are then merged to create task-specific models optimized for the target domain. By dynamically merging experts beneficial to target domains, we improve generalization across various domains without extra training. Additionally, experts can be added or removed conveniently, leading to great scalability. Extensive experiments on multiple benchmarks demonstrate our framework's effectiveness, which outperforms the unified model by an average of 10%. We further provide insights into potential improvements, practical experience, and extensions of our framework.

[22] Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization

Duygu Altinok

Main category: cs.CL

TL;DR: This paper proposes LAIL, a new method to improve CTC-based speech recognition by incorporating linguistic knowledge from large language models, resulting in better accuracy without sacrificing speed.

Details

Motivation: CTC-based ASR models lack strong linguistic modeling capabilities compared to attention-based models, which limits their performance despite their faster decoding speed. This work aims to enhance CTC-based systems using knowledge from large language models. Method: A novel auxiliary loss framework called Language-Aware Intermediate Loss (LAIL) is introduced. It uses connector layers to map encoder outputs into the embedding space of a large language model (LLM), and computes a causal language modeling loss during training while maintaining CTC decoding efficiency. Result: Experiments using the Conformer architecture and various LLaMA models showed significant improvements in Word Error Rate (WER) on LibriSpeech, TEDLIUM2, and WSJ datasets, achieving state-of-the-art results for CTC-based ASR. Conclusion: The proposed LAIL framework effectively enhances the linguistic modeling of CTC-based ASR systems by leveraging large language models, achieving state-of-the-art performance with minimal computational overhead. Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems have revolutionized the field by integrating all components into a single neural network, with attention-based encoder-decoder models achieving state-of-the-art performance. However, their autoregressive decoding process limits inference speed, making them unsuitable for real-time applications. In contrast, CTC-based models offer faster, non-autoregressive decoding but struggle to model linguistic dependencies effectively. Addressing this challenge, we propose a novel auxiliary loss framework called Language-Aware Intermediate Loss (LAIL) to enhance CTC-based ASR using the linguistic knowledge of large language models (LLMs). By attaching connector layers to intermediate encoder layers, LAIL maps outputs to the embedding space of an LLM and computes a causal language modeling loss during training. This approach enhances linguistic modeling while preserving the computational efficiency of CTC decoding. Using the Conformer architecture and various LLaMA models, we demonstrate significant improvements in Word Error Rate (WER) on the LibriSpeech, TEDLIUM2, and WSJ corpora, achieving state-of-the-art performance for CTC-based ASR with minimal computational overhead.

[23] Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems

Yucheng Cai,Yuxuan Wu,Yi Huang,Junlan Feng,Zhijian Ou

Main category: cs.CL

TL;DR: 本文提出了知识增强微调（KAFT）方法，通过在领域特定数据和外部知识上微调大型语言模型，有效提升了对话系统在知识密集型场景下的表现。

Details

Motivation: 大型语言模型（LLMs）在知识密集型场景中容易出错，而现有的基于检索增强生成（RAG）和agent的方法未能有效利用外部知识库中的信息。 Method: 本文提出了一种名为知识增强微调（KAFT）的方法，通过使用领域特定数据和外部知识对LLMs进行微调，并在MobileCS2数据集上进行了实验验证。 Result: 实验结果显示，KAFT在基于RAG和agent的系统中显著提高了事实准确性，是首个对KAFT思想进行实证研究的工作。 Conclusion: KAFT方法在提升基于RAG和agent的系统中的事实准确性方面显著优于传统提示技术。 Abstract: Large language models (LLMs) have recently been applied to dialog systems. Despite making progress, LLMs are prone to errors in knowledge-intensive scenarios. Recently, approaches based on retrieval augmented generation (RAG) and agent have emerged to improve the factual accuracy by enhancing the LLMs with knowledge retrieved from external knowledge bases (KBs). This is mostly implemented by prompting the LLMs with instructions, examples and the retrieved knowledge. However, LLMs may have difficulty using the retrieved knowledge effectively for response generation, because they are not well trained to do such generation for specific domains. To mitigate this problem, we propose to finetune the LLMs in the RAG-based and agent-based systems with domain-specific data, together with domain-specific external knowledge, which is called knowledge augmented finetuning (KAFT). We base our study on the MobileCS2 dataset, a real-life customer service dialog dataset that features intensive knowledge interactions, to systematically compare the prompting and KAFT techniques in the RAG-based and agent-based systems. Experiment results show that KAFT substantially surpasses prompting in both RAG and agent systems, particularly in terms of factual accuracy. To the best of our knowledge, this paper represents the first solid empirical work to investigate the KAFT idea.

[24] DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

Kyochul Jang,Donghyeon Lee,Kyusik Kim,Dongseok Heo,Taewhoo Lee,Woojeong Kim,Bongwon Suh

Main category: cs.CL

TL;DR: This paper introduces DICE-SCORE, a new metric for evaluating function-calling benchmarks, revealing their lack of realism. To address this, the authors propose DICE-BENCH, a framework for generating more practical datasets. Their experiments show that current LLMs still require significant improvements for real-world deployment.

Details

Motivation: Current function-calling benchmarks primarily focus on single-turn interactions and fail to represent the complexity of real-world applications. This limitation hinders the evaluation of large language models (LLMs) in practical scenarios, necessitating a more robust benchmarking approach. Method: The researchers introduced a new metric called DICE-SCORE to assess the dispersion of tool-related information across dialogues in existing benchmarks. Using this metric, they evaluated current benchmarks and identified shortcomings. To address these issues, they developed DICE-BENCH, which synthesizes conversations using a tool graph and a multi-agent system to create a more realistic dataset. Result: Analysis using DICE-SCORE revealed notably low scores for existing benchmarks, indicating limited practical relevance. The proposed DICE-BENCH framework produced a dataset with 1,607 high-DICE-SCORE instances. Experiments involving 19 LLMs demonstrated that significant improvements are needed before these models can perform effectively in real-world settings. Conclusion: The study concludes that existing function-calling benchmarks lack realism and practicality, as demonstrated by low DICE-SCORE results. The proposed DICE-BENCH framework addresses this issue by generating more realistic datasets for evaluating LLMs' performance in real-world scenarios. Abstract: Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE-Bench/.

[25] Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions

Duygu Altinok

Main category: cs.CL

TL;DR: This paper introduces a new ASR training method using overlapping context windows and enriched entity labeling to improve named entity recognition and formatting in long-form transcriptions.

Details

Motivation: ASR systems struggle with named entities and numerical data, especially when proper formatting is required, which impairs semantic understanding in critical domains. Method: A novel training method that uses overlapping context windows to extend the semantic context of ASR models. Entities spanning chunk boundaries are reassigned to the right-hand chunk, and enriched training data with embedded entity labels is used. Result: Evaluation on the Spoken Wikipedia dataset showed improved performance across semantic tasks such as NER and entity formatting. Conclusion: The proposed context-aware training approach effectively addresses ASR limitations in named entity recognition and formatting, particularly for long-form transcription. Abstract: Automatic Speech Recognition (ASR) systems, such as Whisper, achieve high transcription accuracy but struggle with named entities and numerical data, especially when proper formatting is required. These issues increase word error rate (WER) and impair semantic understanding in critical domains like legal, financial, and medical applications. We propose a novel training approach that extends the semantic context of ASR models by adding overlapping context windows during training. By sliding 5-second overlaps on both sides of 30-second chunks, we create a 40-second "effective semantic window," improving entity recognition and formatting while focusing predictions on the central 30 seconds. To address entities spanning chunk boundaries, we reassign such entities entirely to the right-hand chunk, ensuring proper formatting. Additionally, enriched training data with embedded entity labels enables the model to learn both recognition and type-specific formatting. Evaluated on the Spoken Wikipedia dataset, our method improves performance across semantic tasks, including named entity recognition (NER) and entity formatting. These results highlight the effectiveness of context-aware training in addressing ASR limitations for long-form transcription and complex entity recognition tasks.

[26] Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Younwoo Choi,Changling Li,Yongjin Yang,Zhijing Jin

Main category: cs.CL

TL;DR: 这篇论文探讨了大型语言模型在多智能体系统中的对话者意识，分析了其在推理模式、语言风格和偏好一致性方面的能力，并揭示了其在提升协作的同时可能带来的安全风险。

Details

Motivation: 随着LLMs越来越多地集成到多智能体和人机系统中，理解它们对自身上下文和对话伙伴的认知对于确保可靠性能和强大安全性至关重要。 Method: 本文通过三个维度——推理模式、语言风格和偏好一致性来评估LLMs的对话者推断能力，并开发了三个案例研究以展示其实际意义。 Result: 研究发现，LLMs能够可靠地识别同一家族的同行模型以及某些知名模型家族（如GPT和Claude），并且这种能力可以提升多LLM协作，但也可能导致奖励欺骗行为和更高的越狱易感性。 Conclusion: 该论文强调了在多智能体和人机系统中，大型语言模型（LLMs）对对话伙伴身份和特征的识别能力的重要性，并指出这种能力既带来了增强协作的机会，也引入了新的安全漏洞。 Abstract: As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM's ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions-reasoning patterns, linguistic style, and alignment preferences-and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity-sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments. Our code is open-sourced at https://github.com/younwoochoi/InterlocutorAwarenessLLM.

[27] On the Generalizability of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"

Asen Dotsinski,Udit Thakur,Marko Ivanov,Mohammad Hafeez Khan,Maria Heuss

Main category: cs.CL

TL;DR: 本文复现并扩展了Ortu等（2024）的研究，发现注意力头消融的效果因模型、提示结构、领域和任务的不同而变化，特别是在更大模型和特定领域中表现不佳。

Details

Motivation: 旨在验证并扩展Ortu等人（2024）关于语言模型中事实回忆与反事实上下文重复之间机制竞争的研究发现。 Method: 通过复现Ortu等（2024）的实验并从三个方面进行扩展研究：探索更大模型的泛化性、改变提示结构的影响以及测试特定领域的有效性。 Result: 成功复现了原论文的主要结果，并发现注意力头专业化在更大模型中显著减少；改变提示结构导致反事实token的logit明显下降；某些领域因句子主题包含事实预测token而影响结果。 Conclusion: 注意力头消融在不同领域和模型中的效果存在差异，有效性受模型架构、提示结构、领域和任务的影响。 Abstract: We present a reproduction study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals" (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information, the dominance of attention blocks in mechanism competition, and the specialization of attention heads in handling competing information. We reproduce their results on both GPT-2 (Radford et al., 2019) and Pythia 6.9B (Biderman et al., 2023). We extend their work in three significant directions. First, we explore the generalizability of these findings to even larger models by replicating the experiments on Llama 3.1 8B (Grattafiori et al., 2024), discovering greatly reduced attention head specialization. Second, we investigate the impact of prompt structure by introducing variations where we avoid repeating the counterfactual statement verbatim or we change the premise word, observing a marked decrease in the logit for the counterfactual token. Finally, we test the validity of the authors' claims for prompts of specific domains, discovering that certain categories of prompts skew the results by providing the factual prediction token as part of the subject of the sentence. Overall, we find that the attention head ablation proposed in Ortu et al. (2024) is ineffective for domains that are underrepresented in their dataset, and that the effectiveness varies based on model architecture, prompt structure, domain and task.

[28] A Systematic Study of Compositional Syntactic Transformer Language Models

Yida Zhao,Hao Xve,Xiang Hu,Kewei Tu

Main category: cs.CL

TL;DR: This paper introduces a unified framework for compositional syntactic language models (SLMs), evaluates its performance across multiple NLP tasks, and provides design recommendations for improving SLMs.

Details

Motivation: The motivation behind this study is to enhance traditional Transformers by incorporating syntactic biases through linearized syntactic parse trees. The authors aim to identify key design choices in existing compositional SLMs and propose a framework that encompasses both current models and new variants. Method: The authors propose a unified framework for compositional SLMs based on constituency parse trees with explicit bottom-up composition. They conduct a comprehensive empirical evaluation across multiple tasks including language modeling, syntactic generalization, summarization, dialogue, and inference efficiency. Result: The experimental results show the effectiveness of different variants within the proposed framework across a range of NLP tasks, which leads the authors to provide multiple design recommendations for compositional SLMs. Conclusion: The authors conclude that their proposed unified framework for compositional syntactic language models (SLMs) enables a comprehensive evaluation of existing models and novel variants, leading to several design recommendations for improved performance in various NLP tasks. Abstract: Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional SLMs and propose a unified framework encompassing both existing models and novel variants. We conduct a comprehensive empirical evaluation of all the variants in our framework across language modeling, syntactic generalization, summarization, dialogue, and inference efficiency. Based on the experimental results, we make multiple recommendations on the design of compositional SLMs. Our code is released at https://github.com/zhaoyd1/compositional_SLMs.

Xianzhe Fan,Xuhui Zhou,Chuanyang Jin,Kolby Nottingham,Hao Zhu,Maarten Sap

Main category: cs.CL

TL;DR: This paper introduces SoMi-ToM, a new benchmark for evaluating Theory of Mind (ToM) in complex social interactions, showing that current AI models still lag far behind humans in understanding others' states, goals, and behaviors.

Details

Motivation: Most existing Theory of Mind (ToM) benchmarks focus on static, text-based scenarios, which do not accurately reflect dynamic real-world social interactions. This gap motivates the development of a more comprehensive benchmark like SoMi-ToM. Method: The authors proposed the SoMi-ToM benchmark, which evaluates multi-perspective ToM using multimodal interaction data from the SoMi environment. The evaluation includes both first-person and third-person perspectives across videos, images, and multiple-choice questions. Performance of LVLMs was systematically evaluated against human subjects. Result: On the SoMi-ToM dataset, large vision-language models (LVLMs) performed significantly worse than humans, with an average accuracy gap of 40.1% in first-person evaluation and 26.4% in third-person evaluation. Conclusion: The study concludes that current large vision-language models (LVLMs) significantly underperform compared to humans in Theory of Mind (ToM) capabilities within embodied, complex social interactions, indicating a need for future improvements in this area. Abstract: Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.

[30] MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition

João Lucas Luz Lima Sarcinelli,Marina Lages Gonçalves Teixeira,Jade Bortot de Paiva,Diego Furtado Silva

Main category: cs.CL

TL;DR: This paper introduces MariNER, the first gold-standard NER dataset for early 20th-century Brazilian Portuguese, with over 9,000 manually annotated sentences, aimed at advancing NLP applications in historical text analysis.

Details

Motivation: Brazilian Portuguese lacks high-quality NER datasets, especially for specific domains like historical texts in digital humanities. This limits the application of NLP techniques in such contexts. Method: The authors constructed MariNER, a manually annotated NER dataset for early 20th-century Brazilian Portuguese, containing over 9,000 sentences. They evaluated and compared state-of-the-art NER models on this dataset. Result: MariNER was successfully created as the first gold-standard NER dataset for early 20th-century Brazilian Portuguese, and the performance of modern NER models on this dataset was analyzed and compared. Conclusion: The paper concludes that MariNER is a valuable gold-standard dataset for early 20th-century Brazilian Portuguese, and it effectively supports the application of NER models in the context of historical text analysis. Abstract: Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task that aims to identify and classify entity mentions in texts across different categories. While languages such as English possess a large number of high-quality resources for this task, Brazilian Portuguese still lacks in quantity of gold-standard NER datasets, especially when considering specific domains. Particularly, this paper considers the importance of NER for analyzing historical texts in the context of digital humanities. To address this gap, this work outlines the construction of MariNER: \textit{Mapeamento e Anota\c{c}\~oes de Registros hIst\'oricos para NER} (Mapping and Annotation of Historical Records for NER), the first gold-standard dataset for early 20th-century Brazilian Portuguese, with more than 9,000 manually annotated sentences. We also assess and compare the performance of state-of-the-art NER models for the dataset.

[31] Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning

Xiang Zhuang,Bin Wu,Jiyu Cui,Kehua Feng,Xiaotong Li,Huabin Xing,Keyan Ding,Qiang Zhang,Huajun Chen

Main category: cs.CL

TL;DR: 本研究提出了一种知识增强的推理框架（K-MSE），结合蒙特卡洛树搜索、外部化学知识库与分子-谱图评分器，显著提升了大型语言模型在分子结构解析任务中的性能。

Details

Motivation: LLMs在分析复杂任务方面表现出色，但在分子结构解析中面临挑战，主要因其对专业化学知识的理解有限。 Method: 引入了一种基于蒙特卡洛树搜索的知识增强推理框架（K-MSE），构建了外部分子子结构知识库，并设计了一个专门的分子-谱图评分器作为奖励模型，以优化推理过程。 Result: 实验结果表明，K-MSE框架显著提高了LLMs在分子结构解析任务中的表现，特别是在GPT-4o-mini和GPT-4o上的性能提升了超过20%。 Conclusion: K-MSE框架显著提升了LLMs在分子结构解析任务中的性能，尤其是在GPT-4o-mini和GPT-4o上均取得了超过20%的改进。 Abstract: Molecular structure elucidation involves deducing a molecule's structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs' limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs' coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at https://github.com/HICAI-ZJU/K-MSE.

[32] Text2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries

Zhengren Wang,Bozhou Li,Dongwen Yao,Wentao Zhang

Main category: cs.CL

TL;DR: This paper introduces Text2VectorSQL, a framework combining Text-to-SQL and vector search, which enhances natural language query capabilities for databases by overcoming current limitations.

Details

Motivation: The motivation stems from the limitations of Text-to-SQL in handling unstructured data and ambiguous queries, as well as the reliance of VectorSQL on manual crafting and lack of tailored evaluation frameworks. Method: The researchers introduced a novel framework called Text2VectorSQL to unify Text-to-SQL and vector search. They developed dedicated models using synthetic data and built a vector index for semantic retrieval. An automated pipeline with expert review was used for ground truth annotation. Result: Text2VectorSQL demonstrated significant performance improvements over baseline methods, enabling semantic filtering, multi-modal matching, and retrieval acceleration. Conclusion: The study concludes that the proposed Text2VectorSQL framework successfully bridges the gap between Text-to-SQL and vector search, enhancing the expressiveness and versatility of database querying methods. Abstract: While Text-to-SQL enables natural language interaction with structured databases, its effectiveness diminishes with unstructured data or ambiguous queries due to rigid syntax and limited expressiveness. Concurrently, vector search has emerged as a powerful paradigm for semantic retrieval, particularly for unstructured data. However, existing VectorSQL implementations still rely heavily on manual crafting and lack tailored evaluation frameworks, leaving a significant gap between theoretical potential and practical deployment. To bridge these complementary paradigms, we introduces Text2VectorSQL, a novel framework unifying Text-to-SQL and vector search to overcome expressiveness constraints and support more diverse and holistical natural language queries. Specifically, Text2VectorSQL enables semantic filtering, multi-modal matching, and retrieval acceleration. For evaluation, we build vector index on appropriate columns, extend user queries with semantic search, and annotate ground truths via an automatic pipeline with expert review. Furthermore, we develop dedicated Text2VectorSQL models with synthetic data, demonstrating significant performance improvements over baseline methods. Our work establishes the foundation for the Text2VectorSQL task, paving the way for more versatile and intuitive database interfaces. The repository will be publicly available at https://github.com/Open-DataFlow/Text2VectorSQL.

Yue Xu,Wenjie Wang

Main category: cs.CL

TL;DR: 本文介绍了一个名为Genres的新基准，用于评估多模态大语言模型中与社会关系相关的情境性别偏见，揭示了现有模型中存在的细微偏见，并强调了未来需要关注互动驱动偏见的重要性。

Details

Motivation: 尽管多模态大语言模型（MLLMs）在涉及视觉和文本模态的任务中表现出色，但它们可能编码和放大性别偏见，尤其是在社会敏感应用中。现有的基准测试主要评估孤立场景中的偏见，忽略了偏见如何在人际互动中微妙地出现。 Method: 作者通过引入一个名为Genres的新基准来评估MLLM中的性别偏见，该基准专注于双角色配置文件和叙事生成任务，捕捉丰富的人际互动并支持跨多个维度的细粒度偏见评估。 Result: 实验显示，MLLMs中存在持久的情境敏感性别偏见，这些偏见在单角色环境中不明显。通过Genres基准测试，可以更细致地评估与关系相关的性别偏见。 Conclusion: MLLMs表现出持续的、对情境敏感的性别偏见，这些偏见在单角色设置中并不明显。研究强调了关系感知基准的重要性，并为未来的偏见缓解提供了可操作的见解。 Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities across tasks involving both visual and textual modalities. However, growing concerns remain about their potential to encode and amplify gender bias, particularly in socially sensitive applications. Existing benchmarks predominantly evaluate bias in isolated scenarios, overlooking how bias may emerge subtly through interpersonal interactions. We fill this gap by going beyond single-entity evaluation and instead focusing on a deeper examination of relational and contextual gender bias in dual-individual interactions. We introduce Genres, a novel benchmark designed to evaluate gender bias in MLLMs through the lens of social relationships in generated narratives. Genres assesses gender bias through a dual-character profile and narrative generation task that captures rich interpersonal dynamics and supports a fine-grained bias evaluation suite across multiple dimensions. Experiments on both open- and closed-source MLLMs reveal persistent, context-sensitive gender biases that are not evident in single-character settings. Our findings underscore the importance of relationship-aware benchmarks for diagnosing subtle, interaction-driven gender bias in MLLMs and provide actionable insights for future bias mitigation.

[34] FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes

Janki Atul Nawale,Mohammed Safi Ur Rahman Khan,Janani D,Mansi Gupta,Danish Pruthi,Mitesh M. Khapra

Main category: cs.CL

TL;DR: 本文提出了INDIC-BIAS，一个针对印度文化的LLMs公平性评估基准，发现主流LLMs对印度边缘化群体存在显著偏见，并呼吁在实际应用中更加谨慎。

Details

Motivation: 现有关于公平性的研究主要集中在西方国家，无法满足像印度这样文化多元国家的需求。为了解决这一问题，作者开发了一个以印度为中心的基准INDIC-BIAS。 Method: 通过与领域专家合作，整理了1800多个社会文化话题，并基于这些话题生成并手动验证了20,000个现实场景模板，用于探测LLMs的公平性。 Result: 评估14个流行的LLMs后发现，这些模型对边缘化身份存在强烈的负面偏见，并且即使被明确要求解释其决策，也难以减轻偏见。此外，模型可能导致分配性和代表性伤害。 Conclusion: INDIC-BIAS强调了当前LLMs对印度边缘化身份的强烈负面偏见，并呼吁在实际应用中更谨慎地使用这些模型。 Abstract: Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.

[35] Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models

Shivam Sharma,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: This research explores detecting narrative roles in memes using advanced models and prompt strategies, highlighting persistent challenges with the 'Victim' class and cross-cultural generalization.

Details

Motivation: Identifying narrative roles (Hero, Villain, Victim, and Other) in internet memes is challenging due to their nuanced, culture-specific, and context-rich nature, especially when compared to synthetic hateful content. This work aims to benchmark this task across diverse languages and cultural contexts. Method: Using a balanced, linguistically diverse dataset from the CLEF 2024 shared task, the researchers evaluated various models such as multilingual transformers, sentiment-aware classifiers, instruction-tuned LLMs, and vision-language models under zero-shot settings. They also explored prompt design strategies to improve performance. Result: Models like DeBERTa-v3 and Qwen2.5-VL showed notable performance improvements, but identifying the 'Victim' class and generalizing across cultural and code-mixed content remained challenging. Hybrid prompts incorporating structured instructions provided marginal yet consistent improvements. Conclusion: The study emphasizes the importance of cultural context, prompt engineering, and multimodal reasoning in understanding complex narrative structures within visual-textual content like memes. Abstract: This work investigates the challenging task of identifying narrative roles - Hero, Villain, Victim, and Other - in Internet memes, across three diverse test sets spanning English and code-mixed (English-Hindi) languages. Building on an annotated dataset originally skewed toward the 'Other' class, we explore a more balanced and linguistically diverse extension, originally introduced as part of the CLEF 2024 shared task. Comprehensive lexical and structural analyses highlight the nuanced, culture-specific, and context-rich language used in real memes, in contrast to synthetically curated hateful content, which exhibits explicit and repetitive lexical markers. To benchmark the role detection task, we evaluate a wide spectrum of models, including fine-tuned multilingual transformers, sentiment and abuse-aware classifiers, instruction-tuned LLMs, and multimodal vision-language models. Performance is assessed under zero-shot settings using precision, recall, and F1 metrics. While larger models like DeBERTa-v3 and Qwen2.5-VL demonstrate notable gains, results reveal consistent challenges in reliably identifying the 'Victim' class and generalising across cultural and code-mixed content. We also explore prompt design strategies to guide multimodal models and find that hybrid prompts incorporating structured instructions and role definitions offer marginal yet consistent improvements. Our findings underscore the importance of cultural grounding, prompt engineering, and multimodal reasoning in modelling subtle narrative framings in visual-textual content.

[36] Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning

Zhaoye Fei,Li Ji,Siyin Wang,Junhao Shi,Jingjing Gong,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了Embodied Planner-R1，这是一种基于强化学习的框架，旨在提高大型语言模型在具体任务规划场景中的交互能力。

Details

Motivation: 现有的方法在生成动作脚本时依赖于静态知识，难以学习动作与环境反馈之间的因果关系，尤其是在部分可观察的环境中。 Method: Embodied Planner-R1框架引入了三种创新：(1) 使用组 rollout 进行纯强化学习，通过并行探索实现环境内的交互；(2) 完成驱动的稀疏奖励；以及 (3) 用于从分组轨迹中高效学习的交互策略优化（IPO）。 Result: 在两个具有挑战性的基于文本的具身规划基准测试中，Embodied Planner-R1在ALFWorld上达到了97.78%的完成率，在ScienceWorld上达到了79.92%，超越了之前的方法，并且在之前未见过的环境中仅下降了-3.66%。 Conclusion: 这项研究表明，Embodied Planner-R1框架能够显著提升大型语言模型在具身任务规划场景中的性能和泛化能力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they face significant challenges in embodied task planning scenarios that require continuous environmental understanding and action generation. Existing approaches generate open-loop action scripts based on static knowledge, making it difficult to learn causal relationships between actions and environmental feedback, particularly in partially observable environments. We introduce Embodied Planner-R1, a novel outcome-driven reinforcement learning framework that enables LLMs to develop interactive capabilities through autonomous exploration with minimal supervision. Our framework incorporates three key innovations: (1) Without human annotations, we employ pure reinforcement learning with group rollout, incorporating in-environment interaction through parallel exploration; (2) completion-driven sparse reward; and (3) Interactive Policy Optimization (IPO) for efficient learning from grouped trajectories. Across two challenging text-based Embodied planning benchmarks, Embodied Planner-R1 achieves impressive completion rates of 97.78% on ALFWorld and 79.92% on ScienceWorld, surpassing prior methods by a large margin, and suffers only a -3.66% drop in previously unseen environments, evidencing strong generalization.

[37] Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format

Dingzirui Wang,Xuanliang Zhang,Rongyu Cao,Longxu Dou,Xianzhen Luo,Yingwei Ma,Qingfu Zhu,Wanxiang Che,Binhua Li,Fei Huang,Yongbin Li

Main category: cs.CL

TL;DR: 本研究提出了一种称为Format-Adapter的新方法，该方法通过自动生成和选择最适合任务的推理格式来减少大型语言模型的推理不一致性，并显示出性能提升。

Details

Motivation: 之前使用多种格式的工作依赖于人工标注的格式，这可能不适合所有任务且标注成本高。 Method: 提出了一个名为Format-Adapter的方法，利用大型语言模型生成并选择合适的推理格式，以最小化提出的错误测量。 Result: 在数学和常识推理任务上的实验表明，Format-Adapter平均比之前的工作性能提高了4.3%。 Conclusion: Format-Adapter是一个有效的方法，它通过生成和选择适合任务的推理格式来减少大型语言模型的推理不一致性。 Abstract: Generating and voting multiple answers is an effective method to mitigate reasoning inconsistencies of large language models (LLMs). Prior works have shown that multiple reasoning formats outperform a single format when generating multiple answers. However, previous works using multiple formats rely on formats labeled by humans, which could be unsuitable for all tasks and have high labeling costs. To address this issue, we adapt suitable formats to the given tasks by generating and selecting formats. We first propose how to measure the reasoning error when generating multiple answers. Then, we introduce Format-Adapter, which utilizes LLMs to generate and select suitable reasoning formats by minimizing the error measurement we present. We conduct experiments on math and commonsense reasoning tasks, where Format-Adapter achieves a 4.3% performance improvement on average over previous works, demonstrating the effectiveness.

[38] LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation

Shadman Sobhan,Mohammad Ariful Haque

Main category: cs.CL

TL;DR: This paper proposes a RAG pipeline capable of handling complex technical documents with tables and images, combining vector search and a fine-tuned reranker to achieve high accuracy and relevance in question answering.

Details

Motivation: Motivated by the limitations of traditional RAG pipelines in retrieving information from complex technical documents containing structured data like tables and images. Method: The method involves a RAG pipeline that combines vector similarity search with a fine-tuned Gemma-2-9b-it reranker trained using RAFT on a custom dataset. It supports both scanned and searchable document formats. Result: The evaluation showed a faithfulness score of 94% (RAGas) and 96% (DeepEval), and an answer relevancy score of 87% (RAGas) and 93% (DeepEval), proving its effectiveness for table-based questions and out-of-context queries. Conclusion: The proposed RAG pipeline demonstrates superior performance in handling technical documents with structured data, achieving high faithfulness and answer relevancy scores compared to general RAG pipelines. Abstract: Large Language Models (LLMs) are capable of natural language understanding and generation. But they face challenges such as hallucination and outdated knowledge. Fine-tuning is one possible solution, but it is resource-intensive and must be repeated with every data update. Retrieval-Augmented Generation (RAG) offers an efficient solution by allowing LLMs to access external knowledge sources. However, traditional RAG pipelines struggle with retrieving information from complex technical documents with structured data such as tables and images. In this work, we propose a RAG pipeline, capable of handling tables and images in documents, for technical documents that support both scanned and searchable formats. Its retrieval process combines vector similarity search with a fine-tuned reranker based on Gemma-2-9b-it. The reranker is trained using RAFT (Retrieval-Augmented Fine-Tuning) on a custom dataset designed to improve context identification for question answering. Our evaluation demonstrates that the proposed pipeline achieves a high faithfulness score of 94% (RAGas) and 96% (DeepEval), and an answer relevancy score of 87% (RAGas) and 93% (DeepEval). Comparative analysis demonstrates that the proposed architecture is superior to general RAG pipelines in terms of table-based questions and handling questions outside context.

[39] Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion

Siyuan Li,Ruitong Liu,Yan Wen,Te Sun

Main category: cs.CL

TL;DR: This paper proposes the Flow-Modulated Scoring (FMS) framework for Knowledge Graph Completion, combining context-sensitive representations with dynamic transformations to outperform existing methods.

Details

Motivation: Existing approaches for Knowledge Graph Completion rely on static, embedding-based scoring mechanisms that fail to capture contextual dependencies and relational dynamics effectively. This limitation motivates the development of a more advanced framework that incorporates both static and dynamic modeling. Method: The FMS framework utilizes two components: a semantic context learning module for encoding context-sensitive entity representations and a conditional flow-matching module to dynamically transform head-to-tail embeddings based on the contextual information. This synergy refines initial static scores for better relational modeling. Result: The proposed FMS framework achieves superior performance on multiple standard benchmarks for Knowledge Graph Completion, demonstrating its effectiveness in capturing complex relational semantics through a combination of context-aware and dynamic modeling techniques. Conclusion: The Flow-Modulated Scoring (FMS) framework provides significant improvements in Knowledge Graph Completion by integrating context-aware static representations and conditioned dynamic information, outperforming previous state-of-the-art methods. Abstract: Effective modeling of multifaceted relations is pivotal for Knowledge Graph Completion (KGC). However, a majority of existing approaches are predicated on static, embedding-based scoring, exhibiting inherent limitations in capturing contextual dependencies and relational dynamics. Addressing this gap, we propose the Flow-Modulated Scoring (FMS) framework. FMS comprises two principal components: (1) a semantic context learning module that encodes context-sensitive entity representations, and (2) a conditional flow-matching module designed to learn the dynamic transformation from a head to a tail embedding, governed by the aforementioned context. The resultant predictive vector field, representing the context-informed relational path, serves to dynamically refine the initial static score of an entity pair. Through this synergy of context-aware static representations and conditioned dynamic information, FMS facilitates a more profound modeling of relational semantics. Comprehensive evaluations on several standard benchmarks demonstrate that our proposed method surpasses prior state-of-the-art results.

[40] Benchmarking Deep Search over Heterogeneous Enterprise Data

Prafulla Kumar Choubey,Xiangyu Peng,Shilpa Bhagavath,Kung-Hsiang Huang,Caiming Xiong,Chien-Sheng Wu

Main category: cs.CL

TL;DR: 本论文提出了一个用于评估深度搜索的新基准，该基准需要跨多种来源的多跳推理。

Details

Motivation: 现有的检索增强生成（RAG）方法在处理具有复杂结构和人类交互的多样化来源时面临挑战，需要一个新的基准来评估深度搜索能力。 Method: 通过模拟业务流程的合成数据管道构建了一个包含可回答和不可回答查询的基准，并提供了39,190个企业工件的检索池。 Result: 实验结果显示，即使是性能最好的代理RAG方法，在该基准上的平均得分也只有32.96，表明检索仍是主要瓶颈。 Conclusion: 论文强调了现有RAG方法在进行深度搜索和检索所有必要证据方面的困难，并提出了一种新的基准来推动这一领域的发展。 Abstract: We present a new benchmark for evaluating Deep Search--a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an average performance score of 32.96 on our benchmark. With further analysis, we highlight retrieval as the main bottleneck: existing methods struggle to conduct deep searches and retrieve all necessary evidence. Consequently, they often reason over partial context, leading to significant performance degradation.

[41] Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions

Dingzriui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng

Main category: cs.CL

TL;DR: This paper proposes a new metric, Learning-to-Context Slope (LCS), to measure the effectiveness of In-context Learning (ICL) for large language models (LLMs).

Details

Motivation: In-context learning (ICL) effectiveness varies significantly across models and tasks, making it difficult for practitioners to determine its reliability. Existing evaluation methods suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. Method: A novel metric called Learning-to-Context Slope (LCS) was proposed to quantify ICL effectiveness by modeling the slope between learning gain and contextual relevance. The method captures continuous loss changes, attributes failures to specific factors, and minimizes reliance on labeled data through synthetic evaluation. Result: Experiments show that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. It also identifies actionable thresholds and critical model capabilities for ICL success. Conclusion: The proposed LCS metric effectively quantifies ICL effectiveness, addresses the limitations of current performance-based evaluation methods, and reliably reflects true effectiveness in challenging scenarios like biased or data-scarce environments. Abstract: In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.

[42] V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy

Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng

Main category: cs.CL

TL;DR: 本论文提出了一种新的从零开始合成演示的方法V-Synthesis，有效提升了合成质量和多样性。

Details

Motivation: 减少上下文学习中高昂的标签成本，现有的合成方法主要依赖于特定任务或预先存在的演示，缺乏通用性。 Method: 提出了一种新的合成一致性度量方法V-Score和基于V-Score的按比例抽样方法V-Synthesis，并进行了实验验证。 Result: V-Synthesis相较现有合成方法平均性能提高了2.0%。 Conclusion: V-Synthesis是一种有效的合成演示方法，可提高任意任务的一致性和多样性。 Abstract: High labeling cost for in-context learning (ICL) demonstrations motivates using large language models (LLMs) for synthesis to reduce overhead. However, existing synthesis methods are mainly task-specific or rely on pre-existing demonstrations. So this paper focuses on synthesizing demonstrations from scratch for arbitrary tasks. A major challenge in synthesizing from scratch is ensuring consistency with the target task, as the lack of labeling guidance could lead to synthesis bias. We first propose a consistency metric called V-Score, which has higher performance and lower computation cost compared with the metrics based on grams or embedding vectors. Furthermore, we introduce V-Synthesis, which leverages V-Score for proportional sampling to ensure both high consistency and diversity of synthesized demonstrations. Experimental results demonstrate that V-Synthesis yields an average performance improvement of 2.0% compared to existing synthesis methods confirming the effectiveness of V-Synthesis.

[43] RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams

Gabriel Iturra-Bocaz,Felipe Bravo-Marquez

Main category: cs.CL

TL;DR: 这篇论文提出了RiverText，一个用于动态更新词表示的开源Python库，适用于信息检索和自然语言处理领域的流媒体场景。

Details

Motivation: 传统的词嵌入模型具有静态特性，难以适应不断变化的语言模式，例如社交媒体和网络中出现的新标签或品牌名称等新兴来源。 Method: 该论文介绍了RiverText，它在标准化框架中实现了不同的增量词嵌入技术，如Skip-gram、连续词袋模型和词上下文矩阵，并使用PyTorch作为其神经网络训练的后端。 Result: 该工具实现了一个模块，将现有的静态词嵌入评估任务（如词相似性和词分类）适应到流媒体环境中，并对不同超参数设置下的方法进行了比较，讨论了结果。 Conclusion: RiverText是一个用于从文本数据流中训练和评估增量词嵌入的Python库，为处理流媒体场景中的词嵌入的信息检索和自然语言处理社区提供资源。 Abstract: Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams. This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training. We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results. Our open-source library is available at https://github.com/dccuchile/rivertext.

[44] Generalist Reward Models: Found Inside Large Language Models

Yi-Chen Li,Tian Xu,Yang Yu,Xuqin Zhang,Xiong-Hui Chen,Zhongxiang Ling,Ningjing Chao,Lei Yuan,Zhi-Hua Zhou

Main category: cs.CL

TL;DR: 本文发现大型语言模型本身已包含高质量奖励信号，无需额外训练即可用于强化学习，并提供了首个关于强化学习对LLM有效的理论证明。

Details

Motivation: 现有的基于人类偏好数据的奖励模型成本高昂，而使用AI反馈的方法又缺乏严格的理论基础，因此作者希望找到一种新的、理论上可靠的替代方法来降低对齐成本。 Method: 论文提出了一种理论框架，证明了在标准下一个词预测任务中训练的LLM已经隐含了一个强大的通用奖励模型，并通过离线逆强化学习理论解释其有效性。 Result: 实验表明，该方法不仅优于现有的LLM作为评判者的方法，还能超越显式训练的奖励模型，且后续强化学习策略具有比基线模型更优的误差界。 Conclusion: 论文得出结论，通过利用预训练模型中已有的奖励信号，可以替代传统的奖励模型阶段，实现更高效、强大和可扩展的大型语言模型（LLM）对齐方法。 Abstract: The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.

[45] Two Spelling Normalization Approaches Based on Large Language Models

Miguel Domingo,Francisco Casacuberta

Main category: cs.CL

TL;DR: This study explores spelling normalization techniques using large language models, concluding that statistical machine translation is currently the most effective approach.

Details

Motivation: The absence of standardized spelling conventions in historical documents presents a challenge for scholars in the humanities. Method: Two new approaches based on large language models were proposed and evaluated across multiple datasets. Result: Both approaches yielded encouraging results, but statistical machine translation performed best. Conclusion: Statistical machine translation is currently the most suitable technology for spelling normalization. Abstract: The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.

[46] Objective-Free Local Learning and Emergent Language Structure in Thinking Machines

P. Myles Eugenio

Main category: cs.CL

TL;DR: This paper introduces a neuro-symbolic framework for language modeling based on local, emergent learning with a hierarchical Hopfield memory chain. It enables the spontaneous formation of symbolic structures, grammar, and language-like patterns without supervision, offering a pathway to scalable and interpretable AI systems with enhanced generalization capabilities.

Details

Motivation: The motivation is to address limitations in conventional language models by developing a system capable of plasticity and generalization beyond its initial scope without explicit data. The goal is to study how symbolic structure can emerge from local neural learning and create scalable, interpretable neuro-symbolic systems. Method: The method involves a hierarchical Hopfield memory chain used as a compositional short-term memory and dynamic tokenizer (retokenizer). The model learns multi-scale representations of symbol sequences without predefined tokens or supervision, using projection tensors to bind features into hierarchical tokens. Learning occurs locally via Hebbian principles, with constraints dictating allowed structure, and emergent neurons acting as long-term memory supporting compositional inference. Result: The results show that the retokenizer can filter natural language patterns from noise, generating synthetic languages with coherent internal morphology comparable to human language. Emergent embedding neurons support a key-value mechanism for compositional inference and generalization, demonstrating the ability to generalize beyond initial inference classes even without explicit data. Conclusion: The paper concludes that the proposed neuro-symbolic framework can effectively enable generative language modeling through local, event-driven emergent learning. It demonstrates how symbolic structures can emerge from neural learning and highlights the potential for scalable and interpretable systems where grammar, tokens, and reasoning naturally arise. Abstract: We present a neuro-symbolic framework for generative language modeling based on local, event-driven emergent learning. At its core is a hierarchical Hopfield memory chain acting as a compositional short-term memory and dynamic tokenizer (retokenizer). Rather than relying on predefined tokens or supervision, the model builds structure from scratch, learning symbol sequences as multi-scale representations. It constructs projection tensors that bind co-occurring features into hierarchical tokens, introducing redundancy (i.e an emergent gauge structure) and enabling compression of local activations into long-range dependencies. Curiously, we find that the retokenizer can filter natural language patterns from noise, generating synthetic languages with coherent internal morphology -- quantifiably the same as human language. Language is learned in a local (Hebbian) fashion, where model constraints dictate allowed emergent structure, and new information is retained in alignment with this structure. The absence of a global objective enables a form of plasticity not found in conventional language models, allowing the system to generalize beyond its initial inference class -- even without explicit data. We demonstrate that briefly activating a new neuron during inference binds distributed multi-scale token features into a symbolic embedding. These emergent embedding neurons act as long-term memory and support a key-value mechanism for compositional inference and generalization. This architecture provides a methodological foundation for studying how symbolic structure can emerge from local neural learning. It offers a new pathway for building scalable, interpretable neuro-symbolic systems -- where tokens, grammar, and reasoning arise as compressed memory traces within a Hopfield hierarchy. This approach advances the development of neuromorphic architectures for generative language models.

[47] Ensemble BERT for Medication Event Classification on Electronic Health Records (EHRs)

Shouvon Sarker,Xishuang Dong,Lijun Qian

Main category: cs.CL

TL;DR: 本研究开发了一种基于BERT的集成模型，用于从临床笔记中检测和分类药物事件，在n2c2 2022挑战赛中表现出色。

Details

Motivation: 从临床记录中检测和分类药物事件对于临床数据分析具有广泛应用，但存在挑战。 Method: 通过在不同大数据类型上预训练BERT模型，并在CMED训练数据上进行微调，然后采用投票策略整合多个预测结果构建最终预测。 Result: 实验结果显示，该方法分别将严格Micro-F分数和严格Macro-F分数提高了约5%和6%。 Conclusion: 使用基于BERT的集成模型可以有效提升药物事件分类的严格Micro-F分数和Macro-F分数。 Abstract: Identification of key variables such as medications, diseases, relations from health records and clinical notes has a wide range of applications in the clinical domain. n2c2 2022 provided shared tasks on challenges in natural language processing for clinical data analytics on electronic health records (EHR), where it built a comprehensive annotated clinical data Contextualized Medication Event Dataset (CMED). This study focuses on subtask 2 in Track 1 of this challenge that is to detect and classify medication events from clinical notes through building a novel BERT-based ensemble model. It started with pretraining BERT models on different types of big data such as Wikipedia and MIMIC. Afterwards, these pretrained BERT models were fine-tuned on CMED training data. These fine-tuned BERT models were employed to accomplish medication event classification on CMED testing data with multiple predictions. These multiple predictions generated by these fine-tuned BERT models were integrated to build final prediction with voting strategies. Experimental results demonstrated that BERT-based ensemble models can effectively improve strict Micro-F score by about 5% and strict Macro-F score by about 6%, respectively.

[48] Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family

Yumeng Lin,Xufeng Duan,David Haslett,Yige Chen,Zhenguang G. Cai

Main category: cs.CL

TL;DR: This study examines how training data, language proximity, and language family affect multilingual translation quality in large language models like GPT-4 and Llama 2, revealing that translation performance depends not only on data volume but also on structural and typological language relationships.

Details

Motivation: Large language models face challenges in multilingual translation for certain language pairs, especially those with limited training data or significant linguistic divergence from English. Understanding these limitations can improve model development and translation strategies. Method: The study evaluated two large language models, GPT-4 and Llama 2, using round-trip translations. Translation quality was measured through BLEU scores and BERT similarity metrics to systematically investigate the impact of training data, language proximity, and language family on information loss. Result: Languages structurally closer to English yield better translation quality under low-resource conditions. Abundant training data can reduce the effects of linguistic divergence. Orthographic, phylogenetic, syntactic, and geographical distances were identified as strong predictors of translation performance, and language family exerted an independent influence. Conclusion: Translation quality in large language models is influenced by both data volume and structural as well as typological relationships between languages. Training data size interacts robustly with language distance, and certain distance metrics are strong predictors of translation performance. Abstract: Large language models have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs-particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation. We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations. Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low-resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. Language family also exerts an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages.

[49] ATGen: A Framework for Active Text Generation

Akim Tsvigun,Daniil Vasilev,Ivan Tsvigun,Ivan Lysenko,Talgat Bektleuov,Aleksandr Medvedev,Uliana Vinogradova,Nikita Severin,Mikhail Mozikov,Andrey Savchenko,Rostislav Grigorev,Ramil Kuleev,Fedor Zhdanov,Artem Shelmanov,Ilya Makarov

Main category: cs.CL

TL;DR: The paper introduces ATGen, a framework integrating AL with NLG tasks, reducing annotation effort and costs.

Details

Motivation: To address the limited application of AL in NLG despite its effectiveness in reducing annotation effort. Method: Introduction of a framework that integrates AL strategies with NLG, using both human annotators and LLM-based agents. Result: ATGen enables smooth implementation and benchmarking of AL strategies for NLG, showing reduced effort and cost. Conclusion: ATGen successfully bridges AL with NLG tasks, reducing annotation effort and cost. Abstract: Active learning (AL) has demonstrated remarkable potential in reducing the annotation effort required for training machine learning models. However, despite the surging popularity of natural language generation (NLG) tasks in recent years, the application of AL to NLG has been limited. In this paper, we introduce Active Text Generation (ATGen) - a comprehensive framework that bridges AL with text generation tasks, enabling the application of state-of-the-art AL strategies to NLG. Our framework simplifies AL-empowered annotation in NLG tasks using both human annotators and automatic annotation agents based on large language models (LLMs). The framework supports LLMs deployed as services, such as ChatGPT and Claude, or operated on-premises. Furthermore, ATGen provides a unified platform for smooth implementation and benchmarking of novel AL strategies tailored to NLG tasks. Finally, we present evaluation results for state-of-the-art AL strategies across diverse settings and multiple text generation tasks. We show that ATGen reduces both the effort of human annotators and costs associated with API calls to LLM-based annotation agents. The code of the framework is available on GitHub under the MIT license. The video presentation is available at http://atgen-video.nlpresearch.group

[50] Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs

Taejin Kim,Siun-Chuon Mau,Konrad Vesey

Main category: cs.CL

TL;DR: This paper proposes Perspective-Dial, a framework for measuring and controlling the perspective of large language model outputs using a metric space and prompt engineering.

Details

Motivation: The motivation stems from the lack of quantifiable understanding of bias and perspective in LLM outputs despite their use in mission-critical roles. This gap necessitates a method to measure and control perspective systematically. Method: Perspective-Dial uses a metric space called Perspective Space for quantitative measurement of perspectives and employs Systematic Prompt Engineering with greedy-coordinate descent to control LLM output based on feedback from the Perspective Space. Result: The paper introduces Perspective-Dial, which allows for the empirical adjustment of LLM output perspectives without requiring a theoretical understanding of bias, enabling applications like bias detection and mitigation, narrative tracking, and debate bot development. Conclusion: The paper concludes that Perspective-Dial can effectively quantify and adjust the perspective of LLM outputs, offering practical applications in bias mitigation, narrative detection, and discourse tracking. Abstract: Large language models (LLMs) are used in a variety of mission-critical roles. Due to the rapidly developing nature of LLMs, there is a lack of quantifiable understanding of the bias and perspective associated with LLM output. Inspired by this need, this paper considers the broader issue of perspective or viewpoint of general text and perspective control of large-language model (LLM) output. Perspective-Dial consists of two main components: a (1) metric space, dubbed Perspective Space, that enables quantitative measurements of different perspectives regarding a topic, and the use of (2) Systematic Prompt Engineering that utilizes greedy-coordinate descent to control LLM output perspective based on measurement feedback from the Perspective Space. The empirical nature of the approach allows progress to side step a principled understanding of perspective or bias -- effectively quantifying and adjusting outputs for a variety of topics. Potential applications include detection, tracking and mitigation of LLM bias, narrative detection, sense making and tracking in public discourse, and debate bot advocating given perspective.

[51] Hierarchical Memory Organization for Wikipedia Generation

Eugene J. Yu,Dawei Zhu,Yifan Song,Xiangyu Wong,Jiebin Zhang,Wenxuan Shi,Xiaoguang Li,Qun Liu,Sujian Li

Main category: cs.CL

TL;DR: This paper introduces MOG, a hierarchical memory-based framework for autonomous Wikipedia article generation that improves informativeness, verifiability, and traceability.

Details

Motivation: Autonomously generating Wikipedia articles requires integrating accurate, comprehensive, and well-structured information from diverse sources, which is a challenging task. Method: The Memory Organization-based Generation (MOG) framework uses a hierarchical memory architecture to extract fine-grained memory units from web documents, recursively organize them into a Wikipedia-style structure, and guide the generation process. A citation module links every sentence to specific memory units. Result: Evaluations on the WikiStart dataset show that MOG surpasses baseline methods in producing informative and reliable articles, ensuring alignment between memory and article outline while minimizing hallucinations. Conclusion: MOG is especially robust in real-world scenarios and outperforms baseline methods in generating informative and reliable Wikipedia-style articles. Abstract: Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.

[52] Datasets for Fairness in Language Models: An In-Depth Survey

Jiale Zhang,Zichong Wang,Avash Palikhe,Zhipeng Yin,Wenbin Zhang

Main category: cs.CL

TL;DR: 这篇论文调查了语言模型研究中常用的公平性基准数据集，指出了其中存在的偏见，并提出了一个统一的评估框架来揭示这些偏差。作者建议改进现有数据集的使用方式，并呼吁开发更具多样化的公平性基准。

Details

Motivation: 论文的动机在于，尽管公平性基准在评估语言模型中起着核心作用，但对这些基准所依赖的数据集的研究却非常有限。因此，作者希望通过全面审查现有数据集并引入新的评估框架，帮助研究人员更好地理解这些数据集的假设和局限性，并推动更公正的模型评估方法的发展。 Method: 论文的方法是对当前语言模型研究中最广泛使用的24个公平性基准数据集进行全面审查，并从来源、范围、内容和用途等多个维度对其进行分析。此外，作者引入了一个统一的评估框架，用于揭示不同数据集和评分方法之间的系统性人口统计差异。 Result: 论文的结果包括对24个常用公平性基准数据集的详细分析，揭示了这些数据集中存在的系统性人口统计差异。通过统一的评估框架，作者展示了不同数据集和评分方法中的不一致性和潜在偏见，并提供了关于如何选择和解释这些数据集的实际建议。此外，作者还发布了所有代码、数据和详细结果以促进透明度和可重复性。 Conclusion: 论文的结论是，现有的公平性基准数据集存在被忽视的偏见，这些偏见可能会影响对模型公平性的判断。作者通过引入一个统一的评估框架，揭示了不同数据集和评分方法中存在的系统性人口统计差异，并提出了选择、组合和解释这些数据集的实用指南。此外，作者还呼吁创建更能反映多样化社会背景的新公平性基准，并促进这些工具的更有意义的使用。 Abstract: Fairness benchmarks play a central role in shaping how we evaluate language models, yet surprisingly little attention has been given to examining the datasets that these benchmarks rely on. This survey addresses that gap by presenting a broad and careful review of the most widely used fairness datasets in current language model research, characterizing them along several key dimensions including their origin, scope, content, and intended use to help researchers better appreciate the assumptions and limitations embedded in these resources. To support more meaningful comparisons and analyses, we introduce a unified evaluation framework that reveals consistent patterns of demographic disparities across datasets and scoring methods. Applying this framework to twenty four common benchmarks, we highlight the often overlooked biases that can influence conclusions about model fairness and offer practical guidance for selecting, combining, and interpreting these datasets. We also point to opportunities for creating new fairness benchmarks that reflect more diverse social contexts and encourage more thoughtful use of these tools going forward. All code, data, and detailed results are publicly available at https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/datasets to promote transparency and reproducibility across the research community.

[53] TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Felipe Nuti,Tim Franzmeyer,João Henriques

Main category: cs.CL

TL;DR: This paper introduces TuCo, a method to measure the contribution of fine-tuning to LLM responses, revealing insights into how adversarial attacks affect model behavior.

Details

Motivation: A quantitative and systematic method for analyzing the effect of fine-tuning on individual LLM outputs is lacking. Method: The method tracks the model's intermediate hidden states to provide a more fine-grained insight into the effects of fine-tuning. The Tuning Contribution (TuCo) is defined as the ratio of the magnitudes of the fine-tuning component to the pre-training component. Result: Empirical findings show that model behavior and performance can be steered by up- or down-scaling the fine-tuning component during the forward pass. Adversarial attacks on LLMs reduce TuCo and are successful when TuCo is lower. Conclusion: TuCo enables the quantitative study of how fine-tuning influences model behavior and safety, and vice versa. Abstract: Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a quantitative and systematic method for analyzing its effect on individual outputs is still lacking. Here, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method tracks the model's intermediate hidden states, providing a more fine-grained insight into the effects of fine-tuning than a simple comparison of final outputs from pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that model behavior and performance can be steered by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution (TuCo) as the ratio of the magnitudes of the fine-tuning component to the pre-training component. We observe that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces TuCo, and that TuCo is consistently lower on prompts where these attacks succeed compared to those where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of such attacks. In summary, TuCo enables the quantitative study of how fine-tuning influences model behavior and safety, and vice versa.

[54] Pipelined Decoder for Efficient Context-Aware Text Generation

Zixian Huang,Chenxu Niu,Yu Gu,Gengyang Xiao,Xinwei Huang,Gong Cheng

Main category: cs.CL

TL;DR: This paper introduces a pipelined decoder architecture that enables parallel text generation, significantly improving generation speed while maintaining quality.

Details

Motivation: Autoregressive models generate tokens sequentially, forming a bottleneck in generation speed. This work aims to enable parallelism in context-aware generation tasks. Method: A new decoder architecture that enables parallel generation of multiple subsequences, generating a new token for each subsequence at each time-step. Result: Experiments on question answering, text summarization, and keyphrase generation show significant improvements in generation speed with no significant loss in quality or additional memory usage. Conclusion: The proposed pipelined decoder efficiently generates text in parallel, improving generation speed without compromising quality or increasing memory consumption. Abstract: As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.

[55] What to Keep and What to Drop: Adaptive Table Filtering Framework

Jang Won June

Main category: cs.CL

TL;DR: ATF是一种用于表格推理任务的自适应表格过滤框架，可以显著减少输入表格大小并在多数任务上提升模型表现。

Details

Motivation: 由于大语言模型在处理长表格时受限于输入长度限制，因此需要一种有效的方法来减少表格内容以提升模型效率和性能。 Method: 提出了一种模块化、问题感知的表格过滤框架ATF，利用LLM生成的列描述、聚类以及稀疏-密集对齐分数来剪枝无信息量的行列。 Result: ATF减少了约70%的表格单元格，在TableQA任务中提升了性能，但在Table Fact Verification任务中略有下降，因为该任务更依赖完整的表格上下文。 Conclusion: ATF通过自适应过滤表格中的冗余信息，能够在大多数任务上提升模型表现，并且能够灵活地平衡不同任务的信息量和简洁性需求。 Abstract: Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by ~70\%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF's ability to adaptively balance informativeness and minimalism across tasks.

[56] Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

Haocheng Yu,Yaxiong Wu,Hao Wang,Wei Guo,Yong Liu,Yawen Li,Yuyang Ye,Junping Du,Enhong Chen

Main category: cs.CL

TL;DR: 本文提出了TAIRA，一个基于大语言模型的多智能体系统，用于改进交互式推荐中的复杂用户意图处理，通过思维模式提炼提升性能。

Details

Motivation: 由于现有LLM驱动的交互推荐代理的规划和泛化能力有限，难以有效处理各种复杂的用户意图，如直观、未精炼或偶尔模糊的请求。 Method: 提出了一种新的思维增强型交互推荐代理系统TAIRA，该系统通过分解用户需求和规划子任务来运作，并通过思维模式提炼（TPD）加强了规划能力。 Result: 通过多个数据集的全面实验，TAIRA表现出了比现有方法显著更好的性能。特别是在更具挑战性的任务上显示出了更大的优势，并且在新任务上也能有效泛化。 Conclusion: TAIRA通过提炼思维模式有效应对复杂的用户意图，证明了其在交互式推荐系统中处理复杂用户意图的优越性。 Abstract: Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users' real-time needs and enhancing personalized experiences. However, due to limited planning and generalization capabilities, existing formulations of LLM-powered interactive recommender agents struggle to effectively address diverse and complex user intents, such as intuitive, unrefined, or occasionally ambiguous requests. To tackle this challenge, we propose a novel thought-augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns. Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks, with its planning capacity strengthened through Thought Pattern Distillation (TPD), a thought-augmentation method that extracts high-level thoughts from the agent's and human experts' experiences. Moreover, we designed a set of user simulation schemes to generate personalized queries of different difficulties and evaluate the recommendations based on specific datasets. Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods. Notably, TAIRA shows a greater advantage on more challenging tasks while generalizing effectively on novel tasks, further validating its superiority in managing complex user intents within interactive recommendation systems. The code is publicly available at:https://github.com/Alcein/TAIRA.

[57] Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably

Zhihao Zhang,Qiaole Dong,Qi Zhang,Jun Zhao,Enyu Zhou,Zhiheng Xi,Senjie Jin,Xiaoran Fan,Yuhao Zhou,Yanwei Fu,Tao Ji,Tao Gui,Xuanjing Huang

Main category: cs.CL

TL;DR: 研究对比了SFT和RFT对多模态大语言模型适应下游任务的影响，发现SFT快速获取任务但导致灾难性遗忘，而RFT则更慢学习新任务但保留先前知识。

Details

Motivation: 了解后训练算法如监督微调(SFT)和强化微调(RFT)如何影响多模态大语言模型的先验知识保持能力。 Method: 引入拼图任务，利用开源多模态模型Qwen2.5-VL系统地研究SFT和RFT的行为差异，并分析其学习动态。 Result: 实验揭示了SFT与RFT的显著权衡：SFT快速学习但造成灾难性遗忘，RFT则通过维持正确样本的概率分布来保留知识。 Conclusion: 数据分布比算法差异在遗忘中起核心作用，RFT为多模态大语言模型的稳定持续学习提供了潜力。 Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge. We analyze this phenomenon through the lens of learning dynamics, showing that RFT reinforces correct samples that are naturally aligned with the base model's probability landscape, mitigating interference with prior knowledge. Moreover, supervised training on correct RFT-simulated rollouts allows SFT to preserve knowledge while rapidly learning new tasks. These findings suggest that data distribution, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT's potential for stable continual learning in multimodal large language models.

[58] NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning

Phan Quoc Hung Mai,Quang Hung Nguyen,Phuong Giang Duong,Hong Hanh Nguyen,Nguyen Tuan Long

Main category: cs.CL

TL;DR: 本研究提出了一个新的越南语教育评论数据集NEU-ESC，并利用BERT模型实现了高效的多任务情感与主题分类。

Details

Motivation: 现有的教育领域数据集在越南语资源上存在不足，缺乏相关性和学生常用语言表达，因此需要一个更贴合实际需求的高质量数据集。 Method: 研究团队从大学论坛中收集数据，构建了NEU-ESC数据集，并使用仅编码器的语言模型（BERT）进行多任务学习实验，同时与其他数据集和模型进行了基准测试。 Result: 通过多任务学习方法，在NEU-ESC数据集上实现了83.7%的情感分类准确率和79.8%的主题分类准确率，证明了数据集的有效性和模型的良好表现。 Conclusion: NEU-ESC是一个新的越南语教育情感和主题分类数据集，适用于教育领域的意见分析。多任务学习方法在该数据集上表现出良好的性能。 Abstract: In the field of education, understanding students' opinions through their comments is crucial, especially in the Vietnamese language, where resources remain limited. Existing educational datasets often lack domain relevance and student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese dataset for Educational Sentiment Classification and Topic Classification, curated from university forums, which offers more samples, richer class diversity, longer texts, and broader vocabulary. In addition, we explore multitask learning using encoder-only language models (BERT), in which we showed that it achieves performance up to 83.7% and 79.8% accuracy for sentiment and topic classification tasks. We also benchmark our dataset and model with other datasets and models, including Large Language Models, and discuss these benchmarks. The dataset is publicly available at: https://huggingface.co/datasets/hung20gg/NEU-ESC.

[59] On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator?

Jan Kvapil,Martin Fajcik

Main category: cs.CL

TL;DR: This study analyzes whether AI-generated recipes stem from memorization, creativity, or nonsense, and proposes an automated evaluation framework for scalable analysis.

Details

Motivation: To understand how LLMs generate recipes—whether through memorization, creativity, or producing nonsense—and to develop an automated method for evaluating these aspects at scale. Method: Human annotation was conducted on 20 preselected recipes generated by Mixtral to assess memorization, creativity, and nonsense. An automated LLM-as-judge pipeline was also developed for scaling the analysis, using models like Llama 3.1+Gemma 2 9B for ingredient extraction and annotation. Result: Mixtral was found to reuse ingredients traceable to online sources, indicating reliance on memorization. The best-performing model in the automated pipeline, Llama 3.1+Gemma 2 9B, achieved up to 78% accuracy in ingredient matching. Conclusion: The study concludes that Mixtral relies heavily on memorized content when generating recipes, but the proposed LLM-as-judge pipeline enables large-scale assessment of creativity and nonsense in LLM outputs. Abstract: This work-in-progress investigates the memorization, creativity, and nonsense found in cooking recipes generated from Large Language Models (LLMs). Precisely, we aim (i) to analyze memorization, creativity, and non-sense in LLMs using a small, high-quality set of human judgments and (ii) to evaluate potential approaches to automate such a human annotation in order to scale our study to hundreds of recipes. To achieve (i), we conduct a detailed human annotation on 20 preselected recipes generated by LLM (Mixtral), extracting each recipe's ingredients and step-by-step actions to assess which elements are memorized--i.e., directly traceable to online sources possibly seen during training--and which arise from genuine creative synthesis or outright nonsense. We find that Mixtral consistently reuses ingredients that can be found in online documents, potentially seen during model training, suggesting strong reliance on memorized content. To achieve aim (ii) and scale our analysis beyond small sample sizes and single LLM validation, we design an ``LLM-as-judge'' pipeline that automates recipe generation, nonsense detection, parsing ingredients and recipe steps, and their annotation. For instance, comparing its output against human annotations, the best ingredient extractor and annotator is Llama 3.1+Gemma 2 9B, achieving up to 78% accuracy on ingredient matching. This automated framework enables large-scale quantification of memorization, creativity, and nonsense in generated recipes, providing rigorous evidence of the models' creative capacities.

[60] Semantic-guided Diverse Decoding for Large Language Model

Weijie Shi,Yue Cui,Yaguang Wu,Jingzhi Fang,Shibo Zhang,Mengze Li,Sirui Han,Jia Zhu,Jiajie Xu,Xiaofang Zhou

Main category: cs.CL

TL;DR: 本文提出了一种名为 SemDiD 的新方法，通过在嵌入空间中直接操作，显著提高了大型语言模型生成响应的语义多样性。

Details

Motivation: 现有的多样化解码方法多关注于词汇层面而非语义层面的多样性，这限制了多个应用场景的效果，因此需要一种新的方法解决这一问题。 Method: 引入了三种机制：正交方向引导、动态组间排斥和位置去偏概率评估，并通过自适应增益函数和约束优化平衡质量与多样性。 Result: 实验表明 SemDiD 优于现有方法，在多种任务中提升了 Best-of-N 的覆盖率1.4-5.2%，同时在RLHF训练中加快了收敛速度并提高了准确性。 Conclusion: SemDiD 提供了一个有效的方法来实现语义多样化的解码，解决了现有方法主要局限于词汇多样性的问题。 Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or apply n-gram penalties, they fail to ensure meaningful semantic differentiation. We introduce Semantic-guided Diverse Decoding (SemDiD), operating directly in embedding space that balances quality with diversity through three complementary mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment. SemDiD harmonizes these competing objectives using adaptive gain functions and constraint optimization, ensuring both quality thresholds and maximal semantic differentiation. Experiments show SemDiD consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.

[61] Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs

Manuel Pratelli,Marinella Petrocchi

Main category: cs.CL

TL;DR: 这篇论文探讨了使用大语言模型（LLMs）作为合成行为数据生成工具的可能性，重点在于它们能否捕捉人格特质对易受错误信息影响的差异。

Details

Motivation: 论文的研究动机是探索大规模生成合成行为数据的伦理和低成本替代方案是否能够忠实地捕捉由人格特质驱动的心理差异，尤其是在易受错误信息影响方面的差异。 Method: 论文的方法包括利用已发布的数据集，其中具有已知人格特征的人类参与者评估了标题的准确性，并创建匹配的LLM代理来比较其响应与原始人类模式。研究特别关注于新闻辨别能力，即判断真实标题为真和虚假标题为假的能力。 Result: 论文的结果表明，特定特质与错误信息之间的关联，尤其是涉及宜人性和尽责性的关联，可以被可靠地复制，而其他则有所不同，揭示了LLMs在内部化和表达个性方面的系统性偏差。 Conclusion: 该论文得出的结论是，基于大五人格特征的大语言模型（LLMs）在行为模拟方面具有潜力和局限性。某些人格特质与错误信息之间的关联可以被可靠地复制，而其他则存在偏差，这揭示了LLMs在内部化和表达个性方面的系统性偏差。 Abstract: Large language models (LLMs) make it possible to generate synthetic behavioural data at scale, offering an ethical and low-cost alternative to human experiments. Whether such data can faithfully capture psychological differences driven by personality traits, however, remains an open question. We evaluate the capacity of LLM agents, conditioned on Big-Five profiles, to reproduce personality-based variation in susceptibility to misinformation, focusing on news discernment, the ability to judge true headlines as true and false headlines as false. Leveraging published datasets in which human participants with known personality profiles rated headline accuracy, we create matching LLM agents and compare their responses to the original human patterns. Certain trait-misinformation associations, notably those involving Agreeableness and Conscientiousness, are reliably replicated, whereas others diverge, revealing systematic biases in how LLMs internalize and express personality. The results underscore both the promise and the limits of personality-aligned LLMs for behavioral simulation, and offer new insight into modeling cognitive diversity in artificial agents.

[62] Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack

Arnisa Fazla,Lucas Krauter,David Guzman Piedrahita,Andrianos Michail

Main category: cs.CL

TL;DR: 该研究扩展了BeamAttack对抗攻击算法，实现了对文本分类系统的高效攻击，并通过集成LIME优化词替换策略，取得了高成功率。

Details

Motivation: 评估和增强文本分类系统的鲁棒性。 Method: 扩展了BeamAttack算法，包括支持词语删除和跳过替换选项，并集成了LIME以更好地优先考虑词语替换。 Result: 在多个数据集和模型上实现了超过99%的攻击成功率，同时保持了文本的语义和词汇相似性。 Conclusion: BeamAttack在文本分类系统中实现了高效的对抗攻击，具有高成功率，并且保持了与原始文本的语义和词汇相似性。 Abstract: We extend BeamAttack, an adversarial attack algorithm designed to evaluate the robustness of text classification systems through word-level modifications guided by beam search. Our extensions include support for word deletions and the option to skip substitutions, enabling the discovery of minimal modifications that alter model predictions. We also integrate LIME to better prioritize word replacements. Evaluated across multiple datasets and victim models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA framework, our approach achieves over a 99\% attack success rate while preserving the semantic and lexical similarity of the original texts. Through both quantitative and qualitative analysis, we highlight BeamAttack's effectiveness and its limitations. Our implementation is available at https://github.com/LucK1Y/BeamAttack

[63] Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation

Philip Lippmann,Jie Yang

Main category: cs.CL

TL;DR: ZEST is a zero-shot contextual adaptation framework that generates a synthetic context corpus to produce domain-adapted embeddings without requiring target corpus access or retraining.

Details

Motivation: Context-aware embedding methods require access to the target corpus or domain-specific fine-tuning, which poses practical barriers in privacy-sensitive or resource-constrained settings. Method: ZEST uses a multi-step hierarchical procedure to generate a synthetic context corpus from a few exemplar documents, enabling domain-adapted embeddings without retraining or target corpus access. Result: ZEST's zero-shot synthetic context adaptation using only five example documents performs within 0.5% of models leveraging full target corpus access. Conclusion: ZEST provides a practical method for deploying high-performance, adaptable embeddings in constrained environments. Abstract: Context-aware embedding methods boost retrieval accuracy by conditioning on corpus statistics (e.g., term co-occurrence and topical patterns) extracted from neighboring documents. However, this context-aware approach requires access to the target corpus or requires domain-specific finetuning, posing practical barriers in privacy-sensitive or resource-constrained settings. We present ZEST, a zero-shot contextual adaptation framework that replaces real corpus access with a one-time offline synthesis of a compact proxy. Given only a handful exemplar documents representative of the general target domain, we use a multi-step hierarchical procedure to generate a synthetic context corpus of several hundred documents that aims to emulate key domain-specific distributions. At inference, the frozen context-aware encoder uses this proxy corpus -- without any finetuning or target corpus access -- to produce domain-adapted embeddings. Across the MTEB benchmark, ZEST's zero-shot synthetic context adaptation using only five example documents performs within 0.5% of models leveraging full target corpus access -- demonstrating remarkable efficacy without any retraining. ZEST thus provides a practical method for deploying high-performance, adaptable embeddings in constrained environments.

[64] L0: Reinforcement Learning to Become General Agents

Junjie Zhang,Jingyi Xi,Zhuoyang Song,Junyu Lu,Yuhua Ke,Ting Sun,Yukun Yang,Jiaxing Zhang,Songxin Zhang,Zejian Xie

Main category: cs.CL

TL;DR: L-Zero（L0）是一种用于高效训练通用自主代理的新框架，其NB-Agent通过代码执行显著提高了问答任务的准确性。

Details

Motivation: 为了解决大规模语言模型在多轮、长视野任务中的可扩展性和训练效率问题，提供一种高效的自主代理训练方法。 Method: 提出了L-Zero（L0）端到端训练管道和NB-Agent代理框架，使用低成本、可扩展的沙盒并发代理工作池，并采用基于REPL的“代码即动作”方式操作。 Result: 实验表明，该方法在SimpleQA上准确率从30%提升至80%，在HotpotQA上从22%提升至41%。 Conclusion: L0系统在复杂环境中降低了强化学习的应用门槛，并通过NB-Agent展现了出色的问答准确性提升。 Abstract: Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying reinforcement learning in complex environments. We also introduce NB-Agent, the agent scaffold within L0, which operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality question-answering benchmarks. Our experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41 %. We have open-sourced the entire L0 system, including our L0 series models, the NB-Agent, a complete training pipeline, and the corresponding training recipes on (https://github.com/cmriat/l0).

[65] AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data

JiaRu Wu,Mingwei Liu

Main category: cs.CL

TL;DR: 为解决现有大语言模型评估基准的不足，本研究提出了AutoEvoEval这一基于进化的评估框架，通过引入22种原子进化操作和多轮组合生成更具挑战性的测试样本，实验表明该方法能够更准确地评估模型的鲁棒性和泛化能力。

Details

Motivation: 现有的大语言模型评估基准通常是静态的，不足以全面评估其在真实场景中的鲁棒性和泛化能力。进化或对抗的数据增强方法虽然提高了评估多样性，但缺乏对扰动类型和多步骤复杂性的系统控制。 Method: AutoEvoEval引入了一种基于进化的评估框架，包括22种可解释的原子进化操作，并支持多轮组合，以生成多样化、具有挑战性和现实性的测试样本。 Result: 实验结果显示，原子操作导致平均准确率下降7.283%，其中结构破坏或误导语义编辑造成的下降最大。相同扰动下模型敏感性差异显著，多种进化步骤的结合会放大对抗效应高达52.932%。 Conclusion: AutoEvoEval的研究结果表明，当前的基准测试可能高估了模型的真实泛化能力，并强调了进化感知鲁棒性评估的重要性。 Abstract: Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283\%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932\%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: https://github.com/SYSUSELab/AutoEvoEval.

[66] Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences

Tiziano Labruna,Simone Gallo,Giovanni Da San Martino

Main category: cs.CL

TL;DR: This study finds that positional bias in language models increases significantly when answer uncertainty is high, but not in low-uncertainty scenarios.

Details

Motivation: To understand and quantify positional bias in binary question answering across large language models under varying degrees of answer uncertainty. Method: The study uses an adapted SQuAD-it dataset with added incorrect answers and varying context levels. It also evaluates two high-uncertainty benchmarks: WebGPT and Winning Arguments. The order of correct or higher-quality options is flipped to compute Preference Fairness and Position Consistency. Result: Positional bias grows exponentially when it becomes difficult to determine the correct option, whereas it remains minimal in low-uncertainty settings. Conclusion: Positional bias increases under high uncertainty in binary question answering, while it is nearly absent under low-uncertainty conditions. Abstract: Positional bias in binary question answering occurs when a model systematically favors one choice over another based solely on the ordering of presented options. In this study, we quantify and analyze positional bias across five large language models under varying degrees of answer uncertainty. We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option and then created multiple versions with progressively less context and more out-of-context answers, yielding datasets that range from low to high uncertainty. Additionally, we evaluate two naturally higher-uncertainty benchmarks: (1) WebGPT - question pairs with unequal human-assigned quality scores, and (2) Winning Arguments - where models predict the more persuasive argument in Reddit's r/ChangeMyView exchanges. Across each dataset, the order of the "correct" (or higher-quality/persuasive) option is systematically flipped (first placed in position 1, then in position 2) to compute both Preference Fairness and Position Consistency. We observe that positional bias is nearly absent under low-uncertainty conditions, but grows exponentially when it becomes doubtful to decide which option is correct.

[67] Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model

Bowen Ding,Yuhan Chen,Futing Wang,Lingfeng Ming,Tao Lin

Main category: cs.CL

TL;DR: This paper proposes DuP-PO to solve the overthinking dilemma in LRMs by improving token efficiency during reasoning without affecting performance.

Details

Motivation: To address the overthinking dilemma in Large Reasoning Models (LRMs) where verbose responses with thinking tokens reduce efficiency. Method: Dual Policy Preference Optimization (DuP-PO) with rollout sampling strategy, fine-grained advantage control technique, and policy shaping method. Result: Experimental results on five popular math reasoning benchmarks show that DuP-PO improves token efficiency during reasoning while achieving superior performance of the base model. Conclusion: DuP-PO provides a solution to the overthinking dilemma in LRMs by improving token efficiency during reasoning without compromising performance. Abstract: Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma. When handling simple tasks, they often produce verbose responses overloaded with thinking tokens (e.g., wait, however). These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency. In this work, our pilot study reveals that these thinking-token-induced behaviors are not essential for effective problem-solving and may even hinder correct reasoning within constrained token budgets. We identify this phenomenon as the thinking trap. To mitigate this issue, we propose Dual Policy Preference Optimization (DuP-PO), a novel algorithm featuring: (1) A rollout sampling strategy that guarantees balanced exposure to responses with and without thinking tokens; (2) A fine-grained advantage control technique to dynamically regulate the prediction of target tokens; (3) A policy shaping method ensuring stable gradient contributions from thinking tokens. Experimental results on five popular math reasoning benchmarks show that DuP-PO performs well on the popular LRM, which significantly improves their token efficiency during reasoning, while achieving superior performance of the base model.

[68] Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It

Seyed Mahed Mousavi,Edoardo Cecchinato,Lucia Hornikova,Giuseppe Riccardi

Main category: cs.CL

TL;DR: 论文指出当前用于评估大语言模型推理能力的基准测试存在诸多问题，包括题目设计和评分方法的缺陷，导致模型得分无法真实反映其推理能力。

Details

Motivation: 论文旨在揭示当前推理基准测试中存在的问题，这些问题影响了对大语言模型推理能力的准确评估。 Method: 论文通过系统审计三个广泛使用的推理基准（SocialIQa、FauxPas-EAI和ToMi），利用五个LLM（GPT-3、3.5、4、o1和LLaMA 3.1）作为诊断工具，并进行系统的人工注释和重新评估。 Result: 研究发现，当前的基准测试存在结构性、语义性和实用性问题，模型得分的提高往往是因为表面文字的变化而非推理能力的提升。 Conclusion: 论文得出结论，当前的基准测试存在问题，不能准确评估大语言模型的推理能力，需要更合理的评估方法。 Abstract: We conduct a systematic audit of three widely used reasoning benchmarks, SocialIQa, FauxPas-EAI, and ToMi, and uncover pervasive flaws in both benchmark items and evaluation methodology. Using five LLMs (GPT-{3, 3.5, 4, o1}, and LLaMA 3.1) as diagnostic tools, we identify structural, semantic, and pragmatic issues in benchmark design (e.g., duplicated items, ambiguous wording, and implausible answers), as well as scoring procedures that prioritize output form over reasoning process. Through systematic human annotation and re-evaluation on cleaned benchmark subsets, we find that model scores often improve not due to due to erratic surface wording variations and not to improved reasoning. Infact, further analyses show that model performance is highly sensitive to minor input variations such as context availability and phrasing, revealing that high scores may reflect alignment with format-specific cues rather than consistent inference based on the input. These findings challenge the validity of current benchmark-based claims about reasoning in LLMs, and highlight the need for evaluation protocols that assess reasoning as a process of drawing inference from available information, rather than as static output selection. We release audited data and evaluation tools to support more interpretable and diagnostic assessments of model reasoning.

[69] Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting

André de Souza Loureiro,Jorge Valverde-Rebaza,Julieta Noguez,David Escarcega,Ricardo Marcacini

Main category: cs.CL

TL;DR: 该研究提出了一种名为MAPS的新框架，通过整合多种技术显著增强了大型语言模型的多步骤推理能力，同时有效平衡了性能与成本。

Details

Motivation: 尽管大型语言模型在问题解决方面取得了进步，但在面对复杂的多步骤推理任务时仍存在困难，因此需要一种新的方法来提高其推理能力。 Method: 本论文提出了Multi-Layered Self-Reflection with Auto-Prompting（MAPS）框架，结合了Chain of Thought（CoT）、Self-Reflection和Auto-Prompting技术，并采用迭代优化过程来增强LLMs的推理能力。 Result: 实验结果表明，MAPS明显优于标准的CoT方法，并在多个LLM上的多个基准测试中取得了与推理优化模型相当的结果。此外，MAPS使得通用型LLM也能达到类似专业推理模型的性能水平。 Conclusion: MAPS框架显著提升了大型语言模型在多步骤数学推理任务中的表现，使其能够与专门的推理模型相媲美，同时通过限制反思深度实现了成本和性能之间的平衡。 Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved their problem-solving capabilities. However, these models still struggle when faced with complex multi-step reasoning tasks. In this paper, we propose the Multi-Layered Self-Reflection with Auto-Prompting (MAPS) framework, a novel approach designed to enhance multi-step mathematical reasoning in LLMs by integrating techniques such as Chain of Thought (CoT), Self-Reflection, and Auto-Prompting. Unlike traditional static prompting methods, MAPS employs an iterative refinement process. Initially, the model generates a solution using CoT prompting. When errors are detected, an adaptive self-reflection mechanism identifies and analyzes them, generating tailored prompts to guide corrections. These dynamically adjusted prompts enable the model to iteratively refine its reasoning. Experiments on four well-established benchmarks across multiple LLMs show that MAPS significantly outperforms standard CoT and achieves competitive results with reasoning-optimized models. In addition, MAPS enables general-purpose LLMs to reach performance levels comparable to specialized reasoning models. While deeper reflection layers improve accuracy, they also increase token usage and costs. To balance this trade-off, MAPS strategically limits reflection depth, ensuring an optimal balance between cost and reasoning performance.

[70] The Trilemma of Truth in Large Language Models

Germans Savcisens,Tina Eliassi-Rad

Main category: cs.CL

TL;DR: 这篇论文介绍了一种名为sAwMIL的新方法，用于通过利用大型语言模型（LLMs）的内部激活来评估其知识真实性，提供了关于LLMs知识的真实、虚假及中间状态的深入洞察。

Details

Motivation: 我们经常赋予大型语言模型（LLMs）人类特征，并声称它们“知道”某些事物。LLMs具有一种内部的概率知识，代表了训练期间保留的信息。为了评估这种知识的真实性，我们需要探索两种常见的探测LLMs真实性的方法，并发现其中存在的一些错误假设。 Method: 论文的方法基于多实例学习和共形预测，利用LLMs的内部激活将陈述分为真实、虚假和既非真实也非虚假三类。 Result: 研究结果显示：(1) 真实性信号通常集中在LLM深度的四分之三处；(2) 真理和谬误信号并不总是对称的；(3) 对于聊天模型，线性探针比默认模型表现更好；(4) 非线性探针可能对于捕捉某些经过人类反馈强化学习或知识蒸馏的LLMs的真伪信号是必要的；(5) LLMs捕捉到第三种不同于真实和虚假的信号，它既非真实也非虚假。 Conclusion: 论文的结论是，sAwMIL提供了一种可靠的方法来验证大型语言模型（LLMs）“知道”什么以及它们对其概率内部知识的确定程度。 Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.

[71] IMPACT: Inflectional Morphology Probes Across Complex Typologies

Mohammed J. Saeed,Tommi Vehvilainen,Evgeny Fedoseev,Sevil Caliskan,Tatiana Vodolazova

Main category: cs.CL

TL;DR: 本文介绍了IMPACT，一个用于评估多语言大型语言模型在屈折形态方面表现的合成生成评价框架。

Details

Motivation: 为了探究多语言大型语言模型是否真正理解了非英语语言的底层语言复杂性，尤其是在形态学方面的理解。 Method: 引入了一个名为IMPACT的合成生成评价框架，该框架包括针对阿拉伯语、俄语、芬兰语、土耳其语和希伯来语这五种形态丰富的语言的测试用例。 Result: 尽管这些模型在英语上表现出色，但在处理其他语言和不常见的形态模式时存在困难，特别是在判断不合语法的例子时。此外，链式思考和思维模型可能会降低性能。 Conclusion: 研究揭示了多语言大型语言模型在处理语言复杂性方面存在的差距，并指出了改进的空间。 Abstract: Large Language Models (LLMs) have shown significant progress on various multilingual benchmarks and are increasingly used to generate and evaluate text in non-English languages. However, while they may produce fluent outputs, it remains unclear to what extent these models truly grasp the underlying linguistic complexity of those languages, particularly in morphology. To investigate this, we introduce IMPACT, a synthetically generated evaluation framework focused on inflectional morphology, which we publicly release, designed to evaluate LLM performance across five morphologically rich languages: Arabic, Russian, Finnish, Turkish, and Hebrew. IMPACT includes unit-test-style cases covering both shared and language-specific phenomena, from basic verb inflections (e.g., tense, number, gender) to unique features like Arabic's reverse gender agreement and vowel harmony in Finnish and Turkish. We assess eight multilingual LLMs that, despite strong English performance, struggle with other languages and uncommon morphological patterns, especially when judging ungrammatical examples. We also show that Chain of Thought and Thinking Models can degrade performance. Our work exposes gaps in LLMs' handling of linguistic complexity, pointing to clear room for improvement. To support further research, we publicly release the IMPACT framework.

[72] Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages

Ruhina Tabasshum Prome,Tarikul Islam Tamiti,Anomadarshi Barua

Main category: cs.CL

TL;DR: This paper explores how metaphor prompting can improve hate speech detection in low-resource languages like Bengali by overcoming limitations of existing prompting techniques and large language model safety mechanisms.

Details

Motivation: The motivation stems from the challenges in detecting hate speech in low-resource languages due to a lack of large-scale datasets and built-in safety measures in LLMs, combined with complexities such as code-mixing and misspellings in social media content. Method: The study evaluates six prompting strategies, including zero-shot prompting, refusal suppression, flattering the classifier, multi-shot prompting, role prompting, and metaphor prompting. These were tested on the Llama2-7B model. Comparisons were made with pre-trained word embeddings (GloVe, Word2Vec, FastText) across deep learning models (MLP, CNN, BiGRU). Result: The proposed metaphor prompting demonstrated effectiveness in hate speech detection for low-resource languages like Bengali and Hindi, showing potential applicability across high-resource languages such as English and German. Conclusion: This paper concludes that metaphor prompting is an effective technique for hate speech detection in low-resource languages like Bengali, and it surpasses traditional methods by circumventing the safety mechanisms of LLMs. Abstract: The rapid expansion of social media leads to a marked increase in hate speech, which threatens personal lives and results in numerous hate crimes. Detecting hate speech presents several challenges: diverse dialects, frequent code-mixing, and the prevalence of misspelled words in user-generated content on social media platforms. Recent progress in hate speech detection is typically concentrated on high-resource languages. However, low-resource languages still face significant challenges due to the lack of large-scale, high-quality datasets. This paper investigates how we can overcome this limitation via prompt engineering on large language models (LLMs) focusing on low-resource Bengali language. We investigate six prompting strategies - zero-shot prompting, refusal suppression, flattering the classifier, multi-shot prompting, role prompting, and finally our innovative metaphor prompting to detect hate speech effectively in low-resource languages. We pioneer the metaphor prompting to circumvent the built-in safety mechanisms of LLMs that marks a significant departure from existing jailbreaking methods. We investigate all six different prompting strategies on the Llama2-7B model and compare the results extensively with three pre-trained word embeddings - GloVe, Word2Vec, and FastText for three different deep learning models - multilayer perceptron (MLP), convolutional neural network (CNN), and bidirectional gated recurrent unit (BiGRU). To prove the effectiveness of our metaphor prompting in the low-resource Bengali language, we also evaluate it in another low-resource language - Hindi, and two high-resource languages - English and German. The performance of all prompting techniques is evaluated using the F1 score, and environmental impact factor (IF), which measures CO$_2$ emissions, electricity usage, and computational time.

[73] Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs

Yang Dai,Jianxiang An,Tianwei Lin,Hongyang He,Hongzhe Huang,Wenqiao Zhang,Zheqi Lv,Siliang Tang,Yueting Zhuang

Main category: cs.CL

TL;DR: This paper proposes a unified parameter integration framework for domain-specific MLLMs using CAPS, enabling efficient knowledge sharing and compositional adaptability.

Details

Motivation: Fragmentation of knowledge across domain-specialized MLLMs hampers performance; knowledge sharing remains underexplored. Method: Compatibility-Aware Parameter Splicing (CAPS) strategy and domain compatibility scoring mechanism at low-rank adaptation layer granularity. Result: Extensive evaluations validate the effectiveness of the framework in synergizing heterogeneous expertise with minimal inference overhead. Conclusion: The proposed framework enables modular composition of expert capabilities in MLLMs, effectively integrating knowledge while preserving structural modularity. Abstract: Multimodal Large Language Models (MLLMs) have achieved success across various domains. However, their applicability tends to degrade when confronted with different types of data inputs, especially for MLLMs that have been fine-tuned for specific tasks. Despite its importance, the study of knowledge sharing among domain-specific MLLMs--such as those trained for mathematics or code--remains largely underexplored. To address the fragmentation of knowledge across domain-specialized MLLMs, we propose a unified parameter integration framework that enables modular composition of expert capabilities. Our method is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy, which leverages both local functional attribution and global information-theoretic signals to guide selective parameter fusion. By extending this mechanism to the low-rank adaptation layer granularity, we ensure efficient integration with minimal inference overhead. Furthermore, we introduce a domain compatibility scoring mechanism that quantifies inter-expert alignment at the activation level and correlates with downstream task utility. This principled fusion protocol allows the final model to synergize heterogeneous expertise while preserving structural modularity. Extensive evaluations across diverse multimodal benchmarks validate the effectiveness of our framework, offering a scalable path toward compositional, domain-adaptive MLLMs.

[74] Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

Mathis Le Bail,Jérémie Dentan,Davide Buscaldi,Sonia Vanier

Main category: cs.CL

TL;DR: This paper introduces a new Sparse Autoencoder-based architecture for improving the interpretability of concepts extracted from Large Language Models in sentence classification tasks.

Details

Motivation: Sparse Autoencoders have been successfully used to extract interpretable concepts from LLMs in various domains but not extensively explored in sentence classification. This paper aims to fill that gap. Method: The paper presents a novel SAE-based architecture tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. They benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, and other SAE-based concept extraction techniques. Result: The empirical results show that their architecture improves both the causality and interpretability of the extracted features across two classification benchmarks and four fine-tuned LLMs from the Pythia family. Conclusion: The paper concludes that their novel SAE-based architecture improves both the causality and interpretability of extracted features for sentence classification. Abstract: Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based architecture tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, and other SAE-based concept extraction techniques. Our evaluation covers two classification benchmarks and four fine-tuned LLMs from the Pythia family. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that our architecture improves both the causality and interpretability of the extracted features.

[75] TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

Renren Jin,Tianhao Shen,Xinwei Wu,Dan Shi,Haoran Sun,Wuwei Huang,Quandong Wang,Wei Liu,Jian Luan,Bin Wang,Deyi Xiong

Main category: cs.CL

TL;DR: TaP is a new framework for generating preference datasets across languages, improving LLM training efficiency and performance beyond existing methods.

Details

Motivation: Need for high-quality, multilingual datasets for training LLMs due to the resource-intensive nature of dataset creation and predominance of English datasets. Method: Development of the TaP framework based on a structured taxonomy for generating diverse and comprehensive preference datasets, used for supervised and preference fine-tuning of LLMs. Result: LLMs trained on TaP-generated datasets outperformed those trained on existing open-source datasets, including one that was 180 times larger. Conclusion: The TaP framework effectively enables the automated and scalable construction of preference datasets across languages, significantly enhancing the performance of LLMs even surpassing larger open-source datasets. Abstract: Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.

[76] Machine Understanding of Scientific Language

Dustin Wright

Main category: cs.CL

TL;DR: 本论文致力于培养用于机器理解科学语言的数据集、方法和工具，以分析和理解大规模的科学传播。

Details

Motivation: 随着在线科学文本数量激增，能够自动识别这些文本对基础科学的忠实程度已成为一个重要的社会问题。 Method: 在自然语言处理和机器学习的三个领域中提出了多种贡献：自动事实核查、有限数据下的学习以及科学文本处理，包括多源域适应、零样本科学事实核查和检测夸大的科学主张等方法。 Result: 提出了多个新方法和资源，如可识别值得核查的主张、对抗性主张生成、多源域适应、从众包标签中学习、引用必要性检测、零样本科学事实核查以及检测夸大的科学主张等。 Conclusion: 研究结果表明，通过开发新的方法和资源，可以有效从有限的科学文本中学习，并用于识别错误信息的科学陈述，同时为科学传播过程提供新见解。 Abstract: Scientific information expresses human understanding of nature. This knowledge is largely disseminated in different forms of text, including scientific papers, news articles, and discourse among people on social media. While important for accelerating our pursuit of knowledge, not all scientific text is faithful to the underlying science. As the volume of this text has burgeoned online in recent years, it has become a problem of societal importance to be able to identify the faithfulness of a given piece of scientific text automatically. This thesis is concerned with the cultivation of datasets, methods, and tools for machine understanding of scientific language, in order to analyze and understand science communication at scale. To arrive at this, I present several contributions in three areas of natural language processing and machine learning: automatic fact checking, learning with limited data, and scientific text processing. These contributions include new methods and resources for identifying check-worthy claims, adversarial claim generation, multi-source domain adaptation, learning from crowd-sourced labels, cite-worthiness detection, zero-shot scientific fact checking, detecting exaggerated scientific claims, and modeling degrees of information change in science communication. Critically, I demonstrate how the research outputs of this thesis are useful for effectively learning from limited amounts of scientific text in order to identify misinformative scientific statements and generate new insights into the science communication process

[77] Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

Seungjun Yi,Joakim Nguyen,Huimin Xu,Terence Lim,Andrew Well,Mia Markey,Ying Ding

Main category: cs.CL

TL;DR: 本文介绍了一种使用大型语言模型（LLM）进行自动化主题分析的新方法，用于高效分析先天性心脏病患者的临床叙事，减少人工工作量并提高分析的可扩展性和相关性。

Details

Motivation: 先天性心脏病（CHD）给患者带来复杂的终身挑战，这些挑战在传统临床指标中常常未被充分体现。非结构化叙事提供了丰富的患者和护理者体验洞察，但传统的手动主题分析（TA）方法费时且难以扩展。 Method: 本研究开发了一个新颖的多智能体框架，其中专门的LLM代理承担不同角色，以提升主题质量和与人类分析的一致性。此外，可选地集成了来自人类反馈的强化学习（RLHF）来进一步提高主题的相关性。 Result: 提出了一种完全自动化的LLM流水线，能够执行端到端的临床叙事主题分析，显著减少了人工干预的需求。通过多智能体架构和可选的RLHF集成，系统生成的主题质量高并与人类分析高度一致。 Conclusion: 该研究提出了一种基于大型语言模型（LLM）的自动化分析流程，能够对临床叙事进行端到端的主题分析（TA），从而避免了手动编码或全面审查转录文本的需要。此外，通过引入强化学习从人类反馈中优化（RLHF），系统可以进一步提高主题的相关性，并支持针对特定临床背景的微调，实现了大规模、以患者为中心的定性数据分析。 Abstract: Congenital heart disease (CHD) presents complex, lifelong challenges often underrepresented in traditional clinical metrics. While unstructured narratives offer rich insights into patient and caregiver experiences, manual thematic analysis (TA) remains labor-intensive and unscalable. We propose a fully automated large language model (LLM) pipeline that performs end-to-end TA on clinical narratives, which eliminates the need for manual coding or full transcript review. Our system employs a novel multi-agent framework, where specialized LLM agents assume roles to enhance theme quality and alignment with human analysis. To further improve thematic relevance, we optionally integrate reinforcement learning from human feedback (RLHF). This supports scalable, patient-centered analysis of large qualitative datasets and allows LLMs to be fine-tuned for specific clinical contexts.

[78] Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

Anselm R. Strohmaier,Wim Van Dooren,Kathrin Seßler,Brian Greer,Lieven Verschaffel

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型（LLMs）在数学教育中解决应用题的能力，发现尽管LLMs能高效解决不需要深入理解现实情境的s-问题，但在理解复杂或非合理现实情境的问题上仍有局限，从而质疑其作为教学工具的有效性。

Details

Motivation: 随着ChatGPT等大型语言模型（LLMs）的发展，人们希望它们能够在教育领域发挥作用，特别是在数学学习中的应用题解答上。尽管LLMs能够轻松处理文本输入，但它们是否能够理解现实世界的情境并将其应用于课堂仍然不明确。 Method: 从数学教育的角度进行范围综述，包括三个部分：技术概述、对研究中使用的数学应用题的系统回顾，以及对LLMs在数学应用题上的最新实证评估。 Result: 文献综述显示，最受欢迎的应用题数据集主要由s-问题组成，这些问题不需要考虑现实世界情境。对GPT-3.5-turbo、GPT-4o-mini、GPT-4.1和o3的评估表明，大多数最新的LLMs几乎完美地解决了这些s-问题，其中包括在PISA的20个问题中获得满分。然而，LLMs在处理现实世界情境存在问题或不合逻辑的问题时仍表现出不足。 Conclusion: 基于这三个方面，我们总结认为LLMs掌握了一种表面的解题过程，但并没有真正理解数学应用题的实际意义，这可能会限制它们作为数学课堂教学工具的价值。 Abstract: The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word-problem corpora are dominated by s-problems, which do not require a consideration of realities of their real-world context. Finally, our evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems shows that most recent LLMs solve these s-problems with near-perfect accuracy, including a perfect score on 20 problems from PISA. LLMs still showed weaknesses in tackling problems where the real-world context is problematic or non-sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classrooms.

[79] EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations

Hyunjong Kim,Sangyeop Kim,Jongheon Jeong,Yeongjae Cho,Sungzoon Cho

Main category: cs.CL

TL;DR: 本文提出了一种名为EXPERT的新型参考自由评估指标，用于图像字幕的解释性评价，具有更高的解释质量和性能。

Details

Motivation: 现有的图像字幕可解释性评估度量缺乏标准化准则，其生成的解释整体质量未经验证。 Method: 构建高质量结构化解释的大规模数据集，并开发了一个两阶段评估模板来有效监督视觉-语言模型的评分和解释生成。 Result: EXPERT实现了状态-of-the-art的结果，并通过全面的人工评估验证了其在解释生成方面的优越性。 Conclusion: EXPERT是一个无需参考的评估指标，在基准数据集上取得了最先进的结果，并且通过提供更高品质的解释超越了现有指标。 Abstract: Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.

[80] STACK: Adversarial Attacks on LLM Safeguard Pipelines

Ian R. McKenzie,Oskar J. Hollinsworth,Tom Tseng,Xander Davies,Stephen Casper,Aaron D. Tucker,Robert Kirk,Adam Gleave

Main category: cs.CL

TL;DR: This paper evaluates AI defense pipelines and demonstrates that even advanced safeguards can be bypassed with novel attack methods like STACK, highlighting the need for improved security measures.

Details

Motivation: Frontier AI developers use defense pipelines to prevent catastrophic misuse of AI systems, but the security of these pipelines is not well understood. This research aims to evaluate and attack such pipelines to identify vulnerabilities. Method: The researchers developed an open-source defense pipeline and tested its security using a new method called STaged AttaCK (STACK). They evaluated the pipeline's performance against various attacks and datasets, including ClearHarm. Result: A new few-shot-prompted input and output classifier outperformed existing models, reducing the attack success rate (ASR) to 0% on the ClearHarm dataset. However, the STACK attack achieved a 71% ASR in black-box testing and 33% ASR in transfer settings, showing the potential for bypassing defenses. Conclusion: The study concludes that while defense pipelines are crucial for AI security, they are vulnerable to sophisticated attacks like STACK. Specific mitigations are suggested for developers to enhance the security of these pipelines. Abstract: Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.

[81] On the Predictive Power of Representation Dispersion in Language Models

Yanhong Li,Ming Li,Karen Livescu,Jiawei Zhou

Main category: cs.CL

TL;DR: 该论文表明语言模型的文本预测能力与嵌入空间的广度有关，并提出了几种基于表示分散性的实用任务改进方法。

Details

Motivation: 探索语言模型在文本预测上的表现与其内部表示特性之间的关系，并寻找无需标记数据即可提升模型性能的方法。 Method: 研究使用了不同模型家族和领域的隐藏向量之间的平均成对余弦距离来衡量表示分散性，并将其与困惑度进行相关分析。 Result: 发现表示分散性与困惑度之间存在强烈负相关，并展示了如何利用这一特性在新领域中进行有效的模型选择、优化检索方法以及通过训练目标提高分散性。 Conclusion: 本文得出的结论是，语言模型的文本预测能力与其嵌入空间的广度密切相关，并且通过提高表示分散性可以改善模型性能。 Abstract: We show that a language model's ability to predict text is tightly linked to the breadth of its embedding space: models that spread their contextual representations more widely tend to achieve lower perplexity. Concretely, we find that representation dispersion - the average pairwise cosine distance among hidden vectors - strongly and negatively correlates with perplexity across diverse model families (LLaMA, Qwen, and others) and domains (Wikipedia, news, scientific abstracts). Beyond illustrating this link, we show how dispersion can be leveraged for a range of practical tasks without requiring labeled data. First, measuring dispersion on unlabeled text allows us to predict downstream accuracy in new domains, offering a data-efficient tool for model selection. Next, we find that identifying layers with higher dispersion pinpoints the best representations for retrieval-based methods such as kNN-LM, bypassing exhaustive layer-by-layer searches. Finally, we integrate a simple push-away objective into training, which increases dispersion in both single-domain and cross-domain scenarios and directly improves perplexity in each.

[82] Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models

David M. Smiley

Main category: cs.CL

TL;DR: This study demonstrates that pre-trained models like E5 and AlephBERT can efficiently and accurately detect intertextual parallels in biblical Hebrew, offering transformative potential for ancient text analysis.

Details

Motivation: Identifying parallel passages in biblical Hebrew is crucial for uncovering intertextual relationships, but traditional manual methods are labor-intensive and error-prone. This study explores automated alternatives to improve accuracy and efficiency. Method: This study evaluated pre-trained transformer-based language models (E5, AlephBERT, MPNet, and LaBSE) to detect textual parallels in the Hebrew Bible. Using cosine similarity and Wasserstein Distance measures, the models' capabilities were assessed based on known parallels between the books of Samuel/Kings and Chronicles. Result: The findings revealed that E5 and AlephBERT show significant promise in detecting textual parallels, with E5 excelling in parallel detection and AlephBERT demonstrating stronger differentiation of non-parallel passages. Conclusion: Pre-trained transformer-based language models, particularly E5 and AlephBERT, can significantly enhance the efficiency and accuracy of detecting intertextual parallels in biblical Hebrew texts, suggesting broader applications for ancient language studies. Abstract: Identifying parallel passages in biblical Hebrew is foundational in biblical scholarship for uncovering intertextual relationships. Traditional methods rely on manual comparison, which is labor-intensive and prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between the books of Samuel/Kings and Chronicles, I assessed each model's capability to generate word embeddings that delineate parallel from non-parallel passages. Utilizing cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show significant promise, with E5 excelling in parallel detection and AlephBERT demonstrating stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.

cs.CV [Back]

[83] Robust Perspective Correction for Real-World Crack Evolution Tracking in Image-Based Structural Health Monitoring

Xinxin Sun,Peter Chang

Main category: cs.CV

TL;DR: 本研究开发了一种适用于结构健康监测的新图像对齐框架，解决了传统方法在裂缝定位上的局限性，具有高精度、低误差和良好的适应性。

Details

Motivation: 传统的特征检测器如SIFT和SURF在高频率边缘上表现不佳，而轻量级二值化替代方案如ORB和BRISK在纹理或阴影表面上关键点重复性较差，因此需要一种适合结构健康监测（SHM）挑战的方法。 Method: 该研究改进了开放KAZE架构，利用非线性各向异性扩散构建裂缝保持的尺度空间，并结合基于RANSAC的单应性估计，实现了无需训练、参数调优或先验校准的几何校正。 Result: 该方法在手持智能手机采集的砖石和混凝土随时间变化的图像上进行了验证，相比经典检测器，裂缝面积和脊长误差分别减少了70%和90%，同时保持关键指标的对齐误差低于5%。 Conclusion: 这项研究提出了一种基于物理信息的对齐框架，通过非线性尺度空间建模为结构健康监测提供了一种稳健且物理基础扎实的替代方法，能够有效跟踪现实世界中的裂缝演变。 Abstract: Accurate image alignment is essential for monitoring crack evolution in structural health monitoring (SHM), particularly under real-world conditions involving perspective distortion, occlusion, and low contrast. However, traditional feature detectors such as SIFT and SURF, which rely on Gaussian-based scale spaces, tend to suppress high-frequency edges, making them unsuitable for thin crack localization. Lightweight binary alternatives like ORB and BRISK, while computationally efficient, often suffer from poor keypoint repeatability on textured or shadowed surfaces. This study presents a physics-informed alignment framework that adapts the open KAZE architecture to SHM-specific challenges. By utilizing nonlinear anisotropic diffusion to construct a crack-preserving scale space, and integrating RANSAC-based homography estimation, the framework enables accurate geometric correction without the need for training, parameter tuning, or prior calibration. The method is validated on time-lapse images of masonry and concrete acquired via handheld smartphone under varied field conditions, including shadow interference, cropping, oblique viewing angles, and surface clutter. Compared to classical detectors, the proposed framework reduces crack area and spine length errors by up to 70 percent and 90 percent, respectively, while maintaining sub-5 percent alignment error in key metrics. Unsupervised, interpretable, and computationally lightweight, this approach supports scalable deployment via UAVs and mobile platforms. By tailoring nonlinear scale-space modeling to SHM image alignment, this work offers a robust and physically grounded alternative to conventional techniques for tracking real-world crack evolution.

[84] Counting with Confidence: Accurate Pest Monitoring in Water Traps

Xumin Gao,Mark Stevens,Grzegorz Cielniak

Main category: cs.CV

TL;DR: This paper proposes a comprehensive method for evaluating pest counting confidence in real-world scenarios, combining counting results with environmental factors, resulting in improved accuracy.

Details

Motivation: The motivation is to address the lack of reliable evaluation of pest counting results in real-world scenarios where ground truth data is unavailable. Method: The method involves a pest detection network for detection and counting, image quality and complexity assessments, pest distribution uniformity assessment using an adaptive DBSCAN clustering algorithm, and a regression model to predict pest counting confidence. Result: Experimental results show a 31.7% reduction in MSE and a 15.2% improvement in R2 on the pest counting confidence test set compared to the baseline. Conclusion: The paper concludes that their proposed method effectively evaluates pest counting confidence by incorporating information related to counting results and external environmental conditions, demonstrating significant improvements over baseline methods. Abstract: Accurate pest population monitoring and tracking their dynamic changes are crucial for precision agriculture decision-making. A common limitation in existing vision-based automatic pest counting research is that models are typically evaluated on datasets with ground truth but deployed in real-world scenarios without assessing the reliability of counting results due to the lack of ground truth. To this end, this paper proposed a method for comprehensively evaluating pest counting confidence in the image, based on information related to counting results and external environmental conditions. First, a pest detection network is used for pest detection and counting, extracting counting result-related information. Then, the pest images undergo image quality assessment, image complexity assessment, and pest distribution uniformity assessment. And the changes in image clarity caused by stirring during image acquisition are quantified by calculating the average gradient magnitude. Notably, we designed a hypothesis-driven multi-factor sensitivity analysis method to select the optimal image quality assessment and image complexity assessment methods. And we proposed an adaptive DBSCAN clustering algorithm for pest distribution uniformity assessment. Finally, the obtained information related to counting results and external environmental conditions is input into a regression model for prediction, resulting in the final pest counting confidence. To the best of our knowledge, this is the first study dedicated to comprehensively evaluating counting confidence in counting tasks, and quantifying the relationship between influencing factors and counting confidence through a model. Experimental results show our method reduces MSE by 31.7% and improves R2 by 15.2% on the pest counting confidence test set, compared to the baseline built primarily on information related to counting results.

[85] Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization

Weizhi Gao,Zhichao Hou,Junqi Yin,Feiyi Wang,Linyu Peng,Xiaorui Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为MoDiff的新框架，有效加速扩散模型的生成过程，同时保持高质量输出。

Details

Motivation: 扩散模型虽然强大，但采样过程中的高计算成本是瓶颈，因此需要一种新的加速方法来突破当前技术的限制。 Method: 引入了Modulated Diffusion（MoDiff），结合调制量化和误差补偿技术，并通过理论分析和实验验证其效果。 Result: 在CIFAR-10和LSUN上的实验表明，MoDiff可在不降低性能的情况下将激活量化的位数从8位减少到3位。 Conclusion: MoDiff是一个创新且严谨的框架，通过调制量化和误差补偿加速生成建模，不仅继承了现有缓存和量化方法的优势，而且适用于加速所有扩散模型。 Abstract: Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at https://github.com/WeizhiGao/MoDiff.

[86] ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction

Hao Liu,Yu Hu,Rakiba Rayhana,Ling Bai,Zheng Liu

Main category: cs.CV

TL;DR: This paper presents ViFusionTST, a novel deep learning model that uses load-cell sensors and image-based fusion techniques to predict early bed-exit intent, enabling proactive fall prevention in healthcare facilities.

Details

Motivation: Bed-related falls are a significant source of injury in hospitals and long-term-care facilities, and current commercial alarms often trigger only after a patient has already left the bed. This motivates the need for more proactive fall-prevention systems. Method: The method involves predicting early bed-exit intent through four low-cost load cells under bed legs. The load signals are converted into complementary images such as an RGB line plot and texture maps (recurrence plot, Markov transition field, Gramian angular field). These images are processed by a dual-stream Swin Transformer (ViFusionTST) with cross-attention to learn modality weights. Result: On a six-month dataset collected from 95 beds in a long-term-care facility, ViFusionTST achieved an accuracy of 0.885 and an F1 score of 0.794, outperforming recent 1D and 2D time-series baselines across multiple metrics including F1, recall, accuracy, and AUPRC. Conclusion: The study concludes that the ViFusionTST model, using image-based fusion of load-sensor signals, is a practical and effective solution for real-time fall prevention in healthcare settings. Abstract: Bed-related falls remain a leading source of injury in hospitals and long-term-care facilities, yet many commercial alarms trigger only after a patient has already left the bed. We show that early bed-exit intent can be predicted using only four low-cost load cells mounted under the bed legs. The resulting load signals are first converted into a compact set of complementary images: an RGB line plot that preserves raw waveforms and three texture maps - recurrence plot, Markov transition field, and Gramian angular field - that expose higher-order dynamics. We introduce ViFusionTST, a dual-stream Swin Transformer that processes the line plot and texture maps in parallel and fuses them through cross-attention to learn data-driven modality weights. To provide a realistic benchmark, we collected six months of continuous data from 95 beds in a long-term-care facility. On this real-world dataset ViFusionTST reaches an accuracy of 0.885 and an F1 score of 0.794, surpassing recent 1D and 2D time-series baselines across F1, recall, accuracy, and AUPRC. The results demonstrate that image-based fusion of load-sensor signals for time series classification is a practical and effective solution for real-time, privacy-preserving fall prevention.

[87] Scalable Dynamic Origin-Destination Demand Estimation Enhanced by High-Resolution Satellite Imagery Data

Jiachao Liu,Pablo Guarda,Koichiro Niinuma,Sean Qian

Main category: cs.CV

TL;DR: This paper proposes an enhanced dynamic origin-destination demand estimation framework by integrating satellite imagery with traditional traffic data, showing improved accuracy and scalability for urban traffic management.

Details

Motivation: Traditional traffic data sources like local sensors are sparse and limited in coverage, while satellite imagery provides comprehensive, city-wide traffic information, including parking vehicles, which can enhance DODE accuracy. Method: The authors proposed a computer vision pipeline to extract vehicle density data from satellite imagery and combined it with local sensor data in a graph-based DODE model. They validated the approach using synthetic and real-world experiments. Result: Out-of-sample tests showed improved estimation performance when satellite-derived density data was integrated, especially for unsensed links. Real-world experiments confirmed scalability and practical applicability. Conclusion: The study concludes that integrating satellite imagery with traditional traffic data significantly improves dynamic origin-destination demand estimation, particularly for areas without local sensors, and demonstrates scalability for large urban networks. Abstract: This study presents a novel integrated framework for dynamic origin-destination demand estimation (DODE) in multi-class mesoscopic network models, leveraging high-resolution satellite imagery together with conventional traffic data from local sensors. Unlike sparse local detectors, satellite imagery offers consistent, city-wide road and traffic information of both parking and moving vehicles, overcoming data availability limitations. To extract information from imagery data, we design a computer vision pipeline for class-specific vehicle detection and map matching, generating link-level traffic density observations by vehicle class. Building upon this information, we formulate a computational graph-based DODE model that calibrates dynamic network states by jointly matching observed traffic counts and travel times from local sensors with density measurements derived from satellite imagery. To assess the accuracy and scalability of the proposed framework, we conduct a series of numerical experiments using both synthetic and real-world data. The results of out-of-sample tests demonstrate that supplementing traditional data with satellite-derived density significantly improves estimation performance, especially for links without local sensors. Real-world experiments also confirm the framework's capability to handle large-scale networks, supporting its potential for practical deployment in cities of varying sizes. Sensitivity analysis further evaluates the impact of data quality related to satellite imagery data.

[88] Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models

Weiyi Zhao,Xiaoyu Tan,Liang Liu,Sijia Li,Youwei Song,Xihe Qiu

Main category: cs.CV

TL;DR: This paper introduces the OR-VSKC dataset for identifying surgical risks by addressing visual-semantic knowledge conflicts in MLLMs through synthetic image generation and benchmarking.

Details

Motivation: The motivation stems from the critical need for surgical risk identification to enhance patient safety and reduce preventable medical errors, particularly addressing visual-semantic knowledge conflicts (VS-KC) in multimodal large language models (MLLMs). Method: A dataset of over 34,000 synthetic images was generated using diffusion models to depict operating room scenes with safety violations. Additionally, 214 human-annotated images were used as a gold-standard reference. The methodology involved analyzing MLLMs' vulnerabilities through this dataset and benchmarking their performance. Result: The result includes the successful creation of the OR-VSKC dataset with diverse scenarios, an open-source release of the dataset and benchmark, and empirical insights into the learning specificity of MLLMs regarding violation-sensitive knowledge consistency. Conclusion: The study concludes that fine-tuning on the OR-VSKC dataset significantly enhances MLLMs' ability to detect trained conflict entities and generalizes well to new viewpoints, although performance on untrained entity types remains suboptimal, indicating a need for more comprehensive training strategies. Abstract: Surgical risk identification is critical for patient safety and reducing preventable medical errors. While multimodal large language models (MLLMs) show promise for automated operating room (OR) risk detection, they often exhibit visual-semantic knowledge conflicts (VS-KC), failing to identify visual safety violations despite understanding textual rules. To address this, we introduce a dataset comprising over 34,000 synthetic images generated by diffusion models, depicting operating room scenes containing entities that violate established safety rules. These images were created to alleviate data scarcity and examine MLLMs vulnerabilities. In addition, the dataset includes 214 human-annotated images that serve as a gold-standard reference for validation. This comprehensive dataset, spanning diverse perspectives, stages, and configurations, is designed to expose and study VS-KC. Fine-tuning on OR-VSKC significantly improves MLLMs' detection of trained conflict entities and generalizes well to new viewpoints for these entities, but performance on untrained entity types remains poor, highlighting learning specificity and the need for comprehensive training. The main contributions of this work include: (1) a data generation methodology tailored for rule-violation scenarios; (2) the release of the OR-VSKC dataset and its associated benchmark as open-source resources; and (3) an empirical analysis of violation-sensitive knowledge consistency in representative MLLMs. The dataset and appendix are available at https://github.com/zgg2577/VS-KC.

[89] How Can Multimodal Remote Sensing Datasets Transform Classification via SpatialNet-ViT?

Gautam Siddharth Kashyap,Manaswi Kulahara,Nipun Joshi,Usman Naseem

Main category: cs.CV

TL;DR: 本研究提出了SpatialNet-ViT，结合Vision Transformers与多任务学习方法，显著提升遥感数据分类的性能与泛化能力。

Details

Motivation: 解决现有研究局限于狭窄任务或数据集的问题，提高泛化能力。 Method: 提出SpatialNet-ViT模型，并采用数据增强、迁移学习和多任务学习技术。 Result: 改进了分类准确性和模型鲁棒性，适用于多样化遥感数据集。 Conclusion: SpatialNet-ViT通过结合ViT和多任务学习，提升了遥感分类的准确性和可扩展性。 Abstract: Remote sensing datasets offer significant promise for tackling key classification tasks such as land-use categorization, object presence detection, and rural/urban classification. However, many existing studies tend to focus on narrow tasks or datasets, which limits their ability to generalize across various remote sensing classification challenges. To overcome this, we propose a novel model, SpatialNet-ViT, leveraging the power of Vision Transformers (ViTs) and Multi-Task Learning (MTL). This integrated approach combines spatial awareness with contextual understanding, improving both classification accuracy and scalability. Additionally, techniques like data augmentation, transfer learning, and multi-task learning are employed to enhance model robustness and its ability to generalize across diverse datasets

[90] What Makes a Dribble Successful? Insights From 3D Pose Tracking Data

Michiel Schepers,Pieter Robberechts,Jan Van Haaren,Jesse Davis

Main category: cs.CV

TL;DR: This study uses 3D pose tracking data to extract new features related to balance and orientation, which improve predictions of dribble success in soccer when combined with traditional 2D positional data.

Details

Motivation: Traditional 2D positional tracking data fails to capture nuanced aspects like balance, orientation, and ball control in soccer, limiting insights into dribbling skills. This study aims to explore how 3D pose tracking data can improve understanding of dribbling performance. Method: The researchers extracted novel pose-based features from 1,736 dribbles during the 2022/23 Champions League season and evaluated their impact on predicting dribble success, alongside traditional 2D positional data. Result: Pose-based features capturing an attacker's balance and alignment of orientation between attacker and defender were found to be informative predictors of dribble success. Adding these features improved model performance compared to using only traditional 2D positional data. Conclusion: The study concludes that incorporating pose-based features derived from 3D pose tracking data enhances the predictive performance of models evaluating dribble success in soccer. Abstract: Data analysis plays an increasingly important role in soccer, offering new ways to evaluate individual and team performance. One specific application is the evaluation of dribbles: one-on-one situations where an attacker attempts to bypass a defender with the ball. While previous research has primarily relied on 2D positional tracking data, this fails to capture aspects like balance, orientation, and ball control, limiting the depth of current insights. This study explores how pose tracking data (capturing players' posture and movement in three dimensions) can improve our understanding of dribbling skills. We extract novel pose-based features from 1,736 dribbles in the 2022/23 Champions League season and evaluate their impact on dribble success. Our results indicate that features capturing the attacker's balance and the alignment of the orientation between the attacker and defender are informative for predicting dribble success. Incorporating these pose-based features on top of features derived from traditional 2D positional data leads to a measurable improvement in model performance.

[91] Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection

Hassan Baker,Austin J. Brockmeier

Main category: cs.CV

TL;DR: This paper proposes Patch2Loc, an unsupervised method for detecting brain lesions using MRI, which outperforms existing unsupervised approaches.

Details

Motivation: Detecting brain lesions is essential for diagnosis and treatment, but supervised learning methods require annotated data. An unsupervised approach can overcome this limitation and aid radiologists in identifying abnormalities such as tumors and malformations. Method: Patch2Loc trains a neural network model using normal patches from structural MRI to map a patch back to its spatial location within a brain slice. Abnormal patches are detected during inference by higher error and/or variance in location prediction, generating a heatmap for finer-grained segmentation. Result: The model successfully segments abnormal brain tissues, demonstrated through the detection of tumor tissues in MRI datasets (BraTS2021, MSLUB, ATLAS, WMH), outperforming current unsupervised segmentation techniques. Conclusion: The proposed unsupervised approach, Patch2Loc, demonstrates superior performance in segmenting abnormal brain tissues compared to existing state-of-the-art methods. Abstract: Detecting brain lesions as abnormalities observed in magnetic resonance imaging (MRI) is essential for diagnosis and treatment. In the search of abnormalities, such as tumors and malformations, radiologists may benefit from computer-aided diagnostics that use computer vision systems trained with machine learning to segment normal tissue from abnormal brain tissue. While supervised learning methods require annotated lesions, we propose a new unsupervised approach (Patch2Loc) that learns from normal patches taken from structural MRI. We train a neural network model to map a patch back to its spatial location within a slice of the brain volume. During inference, abnormal patches are detected by the relatively higher error and/or variance of the location prediction. This generates a heatmap that can be integrated into pixel-wise methods to achieve finer-grained segmentation. We demonstrate the ability of our model to segment abnormal brain tissues by applying our approach to the detection of tumor tissues in MRI on T2-weighted images from BraTS2021 and MSLUB datasets and T1-weighted images from ATLAS and WMH datasets. We show that it outperforms the state-of-the art in unsupervised segmentation. The codebase for this work can be found on our \href{https://github.com/bakerhassan/Patch2Loc}{GitHub page}.

[92] Weakly Supervised Object Segmentation by Background Conditional Divergence

Hassan Baker,Matthew S. Emigh,Austin J. Brockmeier

Main category: cs.CV

TL;DR: This paper proposes a weakly supervised method for binary object segmentation that effectively works across multiple domains, including natural and sonar images, without requiring large amounts of labeled data or complex network architectures.

Details

Motivation: Automatic object segmentation is challenging in specialized image domains due to the lack of massive labeled data. The paper aims to develop a method that uses weak supervision to reduce labeling costs while maintaining performance across different imaging domains. Method: A masking network is trained with weak supervision (image-wise presence/absence labels) by creating counterfactual images through background clustering and blending segmented objects into new backgrounds. The training loss includes a divergence term between real and counterfactual images and a supervised loss for background-only images. Result: The approach outperforms unsupervised baselines on side-scan and synthetic aperture sonar images, and performs reasonably well on natural images without relying on pretrained models, generative networks, or adversarial critics. Conclusion: The proposed method successfully achieves binary object segmentation using weak supervision and can be applied to various domains, including natural images and synthetic aperture sonar images, avoiding the need for pretrained or generative networks and adversarial critics. Abstract: As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The basecode for this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.

[93] FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment

Hang Xu,Jie Huang,Linjiang Huang,Dong Li,Yidi Liu,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了一种针对扩散密集预测模型的训练免费领域适应方法，称为DNA，通过调整噪声统计信息实现跨领域性能提升。

Details

Motivation: 扩散模型中的暴露偏差（如噪声统计偏差）会带来领域偏移，而不同领域的条件可以通过噪声预测统计来有效捕捉，从而启发了我们提出一种无需训练的领域适应方法。 Method: 提出了一种无需训练的领域噪声对齐（DNA）方法，利用扩散模型中的噪声统计信息来实现领域适应。对于有源域的情况，直接采用DNA方法进行领域适应；对于无源域的情况，则利用高置信度区域的统计信息来指导采样过程中的噪声统计调整。 Result: 该方法证明了其在四个常见的密集预测任务中增强DDP模型领域适应能力的有效性。 Conclusion: DNA方法通过调整目标域的噪声统计信息，使其与源域对齐，在不进行训练的情况下提高了DDP模型在四个常见密集预测任务中的领域适应能力。 Abstract: Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model's performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. Our motivation arises from the observation that the exposure bias (e.g., noise statistics bias) in diffusion brings domain shift, and different domains in conditions of DDP models can also be effectively captured by the noise prediction statistics. Based on this, we propose a training-free Domain Noise Alignment (DNA) approach, which alleviates the variations of noise statistics to domain changes during the diffusion sampling process, thereby achieving domain adaptation. Specifically, when the source domain is available, we directly adopt the DNA method to achieve domain adaptation by aligning the noise statistics of the target domain with those of the source domain. For the more challenging source-free DA, inspired by the observation that regions closer to the source domain exhibit higher confidence meeting variations of sampling noise, we utilize the statistics from the high-confidence regions progressively to guide the noise statistic adjustment during the sampling process. Notably, our method demonstrates the effectiveness of enhancing the DA capability of DDP models across four common dense prediction tasks. Code is available at \href{https://github.com/xuhang07/FreeDNA}{https://github.com/xuhang07/FreeDNA}.

[94] Lightning the Night with Generative Artificial Intelligence

Tingting Zhou,Feng Zhang,Haoyang Fu,Baoxiang Pan,Renhe Zhang,Feng Lu,Zhixin Yang

Main category: cs.CV

TL;DR: This research introduces RefDiff, a diffusion-based model that enables accurate nighttime retrieval of visible light reflectance data from geostationary satellites, enhancing weather observation capabilities beyond daylight hours.

Details

Motivation: Visible light reflectance data is crucial for meteorological observations but cannot be used at night due to the lack of sunlight. This limitation necessitated the development of a method for nighttime visible light reflectance retrieval. Method: Based on multi-band thermal infrared brightness temperature data from the Fengyun-4B satellite's AGRI sensor, a generative diffusion model was designed for visible light reflectance retrieval at night for specific wavelength bands (0.47 μm, 0.65 μm, and 0.825 μm). Result: RefDiff achieved an SSIM index of 0.90, significantly improving accuracy compared to classical models, particularly in areas with complex cloud structures and thick clouds. The nighttime retrieval capability was validated using VIIRS nighttime products. Conclusion: This study successfully developed a high-precision visible light reflectance retrieval model named RefDiff using generative diffusion models, enabling nighttime retrieval of visible light reflectance with comparable performance to daytime data. Abstract: The visible light reflectance data from geostationary satellites is crucial for meteorological observations and plays an important role in weather monitoring and forecasting. However, due to the lack of visible light at night, it is impossible to conduct continuous all-day weather observations using visible light reflectance data. This study pioneers the use of generative diffusion models to address this limitation. Based on the multi-band thermal infrared brightness temperature data from the Advanced Geostationary Radiation Imager (AGRI) onboard the Fengyun-4B (FY4B) geostationary satellite, we developed a high-precision visible light reflectance retrieval model, called Reflectance Diffusion (RefDiff), which enables 0.47~\mu\mathrm{m}, 0.65~\mu\mathrm{m}, and 0.825~\mu\mathrm{m} bands visible light reflectance retrieval at night. Compared to the classical models, RefDiff not only significantly improves accuracy through ensemble averaging but also provides uncertainty estimation. Specifically, the SSIM index of RefDiff can reach 0.90, with particularly significant improvements in areas with complex cloud structures and thick clouds. The model's nighttime retrieval capability was validated using VIIRS nighttime product, demonstrating comparable performance to its daytime counterpart. In summary, this research has made substantial progress in the ability to retrieve visible light reflectance at night, with the potential to expand the application of nighttime visible light data.

[95] Automated Defect Identification and Categorization in NDE 4.0 with the Application of Artificial Intelligence

Aditya Sharma

Main category: cs.CV

TL;DR: This paper proposes an automated framework using a modified U-net model for enhanced fault detection in radiography, demonstrating high efficiency and effectiveness in identifying defects in airplane welds.

Details

Motivation: This study aims to address the lack of sufficiently explained information and explore how to maximize virtual defect increase to determine the viability of the framework in the context of NDE 4.0. Method: A modified U-net model is used with an improved dataset from information expansion systems like virtual defect increase and standard increase. The model's effectiveness is assessed using NDE boundaries such as Case, estimating exactness, and misleading call rate. Result: The suggested approach achieves exceptional awareness in defect detection with strong differentiating evidence of flaws, especially considering a 90/95 size error and fake call rate in the weld area where the consolidated expansion approach outperforms others. Conclusion: The framework proposed in this paper shows promise as a support device in the testing cycle for fault detection and organization in contemporary radiography, particularly due to its fast derivation speed and efficient processing of large images. Abstract: This investigation attempts to create an automated framework for fault detection and organization for usage in contemporary radiography, as per NDE 4.0. The review's goals are to address the lack of information that is sufficiently explained, learn how to make the most of virtual defect increase, and determine whether the framework is viable by using NDE measurements. As its basic information source, the technique consists of compiling and categorizing 223 CR photographs of airplane welds. Information expansion systems, such as virtual defect increase and standard increase, are used to work on the preparation dataset. A modified U-net model is prepared using the improved data to produce semantic fault division veils. To assess the effectiveness of the model, NDE boundaries such as Case, estimating exactness, and misleading call rate are used. Tiny a90/95 characteristics, which provide strong differentiating evidence of flaws, reveal that the suggested approach achieves exceptional awareness in defect detection. Considering a 90/95, size error, and fake call rate in the weld area, the consolidated expansion approach clearly wins. Due to the framework's fast derivation speed, large images can be broken down efficiently and quickly. Professional controllers evaluate the transmitted system in the field and believe that it has a guarantee as a support device in the testing cycle, irrespective of particular equipment cut-off points and programming resemblance.

[96] Container damage detection using advanced computer vision model Yolov12 vs Yolov11 vs RF-DETR A comparative analysis

Subhadip Kumar

Main category: cs.CV

TL;DR: 本文比较了几种计算机视觉模型在集装箱损坏检测方面的性能，发现虽然Yolov11和Yolov12平均精度较高，但RF-DETR在检测不常见的损坏类型时表现更优。

Details

Motivation: 及时检查和检测损坏的集装箱对于延长使用寿命和避免安全隐患至关重要。 Method: 比较了三种最先进的计算机视觉模型Yolov12、Yolov11和RF-DETR的损坏检测性能。 Result: Yolov11和12的mAP@50得分为81.9%，而RF-DETR为77.7%。然而，在测试不常见的损坏集装箱时，RF-DETR模型总体上超越了其他模型。 Conclusion: RF-DETR模型在检测不常见损坏集装箱方面表现优于其他模型，展示了准确检测损坏集装箱和损坏发生的能力。 Abstract: Containers are an integral part of the logistics industry and act as a barrier for cargo. A typical service life for a container is more than 20 years. However, overtime containers suffer various types of damage due to the mechanical as well as natural factors. A damaged container is a safety hazard for the employees handling it and a liability for the logistic company. Therefore, a timely inspection and detection of the damaged container is a key for prolonging service life as well as avoiding safety hazards. In this paper, we will compare the performance of the damage detection by three state-of-the-art advanced computer vision models Yolov12, Yolov11 and RF-DETR. We will use a dataset of 278 annotated images to train, validate and test the model. We will compare the mAP and precision of the model. The objective of this paper is to identify the model that is best suited for container damage detection. The result is mixed. mAP@50 score of Yolov11 and 12 was 81.9% compared to RF-DETR, which was 77.7%. However, while testing the model for not-so-common damaged containers, the RF-DETR model outperformed the others overall, exhibiting superiority to accurately detecting both damaged containers as well as damage occurrences with high confidence.

[97] Preserve Anything: Controllable Image Synthesis with Object Preservation

Prasen Kumar Sharma,Neeraj Matiyali,Siddharth Srivastava,Gaurav Sharma

Main category: cs.CV

TL;DR: Preserve Anything is a new approach to text-to-image generation that enhances object preservation, semantic alignment, and user control over scenes, achieving state-of-the-art results.

Details

Motivation: The motivation stems from the limitations in existing text-to-image generation approaches, which struggle with preserving multiple objects, maintaining semantic alignment with prompts, and offering explicit control over scene composition. Method: The method involves an N-channel ControlNet framework with object preservation, background guidance modules, lighting consistency enforcement, and a high-frequency overlay module. A benchmark dataset of 240K natural images and 18K synthetic images was also introduced. Result: Empirical results show that the proposed method improves feature-space fidelity (FID 15.26) and semantic alignment (CLIP-S 32.85). User studies indicate improvements of ~25%, ~19%, ~13%, and ~14% in prompt alignment, photorealism, AI artifacts reduction, and natural aesthetics respectively. Conclusion: The paper concludes that Preserve Anything achieves state-of-the-art performance in controlled image synthesis, significantly improving feature-space fidelity and semantic alignment. Abstract: We introduce \textit{Preserve Anything}, a novel method for controlled image synthesis that addresses key limitations in object preservation and semantic consistency in text-to-image (T2I) generation. Existing approaches often fail (i) to preserve multiple objects with fidelity, (ii) maintain semantic alignment with prompts, or (iii) provide explicit control over scene composition. To overcome these challenges, the proposed method employs an N-channel ControlNet that integrates (i) object preservation with size and placement agnosticism, color and detail retention, and artifact elimination, (ii) high-resolution, semantically consistent backgrounds with accurate shadows, lighting, and prompt adherence, and (iii) explicit user control over background layouts and lighting conditions. Key components of our framework include object preservation and background guidance modules, enforcing lighting consistency and a high-frequency overlay module to retain fine details while mitigating unwanted artifacts. We introduce a benchmark dataset consisting of 240K natural images filtered for aesthetic quality and 18K 3D-rendered synthetic images with metadata such as lighting, camera angles, and object relationships. This dataset addresses the deficiencies of existing benchmarks and allows a complete evaluation. Empirical results demonstrate that our method achieves state-of-the-art performance, significantly improving feature-space fidelity (FID 15.26) and semantic alignment (CLIP-S 32.85) while maintaining competitive aesthetic quality. We also conducted a user study to demonstrate the efficacy of the proposed work on unseen benchmark and observed a remarkable improvement of $\sim25\%$, $\sim19\%$, $\sim13\%$, and $\sim14\%$ in terms of prompt alignment, photorealism, the presence of AI artifacts, and natural aesthetics over existing works.

[98] Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Vasu Agrawal,Akinniyi Akinyemi,Kathryn Alvero,Morteza Behrooz,Julia Buffalini,Fabio Maria Carlucci,Joy Chen,Junming Chen,Zhang Chen,Shiyang Cheng,Praveen Chowdary,Joe Chuang,Antony D'Avirro,Jon Daly,Ning Dong,Mark Duppenthaler,Cynthia Gao,Jeff Girard,Martin Gleize,Sahir Gomez,Hongyu Gong,Srivathsan Govindarajan,Brandon Han,Sen He,Denise Hernandez,Yordan Hristov,Rongjie Huang,Hirofumi Inaguma,Somya Jain,Raj Janardhan,Qingyao Jia,Christopher Klaiber,Dejan Kovachev,Moneish Kumar,Hang Li,Yilei Li,Pavel Litvin,Wei Liu,Guangyao Ma,Jing Ma,Martin Ma,Xutai Ma,Lucas Mantovani,Sagar Miglani,Sreyas Mohan,Louis-Philippe Morency,Evonne Ng,Kam-Woh Ng,Tu Anh Nguyen,Amia Oberai,Benjamin Peloquin,Juan Pino,Jovan Popovic,Omid Poursaeed,Fabian Prada,Alice Rakotoarison,Alexander Richard,Christophe Ropers,Safiyyah Saleem,Vasu Sharma,Alex Shcherbyna,Jia Shen,Jie Shen,Anastasis Stathopoulos,Anna Sun,Paden Tomasello,Tuan Tran,Arina Turkatenko,Bo Wan,Chao Wang,Jeff Wang,Mary Williamson,Carleigh Wood,Tao Xiang,Yilin Yang,Julien Yao,Chen Zhang,Jiemin Zhang,Xinyue Zhang,Jason Zheng,Pavlo Zhyzheria,Jan Zikes,Michael Zollhoefer

Main category: cs.CV

TL;DR: This paper introduces a comprehensive dataset and modeling framework for understanding and generating human-like interactive behaviors, aiming to enhance social intelligence in AI systems.

Details

Motivation: The motivation is to advance socially intelligent AI by capturing the complexity of human communication through both verbal and nonverbal signals, enabling breakthroughs in virtual agents and multimodal interaction tools. Method: The method involves creating a large-scale dataset of face-to-face interactions and developing models to analyze and generate dyadic behavioral dynamics, including motion gestures, facial expressions, and emotional responses. Result: The result includes the development of the Seamless Interaction Dataset, models that generate behaviorally aligned gestures and expressions, and assessment methods demonstrating improved human-AI interaction quality. Conclusion: The paper concludes that the Seamless Interaction Dataset and their associated models enable significant progress in socially intelligent AI technologies, allowing for more intuitive and responsive human-AI interactions. Abstract: Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual behavior of their interlocutors. We present a variant with speech from an LLM model and integrations with 2D and 3D rendering methods, bringing us closer to interactive virtual agents. Additionally, we describe controllable variants of our motion models that can adapt emotional responses and expressivity levels, as well as generating more semantically-relevant gestures. Finally, we discuss methods for assessing the quality of these dyadic motion models, which are demonstrating the potential for more intuitive and responsive human-AI interactions.

[99] Recomposed realities: animating still images via patch clustering and randomness

Markus Juvonen,Samuli Siltanen

Main category: cs.CV

TL;DR: 本文介绍了一种创新的图像处理技术，它通过聚类和随机采样方法，从现有图像数据中创建动态的、生动的新图像。

Details

Motivation: 为了通过运动将静态图像变得生动，强调重新解释而不是复制，使得源域和目标域可以在概念上不同，同时共享局部结构。 Method: 使用K均值聚类对现有图像数据集中的图像块进行分组，并通过匹配和随机抽样从这些聚类中重构新的目标图像。 Result: 实现了一种新的图像重建和动画方法，能够利用已有图像数据生成具有动态效果的新图像。 Conclusion: 该论文提出了一种基于图像块的图像重建与动画方法，通过运动使静态图像生动起来。 Abstract: We present a patch-based image reconstruction and animation method that uses existing image data to bring still images to life through motion. Image patches from curated datasets are grouped using k-means clustering and a new target image is reconstructed by matching and randomly sampling from these clusters. This approach emphasizes reinterpretation over replication, allowing the source and target domains to differ conceptually while sharing local structures.

[100] Improving Token-based Object Detection with Video

Abhineet Singh,Nilanjan Ray

Main category: cs.CV

TL;DR: 本文提出了一种端到端的视频目标检测新方法，通过离散标记序列和3D框表示，解决了传统方法的局限性，并在性能上取得提升。

Details

Motivation: 为了解决传统检测器在训练过程中的损失稀疏性和推理时启发式后处理的问题，以及实现更高效的视频目标检测与多目标跟踪。 Method: 引入了一种新的离散标记序列表示对象的方法，并将视频目标视为完全集成且不可分割的3D框或轨迹进行处理。 Result: 新方法在多个数据集上均表现出优于基线模型和现有视频检测器的效果，即使受限于计算资源仍具有竞争力。 Conclusion: 论文通过改进Pix2Seq目标检测器，提出了一个端到端的视频目标检测方法，并展示了其在多个数据集上的一致性提升。 Abstract: This paper improves upon the Pix2Seq object detector by extending it for videos. In the process, it introduces a new way to perform end-to-end video object detection that improves upon existing video detectors in two key ways. First, by representing objects as variable-length sequences of discrete tokens, we can succinctly represent widely varying numbers of video objects, with diverse shapes and locations, without having to inject any localization cues in the training process. This eliminates the need to sample the space of all possible boxes that constrains conventional detectors and thus solves the dual problems of loss sparsity during training and heuristics-based postprocessing during inference. Second, it conceptualizes and outputs the video objects as fully integrated and indivisible 3D boxes or tracklets instead of generating image-specific 2D boxes and linking these boxes together to construct the video object, as done in most conventional detectors. This allows it to scale effortlessly with available computational resources by simply increasing the length of the video subsequence that the network takes as input, even generalizing to multi-object tracking if the subsequence can span the entire video. We compare our video detector with the baseline Pix2Seq static detector on several datasets and demonstrate consistent improvement, although with strong signs of being bottlenecked by our limited computational resources. We also compare it with several video detectors on UA-DETRAC to show that it is competitive with the current state of the art even with the computational bottleneck. We make our code and models publicly available.

[101] Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

Shansong Wang,Zhecheng Jin,Mingzhe Hu,Mojtaba Safari,Feng Zhao,Chih-Wei Chang,Richard LJ Qiu,Justin Roper,David S. Yu,Xiaofeng Yang

Main category: cs.CV

TL;DR: 本研究提出了一种基于多教师知识蒸馏的生物医学基础模型MMKD-CLIP，通过结合多个专业或通用生物医学CLIP模型的知识，在有限数据条件下实现了卓越的性能和泛化能力。

Details

Motivation: 由于缺乏大规模生物医学图文语料库、图像模态异质性和跨机构的数据标准碎片化，将CLIP模型的成功转移到生物医学领域面临困难。 Method: 使用两阶段训练管道：第一阶段在超过290万对生物医学图像文本上进行CLIP式预训练；第二阶段利用从教师模型中提取的1920万特征对进行特征级蒸馏。 Result: MMKD-CLIP在涵盖6种核心任务类型的58个不同生物医学数据集上始终优于所有教师模型，并展示了跨图像域和任务设置的显著鲁棒性和泛化能力。 Conclusion: MMKD-CLIP通过多教师知识蒸馏方法，在实际数据可用性的约束下，成功构建了高性能的生物医学基础模型，并在多个任务和领域中展现出卓越的泛化能力和鲁棒性。 Abstract: CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs. Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models. We evaluate MMKD-CLIP on 58 diverse biomedical datasets, encompassing over 10.8 million biomedical images across nine image modalities. The evaluation spans six core task types: zero-shot classification, linear probing, cross-modal retrieval, visual question answering, survival prediction, and cancer diagnosis. MMKD-CLIP consistently outperforms all teacher models while demonstrating remarkable robustness and generalization across image domains and task settings. These results underscore that multi-teacher knowledge distillation is a scalable and effective paradigm for building high-performing biomedical foundation models under the practical constraints of real-world data availability.

[102] Dual Atrous Separable Convolution for Improving Agricultural Semantic Segmentation

Chee Mei Ling,Thangarajah Akilan,Aparna Ravinda Phalke

Main category: cs.CV

TL;DR: The paper introduces an efficient image segmentation method using a novel Dual Atrous Separable Convolution module and strategic skip connections, achieving high performance with reduced computational complexity on agricultural image analysis.

Details

Motivation: Agricultural image semantic segmentation is a pivotal component of modern agriculture, facilitating accurate visual data analysis to improve crop management, optimize resource utilization, and boost overall productivity. This study proposes an efficient image segmentation method for precision agriculture, focusing on accurately delineating farmland anomalies to support informed decision-making and proactive interventions. Method: A novel Dual Atrous Separable Convolution (DAS Conv) module is integrated within the DeepLabV3-based segmentation framework. The DAS Conv module is meticulously designed to achieve an optimal balance between dilation rates and padding size. Additionally, a strategic skip connection from an optimal stage in the encoder to the decoder is incorporated to bolster the model's capacity to capture fine-grained spatial features. Result: Despite its lower computational complexity, the proposed model outperforms its baseline and achieves performance comparable to highly complex transformer-based state-of-the-art (SOTA) models on the Agriculture Vision benchmark dataset. It achieves more than 66% improvement in efficiency when considering the trade-off between model complexity and performance, compared to the SOTA model. Conclusion: This study highlights an efficient and effective solution for improving semantic segmentation in remote sensing applications, offering a computationally lightweight model capable of high-quality performance in agricultural imagery. Abstract: Agricultural image semantic segmentation is a pivotal component of modern agriculture, facilitating accurate visual data analysis to improve crop management, optimize resource utilization, and boost overall productivity. This study proposes an efficient image segmentation method for precision agriculture, focusing on accurately delineating farmland anomalies to support informed decision-making and proactive interventions. A novel Dual Atrous Separable Convolution (DAS Conv) module is integrated within the DeepLabV3-based segmentation framework. The DAS Conv module is meticulously designed to achieve an optimal balance between dilation rates and padding size, thereby enhancing model performance without compromising efficiency. The study also incorporates a strategic skip connection from an optimal stage in the encoder to the decoder to bolster the model's capacity to capture fine-grained spatial features. Despite its lower computational complexity, the proposed model outperforms its baseline and achieves performance comparable to highly complex transformer-based state-of-the-art (SOTA) models on the Agriculture Vision benchmark dataset. It achieves more than 66% improvement in efficiency when considering the trade-off between model complexity and performance, compared to the SOTA model. This study highlights an efficient and effective solution for improving semantic segmentation in remote sensing applications, offering a computationally lightweight model capable of high-quality performance in agricultural imagery.

Yijun Lin,Rhett Olson,Junhan Wu,Yao-Yi Chiang,Jerod Weinman

Main category: cs.CV

TL;DR: This paper proposes LIGHT, a novel multi-modal method for linking text on historical maps by combining linguistic, visual, and geometric features, which significantly improves performance over existing approaches.

Details

Motivation: Text on historical maps provides valuable insights but is challenging to process due to variations in orientation, shape, and placement. Existing methods struggle to link recognized text fragments effectively because they neglect geometric information. Method: LIGHT uses a geometry-aware embedding module to encode polygonal coordinates of text regions and combines geometric information with visual and linguistic embeddings from LayoutLMv3. It predicts reading-order successors using a bi-directional learning strategy. Result: Experimental results demonstrate that LIGHT outperforms existing methods on the ICDAR 2024/2025 MapText Competition data, proving the effectiveness of multi-modal learning for text linking on historical maps. Conclusion: The proposed LIGHT approach effectively addresses the challenges of linking text on historical maps by integrating linguistic, image, and geometric features, outperforming existing methods. Abstract: Text on historical maps provides valuable information for studies in history, economics, geography, and other related fields. Unlike structured or semi-structured documents, text on maps varies significantly in orientation, reading order, shape, and placement. Many modern methods can detect and transcribe text regions, but they struggle to effectively ``link'' the recognized text fragments, e.g., determining a multi-word place name. Existing layout analysis methods model word relationships to improve text understanding in structured documents, but they primarily rely on linguistic features and neglect geometric information, which is essential for handling map text. To address these challenges, we propose LIGHT, a novel multi-modal approach that integrates linguistic, image, and geometric features for linking text on historical maps. In particular, LIGHT includes a geometry-aware embedding module that encodes the polygonal coordinates of text regions to capture polygon shapes and their relative spatial positions on an image. LIGHT unifies this geometric information with the visual and linguistic token embeddings from LayoutLMv3, a pretrained layout analysis model. LIGHT uses the cross-modal information to predict the reading-order successor of each text instance directly with a bi-directional learning strategy that enhances sequence robustness. Experimental results show that LIGHT outperforms existing methods on the ICDAR 2024/2025 MapText Competition data, demonstrating the effectiveness of multi-modal learning for historical map text linking.

[104] BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data

Arunkumar Kannan,Martin A. Lindquist,Brian Caffo

Main category: cs.CV

TL;DR: 本文提出了一种新的混合框架BrainMT，通过结合Mamba块和Transformer块的优势，解决了现有深度学习模型在fMRI数据分析中的不足，并在多个任务上取得了领先的性能表现。

Details

Motivation: 现有的基于卷积神经网络或Transformer架构的方法在捕捉fMRI数据中的长程空间和时间依赖性方面存在局限性，因此引入了BrainMT。 Method: 该方法采用两阶段设计：(1) 使用时间优先扫描机制的双向Mamba块来捕获全局时间交互；(2) 利用自注意力建模空间关系的Transformer块。 Result: 在UKBioBank和Human Connectome Project两个大规模公共数据集上的实验表明，BrainMT在分类（性别预测）和回归（认知智能预测）任务中均显著优于现有方法。 Conclusion: BrainMT是一个新颖的混合框架，能够高效地学习和整合fMRI数据中的长程时空属性，实现了最先进的性能。 Abstract: Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes, sparking significant interest in the neuroimaging community. However, existing approaches, primarily based on convolutional neural networks or transformer architectures, often struggle to model the complex relationships inherent in fMRI data, limited by their inability to capture long-range spatial and temporal dependencies. To overcome these shortcomings, we introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-range spatiotemporal attributes in fMRI data. Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships across the deep features processed by the Mamba block. Extensive experiments on two large-scale public datasets, UKBioBank and the Human Connectome Project, demonstrate that BrainMT achieves state-of-the-art performance on both classification (sex prediction) and regression (cognitive intelligence prediction) tasks, outperforming existing methods by a significant margin. Our code and implementation details will be made publicly available at this https://github.com/arunkumar-kannan/BrainMT-fMRI

[105] Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning

Zuyao You,Zuxuan Wu

Main category: cs.CV

TL;DR: This paper introduces Seg-R1, an RL-based approach that enhances the pixel-level understanding of large multimodal models, achieving strong performance in segmentation tasks without text supervision.

Details

Motivation: The motivation is to improve the pixel-level reasoning capabilities of large multimodal models using reinforcement learning, aiming to achieve better performance in segmentation tasks while minimizing dependency on text-supervised data. Method: The paper introduces Seg-R1, which uses reinforcement learning (RL) and a technique called Group Relative Policy Optimization (GRPO) to train a large multimodal model (LMM) for segmentation tasks. The LMM generates prompts like points and bounding boxes in a next-token fashion to guide SAM2 in producing segmentation masks. Result: Seg-R1 achieved remarkable results with purely RL-based training, including a .873 S-measure on COD10K and impressive zero-shot performance on referring segmentation (71.4 cIoU on RefCOCOg test) and reasoning segmentation tasks (56.7 gIoU on ReasonSeg test), outperforming fully supervised models. Conclusion: Seg-R1 demonstrates that reinforcement learning can effectively enhance pixel-level understanding in large multimodal models, achieving strong performance and open-world generalization without complex model modifications or text supervision. Abstract: We present Seg-R1, a preliminary exploration of using reinforcement learning (RL) to enhance the pixel-level understanding and reasoning capabilities of large multimodal models (LMMs). Starting with foreground segmentation tasks, specifically camouflaged object detection (COD) and salient object detection (SOD), our approach enables the LMM to generate point and bounding box prompts in the next-token fashion, which are then used to guide SAM2 in producing segmentation masks. We introduce Group Relative Policy Optimization (GRPO) into the segmentation domain, equipping the LMM with pixel-level comprehension through a carefully designed training strategy. Notably, Seg-R1 achieves remarkable performance with purely RL-based training, achieving .873 S-measure on COD10K without complex model modification. Moreover, we found that pure RL training demonstrates strong open-world generalization. Despite being trained solely on foreground segmentation image-mask pairs without text supervision, Seg-R1 achieves impressive zero-shot performance on referring segmentation and reasoning segmentation tasks, with 71.4 cIoU on RefCOCOg test and 56.7 gIoU on ReasonSeg test, outperforming models fully supervised on these datasets.

[106] ReCo: Reminder Composition Mitigates Hallucinations in Vision-Language Models

Sotirios Panagiotis Chytas,Miso Choi,Hyunwoo J. Kim,Vikas Singh

Main category: cs.CV

TL;DR: This paper introduces ReCo, a lightweight module based on geometric algebra and relational compositions, which effectively reduces the 'fading memory effect' in Vision Language Models (VLMs), thereby enhancing their performance and reducing hallucinations.

Details

Motivation: Vision Language Models (VLMs) often suffer from hallucinations due to an over-reliance on language and a 'fading memory effect' with respect to visual input. This research aims to explore mechanisms to control this behavior and improve the reliability of VLMs. Method: Drawing from geometric algebra and relational compositions, the researchers developed a small, trainable module called ReCo, which is added on top of existing VLMs without requiring any other modifications to the models. Result: The ReCo module was successfully applied to three widely used VLMs—instructBLIP, LlaVA, and MiniGPT4—showing notable performance improvements on various benchmarks. It also enhanced the effectiveness of other hallucination-reduction approaches when combined with them. Conclusion: The study concludes that the proposed ReCo module effectively mitigates the 'fading memory effect' in Vision Language Models (VLMs), improving their performance on multiple benchmarks and complementing other hallucination-reduction techniques. Abstract: Vision Language Models (VLMs) show impressive capabilities in integrating and reasoning with both visual and language data. But these models make mistakes. A common finding -- similar to LLMs -- is their tendency to hallucinate, i.e., generate plausible sounding text which is not grounded in the visual input, or at worst, is contradictory. A growing consensus attributes this behavior to an over-reliance on language -- especially as the generation progresses, the model suffers from a ``fading memory effect'' with respect to the provided visual input. We study mechanisms by which this behavior can be controlled. Specifically, using ideas from geometric algebra and relational compositions, we propose the addition of a small, trainable module (named ReCo) on top of any VLM -- no other modification is needed. We show that such a lightweight module is able to mitigate the fading memory effect on three of the most widely used VLMs (InstructBLIP, LlaVA, MiniGPT4), where we see performance improvements on multiple benchmarks. Additionally, we show that our module can be combined with many of the other approaches for reducing hallucination where we achieve improved results for each one.

[107] CaO$_2$: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation

Haoxuan Wang,Zhenghao Zhao,Junyi Wu,Yuzhang Shang,Gaowen Liu,Yan Yan

Main category: cs.CV

TL;DR: This paper proposes CaO$_2$, a novel diffusion-based dataset distillation method that resolves objective and condition inconsistencies, leading to improved accuracy and alignment with evaluation objectives.

Details

Motivation: Diffusion models have shown potential in dataset distillation by creating compact surrogate datasets more efficiently than traditional methods. However, current approaches suffer from objective and condition inconsistencies, which degrade performance. Method: The paper introduces CaO$_2$, a two-stage diffusion-based framework. The first stage uses a probability-informed sample selection pipeline, while the second stage refines latent representations to improve conditional likelihood. Result: CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3% accuracy. Conclusion: CaO$_2$ addresses the issues of objective and condition inconsistency in diffusion-based dataset distillation, aligning the distillation process with the evaluation objective and achieving superior performance. Abstract: The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce Condition-aware Optimization with Objective-guided Sampling (CaO$_2$), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood. CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3% accuracy.

[108] 3D Shape Generation: A Survey

Nicolas Caytuiro,Ivan Sipiran

Main category: cs.CV

TL;DR: This survey covers recent developments in 3D shape generation using deep learning, discussing shape representations, modeling approaches, evaluation methods, and future research directions.

Details

Motivation: Recent advances in deep learning have transformed 3D shape generation, necessitating a comprehensive overview of the state of the art to guide researchers and practitioners. Method: The paper organizes the discussion around three core components: shape representations, generative modeling approaches, and evaluation protocols. It categorizes 3D representations into explicit, implicit, and hybrid setups and reviews various generation methods. Result: A systematic review of current techniques and trends in 3D shape generation is provided, including an analysis of shape representations, generative models, datasets, and evaluation metrics. Conclusion: The survey aims to serve as a reference for researchers and practitioners by identifying open challenges and outlining future research directions in 3D shape generation. Abstract: Recent advances in deep learning have significantly transformed the field of 3D shape generation, enabling the synthesis of complex, diverse, and semantically meaningful 3D objects. This survey provides a comprehensive overview of the current state of the art in 3D shape generation, organizing the discussion around three core components: shape representations, generative modeling approaches, and evaluation protocols. We begin by categorizing 3D representations into explicit, implicit, and hybrid setups, highlighting their structural properties, advantages, and limitations. Next, we review a wide range of generation methods, focusing on feedforward architectures. We further summarize commonly used datasets and evaluation metrics that assess fidelity, diversity, and realism of generated shapes. Finally, we identify open challenges and outline future research directions that could drive progress in controllable, efficient, and high-quality 3D shape generation. This survey aims to serve as a valuable reference for researchers and practitioners seeking a structured and in-depth understanding of this rapidly evolving field.

Jiang Yuan,JI Ma,Bo Wang,Guanzhou Ke,Weiming Hu

Main category: cs.CV

TL;DR: 本文提出了一个轻量级盲超分辨率模型LightBSR，通过优化隐式退化表示的判别能力，实现高效图像恢复。

Details

Motivation: 现有IDE-BSR方法忽视了IDR判别能力的重要性，导致模型复杂度增加。因此需要优化IDR判别能力以提高效率。 Method: 提出了一种基于知识蒸馏的轻量级模型LightBSR，包含退化先验约束对比学习和特征对齐技术。 Result: LightBSR在多个盲SR任务中取得了优异性能，同时保持最小的计算复杂度。 Conclusion: LightBSR通过优化IDR判别能力，实现了轻量级且高效的盲超分辨率性能，在多种任务中表现出色。 Abstract: Implicit degradation estimation-based blind super-resolution (IDE-BSR) hinges on extracting the implicit degradation representation (IDR) of the LR image and adapting it to LR image features to guide HR detail restoration. Although IDE-BSR has shown potential in dealing with noise interference and complex degradations, existing methods ignore the importance of IDR discriminability for BSR and instead over-complicate the adaptation process to improve effect, resulting in a significant increase in the model's parameters and computations. In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. Specifically, we employ a knowledge distillation-based learning framework. We first introduce a well-designed degradation-prior-constrained contrastive learning technique during teacher stage to make the model more focused on distinguishing different degradation types. Then we utilize a feature alignment technique to transfer the degradation-related knowledge acquired by the teacher to the student for practical inferencing. Extensive experiments demonstrate the effectiveness of IDR discriminability-driven BSR model design. The proposed LightBSR can achieve outstanding performance with minimal complexity across a range of blind SR tasks. Our code is accessible at: https://github.com/MJ-NCEPU/LightBSR.

[110] Part Segmentation and Motion Estimation for Articulated Objects with Dynamic 3D Gaussians

Jun-Jee Chao,Qingyuan Jiang,Volkan Isler

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于3D高斯表示的新方法，用于解决关节物体在复杂场景下的部分分割和运动估计问题，相比传统点对应方法更加鲁棒且性能优越。

Details

Motivation: 传统的依赖点对应的方法在面对遮挡或由多个异步传感器采集的数据时表现不佳，因此需要一种更鲁棒的方法来解决这些问题。 Method: 将物体表示为一组建模为3D高斯函数的基本单元，这些高斯函数的旋转、平移和尺度参数随时间变化且跨时间步共享，通过建立观测点与高斯函数之间的对应关系实现部分分割和运动估计。 Result: 实验表明，该方法在存在遮挡的情况下比现有方法更具鲁棒性，在部分分割任务中比最先进的方法性能提高了13%。 Conclusion: 该论文提出了一种基于3D高斯表示的新方法，用于解决关节物体运动分析中的部分分割和运动估计问题，并在处理遮挡和传感器异步采样方面优于现有的点对应方法。 Abstract: Part segmentation and motion estimation are two fundamental problems for articulated object motion analysis. In this paper, we present a method to solve these two problems jointly from a sequence of observed point clouds of a single articulated object. The main challenge in our problem setting is that the point clouds are not assumed to be generated by a fixed set of moving points. Instead, each point cloud in the sequence could be an arbitrary sampling of the object surface at that particular time step. Such scenarios occur when the object undergoes major occlusions, or if the dataset is collected using measurements from multiple sensors asynchronously. In these scenarios, methods that rely on tracking point correspondences are not appropriate. We present an alternative approach based on a compact but effective representation where we represent the object as a collection of simple building blocks modeled as 3D Gaussians. We parameterize the Gaussians with time-dependent rotations, translations, and scales that are shared across all time steps. With our representation, part segmentation can be achieved by building correspondences between the observed points and the Gaussians. Moreover, the transformation of each point across time can be obtained by following the poses of the assigned Gaussian (even when the point is not observed). Experiments show that our method outperforms existing methods that solely rely on finding point correspondences. Additionally, we extend existing datasets to emulate real-world scenarios by considering viewpoint occlusions. We further demonstrate that our method is more robust to missing points as compared to existing approaches on these challenging datasets, even when some parts are completely occluded in some time-steps. Notably, our part segmentation performance outperforms the state-of-the-art method by 13% on point clouds with occlusions.

[111] Deterministic Object Pose Confidence Region Estimation

Jinghao Wang,Zhang Li,Zi Wang,Banglei Guan,Yang Shang,Qifeng Yu

Main category: cs.CV

TL;DR: 本文提出了一种确定性和高效的方法来估计6D姿态置信区域，通过使用归纳共形预测校准高斯关键点分布，并利用隐函数定理传播这些关键点置信区域到6D姿态置信区域。

Details

Motivation: 现有的基于采样的方法在实际部署中存在严重限制：1）随着样本数量的增加，采样速度显著下降；2）得到的置信区域往往过大。 Method: 我们提出的方法包括两个步骤：首先使用归纳共形预测将确定性回归的高斯关键点分布校准为2D关键点置信区域；然后利用隐函数定理将这些关键点置信区域直接传播到6D姿态置信区域。 Result: 实验结果表明，我们的方法在减少计算时间的同时提高了姿态估计的准确性，并且对于相同的覆盖率，显著减小了置信区域的体积，旋转减少了最多99.9%，平移减少了最多99.8%。 Conclusion: 该方法避免了与采样和集成相关的低效率和置信区域膨胀问题，提供了紧凑的置信区域，能够以用户定义的置信水平覆盖真实姿态。 Abstract: 6D pose confidence region estimation has emerged as a critical direction, aiming to perform uncertainty quantification for assessing the reliability of estimated poses. However, current sampling-based approach suffers from critical limitations that severely impede their practical deployment: 1) the sampling speed significantly decreases as the number of samples increases. 2) the derived confidence regions are often excessively large. To address these challenges, we propose a deterministic and efficient method for estimating pose confidence regions. Our approach uses inductive conformal prediction to calibrate the deterministically regressed Gaussian keypoint distributions into 2D keypoint confidence regions. We then leverage the implicit function theorem to propagate these keypoint confidence regions directly into 6D pose confidence regions. This method avoids the inefficiency and inflated region sizes associated with sampling and ensembling. It provides compact confidence regions that cover the ground-truth poses with a user-defined confidence level. Experimental results on the LineMOD Occlusion and SPEED datasets show that our method achieves higher pose estimation accuracy with reduced computational time. For the same coverage rate, our method yields significantly smaller confidence region volumes, reducing them by up to 99.9\% for rotations and 99.8\% for translations. The code will be available soon.

[112] XTransfer: Cross-Modality Model Transfer for Human Sensing with Few Data at the Edge

Yu Zhang,Xi Zhang,Hualin zhou,Xinyuan Chen,Shang Gao,Hong Jia,Jianfei Yang,Yuankai Qi,Tao Gu

Main category: cs.CV

TL;DR: XTransfer是一种用于高效、跨模态模型迁移的深度学习方法，通过模型修复和层重组技术，在减少传感器数据收集、模型训练和边缘部署成本的同时，在人类感知任务中实现了最先进的性能。

Details

Motivation: 由于边缘系统的传感器数据有限且资源受限，现有的依赖于预训练模型转移的方法面临模态偏移和高资源需求的问题，导致准确性损失、资源开销大以及在不同传感应用中的适应性差。 Method: 提出XTransfer，一种资源高效且与模态无关的模型迁移方法。XTransfer通过(i)模型修复（仅需少量传感器数据安全修复预训练模型层中的模态偏移）和(ii)层重组（以逐层的方式搜索并重组源模型中的相关层以创建紧凑模型）来实现知识迁移。 Result: 全面的实验结果显示，XTransfer在人类感知任务中达到了最先进的性能，并显著减少了传感器数据收集、模型训练和边缘部署的成本。 Conclusion: XTransfer是一种有效的深度学习方法，能够克服边缘系统中因传感器数据有限和资源约束而导致的模型训练和开发障碍。 Abstract: Deep learning for human sensing on edge systems offers significant opportunities for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. Current methods that rely on transferring pre-trained models often encounter issues such as modality shift and high resource demands, resulting in substantial accuracy loss, resource overhead, and poor adaptability across different sensing applications. In this paper, we propose XTransfer, a first-of-its-kind method for resource-efficient, modality-agnostic model transfer. XTransfer freely leverages single or multiple pre-trained models and transfers knowledge across different modalities by (i) model repairing that safely repairs modality shift in pre-trained model layers with only few sensor data, and (ii) layer recombining that efficiently searches and recombines layers of interest from source models in a layer-wise manner to create compact models. We benchmark various baselines across diverse human sensing datasets spanning different modalities. Comprehensive results demonstrate that XTransfer achieves state-of-the-art performance on human sensing tasks while significantly reducing the costs of sensor data collection, model training, and edge deployment.

Dayong Su,Yafei Zhang,Huafeng Li,Jinxing Li,Yu Liu

Main category: cs.CV

TL;DR: UniFuse is a general medical image fusion framework that simultaneously handles alignment, restoration, and fusion using a degradation-aware, all-in-one approach, outperforming existing methods.

Details

Motivation: Current multimodal medical image fusion methods rely on high-quality, pixel-level aligned images, which limits their effectiveness when handling misaligned or degraded medical images. Method: UniFuse incorporates an Omni Unified Feature Representation scheme using Spatial Mamba to encode multi-directional features, a degradation-aware prompt learning module for cross-modal alignment and restoration, and a Universal Feature Restoration & Fusion module with an Adaptive LoRA Synergistic Network (ALSN). Result: The proposed UniFuse framework enables joint optimization of alignment, restoration, and fusion in a single-stage All-in-One configuration, showing superior performance compared to staged approaches. Conclusion: UniFuse unifies alignment, restoration, and fusion within a single framework, demonstrating effectiveness and significant advantages over existing approaches across multiple datasets. Abstract: Current multimodal medical image fusion typically assumes that source images are of high quality and perfectly aligned at the pixel level. Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. By embedding a degradation-aware prompt learning module, UniFuse seamlessly integrates multi-directional information from input images and correlates cross-modal alignment with restoration, enabling joint optimization of both tasks within a unified framework. Additionally, we design an Omni Unified Feature Representation scheme, which leverages Spatial Mamba to encode multi-directional features and mitigate modality differences in feature alignment. To enable simultaneous restoration and fusion within an All-in-One configuration, we propose a Universal Feature Restoration & Fusion module, incorporating the Adaptive LoRA Synergistic Network (ALSN) based on LoRA principles. By leveraging ALSN's adaptive feature representation along with degradation-type guidance, we enable joint restoration and fusion within a single-stage framework. Compared to staged approaches, UniFuse unifies alignment, restoration, and fusion within a single framework. Experimental results across multiple datasets demonstrate the method's effectiveness and significant advantages over existing approaches.

[114] Deep Learning based Joint Geometry and Attribute Up-sampling for Large-Scale Colored Point Clouds

Yun Zhang,Feifan Chen,Na Li,Zhiwei Guo,Xu Wang,Fen Miao,Sam Kwong

Main category: cs.CV

TL;DR: 本文提出了一种名为JGAU的深度学习方法，用于彩色点云的几何与属性联合上采样，通过新构建的大型数据集SYSU-PCUD和创新网络结构，显著提升了上采样点云的质量。

Details

Motivation: 为了生成大规模且密集的彩色点云，需要一种能够同时建模几何与属性特征并利用空间属性相关性的方法。 Method: 提出了一种联合几何和属性上采样的深度学习框架（JGAU），包括几何上采样网络、属性上采样网络以及属性增强模块，并引入了两种粗略属性上采样方法（GDWAI和DLAI）。 Result: 在不同上采样率（4倍、8倍、12倍和16倍）下，JGAU方法分别达到了33.90 dB、32.10 dB、31.10 dB和30.39 dB的PSNR，平均PSNR增益分别为2.32 dB、2.47 dB、2.28 dB和2.11 dB。 Conclusion: JGAU方法在彩色点云超分辨率任务中表现出色，相较于现有技术，在不同上采样率下均实现了显著的PSNR增益。 Abstract: Colored point cloud, which includes geometry and attribute components, is a mainstream representation enabling realistic and immersive 3D applications. To generate large-scale and denser colored point clouds, we propose a deep learning-based Joint Geometry and Attribute Up-sampling (JGAU) method that learns to model both geometry and attribute patterns while leveraging spatial attribute correlations. First, we establish and release a large-scale dataset for colored point cloud up-sampling called SYSU-PCUD, containing 121 large-scale colored point clouds with diverse geometry and attribute complexities across six categories and four sampling rates. Second, to improve the quality of up-sampled point clouds, we propose a deep learning-based JGAU framework that jointly up-samples geometry and attributes. It consists of a geometry up-sampling network and an attribute up-sampling network, where the latter leverages the up-sampled auxiliary geometry to model neighborhood correlations of the attributes. Third, we propose two coarse attribute up-sampling methods, Geometric Distance Weighted Attribute Interpolation (GDWAI) and Deep Learning-based Attribute Interpolation (DLAI), to generate coarse up-sampled attributes for each point. Then, an attribute enhancement module is introduced to refine these up-sampled attributes and produce high-quality point clouds by further exploiting intrinsic attribute and geometry patterns. Extensive experiments show that the Peak Signal-to-Noise Ratio (PSNR) achieved by the proposed JGAU method is 33.90 decibels, 32.10 decibels, 31.10 decibels, and 30.39 decibels for up-sampling rates of 4 times, 8 times, 12 times, and 16 times, respectively. Compared to state-of-the-art methods, JGAU achieves average PSNR gains of 2.32 decibels, 2.47 decibels, 2.28 decibels, and 2.11 decibels at these four up-sampling rates, demonstrating significant improvement.

[115] Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

Jianing Zhang,Jiayi Zhu,Feiyu Ji,Xiaokang Yang,Xiaoyun Yuan

Main category: cs.CV

TL;DR: This paper proposes a novel diffusion-based framework for metalens photography that addresses optical degradation challenges using pretrained models and adaptive attention mechanisms, achieving superior image reconstruction without large datasets.

Details

Motivation: Metalenses have potential for compact imaging systems but face challenges due to optical degradation and restoration difficulties. Existing methods require precise calibration or large datasets, which are impractical in real-world scenarios. This work aims to overcome these limitations using natural image priors from pretrained models. Method: The method introduces a multipath diffusion framework with positive, neutral, and negative-prompt paths, combined with pseudo data augmentation and a spatially varying degradation-aware attention (SVDA) module to model complex degradations. A tunable decoder enables flexible trade-offs between fidelity and perceptual quality. Result: Extensive experiments show that the proposed method outperforms state-of-the-art approaches in terms of high-fidelity and sharp image reconstruction for metalens-based systems. Conclusion: The proposed Degradation-Modeled Multipath Diffusion framework demonstrates superior performance for metalens photography by effectively balancing detail generation, structural fidelity, and degradation suppression without reliance on large datasets. Abstract: Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, a lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside \textit{pseudo} data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: https://dmdiff.github.io/.

[116] RoboPearls: Editable Video Simulation for Robot Manipulation

Tao Tang,Likui Zhang,Youpeng Wen,Kaidong Zhang,Jia-Wang Bian,xia zhou,Tianyi Yan,Kun Zhan,Peng Jia,Hefeng Wu,Liang Lin,Xiaodan Liang

Main category: cs.CV

TL;DR: RoboPearls 是一种基于视频模拟的机器人操作策略生成框架，利用先进 AI 技术实现高效、逼真的模拟训练。

Details

Motivation: 收集真实世界演示数据的高成本和低效率限制了机器人操作策略的发展，而 RoboPearls 能够通过视频模拟解决这一问题。 Method: 基于3D 高斯点阵 (3DGS) 构建模拟框架，结合了 Incremental Semantic Distillation (ISD)、3D regularized NNFM Loss (3D-NNFM)、大语言模型 (LLM) 和视觉语言模型 (VLM) 技术。 Result: 在 RLBench、COLOSSEUM、Ego4D、Open X-Embodiment 和真实机器人实验中，RoboPearls 展示了良好的模拟性能。 Conclusion: RoboPearls 是一个高效的机器人操作模拟框架，通过使用演示视频和先进技术模块实现了逼真的模拟，并在多个数据集上验证了其有效性。 Abstract: The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.

[117] VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Dinh Phu Tran,Dao Duy Hung,Daeyoung Kim

Main category: cs.CV

TL;DR: This paper proposes VSRM, a highly effective video super-resolution framework using Mamba-based architecture, outperforming existing methods.

Details

Motivation: Existing methods like CNNs and Transformers have limitations in handling long sequences for video super-resolution, prompting the need for an efficient solution with large receptive fields and linear complexity. Method: VSRM uses Spatial-to-Temporal and Temporal-to-Spatial Mamba blocks, Deformable Cross-Mamba Alignment module, and Frequency Charbonnier-like loss to enhance video super-resolution. Result: VSRM achieves state-of-the-art results on diverse benchmarks for video super-resolution. Conclusion: VSRM is a novel framework for video super-resolution that achieves state-of-the-art results by leveraging Mamba's capabilities, offering a foundation for future research. Abstract: Video super-resolution remains a major challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel \textbf{V}ideo \textbf{S}uper-\textbf{R}esolution framework that leverages the power of \textbf{M}amba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.

[118] PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

Oguzhan Baser,Ahmet Ege Tanriverdi,Sriram Vishwanath,Sandeep P. Chinchali

Main category: cs.CV

TL;DR: This paper introduces PhonemeFake, a new Deepfake attack method that manipulates speech segments using language reasoning, making it much harder to detect while offering efficient and scalable detection solutions.

Details

Motivation: Existing Deepfake datasets are not effective in deceiving human perception, unlike real-world Deepfake attacks. This gap necessitates the creation of more realistic attack vectors. Method: The researchers introduce PhonemeFake, a DF attack based on manipulating critical speech segments using language reasoning, and develop a bilevel DF segment detection model to detect such attacks. Result: PhonemeFake significantly deceives human perception by up to 42% and reduces benchmark accuracies by up to 94%. The detection model achieves a 91% reduction in EER, a 90% speed-up, with minimal compute overhead. Conclusion: The study concludes that the introduced PhonemeFake attack effectively reduces human perception and benchmark accuracies, showing the need for more realistic DF attack vectors. Abstract: Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.

Yu Han,Zhiwei Huang,Yanting Zhang,Fangjun Ding,Shen Cai,Rui Fan

Main category: cs.CV

TL;DR: 本研究提出了一种新的LiDAR和相机数据融合方法，解决了单帧稀疏LiDAR数据与图像之间的匹配难题，取得了优异的实验结果。

Details

Motivation: 现有的方法在处理单帧稀疏LiDAR数据时存在模态差距和噪声问题，需要依赖多帧累积或额外先验信息来提高可靠性，而本文旨在解决这一挑战。 Method: 将LiDAR强度图从LiDAR视角投影到2D视图，并将其输入到基于注意力机制的无检测器匹配网络中，同时引入了可重复性评分机制作为软可见性先验。 Result: 在KITTI、nuScenes和MIAS-LCEC-TF70基准测试中，该方法即使仅使用单帧LiDAR数据，也达到了最先进的性能，优于依赖累积点云的先前方法。 Conclusion: 该论文提出了一种基于投影的无检测器框架，实现了LiDAR点云和相机图像之间的直接点像素匹配，并通过可重复性评分机制增强了匹配的可靠性。 Abstract: Point-pixel registration between LiDAR point clouds and camera images is a fundamental yet challenging task in autonomous driving and robotic perception. A key difficulty lies in the modality gap between unstructured point clouds and structured images, especially under sparse single-frame LiDAR settings. Existing methods typically extract features separately from point clouds and images, then rely on hand-crafted or learned matching strategies. This separate encoding fails to bridge the modality gap effectively, and more critically, these methods struggle with the sparsity and noise of single-frame LiDAR, often requiring point cloud accumulation or additional priors to improve reliability. Inspired by recent progress in detector-free matching paradigms (e.g. MatchAnything), we revisit the projection-based approach and introduce the detector-free framework for direct point-pixel matching between LiDAR and camera views. Specifically, we project the LiDAR intensity map into a 2D view from the LiDAR perspective and feed it into an attention-based detector-free matching network, enabling cross-modal correspondence estimation without relying on multi-frame accumulation. To further enhance matching reliability, we introduce a repeatability scoring mechanism that acts as a soft visibility prior. This guides the network to suppress unreliable matches in regions with low intensity variation, improving robustness under sparse input. Extensive experiments on KITTI, nuScenes, and MIAS-LCEC-TF70 benchmarks demonstrate that our method achieves state-of-the-art performance, outperforming prior approaches on nuScenes (even those relying on accumulated point clouds), despite using only single-frame LiDAR.

[120] RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors

Sicong Du,Jiarun Liu,Qifeng Chen,Hao-Xiang Chen,Tai-Jiang Mu,Sheng Yang

Main category: cs.CV

TL;DR: 本文介绍了一种名为RGE-GS的新场景扩展框架，有效解决了当前3D高斯随机投影技术在物理一致性和训练效率上的不足，并展示了其在重建质量方面的卓越表现。

Details

Motivation: 单次驾驶剪辑导致道路结构扫描不完整，需要有效的场景扩展方法来提高传感器模拟器的性能，而现有3D高斯随机投影技术在集成扩散先验时存在物理不一致性和训练效率问题。 Method: 提出了一种新的扩展重建框架RGE-GS，包括一个奖励网络用于选择性保留扩散输出和一种差异化训练策略以优化高斯优化过程。 Result: 广泛评估表明RGE-GS在重建质量方面达到了最先进水平。 Conclusion: RGE-GS通过结合基于奖励的高斯积分和扩散生成技术，实现了最先进的重建质量，并将在未来公开源代码和更新版本。 Abstract: A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at https://github.com/CN-ADLab/RGE-GS. (Camera-ready version incorporating reviewer suggestions will be updated soon.)

[121] Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding

Nuoye Xiong,Anqi Dong,Ning Wang,Cong Hua,Guangming Zhu,Mei Lin,Peiyi Shen,Liang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为CBM-HNMU的新方法，通过改进概念瓶颈模型，有效增强了神经网络的可解释性和准确性。

Details

Motivation: 随着深度学习模型变得越来越复杂，其可解释性降低，决策过程也更难理解。现有的许多方法缺乏有效的干预机制，或仅在样本层面操作而不修改模型本身。 Method: CBM-HNMU 使用概念瓶颈模型（CBM）作为可解释框架，自动识别并优化对模型决策有负面影响的概念，并将修正后的知识重新蒸馏到黑盒模型中。 Result: 在多个基于CNN和Transformer的模型上评估了CBM-HNMU，在Flower-102、CIFAR-10、CIFAR-100、FGVC-Aircraft和CUB-200数据集上实现了最高2.64%的准确率提升，平均准确率提升了1.03%。 Conclusion: CBM-HNMU 提出了一种新的方法，通过利用概念瓶颈模型来提高人与神经网络之间的相互理解，从而增强模型的可解释性和准确性。 Abstract: Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: https://github.com/XiGuaBo/CBM-HNMU.

[122] Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate

Byung Hyun Lee,Sungjin Lim,Seunggyu Lee,Dong Un Kang,Se Young Chun

Main category: cs.CV

TL;DR: This paper proposes Concept Pinpoint Eraser (CPE), a novel framework for selectively erasing target concepts in text-to-image diffusion models while preserving diverse remaining concepts and enhancing robustness against adversarial attacks.

Details

Motivation: Recent methods focusing on fine-tuning cross-attention layers may not adequately preserve diverse concepts; thus, a more effective and robust approach is needed. Method: CPE introduces nonlinear Residual Attention Gates (ResAGs) and an attention anchoring loss, trained iteratively with adversarial learning to enhance concept erasure performance. Result: Experiments showed that CPE outperforms existing approaches by achieving better erasure of targeted concepts while safeguarding other concepts from broad distributions. Conclusion: The proposed CPE framework effectively erases target concepts in diffusion models while preserving diverse remaining concepts and demonstrating robustness against adversarial attacks. Abstract: Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at https://github.com/Hyun1A/CPE

[123] FreqDGT: Frequency-Adaptive Dynamic Graph Networks with Transformer for Cross-subject EEG Emotion Recognition

Yueyang Li,Shengyu Gong,Weiming Zeng,Nizhuan Wang,Wai Ting Siok

Main category: cs.CV

TL;DR: This paper proposes FreqDGT, a novel framework for EEG-based emotion recognition that enhances cross-subject generalization through frequency-adaptive processing, dynamic graph learning, and multi-scale temporal modeling.

Details

Motivation: Cross-subject generalization remains a challenge in EEG-based emotion recognition due to individual variability. This work aims to systematically address this limitation. Method: FreqDGT introduces frequency-adaptive processing (FAP), adaptive dynamic graph learning (ADGL), and a multi-scale temporal disentanglement network (MTDN) combining hierarchical temporal transformers with adversarial feature disentanglement. Result: Comprehensive experiments show that FreqDGT significantly improves cross-subject emotion recognition accuracy. Conclusion: FreqDGT effectively improves cross-subject emotion recognition by integrating frequency-adaptive, spatial-dynamic, and temporal-hierarchical modeling, ensuring robustness to individual differences. Abstract: Electroencephalography (EEG) serves as a reliable and objective signal for emotion recognition in affective brain-computer interfaces, offering unique advantages through its high temporal resolution and ability to capture authentic emotional states that cannot be consciously controlled. However, cross-subject generalization remains a fundamental challenge due to individual variability, cognitive traits, and emotional responses. We propose FreqDGT, a frequency-adaptive dynamic graph transformer that systematically addresses these limitations through an integrated framework. FreqDGT introduces frequency-adaptive processing (FAP) to dynamically weight emotion-relevant frequency bands based on neuroscientific evidence, employs adaptive dynamic graph learning (ADGL) to learn input-specific brain connectivity patterns, and implements multi-scale temporal disentanglement network (MTDN) that combines hierarchical temporal transformers with adversarial feature disentanglement to capture both temporal dynamics and ensure cross-subject robustness. Comprehensive experiments demonstrate that FreqDGT significantly improves cross-subject emotion recognition accuracy, confirming the effectiveness of integrating frequency-adaptive, spatial-dynamic, and temporal-hierarchical modeling while ensuring robustness to individual differences. The code is available at https://github.com/NZWANG/FreqDGT.

[124] Efficient Multi-Crop Saliency Partitioning for Automatic Image Cropping

Andrew Hamara,Andrew C. Freeman

Main category: cs.CV

TL;DR: This paper introduces an enhanced Fixed Aspect Ratio Cropping algorithm that efficiently extracts multiple non-overlapping visually salient regions in images, addressing limitations of traditional methods optimized for single bounding boxes.

Details

Motivation: The motivation is to address the limitations of traditional saliency-aware cropping methods, which are ineffective for applications requiring multiple disjoint crops, by developing an approach that can efficiently extract multiple non-overlapping crops. Method: The method involves extending the Fixed Aspect Ratio Cropping algorithm to dynamically adjust attention thresholds and remove selected crops from consideration without recomputing the entire saliency map, achieving efficiency in linear time. Result: The result is an improved algorithm that can extract multiple non-overlapping crops in linear time, demonstrated with qualitative results showing its effectiveness. Conclusion: The paper concludes that extending the Fixed Aspect Ratio Cropping algorithm allows for efficient extraction of multiple non-overlapping crops, offering improvements over traditional methods that optimize only a single bounding box. Abstract: Automatic image cropping aims to extract the most visually salient regions while preserving essential composition elements. Traditional saliency-aware cropping methods optimize a single bounding box, making them ineffective for applications requiring multiple disjoint crops. In this work, we extend the Fixed Aspect Ratio Cropping algorithm to efficiently extract multiple non-overlapping crops in linear time. Our approach dynamically adjusts attention thresholds and removes selected crops from consideration without recomputing the entire saliency map. We discuss qualitative results and introduce the potential for future datasets and benchmarks.

[125] Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding

Xingyilang Yin,Jiale Wang,Xi Yang,Mutian Xu,Xu Gu,Nannan Wang

Main category: cs.CV

TL;DR: 本文提出了一種新的方法MVOV3D，用於開放詞彙的3D場景理解。

Details

Motivation: 現有的開放詞彙3D場景理解方法在處理多樣化的對象類別時存在困難，因為它們受限於有限的3D數據訓練強大的模型。 Method: MVOV3D通過利用精確的區域級圖像特徵和CLIP編碼器生成的文本特徵，並結合3D幾何先驗來優化多視角融合，從而減少固有的噪聲。 Result: 實驗結果顯示，MVOV3D在ScanNet200和Matterport160數據集上的挑戰性開放詞彙語義分割任務中分別達到了14.7%和16.2%的mIoU，顯著超越了當前領先的3D網絡。 Conclusion: MVOV3D是一種有效的開放詞彙3D場景理解方法，具有良好的性能和應用潛力。 Abstract: Recent open-vocabulary 3D scene understanding approaches mainly focus on training 3D networks through contrastive learning with point-text pairs or by distilling 2D features into 3D models via point-pixel alignment. While these methods show considerable performance in benchmarks with limited vocabularies, they struggle to handle diverse object categories as the limited amount of 3D data upbound training strong open-vocabulary 3d models. We observe that 2D multi-view fusion methods take precedence in understanding diverse concepts in 3D scenes. However, inherent noises in vision-language models lead multi-view fusion to sub-optimal performance. To this end, we introduce MVOV3D, a novel approach aimed at unleashing the potential of 2D multi-view fusion for open-vocabulary 3D scene understanding. We focus on reducing the inherent noises without training, thereby preserving the generalizability while enhancing open-world capabilities. Specifically, MVOV3D improves multi-view 2D features by leveraging precise region-level image features and text features encoded by CLIP encoders and incorporates 3D geometric priors to optimize multi-view fusion. Extensive experiments on various datasets demonstrate the effectiveness of our method. Notably, our MVOV3D achieves a new record with 14.7% mIoU on ScanNet200 and 16.2% mIoU on Matterport160 for challenge open-vocabulary semantic segmentation, outperforming current leading trained 3D networks by a significant margin.

[126] Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration

Ramya Hebbalaguppe,Tamoghno Kandar,Abhinav Nagpal,Chetan Arora

Main category: cs.CV

TL;DR: This paper introduces TCA, a method for better-calibrated vision-language models through improved test-time prompt initialization and regularization.

Details

Motivation: Test-time prompt tuning (TPT) methods often suffer from poor confidence calibration due to naive prompt initialization, limiting their applicability in critical domains. Method: The authors propose using careful initialization of test-time prompts with prior knowledge from a large language model (LLM) and introduce a regularization loss to enhance inter-class separation and reduce intra-class distance. Result: Experiments show that TCA achieves an average expected calibration error (ECE) of 4.11, outperforming other TPT approaches like C-TPT (6.12), DiffTPT (6.78), and PromptAlign (8.43). Conclusion: The proposed method TCA improves the calibration of vision-language models after test-time prompt tuning by leveraging prior knowledge from a large language model and a novel regularization loss. Abstract: Vision-language models (VLM) have demonstrated impressive performance in image recognition by leveraging self-supervised training on large datasets. Their performance can be further improved by adapting to the test sample using test-time prompt tuning (TPT). Unfortunately, the singular focus of TPT approaches on improving the accuracy suffers from tunnel vision, and leads to degradation in confidence calibration. This limits the applicability of TPT in critical applications. We make three contributions in this work. (1) We posit that random or naive initialization of prompts leads to overfitting on a particular test sample, and is the main reason for miscalibration of the VLM after TPT. To mitigate the problem, we propose careful initialization of test time prompt using prior knowledge about the target label attributes from a large language model (LLM); (2) To further maintain the quality of prompts during \tpt, we propose a novel regularization loss to reduce intraclass distance, and increase inter-class distance between the learnt Through extensive experiments on different CLIP architectures and 15 datasets, we show that our approach can effectively improve the calibration after TPT. We report an average expected calibration error (ECE) of 4.11 with our method, TCA, compared to 11.7 for vanilla TPT, 6.12 for C-TPT (ICLR'24), 6.78 for DiffTPT (CVPR'23), and 8.43 for PromptAlign (NeurIPS'23). The code is publicly accessible at: https://github.com/rhebbalaguppe/TCA_PromptWithoutPanic.

[127] Listener-Rewarded Thinking in VLMs for Image Preferences

Alexander Gambashidze,Li Pengyi,Matvey Skripkin,Andrey Galichin,Anton Gusarov,Konstantin Sobolev,Andrey Kuznetsov,Ivan Oseledets

Main category: cs.CV

TL;DR: 本研究提出一种基于听者增强的GRPO框架，用于训练具有人类视觉偏好的奖励模型。

Details

Motivation: 当前的奖励模型泛化能力不足，监督微调导致记忆问题，并且需要复杂的注释流程。尽管强化学习（RL）可以提高泛化性能，但在模型推理与独立视觉-语言模型评估结果矛盾时会出现显著的推理精度下降问题。 Method: 引入了一种听者增强的GRPO框架，其中听者重新评估推理者的思维链以提供密集、校准的置信度评分，从而塑造RL奖励信号。 Result: 所提出的听者形状奖励方案在ImageReward基准上取得了最佳准确率（67.4%），并且在大规模人类偏好数据集上显著提高了分布外（OOD）性能（120万投票，比朴素推理者提高最多+6%），同时减少了推理矛盾。 Conclusion: 基于听者的奖励为对齐视觉-语言模型与复杂的人类偏好提供了可扩展且数据高效的方法。 Abstract: Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

[128] SemFaceEdit: Semantic Face Editing on Generative Radiance Manifolds

Shashikant Verma,Shanmuganathan Raman

Main category: cs.CV

TL;DR: SemFaceEdit introduces a novel method for localized image editing by generating semantic fields on radiance manifolds, enabling precise control over facial features.

Details

Motivation: Existing 3D-aware GAN techniques lack the capacity for localized editing, prompting the development of SemFaceEdit to enable more precise and controlled image editing. Method: SemFaceEdit uses latent codes to disentangle geometry and appearance on generative radiance manifolds, incorporating a Geometry module for semantic radiance and occupancy fields and an Appearance module for RGB radiance prediction, trained jointly in adversarial settings. Result: SemFaceEdit achieves improved radiance field disentanglement and allows for accurate editing of specific facial features without affecting the overall image integrity. Conclusion: SemFaceEdit enables precise editing of facial semantics in generated images while preserving other regions, offering superior performance in semantic field-based editing. Abstract: Despite multiple view consistency offered by 3D-aware GAN techniques, the resulting images often lack the capacity for localized editing. In response, generative radiance manifolds emerge as an efficient approach for constrained point sampling within volumes, effectively reducing computational demands and enabling the learning of fine details. This work introduces SemFaceEdit, a novel method that streamlines the appearance and geometric editing process by generating semantic fields on generative radiance manifolds. Utilizing latent codes, our method effectively disentangles the geometry and appearance associated with different facial semantics within the generated image. In contrast to existing methods that can change the appearance of the entire radiance field, our method enables the precise editing of particular facial semantics while preserving the integrity of other regions. Our network comprises two key modules: the Geometry module, which generates semantic radiance and occupancy fields, and the Appearance module, which is responsible for predicting RGB radiance. We jointly train both modules in adversarial settings to learn semantic-aware geometry and appearance descriptors. The appearance descriptors are then conditioned on their respective semantic latent codes by the Appearance Module, facilitating disentanglement and enhanced control. Our experiments highlight SemFaceEdit's superior performance in semantic field-based editing, particularly in achieving improved radiance field disentanglement.

[129] FOCUS: Fine-grained Optimization with Semantic Guided Understanding for Pedestrian Attributes Recognition

Hongyan An,Kuan Zhu,Xin He,Haiyun Guo,Chaoyang Zhao,Ming Tang,Jinqiao Wang

Main category: cs.CV

TL;DR: This paper introduces a novel framework called FOCUS for pedestrian attribute recognition, enabling fine-grained feature extraction guided by textual attributes, leading to improved accuracy and generalization on multiple benchmark datasets.

Details

Motivation: Existing methods use regional features to predict fixed sets of attributes, limiting performance by compromising fine-grained patterns and failing to generalize to unseen attributes. This work aims to address these limitations. Method: The paper proposes the Fine-grained Optimization with semantic-Guided Understanding (FOCUS) approach, which includes Multi-Granularity Mix Tokens (MGMT), Attribute-guided Visual Feature Extraction (AVFE), and Region-Aware Contrastive Learning (RACL) to improve feature extraction and generalization for both seen and unseen attributes. Result: Extensive experiments on PA100K, PETA, and RAPv1 datasets show that the proposed method achieves superior performance and strong generalization ability in recognizing both seen and unseen attributes. Conclusion: The paper concludes that the proposed FOCUS approach effectively enhances pedestrian attribute recognition by adaptively extracting fine-grained features and demonstrates strong generalization across multiple datasets. Abstract: Pedestrian attribute recognition (PAR) is a fundamental perception task in intelligent transportation and security. To tackle this fine-grained task, most existing methods focus on extracting regional features to enrich attribute information. However, a regional feature is typically used to predict a fixed set of pre-defined attributes in these methods, which limits the performance and practicality in two aspects: 1) Regional features may compromise fine-grained patterns unique to certain attributes in favor of capturing common characteristics shared across attributes. 2) Regional features cannot generalize to predict unseen attributes in the test time. In this paper, we propose the \textbf{F}ine-grained \textbf{O}ptimization with semanti\textbf{C} g\textbf{U}ided under\textbf{S}tanding (FOCUS) approach for PAR, which adaptively extracts fine-grained attribute-level features for each attribute individually, regardless of whether the attributes are seen or not during training. Specifically, we propose the Multi-Granularity Mix Tokens (MGMT) to capture latent features at varying levels of visual granularity, thereby enriching the diversity of the extracted information. Next, we introduce the Attribute-guided Visual Feature Extraction (AVFE) module, which leverages textual attributes as queries to retrieve their corresponding visual attribute features from the Mix Tokens using a cross-attention mechanism. To ensure that textual attributes focus on the appropriate Mix Tokens, we further incorporate a Region-Aware Contrastive Learning (RACL) method, encouraging attributes within the same region to share consistent attention maps. Extensive experiments on PA100K, PETA, and RAPv1 datasets demonstrate the effectiveness and strong generalization ability of our method.

[130] AG-VPReID 2025: Aerial-Ground Video-based Person Re-identification Challenge Results

Kien Nguyen,Clinton Fookes,Sridha Sridharan,Huy Nguyen,Feng Liu,Xiaoming Liu,Arun Ross,Dana Michalski,Tamás Endrei,Ivan DeAndres-Tame,Ruben Tolosana,Ruben Vera-Rodriguez,Aythami Morales,Julian Fierrez,Javier Ortega-Garcia,Zijing Gong,Yuhao Wang,Xuehu Liu,Pingping Zhang,Md Rashidunnabi,Hugo Proença,Kailash A. Hambarde,Saeid Rezaei

Main category: cs.CV

TL;DR: This paper introduces the AG-VPReID 2025 Challenge, a large-scale video-based competition for aerial-ground person re-identification, showcasing advanced methodologies and results that address the domain gap in cross-view ReID.

Details

Motivation: The motivation stems from the increasing importance of person re-identification across aerial and ground vantage points for surveillance and public safety, and the challenge posed by extreme viewpoint differences, scale variations, and occlusions. Method: The authors constructed the AG-VPReID dataset featuring 3,027 identities, over 13,500 tracklets, and approximately 3.7 million frames from UAVs, CCTV, and wearable cameras, and organized an international competition where teams developed various solutions including multi-stream architectures, transformer-based temporal reasoning, and physics-informed modeling. Result: The leading approach achieved 72.28% Rank-1 accuracy in aerial-to-ground ReID and 70.77% in ground-to-aerial ReID, surpassing existing baselines and highlighting the complexity of the new dataset. Conclusion: The paper concludes that the AG-VPReID 2025 Challenge successfully introduced a large-scale dataset and competition for aerial-ground ReID, demonstrating significant progress in addressing the domain gap with advanced methodologies. Abstract: Person re-identification (ReID) across aerial and ground vantage points has become crucial for large-scale surveillance and public safety applications. Although significant progress has been made in ground-only scenarios, bridging the aerial-ground domain gap remains a formidable challenge due to extreme viewpoint differences, scale variations, and occlusions. Building upon the achievements of the AG-ReID 2023 Challenge, this paper introduces the AG-VPReID 2025 Challenge - the first large-scale video-based competition focused on high-altitude (80-120m) aerial-ground ReID. Constructed on the new AG-VPReID dataset with 3,027 identities, over 13,500 tracklets, and approximately 3.7 million frames captured from UAVs, CCTV, and wearable cameras, the challenge featured four international teams. These teams developed solutions ranging from multi-stream architectures to transformer-based temporal reasoning and physics-informed modeling. The leading approach, X-TFCLIP from UAM, attained 72.28% Rank-1 accuracy in the aerial-to-ground ReID setting and 70.77% in the ground-to-aerial ReID setting, surpassing existing baselines while highlighting the dataset's complexity. For additional details, please refer to the official website at https://agvpreid25.github.io.

[131] DMD-Net: Deep Mesh Denoising Network

Aalok Gangopadhyay,Shashikant Verma,Shanmuganathan Raman

Main category: cs.CV

TL;DR: This paper proposes DMD-Net, a deep learning framework combining graph convolution and feature-guided transformations for effective mesh denoising, demonstrating strong performance and robustness.

Details

Motivation: The motivation is to develop an efficient and robust solution for the mesh denoising problem using deep learning techniques, particularly by leveraging graph-based architectures and transformation paradigms. Method: The paper introduces DMD-Net, an end-to-end deep learning framework consisting of a Graph Convolutional Neural Network with a primal-dual fusion block and a Feature Guided Transformer paradigm composed of a feature extractor, transformer, and denoiser. Result: DMD-Net demonstrates competitive or better performance than existing mesh denoising algorithms, showing robustness to various types of noise, including high-noise scenarios. Conclusion: The paper concludes that DMD-Net is a robust and effective method for mesh denoising, capable of achieving competitive or superior results compared to state-of-the-art techniques. Abstract: We present Deep Mesh Denoising Network (DMD-Net), an end-to-end deep learning framework, for solving the mesh denoising problem. DMD-Net consists of a Graph Convolutional Neural Network in which aggregation is performed in both the primal as well as the dual graph. This is realized in the form of an asymmetric two-stream network, which contains a primal-dual fusion block that enables communication between the primal-stream and the dual-stream. We develop a Feature Guided Transformer (FGT) paradigm, which consists of a feature extractor, a transformer, and a denoiser. The feature extractor estimates the local features, that guide the transformer to compute a transformation, which is applied to the noisy input mesh to obtain a useful intermediate representation. This is further processed by the denoiser to obtain the denoised mesh. Our network is trained on a large scale dataset of 3D objects. We perform exhaustive ablation studies to demonstrate that each component in our network is essential for obtaining the best performance. We show that our method obtains competitive or better results when compared with the state-of-the-art mesh denoising algorithms. We demonstrate that our method is robust to various kinds of noise. We observe that even in the presence of extremely high noise, our method achieves excellent performance.

Li-Cheng Shen,Jih-Kang Hsieh,Wei-Hua Li,Chu-Song Chen

Main category: cs.CV

TL;DR: 本研究介绍了Mask-aware TIR (MaTIR)，它是一个结合文本到图像检索和指代表达分割的新任务，通过一个两阶段框架有效提升检索和分割性能。

Details

Motivation: 现有文本到图像检索方法主要依赖于整图描述，缺乏可解释性；而指代表达分割在大规模图像集合上计算成本高，因此引入MaTIR任务以弥合这一差距。 Method: 提出了一种两阶段框架，第一阶段为分割感知的图像检索，第二阶段为利用多模态大语言模型进行重排序和对象定位。 Result: 在COCO和D$^3$数据集上的评估表明，所提方法在检索准确性和分割质量方面均优于先前方法。 Conclusion: 该研究提出了MaTIR框架，成功结合了文本到图像检索与指代表达分割，实现了高效的图像搜索和精确的对象分割。 Abstract: Text-to-image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole-image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract region-level embeddings offline at first, enabling effective and scalable online retrieval. Secondly, MLLM is used to refine retrieval rankings and generate bounding boxes, which are matched to segmentation masks. We evaluate our approach on COCO and D$^3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.

[133] Region-Aware CAM: High-Resolution Weakly-Supervised Defect Segmentation via Salient Region Perception

Hang-Cheng Dong,Lu Zou,Bingguo Liu,Dong Ye,Guodong Liu

Main category: cs.CV

TL;DR: 本文提出了一种适用于工业缺陷检测的弱监督语义分割框架，通过过滤引导反向传播和区域感知加权模块实现高精度分割，同时减少了对大规模标注数据的依赖。

Details

Motivation: 表面缺陷检测在工业质量检测中起着至关重要的作用。近年来人工智能的发展显著提高了检测过程的自动化水平。然而，传统的语义分割和目标检测模型严重依赖大规模标注数据集，这与缺陷检测任务的实际需求相矛盾。 Method: 提出了一种新颖的弱监督语义分割框架，包括区域感知类激活图（CAM）和伪标签训练两个关键组件。引入过滤引导反向传播（FGBP）来优化目标区域，并开发了一个区域感知加权模块以提高空间精度。最后实现了伪标签分割来迭代优化模型性能。 Result: 在工业缺陷数据集上进行了全面的实验，证明了所提方法的优越性。 Conclusion: 该框架有效地弥合了弱监督学习与高精度缺陷分割之间的差距，为资源受限的工业场景提供了实用的解决方案。 Abstract: Surface defect detection plays a critical role in industrial quality inspection. Recent advances in artificial intelligence have significantly enhanced the automation level of detection processes. However, conventional semantic segmentation and object detection models heavily rely on large-scale annotated datasets, which conflicts with the practical requirements of defect detection tasks. This paper proposes a novel weakly supervised semantic segmentation framework comprising two key components: a region-aware class activation map (CAM) and pseudo-label training. To address the limitations of existing CAM methods, especially low-resolution thermal maps, and insufficient detail preservation, we introduce filtering-guided backpropagation (FGBP), which refines target regions by filtering gradient magnitudes to identify areas with higher relevance to defects. Building upon this, we further develop a region-aware weighted module to enhance spatial precision. Finally, pseudo-label segmentation is implemented to refine the model's performance iteratively. Comprehensive experiments on industrial defect datasets demonstrate the superiority of our method. The proposed framework effectively bridges the gap between weakly supervised learning and high-precision defect segmentation, offering a practical solution for resource-constrained industrial scenarios.

[134] STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing

Junsung Lee,Junoh Kang,Bohyung Han

Main category: cs.CV

TL;DR: 本文提出了一种名为STR-Match的视频编辑算法，通过利用文本到视频扩散模型中的空间和时间模块来捕捉时空像素相关性，从而实现视觉上吸引人且时空一致的视频编辑。

Details

Motivation: 先前的文本引导视频编辑方法在时序一致性、运动失真和领域转换方面存在局限性，这主要归因于编辑过程中对时空像素相关性的建模不足。 Method: STR-Match是一种无需训练的视频编辑算法，通过结合2D空间注意力和1D时间模块，在不增加计算成本的情况下捕捉相邻帧之间的时空像素相关性，并将其集成到一个带有潜在掩码的优化框架中。 Result: STR-Match在视觉质量和时空一致性方面均优于现有方法，即使在显著的领域变换下也能保持源视频的关键视觉属性。 Conclusion: STR-Match通过有效的时空像素相关性建模，解决了传统文本引导视频编辑方法中存在的问题，为视频编辑提供了一个高效且高质量的解决方案。 Abstract: Previous text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and-most notably-limited domain transformation. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms. Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, maintaining strong performance even under significant domain transformations while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.

[135] Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

Dang Jisheng,Wu Xudong,Wang Bimei,Lv Ning,Chen Jiayu,Jingwen Zhao,Yichu liu,Jizhao Liu,Juncheng Li,Teng Wang

Main category: cs.CV

TL;DR: 本文提出了DeSa2VA，通过解耦增强的提示机制提升视频分割性能，解决了视觉和语义信息纠缠的问题。

Details

Motivation: 解决现有方法中动态视觉信息与静态语义纠缠导致的分割精度下降问题。 Method: 提出了一种新的提示机制，结合文本预训练与线性解耦模块，并采用动态掩码融合策略。 Result: 实验表明DeSa2VA在图像/视频分割及问答任务上均取得先进性能。 Conclusion: DeSa2VA通过解耦文本和视觉特征，提升了视频分割的准确性，并在多个任务中达到SOTA性能。 Abstract: Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.

[136] How Semantically Informative is an Image?: Measuring the Covariance-Weighted Norm of Contrastive Learning Embeddings

Fumiya Uchiyama,Rintaro Yanagi,Shohei Taniguchi,Shota Takashiro,Masahiro Suzuki,Hirokatsu Kataoka,Yusuke Iwasawa,Yutaka Matsuo

Main category: cs.CV

TL;DR: 本文研究了对比学习在建模多模态概率分布方面的应用，引入了一种新的语义信息量度量方法，并验证了其有效性。

Details

Motivation: 目前尚不清楚对比学习是否能够表示绝对语义信息量。 Method: 通过对比学习模型计算图像和文本的信息增益，利用嵌入的范数来估计信息增益。 Result: 在OpenCLIP的实证结果中，信息增益得分最低的图像通常对应于“找不到图像”的占位符图标。 Conclusion: 信息增益可以使用CLIP或SigLIP进行测量，并且结果显示出与决定系数强烈相关。 Abstract: Contrastive learning has the capacity to model multimodal probability distributions by embedding and aligning visual representations with semantics from captions. This approach enables the estimation of relational semantic similarity; however, it remains unclear whether it can also represent absolute semantic informativeness. In this work, we introduce a semantic informativeness metric for an image calculated from text samples via a contrastive learning model; similarly, the informativeness of a text is calculated from image samples. We propose a redefinition of the concept of Information Gain, a concept previously explored in natural language processing, extending its application to the domains of vision and language. Our metric quantifies how conditioning on an image distorts the distribution of associated texts, and vice versa for text conditioning on image distributions. In OpenCLIP's empirical results, we observe that images with the lowest Information Gain scores often correspond to placeholder icons such as "image not found." Furthermore, we propose to measure a norm-based metric of the embedding to estimate the Information Gain, following the theoretical results for Skip-Gram with Negative Sampling (SGNS) word embedding. Information Gain can be measured using either CLIP or SigLIP, and the results demonstrate a strong correlation with a coefficient of determination ranging from 0.98 to 1.00. After obtaining the mean and the covariance of the sample embedding, the computational cost of this method is independent of the sample size, and it is compatible with publicly available, open-weight models.

[137] CP-Guard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems

Senkang Hu,Yihang Tao,Guowen Xu,Xinyuan Qian,Yiqin Deng,Xianhao Chen,Sam Tak Wu Kwong,Yuguang Fang

Main category: cs.CV

TL;DR: CP-Guard is a security framework for collaborative perception systems that identifies and removes malicious agents using consensus-based verification techniques.

Details

Motivation: Collaborative Perception (CP) is vulnerable to attacks from malicious agents; thus, a defense mechanism is needed to ensure system integrity. Method: Proposed a framework called CP-Guard with PASAC method, Collaborative Consistency Loss (CCLoss), and online adaptive threshold via dual sliding windows. Result: Extensive experiments demonstrated the effectiveness of CP-Guard in detecting malicious agents and ensuring consensus in dynamic environments. Conclusion: CP-Guard effectively detects and eliminates malicious agents in collaborative perception systems, enhancing security and reliability. Abstract: Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability-agnostic, and adaptive framework, namely, CP-Guard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent's perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird's eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code will be released at https://github.com/CP-Security/CP-Guard

[138] Neural Cellular Automata: From Cells to Pixels

Ehsan Pajouheshgar,Yitao Xu,Ali Abbasi,Alexander Mordvintsev,Wenzel Jakob,Sabine Süsstrunk

Main category: cs.CV

TL;DR: This paper introduces an implicit decoder-based framework for Neural Cellular Automata (NCAs), enabling them to efficiently scale to high-resolution outputs while preserving their self-organizing and emergent behaviors.

Details

Motivation: To overcome the limitations of NCAs in handling high-resolution grids due to quadratic training costs, local information propagation, and heavy compute demands during inference. Method: The authors pair Neural Cellular Automata (NCAs) with a tiny, shared implicit decoder inspired by advances in implicit neural representations. They also propose novel loss functions for morphogenesis and texture synthesis tasks to optimize high-resolution output with minimal overhead. Result: The proposed approach enables NCAs to produce full-HD outputs in real time while preserving emergent properties like self-regeneration and robustness, demonstrating applicability across 2D, 3D grids, and meshes with improved quality, efficiency, and performance. Conclusion: NCAs equipped with an implicit decoder can generate high-resolution outputs in real time while maintaining their self-organizing properties and scalability across multiple tasks and variants. Abstract: Neural Cellular Automata (NCAs) are bio-inspired systems in which identical cells self-organize to form complex and coherent patterns by repeatedly applying simple local rules. NCAs display striking emergent behaviors including self-regeneration, generalization and robustness to unseen situations, and spontaneous motion. Despite their success in texture synthesis and morphogenesis, NCAs remain largely confined to low-resolution grids. This limitation stems from (1) training time and memory requirements that grow quadratically with grid size, (2) the strictly local propagation of information which impedes long-range cell communication, and (3) the heavy compute demands of real-time inference at high resolution. In this work, we overcome this limitation by pairing NCA with a tiny, shared implicit decoder, inspired by recent advances in implicit neural representations. Following NCA evolution on a coarse grid, a lightweight decoder renders output images at arbitrary resolution. We also propose novel loss functions for both morphogenesis and texture synthesis tasks, specifically tailored for high-resolution output with minimal memory and computation overhead. Combining our proposed architecture and loss functions brings substantial improvement in quality, efficiency, and performance. NCAs equipped with our implicit decoder can generate full-HD outputs in real time while preserving their self-organizing, emergent properties. Moreover, because each MLP processes cell states independently, inference remains highly parallelizable and efficient. We demonstrate the applicability of our approach across multiple NCA variants (on 2D, 3D grids, and 3D meshes) and multiple tasks, including texture generation and morphogenesis (growing patterns from a seed), showing that with our proposed framework, NCAs seamlessly scale to high-resolution outputs with minimal computational overhead.

[139] MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

Mai A. Shaaban,Tausifa Jan Saleem,Vijay Ram Papineni,Mohammad Yaqub

Main category: cs.CV

TL;DR: MOTOR improves Medical Visual Question Answering by integrating textual and visual context through a novel multimodal retrieval and re-ranking approach, achieving higher accuracy compared to existing methods.

Details

Motivation: Existing approaches often neglect visual or multimodal context crucial for medical diagnosis, leading to factually incorrect answers from vision-language models (VLMs). Method: The authors proposed MOTOR, a novel approach leveraging grounded captions and optimal transport to enhance retrieval relevance in MedVQA tasks. The method captures relationships between queries and retrieved contexts using textual and visual information. Result: Empirical analysis and human expert evaluation show that MOTOR outperforms state-of-the-art methods on MedVQA datasets by an average of 6.45%. Conclusion: MOTOR is an effective multimodal retrieval and re-ranking approach for Medical Visual Question Answering, which improves accuracy by incorporating both textual and visual context in the retrieval process. Abstract: Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at https://github.com/BioMedIA-MBZUAI/MOTOR.

[140] Point Cloud Compression and Objective Quality Assessment: A Survey

Yiling Xu,Yujie Zhang,Shuting Xia,Kaifa Yang,He Huang,Ziyu Shan,Wenjie Huang,Qi Yang,Le Yang

Main category: cs.CV

TL;DR: This paper surveys recent developments in point cloud compression and quality assessment, identifying current challenges and suggesting future improvements like hybrid frameworks and better feature extraction for more effective 3D applications.

Details

Motivation: The motivation for this research stems from the rapid growth of 3D point cloud data driven by applications in autonomous driving, robotics, and immersive environments, which demands efficient compression and quality assessment techniques despite the unique challenges posed by point clouds' irregular structure, high data volume, and complex attributes. Method: The paper provides a comprehensive survey of recent advances in point cloud compression (PCC) and point cloud quality assessment (PCQA). It analyzes handcrafted and learning-based PCC algorithms and objective PCQA metrics, offering comparisons and insights into their strengths and limitations. Result: The result is a detailed overview and benchmarking of various PCC and PCQA methods on emerging datasets, providing practical insights into their strengths and limitations. The paper highlights current challenges and outlines potential future directions for improving 3D applications. Conclusion: This paper concludes that while significant progress has been made in point cloud compression and quality assessment, challenges remain in enhancing visual fidelity, reducing latency, and supporting multimodal data. The authors suggest future directions such as hybrid compression frameworks and advanced feature extraction strategies to enable more efficient and intelligent 3D applications. Abstract: The rapid growth of 3D point cloud data, driven by applications in autonomous driving, robotics, and immersive environments, has led to criticals demand for efficient compression and quality assessment techniques. Unlike traditional 2D media, point clouds present unique challenges due to their irregular structure, high data volume, and complex attributes. This paper provides a comprehensive survey of recent advances in point cloud compression (PCC) and point cloud quality assessment (PCQA), emphasizing their significance for real-time and perceptually relevant applications. We analyze a wide range of handcrafted and learning-based PCC algorithms, along with objective PCQA metrics. By benchmarking representative methods on emerging datasets, we offer detailed comparisons and practical insights into their strengths and limitations. Despite notable progress, challenges such as enhancing visual fidelity, reducing latency, and supporting multimodal data remain. This survey outlines future directions, including hybrid compression frameworks and advanced feature extraction strategies, to enable more efficient, immersive, and intelligent 3D applications.

[141] MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances

Yunzhe Shao,Xinyu Yi,Lu Yin,Shihui Guo,Junhai Yong,Feng Xu

Main category: cs.CV

TL;DR: MagShield 是一种用于解决稀疏惯性运动捕捉系统中磁干扰问题的新方法，具有较高的准确性和广泛的适用性。

Details

Motivation: 现有的惯性测量单元（IMU）系统在受磁干扰的环境中容易出现方向估计错误，限制了其在现实场景中的应用。 Method: MagShield 采用“检测-然后校正”策略，通过多IMU联合分析检测磁干扰，并使用人体运动先验校正方向误差。 Result: 实验结果表明，MagShield 在存在磁干扰的情况下显著提高了动作捕捉的准确性，并显示出良好的跨不同稀疏惯性MoCap系统的兼容性。 Conclusion: MagShield 是一种有效的磁干扰解决方案，可提高现有稀疏惯性运动捕捉系统的准确性与适用性。 Abstract: This paper proposes a novel method called MagShield, designed to address the issue of magnetic interference in sparse inertial motion capture (MoCap) systems. Existing Inertial Measurement Unit (IMU) systems are prone to orientation estimation errors in magnetically disturbed environments, limiting their practical application in real-world scenarios. To address this problem, MagShield employs a "detect-then-correct" strategy, first detecting magnetic disturbances through multi-IMU joint analysis, and then correcting orientation errors using human motion priors. MagShield can be integrated with most existing sparse inertial MoCap systems, improving their performance in magnetically disturbed environments. Experimental results demonstrate that MagShield significantly enhances the accuracy of motion capture under magnetic interference and exhibits good compatibility across different sparse inertial MoCap systems.

[142] Attention to Burstiness: Low-Rank Bilinear Prompt Tuning

Yuzhu Wang,Manni Duan,Shu Kong

Main category: cs.CV

TL;DR: This paper proposes a new method called Bilinear Prompt Tuning (BPT) that improves prompt tuning efficiency and accuracy in vision Transformers by addressing the issue of non-Gaussian distributions in prompt learning.

Details

Motivation: The paper identifies 'burstiness' in the values from the interaction of image patch embeddings and key and query projectors in Vision Transformers, which poses challenges for prompt learning due to their non-Gaussian distribution. Method: A bilinear prompt tuning (BPT) method was proposed, which involves deriving a whitening matrix over random image patch embeddings and ViT's key and query projectors to address the challenge posed by non-Gaussian distributions in learning prompts. A compact, low-rank version of the bilinear model was also developed. Result: The proposed method significantly accelerates prompt tuning and boosts accuracy, e.g., >25 accuracy points on the CUB dataset. It also learns 'bursty prompts'. Conclusion: The proposed BPT methods outperform various VPT methods while reducing parameter count and computation overhead. Abstract: Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover ``burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Furthermore, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner. Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., $>$25 accuracy points on the CUB dataset; interestingly, it learns ``bursty prompts''. Extending the bilinear model which is known to introduce burstiness, we present a compact, low-rank version by learning two smaller matrices whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments across multiple benchmark datasets demonstrate that BPT methods not only outperform various VPT methods but also reduce parameter count and computation overhead.

[143] Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

Yiwei He,Xiangtai Li,Zhenglin Huang,Yi Dong,Hao Fei,Jiangning Zhang,Baoyuan Wu,Guangliang Cheng

Main category: cs.CV

TL;DR: BiMi is a new bilingual multimodal framework that effectively detects misinformation in news media by combining multiple detection techniques and enhancing interpretability.

Details

Motivation: Misinformation in news media has become more subtle due to the increasing realism of multimodal content, especially when images are paired with bilingual subtitles. This necessitates a more advanced framework to detect such misinformation. Method: The study introduces BiMi, which integrates region-level localization, cross-modal and cross-lingual consistency detection, natural language explanation, and an online retrieval module. It also applies GRPO to improve explanation quality. Result: BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore. Conclusion: BiMi is an effective bilingual multimodal framework for detecting misinformation in news media with high accuracy and interpretability. Abstract: The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.

[144] Utilizing a Novel Deep Learning Method for Scene Categorization in Remote Sensing Data

Ghufran A. Omran,Wassan Saad Abduljabbar Hayale,Ahmad AbdulQadir AlRababah,Israa Ibraheem Al-Barazanchi,Ravi Sekhar,Pritesh Shah,Sushma Parihar,Harshavardhan Reddy Penubadi

Main category: cs.CV

TL;DR: 本文提出了一种名为CO-BRNN的新技术，在遥感图像的场景分类中实现了高达97%的准确率，优于现有其他方法。

Details

Motivation: 传统深度学习模型需要大量多样化和高噪声的数据库来捕捉重要的视觉特征，因此在遥感数据中实现高准确度的场景分类具有挑战性。 Method: 该研究引入了一种名为CO-BRNN的创新技术，用于处理远程传感数据中的场景分类问题，并将其与现有的多种技术进行比较。 Result: 实验结果表明，CO-BRNN达到了97%的最大准确率，其次是LSTM-CRF（90%）、MLP-CNN（85%）和CNN-LSTM（80%）。 Conclusion: 研究强调了物理验证的重要性，以确保卫星数据的效率，并提出了一种新的方法CO-BRNN，在遥感数据的场景分类中表现出更高的准确性。 Abstract: Scene categorization (SC) in remotely acquired images is an important subject with broad consequences in different fields, including catastrophe control, ecological observation, architecture for cities, and more. Nevertheless, its several apps, reaching a high degree of accuracy in SC from distant observation data has demonstrated to be difficult. This is because traditional conventional deep learning models require large databases with high variety and high levels of noise to capture important visual features. To address these problems, this investigation file introduces an innovative technique referred to as the Cuttlefish Optimized Bidirectional Recurrent Neural Network (CO- BRNN) for type of scenes in remote sensing data. The investigation compares the execution of CO-BRNN with current techniques, including Multilayer Perceptron- Convolutional Neural Network (MLP-CNN), Convolutional Neural Network-Long Short Term Memory (CNN-LSTM), and Long Short Term Memory-Conditional Random Field (LSTM-CRF), Graph-Based (GB), Multilabel Image Retrieval Model (MIRM-CF), Convolutional Neural Networks Data Augmentation (CNN-DA). The results demonstrate that CO-BRNN attained the maximum accuracy of 97%, followed by LSTM-CRF with 90%, MLP-CNN with 85%, and CNN-LSTM with 80%. The study highlights the significance of physical confirmation to ensure the efficiency of satellite data.

[145] YM-WML: A new Yolo-based segmentation Model with Weighted Multi-class Loss for medical imaging

Haniyeh Nikkhah,Jafar Tanha,Mahdi Zarrin,SeyedEhsan Roshan,Amin Kazempour

Main category: cs.CV

TL;DR: This paper proposes YM-WML, a novel model for cardiac image segmentation that addresses class imbalance and achieves superior performance on the ACDC dataset.

Details

Motivation: Medical image segmentation faces challenges like class imbalance and complex structures, necessitating improved models for better accuracy and efficiency. Method: The study introduces the YM-WML model, which combines a robust backbone for feature extraction, a YOLOv11 neck for multi-scale feature aggregation, and an attention-based segmentation head. Additionally, it uses a Weighted Multi-class Exponential (WME) loss function to handle class imbalance. Result: On the ACDC dataset, the YM-WML model achieved a Dice Similarity Coefficient of 91.02, outperforming existing state-of-the-art methods. Conclusion: The YM-WML model sets a new benchmark in cardiac segmentation tasks due to its stable training, accurate segmentation, and strong generalization. Abstract: Medical image segmentation poses significant challenges due to class imbalance and the complex structure of medical images. To address these challenges, this study proposes YM-WML, a novel model for cardiac image segmentation. The model integrates a robust backbone for effective feature extraction, a YOLOv11 neck for multi-scale feature aggregation, and an attention-based segmentation head for precise and accurate segmentation. To address class imbalance, we introduce the Weighted Multi-class Exponential (WME) loss function. On the ACDC dataset, YM-WML achieves a Dice Similarity Coefficient of 91.02, outperforming state-of-the-art methods. The model demonstrates stable training, accurate segmentation, and strong generalization, setting a new benchmark in cardiac segmentation tasks.

[146] Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images

Shreyas Dixit,Ashhar Aziz,Shashwat Bajpai,Vasu Sharma,Aman Chadha,Vinija Jain,Amitava Das

Main category: cs.CV

TL;DR: This paper introduces PECCAVI, a new image watermarking technique that is safe from visual paraphrase attacks and free from distortion, addressing vulnerabilities in current watermarking methods against AI-generated content manipulation.

Details

Motivation: Concerns about the vulnerability of invisible watermarking techniques to tampering and the potential for malicious actors to bypass them entirely motivated this study. Generative AI-powered de-watermarking attacks, like visual paraphrase attacks, have shown an ability to fully remove watermarks. Method: The paper introduces PECCAVI, which strategically embeds watermarks within Non-Melting Points (NMPs) using multi-channel frequency domain watermarking. It also utilizes noisy burnishing to counter reverse-engineering attempts. Result: This paper presents PECCAVI, the first visual paraphrase attack-safe and distortion-free image watermarking technique. In visual paraphrase attacks, an image is altered while preserving its core semantic regions, termed Non-Melting Points (NMPs). Conclusion: PECCAVI is model-agnostic and aims to enhance durability against reverse-engineering efforts by incorporating noisy burnishing, with all relevant resources and codes to be open-sourced. Abstract: A report by the European Union Law Enforcement Agency predicts that by 2026, up to 90 percent of online content could be synthetically generated, raising concerns among policymakers, who cautioned that "Generative AI could act as a force multiplier for political disinformation. The combined effect of generative text, images, videos, and audio may surpass the influence of any single modality." In response, California's Bill AB 3211 mandates the watermarking of AI-generated images, videos, and audio. However, concerns remain regarding the vulnerability of invisible watermarking techniques to tampering and the potential for malicious actors to bypass them entirely. Generative AI-powered de-watermarking attacks, especially the newly introduced visual paraphrase attack, have shown an ability to fully remove watermarks, resulting in a paraphrase of the original image. This paper introduces PECCAVI, the first visual paraphrase attack-safe and distortion-free image watermarking technique. In visual paraphrase attacks, an image is altered while preserving its core semantic regions, termed Non-Melting Points (NMPs). PECCAVI strategically embeds watermarks within these NMPs and employs multi-channel frequency domain watermarking. It also incorporates noisy burnishing to counter reverse-engineering efforts aimed at locating NMPs to disrupt the embedded watermark, thereby enhancing durability. PECCAVI is model-agnostic. All relevant resources and codes will be open-sourced.

[147] ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Amir Aghdam,Vincent Tao Hu

Main category: cs.CV

TL;DR: ActAlign通过将视频分类表述为序列对齐任务，在零样本细粒度视频分类中表现出色，而无需任何视频-文本监督或微调。

Details

Motivation: 对比视觉-语言模型（如SigLIP）虽然通过平均池化的图像-文本相似性展现了强大的开放集识别能力，但它们无法捕捉对于区分细粒度活动至关重要的时间结构。 Method: ActAlign利用大型语言模型为每个类别生成有序的子动作序列，并在共享嵌入空间中使用动态时间规整（DTW）将其与视频帧对齐。 Result: ActAlign在极具挑战性的ActionAtlas基准测试中实现了30.5%的准确率，超过了十亿参数级别的视频-语言模型，同时使用的参数大约少了8倍。 Conclusion: 结合经典的对齐技术与结构化语言先验提供了一种可扩展且通用的方法，能够释放视觉-语言模型在细粒度视频理解中的开放集识别潜力。 Abstract: We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.

[148] Probabilistic Prototype Calibration of Vision-Language Models for Generalized Few-shot Semantic Segmentation

Jie Liu,Jiayi Shen,Pan Zhou,Jan-Jakob Sonke,Efstratios Gavves

Main category: cs.CV

TL;DR: FewCLIP improves generalized few-shot semantic segmentation by introducing a probabilistic prototype calibration framework that enhances adaptability and generalization.

Details

Motivation: Existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples. FewCLIP aims to overcome this limitation through probabilistic prototype calibration. Method: FewCLIP introduces a prototype calibration mechanism and distribution regularization over calibration prototypes to provide more adaptive prototype learning for GFSS. Result: Extensive experimental results demonstrate that FewCLIP significantly outperforms existing approaches on PASCAL-5i and COCO-20i datasets in both GFSS and class-incremental settings. Conclusion: FewCLIP outperforms state-of-the-art approaches on GFSS tasks by introducing a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP. Abstract: Generalized Few-Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision-language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi-modal prototypes learning. However, existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, we propose FewCLIP, a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, FewCLIP first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, FewCLIP introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty-aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrate that our proposed FewCLIP significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting. The code is available at https://github.com/jliu4ai/FewCLIP.

[149] Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models

Atharv Mittal,Agam Pandey,Amritanshu Tiwari,Sukrit Jindal,Swadesh Swain

Main category: cs.CV

TL;DR: 本研究验证并改进了一种高效的跨提示对抗攻击方法，提升了大型视觉-语言模型中的对抗样本生成效果，并强调了其安全性问题。

Details

Motivation: 尽管大型视觉-语言模型(VLMs)在计算机视觉领域取得了革命性的进展，但它们仍然容易受到对抗攻击，特别是在视觉和文本模态都可以被操纵的情况下。 Method: 通过验证Cross-Prompt Attack (CroPA) 并提出若干改进措施，包括新颖的初始化策略、学习通用扰动以研究跨图像可迁移性以及针对视觉编码器注意力机制的新损失函数。 Result: 通过对Flamingo、BLIP-2和InstructBLIP等主要VLMs进行评估，以及在LLaVA上的扩展实验，证实了原始结果，并证明所提出的改进措施能够持续提升对抗效果。 Conclusion: 该研究强调了在VLMs中研究对抗性漏洞的重要性，并提供了一个更强大的可迁移对抗样本生成框架，对理解实际应用中VLMs的安全性有重要意义。 Abstract: Large Vision-Language Models (VLMs) have revolutionized computer vision, enabling tasks such as image classification, captioning, and visual question answering. However, they remain highly vulnerable to adversarial attacks, particularly in scenarios where both visual and textual modalities can be manipulated. In this study, we conduct a comprehensive reproducibility study of "An Image is Worth 1000 Lies: Adversarial Transferability Across Prompts on Vision-Language Models" validating the Cross-Prompt Attack (CroPA) and confirming its superior cross-prompt transferability compared to existing baselines. Beyond replication we propose several key improvements: (1) A novel initialization strategy that significantly improves Attack Success Rate (ASR). (2) Investigate cross-image transferability by learning universal perturbations. (3) A novel loss function targeting vision encoder attention mechanisms to improve generalization. Our evaluation across prominent VLMs -- including Flamingo, BLIP-2, and InstructBLIP as well as extended experiments on LLaVA validates the original results and demonstrates that our improvements consistently boost adversarial effectiveness. Our work reinforces the importance of studying adversarial vulnerabilities in VLMs and provides a more robust framework for generating transferable adversarial examples, with significant implications for understanding the security of VLMs in real-world applications.

[150] A Novel Frame Identification and Synchronization Technique for Smartphone Visible Light Communication Systems Based on Convolutional Neural Networks

Vaigai Nayaki Yokar,Hoa Le-Minh,Xicong Li,Wai Lok Woo,Luis Nero Alves,Stanislav Zvanovec,Tran The Son,Zabih Ghassemlooy

Main category: cs.CV

TL;DR: 本文提出了一种用于S2C VLC系统中帧识别和同步的高效CNN方法，具有高准确率和实际应用潜力。

Details

Motivation: 旨在增强基于屏幕到摄像头（S2C）的可见光通信（VLC）系统的短链接通信性能。 Method: 使用Python和TensorFlow Keras框架开发了一种新颖的、鲁棒的、轻量级的基于CNN的方法，并通过Jupyter Notebook进行了三个实时实验研究。 Result: 所提出的模型在实时挑战中表现良好，包括模糊、裁剪和移动场景中的旋转图像处理，引入开销帧后系统性能得到提升。 Conclusion: 实验结果表明，该模型在识别和同步S2C VLC系统中的帧方面表现出色，整体准确率约为98.74%。 Abstract: This paper proposes a novel, robust, and lightweight supervised Convolutional Neural Network (CNN)-based technique for frame identification and synchronization, designed to enhance short-link communication performance in a screen-to-camera (S2C) based visible light communication (VLC) system. Developed using Python and the TensorFlow Keras framework, the proposed CNN model was trained through three real-time experimental investigations conducted in Jupyter Notebook. These experiments incorporated a dataset created from scratch to address various real-time challenges in S2C communication, including blurring, cropping, and rotated images in mobility scenarios. Overhead frames were introduced for synchronization, which leads to enhanced system performance. The experimental results demonstrate that the proposed model achieves an overall accuracy of approximately 98.74%, highlighting its effectiveness in identifying and synchronizing frames in S2C VLC systems.

[151] MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

Jian Chen,Wenye Ma,Penghang Liu,Wei Wang,Tengwei Song,Ming Li,Chenguang Wang,Ruiyi Zhang,Changyou Chen

Main category: cs.CV

TL;DR: 本文提出了MusiXQA数据集和Phi-3-MusiX模型，推动了多模态大语言模型在音乐乐谱理解领域的发展。

Details

Motivation: 多模态大语言模型在音乐乐谱解释方面的能力尚未被深入探索，需要一个全面的数据集来推动该领域的发展。 Method: 通过MusiXTeX生成高质量的合成音乐乐谱，并进行结构化注释，同时开发了Phi-3-MusiX模型。 Result: 评估结果显示当前最先进的MLLMs在此领域存在显著局限性，而Phi-3-MusiX模型相比GPT方法取得了显著性能提升。 Conclusion: MusiXQA数据集和Phi-3-MusiX模型的提出为音乐乐谱理解领域的MLLM发展奠定了基础。 Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

[152] VisionScores -- A system-segmented image score dataset for deep learning tasks

Alejandro Romero Amezcua,Mariano José Juan Rivera Meraz

Main category: cs.CV

TL;DR: VisionScores is a new image score dataset designed for deep learning, containing 24.8k grayscale images segmented into two creative scenarios based on composer and composition type.

Details

Motivation: To provide a novel system-segmented image score dataset that captures both graphic similarity and instrument-dependent composition patterns for machine and deep learning tasks. Method: The dataset is segmented into two scenarios: one focusing on Sonatinas from different authors (14k samples) and the other focusing on various composition types by Franz Liszt (10.8k samples). All images are grayscale jpg of 128×512 pixels with metadata provided. Result: A total of 24.8k formatted grayscale images were created along with metadata, unsegmented full-page scores, and pre-formatted images for further analysis. Conclusion: VisionScores successfully introduces a structured, high information-density image score dataset tailored for machine and deep learning tasks, offering diverse scenarios for analysis. Abstract: VisionScores presents a novel proposal being the first system-segmented image score dataset, aiming to offer structure-rich, high information-density images for machine and deep learning tasks. Delimited to two-handed piano pieces, it was built to consider not only certain graphic similarity but also composition patterns, as this creative process is highly instrument-dependent. It provides two scenarios in relation to composer and composition type. The first, formed by 14k samples, considers works from different authors but the same composition type, specifically, Sonatinas. The latter, consisting of 10.8K samples, presents the opposite case, various composition types from the same author, being the one selected Franz Liszt. All of the 24.8k samples are formatted as grayscale jpg images of $128 \times 512$ pixels. VisionScores supplies the users not only the formatted samples but the systems' order and pieces' metadata. Moreover, unsegmented full-page scores and the pre-formatted images are included for further analysis.

[153] Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation

Xinrong Hu,Yiyu Shi

Main category: cs.CV

TL;DR: 本文提出了一种名为AugPaint的新方法，它通过利用潜在扩散模型的修复能力生成更多训练数据，在减少人工标注需求的同时显著提升了医学图像的分割效果。

Details

Motivation: 医学图像数据集的像素级标注费时费力且成本高昂，在标注数据稀缺的情况下提升分割性能是一个关键挑战。 Method: 提出AugPaint方法，通过适配潜在扩散模型的采样过程进行修复任务，从有限标注数据中生成高质量的图像-标签对用于下游分割模型训练。 Result: 在四个公开医学图像分割数据集（CT、MRI和皮肤成像）上的广泛评估表明，AugPaint优于现有的最先进的标签高效方法，显著提高了分割性能。 Conclusion: AugPaint作为一种无需重新训练的数据增强框架，利用修复生成图像-标签对，显著提升了有限标注数据下的分割性能。 Abstract: Collecting pixel-level labels for medical datasets can be a laborious and expensive process, and enhancing segmentation performance with a scarcity of labeled data is a crucial challenge. This work introduces AugPaint, a data augmentation framework that utilizes inpainting to generate image-label pairs from limited labeled data. AugPaint leverages latent diffusion models, known for their ability to generate high-quality in-domain images with low overhead, and adapts the sampling process for the inpainting task without need for retraining. Specifically, given a pair of image and label mask, we crop the area labeled with the foreground and condition on it during reversed denoising process for every noise level. Masked background area would gradually be filled in, and all generated images are paired with the label mask. This approach ensures the accuracy of match between synthetic images and label masks, setting it apart from existing dataset generation methods. The generated images serve as valuable supervision for training downstream segmentation models, effectively addressing the challenge of limited annotations. We conducted extensive evaluations of our data augmentation method on four public medical image segmentation datasets, including CT, MRI, and skin imaging. Results across all datasets demonstrate that AugPaint outperforms state-of-the-art label-efficient methodologies, significantly improving segmentation performance.

[154] From Coarse to Fine: Learnable Discrete Wavelet Transforms for Efficient 3D Gaussian Splatting

Hung Nguyen,An Le,Runfa Li,Truong Nguyen

Main category: cs.CV

TL;DR: AutoOpti3DGS introduces a wavelet-based approach to control Gaussian growth in 3D view synthesis, improving efficiency for constrained hardware.

Details

Motivation: To address the issue of increasing memory and bandwidth demands caused by the proliferation of Gaussian primitives in 3D Gaussian Splatting. Method: The method uses learnable Forward and Inverse Discrete Wavelet Transforms with fixed low-pass filters, learnable high-pass filters initialized to zero, and an auxiliary orthogonality loss to gradually activate fine frequencies. Result: AutoOpti3DGS achieves sparser scene representations, delays redundant fine Gaussian formation, and works seamlessly with existing 3DGS frameworks while requiring only one additional hyper-parameter. Conclusion: AutoOpti3DGS provides a solution to restrain Gaussian proliferation in 3D Gaussian Splatting without compromising visual quality, leading to more efficient and sparse scene representations. Abstract: 3D Gaussian Splatting has emerged as a powerful approach in novel view synthesis, delivering rapid training and rendering but at the cost of an ever-growing set of Gaussian primitives that strains memory and bandwidth. We introduce AutoOpti3DGS, a training-time framework that automatically restrains Gaussian proliferation without sacrificing visual fidelity. The key idea is to feed the input images to a sequence of learnable Forward and Inverse Discrete Wavelet Transforms, where low-pass filters are kept fixed, high-pass filters are learnable and initialized to zero, and an auxiliary orthogonality loss gradually activates fine frequencies. This wavelet-driven, coarse-to-fine process delays the formation of redundant fine Gaussians, allowing 3DGS to capture global structure first and refine detail only when necessary. Through extensive experiments, AutoOpti3DGS requires just a single filter learning-rate hyper-parameter, integrates seamlessly with existing efficient 3DGS frameworks, and consistently produces sparser scene representations more compatible with memory or storage-constrained hardware.

[155] Ovis-U1 Technical Report

Guo-Hua Wang,Shanshan Zhao,Xinjie Zhang,Liangfu Cao,Pengxin Zhan,Lunhao Duan,Shiyin Lu,Minghao Fu,Xiaohao Chen,Jianshan Zhao,Yang Li,Qing-Guo Chen

Main category: cs.CV

TL;DR: Ovis-U1是一个集成多模态理解、文本到图像生成和图像编辑能力的30亿参数统一模型，其统一训练方法显著提升了性能并在多项基准测试中表现优异。

Details

Motivation: 为了提升多模态理解、文本到图像生成和图像编辑任务的性能，整合这些任务进行统一训练。 Method: 基于扩散的视觉解码器与双向令牌优化器相结合，并采用从语言模型开始的统一训练方法。 Result: Ovis-U1在OpenCompass多模态学术基准上得分为69.6，在DPG-Bench和GenEval基准上的文本到图像生成得分分别为83.72和0.89，在ImgEdit-Bench和GEdit-Bench-EN上的图像编辑得分分别为4.00和6.42。 Conclusion: Ovis-U1实现了多模态理解和生成任务的统一，通过全新的训练方法在多个基准测试中超越了现有最先进模型。 Abstract: In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.

[156] Empowering Small VLMs to Think with Dynamic Memorization and Exploration

Jiazhen Liu,Yuchuan Deng,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的训练范式DyME，可以在每次优化步骤中动态选择记忆和探索模式，以确保每个更新都有助于权衡，从而增强小型视觉-语言模型的可靠思考能力和性能。

Details

Motivation: 由于小型视觉-语言模型（SVLMs）参数容量有限且指令跟随能力较弱，因此为其赋予可靠的思考能力仍然具有根本性的挑战。现有的训练范式如监督微调（SFT）和可验证奖励强化学习（RLVR）对基础VLM有较高的要求，超出了SVLMs的能力范围。直接将这些范式应用于SVLMs通常会导致严重的伪思考痕迹和优势崩溃，最终损害思考可靠性和任务性能。一个自然的解决方案是结合SFT和RLVR，利用它们的互补性来减少对模型容量的依赖。然而，广泛采用的两阶段训练范式在SVLMs上的表现仍然不佳，因为它们倾向于次优收敛，阻碍了权衡并限制了组合的好处。 Method: 提出了一种新的训练范式DyME，在每次优化步骤中动态选择记忆（通过SFT）和探索（通过RLVR）模式，确保每个更新都有助于权衡。 Result: 广泛的跨领域实验表明，DyME能够持续实现这种平衡，从而显著提高性能。 Conclusion: DyME是一个实用且有效的解决方案，可增强SVLMs的可靠思考能力。 Abstract: Empowering Small-scale Vision-Language Models (SVLMs) with reliable thinking capabilities remains fundamentally challenging due to their limited parameter capacity and weak instruction-following abilities. Existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capabilities of SVLMs. Consequently, directly applying these paradigms to SVLMs often suffers from severe pseudo thinking traces and advantage collapse, ultimately undermining both thinking reliability and task performance. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. However, the widely adopted two-stage training paradigm still performs poorly on SVLMs, as their tendency toward sub-optimal convergence hinders the trade-off and limits the benefits of the combination. To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) modes at each optimization step, ensuring that every update contributes to the trade-off. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: https://github.com/HKUST-LongGroup/DyME

[157] CoreMark: Toward Robust and Universal Text Watermarking Technique

Jiale Meng,Yiming Li,Zheming Lu,Zewei He,Hao Luo,Tianwei Zhang

Main category: cs.CV

TL;DR: This paper proposes CoreMark, a robust and generalizable text watermarking framework using the novel CORE embedding paradigm, which enhances resistance to various attacks while preserving visual quality.

Details

Motivation: Text watermarking faces challenges in achieving robustness, generalizability, and imperceptibility; this work aims to overcome these limitations through a novel embedding approach. Method: A new embedding paradigm called CORE is introduced, which uses aligned black pixel segments for noise resistance. CoreMark dynamically extracts COREs from characters, selects robust characters based on CORE length, embeds data by modifying CORE thickness, and employs an adaptive embedding strength modulator based on font size. Result: CoreMark demonstrates excellent generalizability across languages and fonts, significantly improves resistance against screenshot, print-scan, and print camera attacks, and maintains high imperceptibility compared to existing methods. Conclusion: CoreMark, based on the CORE embedding paradigm, achieves superior robustness, generalizability, and imperceptibility in text watermarking, outperforming existing methods in resisting various attacks while maintaining visual integrity. Abstract: Text watermarking schemes have gained considerable attention in recent years, yet still face critical challenges in achieving simultaneous robustness, generalizability, and imperceptibility. This paper introduces a new embedding paradigm,termed CORE, which comprises several consecutively aligned black pixel segments. Its key innovation lies in its inherent noise resistance during transmission and broad applicability across languages and fonts. Based on the CORE, we present a text watermarking framework named CoreMark. Specifically, CoreMark first dynamically extracts COREs from characters. Then, the characters with stronger robustness are selected according to the lengths of COREs. By modifying the thickness of the CORE, the hidden data is embedded into the selected characters without causing significant visual distortions. Moreover, a general plug-and-play embedding strength modulator is proposed, which can adaptively enhance the robustness for small font sizes by adjusting the embedding strength according to the font size. Experimental evaluation indicates that CoreMark demonstrates outstanding generalizability across multiple languages and fonts. Compared to existing methods, CoreMark achieves significant improvements in resisting screenshot, print-scan, and print camera attacks, while maintaining satisfactory imperceptibility.

[158] Unsupervised 3D Braided Hair Reconstruction from a Single-View Image

Jing Gao

Main category: cs.CV

TL;DR: 本文提出一种新的无监督方法，通过合成辫子模型，从单张图片重建复杂的3D编发发型，效果优于现有技术。

Details

Motivation: 由于编发结构复杂且拓扑关系交错，从单视角图像重建3D编发发型仍然是一项挑战，而现有的基于发丝的方法难以处理此类问题。 Method: 利用受辫群理论启发的合成辫子模型，从单视角RGB图像中高效重建3D编织发型。 Result: 大量实验表明，该方法在重建3D编发发型方面表现优越，能够支持数字人中更具表现力的发型建模。 Conclusion: 该论文提出了一种新颖的无监督3D编织发型重建方法，优于现有技术，在准确性、真实感和效率方面均有提升。 Abstract: Reconstructing 3D braided hairstyles from single-view images remains a challenging task due to the intricate interwoven structure and complex topologies of braids. Existing strand-based hair reconstruction methods typically focus on loose hairstyles and often struggle to capture the fine-grained geometry of braided hair. In this paper, we propose a novel unsupervised pipeline for efficiently reconstructing 3D braided hair from single-view RGB images. Leveraging a synthetic braid model inspired by braid theory, our approach effectively captures the complex intertwined structures of braids. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, providing superior accuracy, realism, and efficiency in reconstructing 3D braided hairstyles, supporting expressive hairstyle modeling in digital humans.

[159] Learning Counterfactually Decoupled Attention for Open-World Model Attribution

Yu Zheng,Boyang Gong,Fanye Kong,Yueqi Duan,Bingyao Yu,Wenzhao Zheng,Lei Chen,Jiwen Lu,Jie Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为Counterfactually Decoupled Attention Learning (CDAL)的方法，用于解决开放世界模型归属问题。CDAL通过建模注意力视觉痕迹与源模型归属之间的因果关系，并通过反事实解耦区分模型特定的特征与混淆的源偏差，从而提高对未见攻击的泛化能力。实验结果表明，CDAL在提升现有最先进模型的表现方面效果显著。

Details

Motivation: 现有方法依赖于手工设计的区域划分或特征空间，这可能受到虚假统计相关性的干扰，并且难以应对开放世界场景中的新型攻击。 Method: CDAL方法明确地建模了注意力视觉痕迹与源模型归属之间的因果关系，并通过反事实解耦区分模型特定的特征与混淆的源偏差。 Result: 在现有的开放世界模型归属基准测试中，CDAL以最小的计算开销持续大幅提升了最先进的模型表现，尤其是在面对未见过的新攻击时。 Conclusion: CDAL通过建模注意力视觉痕迹与源模型归属之间的因果关系，提高了开放世界模型归属任务中对未见攻击的泛化能力。 Abstract: In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks. Source code: https://github.com/yzheng97/CDAL.

[160] Dynamic Contrastive Learning for Hierarchical Retrieval: A Case Study of Distance-Aware Cross-View Geo-Localization

Suofei Zhang,Xinxin Wang,Xiaofu Wu,Quan Zhou,Haifeng Hu

Main category: cs.CV

TL;DR: This paper introduces DyCL, a novel contrastive learning framework for Distance-Aware Cross-View Geo-Localization, achieving superior performance on a new benchmark called DA-Campus.

Details

Motivation: To enable models to capture contextual information and reduce localization error costs in cross-view geo-localization tasks. Method: Dynamic Contrastive Learning (DyCL) was proposed and evaluated using the DA-Campus benchmark, which pairs multi-view imagery with precise distance annotations. Result: DyCL showed substantial improvements in hierarchical retrieval performance and overall cross-view geo-localization accuracy compared to existing methods. Conclusion: The proposed DyCL framework effectively addresses the DACVGL problem by progressively aligning feature representations, demonstrating significant improvements in performance. Abstract: Existing deep learning-based cross-view geo-localization methods primarily focus on improving the accuracy of cross-domain image matching, rather than enabling models to comprehensively capture contextual information around the target and minimize the cost of localization errors. To support systematic research into this Distance-Aware Cross-View Geo-Localization (DACVGL) problem, we construct Distance-Aware Campus (DA-Campus), the first benchmark that pairs multi-view imagery with precise distance annotations across three spatial resolutions. Based on DA-Campus, we formulate DACVGL as a hierarchical retrieval problem across different domains. Our study further reveals that, due to the inherent complexity of spatial relationships among buildings, this problem can only be addressed via a contrastive learning paradigm, rather than conventional metric learning. To tackle this challenge, we propose Dynamic Contrastive Learning (DyCL), a novel framework that progressively aligns feature representations according to hierarchical spatial margins. Extensive experiments demonstrate that DyCL is highly complementary to existing multi-scale metric learning methods and yields substantial improvements in both hierarchical retrieval performance and overall cross-view geo-localization accuracy. Our code and benchmark are publicly available at https://github.com/anocodetest1/DyCL.

[161] Frequency-enhanced Multi-granularity Context Network for Efficient Vertebrae Segmentation

Jian Shi,Tianqi You,Pingping Zhang,Hongli Zhang,Rui Xu,Haojie Li

Main category: cs.CV

TL;DR: This paper proposes FMC-Net, an improved method for vertebrae segmentation in 3D CT and MRI images that addresses image blurring and distinguishes similar vertebrae more effectively than existing approaches.

Details

Motivation: Current imaging techniques face challenges in reducing the impact of image blurring and distinguishing similar vertebrae, making accurate segmentation difficult. Method: A Frequency-enhanced Multi-granularity Context Network (FMC-Net) is introduced, which uses wavelet transform for lossless downsampling, High-frequency Feature Refinement (HFR) to amplify key features, and a Multi-granularity State Space Model (MG-SSM) to capture long-range dependencies with linear complexity. Result: Extensive experiments show that the proposed method outperforms state-of-the-art approaches on both CT and MRI vertebrae segmentation datasets. Conclusion: The proposed FMC-Net method improves the accuracy of vertebrae segmentation in 3D CT and MRI images by addressing issues like image blurring and distinguishing similar vertebrae. Abstract: Automated and accurate segmentation of individual vertebra in 3D CT and MRI images is essential for various clinical applications. Due to the limitations of current imaging techniques and the complexity of spinal structures, existing methods still struggle with reducing the impact of image blurring and distinguishing similar vertebrae. To alleviate these issues, we introduce a Frequency-enhanced Multi-granularity Context Network (FMC-Net) to improve the accuracy of vertebrae segmentation. Specifically, we first apply wavelet transform for lossless downsampling to reduce the feature distortion in blurred images. The decomposed high and low-frequency components are then processed separately. For the high-frequency components, we apply a High-frequency Feature Refinement (HFR) to amplify the prominence of key features and filter out noises, restoring fine-grained details in blurred images. For the low-frequency components, we use a Multi-granularity State Space Model (MG-SSM) to aggregate feature representations with different receptive fields, extracting spatially-varying contexts while capturing long-range dependencies with linear complexity. The utilization of multi-granularity contexts is essential for distinguishing similar vertebrae and improving segmentation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches on both CT and MRI vertebrae segmentation datasets. The source code is publicly available at https://github.com/anaanaa/FMCNet.

[162] Where, What, Why: Towards Explainable Driver Attention Prediction

Yuchen Zhou,Jiayu Tang,Xiaoyan Xiao,Yueyao Lin,Linkai Liu,Zipeng Guo,Hao Fei,Xiaobo Xia,Chao Gou

Main category: cs.CV

TL;DR: 本研究提出了一种新的任务范式和LLada框架，以增进对驾驶员注意力机制的理解，对多个领域有重要影响。

Details

Motivation: 现有的方法主要通过生成空间热图来预测驾驶员看的位置，但未能捕捉特定情境下注意力分配的认知动机，这限制了对注意力机制的深入理解。 Method: 提出了LLada框架，它统一了像素建模、语义解析和认知推理在一个端到端的架构中，并引入了可解释的驾驶员注意力预测的新任务范式。 Result: 进行了广泛的实验，证明了LLada的有效性，在不同数据集和驾驶条件下表现出强大的泛化能力。 Conclusion: 该研究为理解驾驶员注意力机制提供了重要的一步，对自动驾驶、智能驾驶员培训和人机交互具有重要意义。 Abstract: Modeling task-driven attention in driving is a fundamental challenge for both autonomous vehicles and cognitive science. Existing methods primarily predict where drivers look by generating spatial heatmaps, but fail to capture the cognitive motivations behind attention allocation in specific contexts, which limits deeper understanding of attention mechanisms. To bridge this gap, we introduce Explainable Driver Attention Prediction, a novel task paradigm that jointly predicts spatial attention regions (where), parses attended semantics (what), and provides cognitive reasoning for attention allocation (why). To support this, we present W3DA, the first large-scale explainable driver attention dataset. It enriches existing benchmarks with detailed semantic and causal annotations across diverse driving scenarios, including normal conditions, safety-critical situations, and traffic accidents. We further propose LLada, a Large Language model-driven framework for driver attention prediction, which unifies pixel modeling, semantic parsing, and cognitive reasoning within an end-to-end architecture. Extensive experiments demonstrate the effectiveness of LLada, exhibiting robust generalization across datasets and driving conditions. This work serves as a key step toward a deeper understanding of driver attention mechanisms, with significant implications for autonomous driving, intelligent driver training, and human-computer interaction.

[163] DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation

Jihun Kim,Hoyong Kwon,Hyeokjun Kweon,Wooseong Jeong,Kuk-Jin Yoon

Main category: cs.CV

TL;DR: 本文介绍了一种名为DC-TTA的新测试时适应框架，它通过利用用户交互作为监督，解决了Segment Anything Model在特定领域和复杂情境下处理能力不足的问题。

Details

Motivation: Segment Anything Model (SAM) 在特定领域或处理复杂情况（例如伪装或多部件对象）时常常遇到困难，需要一种新的方法来提升其表现。 Method: 提出了一种新的测试时适应框架DC-TTA，该框架利用用户交互作为监督，采用分而治之的策略对SAM进行逐样本自适应。 Result: 实验结果显示，DC-TTA在各种基准测试中均显著优于SAM的零样本结果和传统TTA方法，能有效处理如伪装对象分割等复杂任务，且所需交互更少、准确性更高。 Conclusion: DC-TTA通过将用户点击划分为更一致的子集并分别进行TTA处理，克服了SAM在特定领域或复杂场景下的不足，并在各种基准测试中显著优于SAM的零样本结果和传统的TTA方法。 Abstract: Interactive segmentation (IS) allows users to iteratively refine object boundaries with minimal cues, such as positive and negative clicks. While the Segment Anything Model (SAM) has garnered attention in the IS community for its promptable segmentation capabilities, it often struggles in specialized domains or when handling complex scenarios (e.g., camouflaged or multi-part objects). To overcome these challenges, we propose DC-TTA, a novel test-time adaptation (TTA) framework that adapts SAM on a per-sample basis by leveraging user interactions as supervision. Instead of forcing a single model to incorporate all user clicks at once, DC-TTA partitions the clicks into more coherent subsets, each processed independently via TTA with a separated model. This Divide-and-Conquer strategy reduces conflicts among diverse cues and enables more localized updates. Finally, we merge the adapted models to form a unified predictor that integrates the specialized knowledge from each subset. Experimental results across various benchmarks demonstrate that DC-TTA significantly outperforms SAM's zero-shot results and conventional TTA methods, effectively handling complex tasks such as camouflaged object segmentation with fewer interactions and improved accuracy.

[164] Computer-Aided Multi-Stroke Character Simplification by Stroke Removal

Ryo Ishiyama,Shinnosuke Matsuo,Seiichi Uchida

Main category: cs.CV

TL;DR: This paper proposes a framework to simplify multi-stroke characters while maintaining legibility by selectively removing strokes with minimal impact, assessed using a high-accuracy recognition model.

Details

Motivation: The motivation is to simplify complex multi-stroke characters in scripts like Chinese and Japanese without degrading their legibility, thereby reducing learning barriers for non-native speakers and contributing to efficient communication systems. Method: The method involves using a highly accurate character recognition model to assess legibility and determine which strokes can be removed with minimal impact. Result: Experimental results on 1,256 character classes show that many characters remain distinguishable even after multiple strokes are removed. Conclusion: The paper concludes that multi-stroke characters can be systematically simplified by selectively removing strokes with minimal impact on legibility, suggesting potential for more formalized simplification strategies. Abstract: Multi-stroke characters in scripts such as Chinese and Japanese can be highly complex, posing significant challenges for both native speakers and, especially, non-native learners. If these characters can be simplified without degrading their legibility, it could reduce learning barriers for non-native speakers, facilitate simpler and legible font designs, and contribute to efficient character-based communication systems. In this paper, we propose a framework to systematically simplify multi-stroke characters by selectively removing strokes while preserving their overall legibility. More specifically, we use a highly accurate character recognition model to assess legibility and remove those strokes that minimally impact it. Experimental results on 1,256 character classes with 5, 10, 15, and 20 strokes reveal several key findings, including the observation that even after removing multiple strokes, many characters remain distinguishable. These findings suggest the potential for more formalized simplification strategies.

Zhiyuan Zhu,Jian Wang,Yong Jiang,Tong Han,Yuhao Huang,Ang Zhang,Kaiwen Yang,Mingyuan Luo,Zhe Liu,Yaofei Duan,Dong Ni,Tianhong Tang,Xin Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为CVC-RF的新框架，用于准确的颈动脉斑块分级，结合了深度学习技术，以提高心血管和脑血管疾病风险评估的准确性。

Details

Motivation: 由于斑块的小尺寸和高类内变异性，现有的深度学习方法忽略了表示学习和类别特征差异的重要性。 Method: 提出了一种新的Corpus-View-Category Refinement Framework（CVC-RF），包括一种新颖的中心记忆对比损失、级联下采样注意力模块和无参数的mixture-of-experts加权策略。 Result: 实验结果表明，CVC-RF在最新的Carotid Plaque-RADS指南下是首个基于深度学习的CPG方法，并且在CPG任务中表现出色。 Conclusion: CVC-RF有效地通过多级细化建模全局特征，在具有挑战性的CPG任务中实现了最先进的性能。 Abstract: Accurate carotid plaque grading (CPG) is vital to assess the risk of cardiovascular and cerebrovascular diseases. Due to the small size and high intra-class variability of plaque, CPG is commonly evaluated using a combination of transverse and longitudinal ultrasound views in clinical practice. However, most existing deep learning-based multi-view classification methods focus on feature fusion across different views, neglecting the importance of representation learning and the difference in class features. To address these issues, we propose a novel Corpus-View-Category Refinement Framework (CVC-RF) that processes information from Corpus-, View-, and Category-levels, enhancing model performance. Our contribution is four-fold. First, to the best of our knowledge, we are the foremost deep learning-based method for CPG according to the latest Carotid Plaque-RADS guidelines. Second, we propose a novel center-memory contrastive loss, which enhances the network's global modeling capability by comparing with representative cluster centers and diverse negative samples at the Corpus level. Third, we design a cascaded down-sampling attention module to fuse multi-scale information and achieve implicit feature interaction at the View level. Finally, a parameter-free mixture-of-experts weighting strategy is introduced to leverage class clustering knowledge to weight different experts, enabling feature decoupling at the Category level. Experimental results indicate that CVC-RF effectively models global features via multi-level refinement, achieving state-of-the-art performance in the challenging CPG task.

[166] MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

Haonan Chen,Hong Liu,Yuping Luo,Liang Wang,Nan Yang,Furu Wei,Zhicheng Dou

Main category: cs.CV

TL;DR: MoCa is a two-stage method for building better bidirectional multimodal embedding models by leveraging continual pre-training and contrastive fine-tuning with diverse data, leading to improved performance and scalability.

Details

Motivation: Current multimodal embedding models suffer from suboptimal causal attention for embeddings, scalability issues due to reliance on labeled data, and limited diversity in training objectives and data. Method: MoCa proposes a two-stage framework: Modality-aware Continual Pre-training with a joint reconstruction objective to enhance bidirectional reasoning, followed by Heterogeneous Contrastive Fine-tuning that uses diverse multimodal data for better generalization. Result: Experiments show that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieves new state-of-the-art results, and scales effectively with both model size and training data on MMEB. Conclusion: MoCa addresses the limitations of current multimodal embedding models by transforming pre-trained VLMs into effective bidirectional multimodal embedding models, achieving state-of-the-art results and showing strong scalability. Abstract: Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.

[167] Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation

Zhenhua Ning,Zhuotao Tian,Shaoshuai Shi,Guangming Lu,Daojing He,Wenjie Pei,Li Jiang

Main category: cs.CV

TL;DR: 提出了一种新的基于推理的3D点云分割框架R^2S和一个大规模数据集3D ReasonSeg，提高了空间推理能力。

Details

Motivation: 尽管利用大语言模型进行视觉-语言对齐已在点云感知方面取得进展，但现有方法在处理需要精确空间推理的复杂指令时仍面临挑战。 Method: 提出了基于推理的分割框架R^2S，并引入了包含25185个训练样本和3966个验证样本的新数据集3D ReasonSeg。 Result: 通过定量和定性实验，证明了R^2S和3D ReasonSeg的有效性。 Conclusion: R^2S和3D ReasonSeg为3D点云感知提供了更强的空间推理能力，有望成为未来工作的基准和新标准。 Abstract: Recent advances in point cloud perception have demonstrated remarkable progress in scene understanding through vision-language alignment leveraging large language models (LLMs). However, existing methods may still encounter challenges in handling complex instructions that require accurate spatial reasoning, even if the 3D point cloud data provides detailed spatial cues such as size and position for identifying the targets. To tackle this issue, we propose Relevant Reasoning Segmentation (R$^2$S), a reasoning-based segmentation framework. The framework emulates human cognitive processes by decomposing spatial reasoning into two sequential stages: first identifying relevant elements, then processing instructions guided by their associated visual priors. Furthermore, acknowledging the inadequacy of existing datasets in complex reasoning tasks, we introduce 3D ReasonSeg, a reasoning-based segmentation dataset comprising 25,185 training samples and 3,966 validation samples with precise annotations. Both quantitative and qualitative experiments demonstrate that the R$^2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities, and we hope that they can serve as a new baseline and benchmark for future work.

[168] Dare to Plagiarize? Plagiarized Painting Recognition and Retrieval

Sophie Zhou,Shu Kong

Main category: cs.CV

TL;DR: 本文提出了一种基于DINOv2的绘画抄袭检测方法，并通过合成数据集验证其有效性，发现微调模型虽提升了检索效果但影响了识别准确率。

Details

Motivation: 艺术抄袭检测对于保护艺术家版权和知识产权至关重要，但在法医分析中仍是一个挑战。因此，本文旨在通过构建一个包含真实和合成抄袭作品的数据集，实现对抄袭绘画的识别与解释。 Method: 文章采用视觉基础模型DINOv2提取特征，并基于相似性阈值进行抄袭分类。随后通过度量学习损失函数对DINOv2进行微调以提升检索质量。 Result: 实验结果显示，非学习方法的基线模型在识别准确率上表现优异（97.2%），但检索精度较低（平均精度29.0%）。通过微调DINOv2模型，检索性能提升了12% AP，但识别准确率下降至92.7%。 Conclusion: 本文得出的结论是，使用生成AI合成数据集来识别抄袭绘画并解释检测到的抄袭行为具有一定的可行性，但需要在检索质量和识别准确性之间进行权衡。微调模型提升了检索性能，但降低了识别准确率，未来研究可以进一步优化这一问题。 Abstract: Art plagiarism detection plays a crucial role in protecting artists' copyrights and intellectual property, yet it remains a challenging problem in forensic analysis. In this paper, we address the task of recognizing plagiarized paintings and explaining the detected plagarisms by retrieving visually similar authentic artworks. To support this study, we construct a dataset by collecting painting photos and synthesizing plagiarized versions using generative AI, tailored to specific artists' styles. We first establish a baseline approach using off-the-shelf features from the visual foundation model DINOv2 to retrieve the most similar images in the database and classify plagiarism based on a similarity threshold. Surprisingly, this non-learned method achieves a high recognition accuracy of 97.2\% but suffers from low retrieval precision 29.0\% average precision (AP). To improve retrieval quality, we finetune DINOv2 with a metric learning loss using positive and negative sample pairs sampled in the database. The finetuned model greatly improves retrieval performance by 12\% AP over the baseline, though it unexpectedly results in a lower recognition accuracy (92.7\%). We conclude with insightful discussions and outline directions for future research.

[169] RoboScape: Physics-informed Embodied World Model

Yu Shang,Xin Zhang,Yinzhou Tang,Lei Jin,Chen Gao,Wei Wu,Yong Li

Main category: cs.CV

TL;DR: 本文提出RoboScape，一种统一的物理信息世界模型，通过结合时间深度预测和关键点动力学学习，改善3D几何一致性和复杂运动建模，从而在多种机器人场景中实现高质量视频生成与实际应用价值。

Details

Motivation: 当前具身世界模型在3D几何和运动动态建模方面存在局限性，导致接触丰富的机器人场景视频生成不现实。 Method: 引入了时间深度预测和关键点动力学学习两个任务，在一个统一的框架中联合学习RGB视频生成和物理知识。 Result: 实验表明，RoboScape在多样化的机器人场景中生成具有卓越视觉保真度和物理合理性的视频，并通过了包括机器人策略训练和评估在内的下游应用验证。 Conclusion: RoboScape提供了一种新的基于物理信息的世界模型方法，以提高机器人的视频生成质量和运动建模的物理合理性。 Abstract: World models have become indispensable tools for embodied intelligence, serving as powerful simulators capable of generating realistic robotic videos while addressing critical data scarcity challenges. However, current embodied world models exhibit limited physical awareness, particularly in modeling 3D geometry and motion dynamics, resulting in unrealistic video generation for contact-rich robotic scenarios. In this paper, we present RoboScape, a unified physics-informed world model that jointly learns RGB video generation and physics knowledge within an integrated framework. We introduce two key physics-informed joint training tasks: temporal depth prediction that enhances 3D geometric consistency in video rendering, and keypoint dynamics learning that implicitly encodes physical properties (e.g., object shape and material characteristics) while improving complex motion modeling. Extensive experiments demonstrate that RoboScape generates videos with superior visual fidelity and physical plausibility across diverse robotic scenarios. We further validate its practical utility through downstream applications including robotic policy training with generated data and policy evaluation. Our work provides new insights for building efficient physics-informed world models to advance embodied intelligence research. The code is available at: https://github.com/tsinghua-fib-lab/RoboScape.

[170] VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis

Shiyu Wu,Mingzhen Sun,Weining Wang,Yequan Wang,Jing Liu

Main category: cs.CV

TL;DR: This paper introduces VisualPrompter, a new training-free framework for optimizing text-to-image prompts in diffusion models. It aims to bridge the gap between user input and model preference, enhancing both the aesthetic quality and semantic alignment of generated images through self-reflection and fine-grained prompt optimization.

Details

Motivation: The motivation is to address the issue where existing text-to-image prompt engineering methods often result in visually appealing but semantically misaligned outputs, meaning the images look good but do not accurately reflect the user's description. Method: The method involves developing a training-free framework called VisualPrompter. It uses an automatic self-reflection module to identify missing concepts in the generated images and a target-specific prompt optimization mechanism to refine prompts in detail, thereby improving the alignment between the text descriptions and the resulting images. Result: The experiments showed that VisualPrompter effectively enhances the alignment between text descriptions and generated images, achieving state-of-the-art performance on multiple benchmarks. Additionally, its plug-and-play design allows it to be easily adapted to various generative models. Conclusion: VisualPrompter successfully bridges the gap between user-provided and model-preferred prompts, offering a versatile solution that improves both the aesthetic quality and semantic accuracy of images generated by diffusion models without requiring additional training. Abstract: Since there exists a notable gap between user-provided and model-preferred prompts, generating high-quality and satisfactory images using diffusion models often requires prompt engineering to optimize user inputs. Current studies on text-to-image prompt engineering can effectively enhance the style and aesthetics of generated images. However, they often neglect the semantic alignment between generated images and user descriptions, resulting in visually appealing but content-wise unsatisfying outputs. In this work, we propose VisualPrompter, a novel training-free prompt engineering framework that refines user inputs to model-preferred sentences. In particular, VisualPrompter utilizes an automatic self-reflection module to identify the missing concepts in generated images and a target-specific prompt optimization mechanism to revise the prompts in a fine-grained manner. Extensive experiments demonstrate the effectiveness of our VisualPrompter, which achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation. Additionally, our framework features a plug-and-play design, making it highly adaptable to various generative models.

[171] AlignCVC: Aligning Cross-View Consistency for Single-Image-to-3D Generation

Xinyue Liang,Zhiyuan Ma,Lingchen Sun,Yanjun Guo,Lei Zhang

Main category: cs.CV

TL;DR: AlignCVC enhances single-image-to-3D generation by aligning distributions for better consistency.

Details

Motivation: Existing methods suffer from poor cross-view consistency due to noisy intermediate outputs. Method: Distribution alignment using a soft-hard alignment strategy, integrating multi-view generation with 3D reconstruction models. Result: Significantly improved generation quality and faster inference in as few as 4 steps. Conclusion: AlignCVC improves cross-view consistency and accelerates inference for single-image-to-3D generation. Abstract: Single-image-to-3D models typically follow a sequential generation and reconstruction workflow. However, intermediate multi-view images synthesized by pre-trained generation models often lack cross-view consistency (CVC), significantly degrading 3D reconstruction performance. While recent methods attempt to refine CVC by feeding reconstruction results back into the multi-view generator, these approaches struggle with noisy and unstable reconstruction outputs that limit effective CVC improvement. We introduce AlignCVC, a novel framework that fundamentally re-frames single-image-to-3D generation through distribution alignment rather than relying on strict regression losses. Our key insight is to align both generated and reconstructed multi-view distributions toward the ground-truth multi-view distribution, establishing a principled foundation for improved CVC. Observing that generated images exhibit weak CVC while reconstructed images display strong CVC due to explicit rendering, we propose a soft-hard alignment strategy with distinct objectives for generation and reconstruction models. This approach not only enhances generation quality but also dramatically accelerates inference to as few as 4 steps. As a plug-and-play paradigm, our method, namely AlignCVC, seamlessly integrates various multi-view generation models with 3D reconstruction models. Extensive experiments demonstrate the effectiveness and efficiency of AlignCVC for single-image-to-3D generation.

[172] MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation

Vladislav Bargatin,Egor Chistov,Alexander Yakovenko,Dmitriy Vatolin

Main category: cs.CV

TL;DR: MEMFOF是一种高效且准确的多帧光流估计方法，特别适用于高分辨率输入，并在多个基准测试中表现出色。

Details

Motivation: 现有的光流估计方法在追求准确性的同时增加了GPU内存消耗，尤其是在处理高分辨率输入时。因此需要一种更高效的方法来优化内存使用。 Method: 通过改进RAFT架构的设计选择，包括减少相关体积、采用高分辨率训练协议以及结合多帧估计方法，实现了一种内存友好的多帧光流估计方法。 Result: MEMFOF在运行时仅需2.09 GB GPU内存，训练期间需28.5 GB，分别在Spring、Sintel和KITTI-2015基准测试中达到了3.289%的1像素异常值率、0.963的端点误差和2.94%的Fl-all误差。 Conclusion: MEMFOF是一种内存高效的多帧光流估计方法，在保证高分辨率输入的准确性的前提下，显著降低了GPU内存消耗，并在多个基准测试中实现了最先进的性能。 Abstract: Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at https://github.com/msu-video-group/memfof.

Jie Feng,Shengyuan Wang,Tianhui Liu,Yanxin Xi,Yong Li

Main category: cs.CV

TL;DR: This paper introduces UrbanLLaVA, a multi-modal large language model designed for urban research, capable of handling various data types and performing well across diverse urban tasks.

Details

Motivation: Current methods in urban research focus on specific data types and lack a unified framework to process diverse urban data. The emergence of multi-modal large language models (MLLMs) offers an opportunity to overcome this limitation by enabling comprehensive processing of multi-modal data. Method: The paper proposes UrbanLLaVA, which includes a curated dataset of single-modal and cross-modal urban data, a multi-stage training framework that decouples spatial reasoning from domain knowledge learning, and an extended benchmark for evaluating MLLMs in urban tasks. Result: UrbanLLaVA achieves strong performance in both single-modal and complex cross-modal urban tasks, surpassing open-source and proprietary MLLMs, and demonstrates robust generalization across three different cities. Conclusion: UrbanLLaVA is a multi-modal large language model that outperforms existing MLLMs in urban research tasks across multiple cities, showing robust generalization abilities. Abstract: Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

[174] Dynamic View Synthesis from Small Camera Motion Videos

Huiqiang Sun,Xingyi Li,Juewen Peng,Liao Shen,Zhiguo Cao,Ke Xian,Guosheng Lin

Main category: cs.CV

TL;DR: This paper introduces DDR, a new depth regularization technique combined with camera parameter learning, to improve novel view synthesis in dynamic 3D scenes with limited camera motion.

Details

Motivation: The motivation stems from the limitations of existing NeRF-based approaches for novel view synthesis in dynamic 3D scenes when camera motion is limited or stationary, which leads to incorrect scene geometry and inaccurate camera parameter estimation. Method: The authors propose a Distribution-based Depth Regularization (DDR) approach using Gumbel-softmax to calculate the expectation of the error from discrete rendering weight distribution. They also introduce constraints on volume density and incorporate camera parameter learning during training. Result: The experiments demonstrate that the proposed method effectively represents scenes with small camera motion input and achieves favorable results compared to state-of-the-art techniques. Conclusion: The paper concludes that their proposed method, DDR, along with camera parameter learning, effectively addresses the challenges of scene geometry representation and camera parameter estimation in dynamic 3D scenes with small camera motion, outperforming state-of-the-art methods. Abstract: Novel view synthesis for dynamic $3$D scenes poses a significant challenge. Many notable efforts use NeRF-based approaches to address this task and yield impressive results. However, these methods rely heavily on sufficient motion parallax in the input images or videos. When the camera motion range becomes limited or even stationary (i.e., small camera motion), existing methods encounter two primary challenges: incorrect representation of scene geometry and inaccurate estimation of camera parameters. These challenges make prior methods struggle to produce satisfactory results or even become invalid. To address the first challenge, we propose a novel Distribution-based Depth Regularization (DDR) that ensures the rendering weight distribution to align with the true distribution. Specifically, unlike previous methods that use depth loss to calculate the error of the expectation, we calculate the expectation of the error by using Gumbel-softmax to differentiably sample points from discrete rendering weight distribution. Additionally, we introduce constraints that enforce the volume density of spatial points before the object boundary along the ray to be near zero, ensuring that our model learns the correct geometry of the scene. To demystify the DDR, we further propose a visualization tool that enables observing the scene geometry representation at the rendering weight level. For the second challenge, we incorporate camera parameter learning during training to enhance the robustness of our model to camera parameters. We conduct extensive experiments to demonstrate the effectiveness of our approach in representing scenes with small camera motion input, and our results compare favorably to state-of-the-art methods.

[175] Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization

Md Moinul Islam,Sofoklis Kakouros,Janne Heikkilä,Mourad Oussalah

Main category: cs.CV

TL;DR: 本文提出了一种行为感知的多模态视频摘要框架，结合文本、音频和视觉线索生成时间戳对齐的摘要，并通过识别跨模态强调的“奖励词”来提升摘要的语义相关性和表达清晰度。

Details

Motivation: 随着教育、专业和社交领域中视频内容的增加，传统单模态摘要方法已无法满足需求，需要更有效的多模态摘要技术。 Method: 该框架提取韵律特征、文本线索和视觉指标，识别语义和情感上重要的时刻，并引入“奖励词”概念，以提升摘要质量；同时使用基于LLM的抽取方法生成伪地面真实摘要（pGT）进行评估。 Result: 与传统的Edmundson方法相比，文本指标ROUGE-1从0.4769提高到0.7929，BERTScore从0.9152提高到0.9536；视频评估中F1-Score提高了近23%。 Conclusion: 研究结果表明，多模态整合在生成全面且行为信息丰富视频摘要方面具有巨大潜力。 Abstract: The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method, such as the Edmundson method, in both text and video-based evaluation metrics. Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework improves F1-Score by almost 23%. The findings underscore the potential of multimodal integration in producing comprehensive and behaviourally informed video summaries.

[176] Self-Supervised Contrastive Learning for Multi-Label Images

Jiale Chen

Main category: cs.CV

TL;DR: 本文提出了适用于少量多标签图像的自监督学习方法，通过分块增强和图像感知对比损失提升表征学习效果。

Details

Motivation: 主流的自监督学习方法依赖于单标签的大规模数据集（如ImageNet），导致预训练开销巨大，同时忽视了蕴含更丰富语义信息的多标签图像。 Method: 首先设计了一个分块增强模块，从多标签图像中提取更多潜在的正视图对；随后设计了一种图像感知对比损失函数，用于建立这些视图之间的联系，从而促进语义一致表征的提取。 Result: 通过全面的线性微调和迁移学习实验验证了该方法在面对样本质量和数量挑战时依然具有竞争力。 Conclusion: 该论文提出了一种针对多标签图像的改进自监督学习方法，能够在较少样本的情况下实现优秀的表征学习能力。 Abstract: Self-supervised learning (SSL) has demonstrated its effectiveness in learning representations through comparison methods that align with human intuition. However, mainstream SSL methods heavily rely on high body datasets with single label, such as ImageNet, resulting in intolerable pre-training overhead. Besides, more general multi-label images are frequently overlooked in SSL, despite their potential for richer semantic information and broader applicability in downstream scenarios. Therefore, we tailor the mainstream SSL approach to guarantee excellent representation learning capabilities using fewer multi-label images. Firstly, we propose a block-wise augmentation module aimed at extracting additional potential positive view pairs from multi-label images. Subsequently, an image-aware contrastive loss is devised to establish connections between these views, thereby facilitating the extraction of semantically consistent representations. Comprehensive linear fine-tuning and transfer learning validate the competitiveness of our approach despite challenging sample quality and quantity.

Hongxin Zhang,Zheyuan Zhang,Zeyuan Wang,Zunzhe Zhang,Lixing Fang,Qinhong Zhou,Chuang Gan

Main category: cs.CV

TL;DR: Ella is an embodied social agent that uses a structured memory system and foundation models to learn and interact effectively in a 3D open world, showcasing transformative potential in advancing embodied intelligence.

Details

Motivation: To advance embodied intelligence by creating an agent capable of lifelong learning through visual observations and social interactions in a realistic, open-world environment. Method: The paper introduces Ella's structured, long-term multimodal memory system, which includes semantic and episodic memory components. This system is integrated with foundation models for decision-making, planning, and social interaction in a dynamic 3D open world. Result: Experimental evaluations showed that Ella could successfully influence, lead, and cooperate with other agents to achieve goals, demonstrating its effectiveness in learning through observation and interaction. Conclusion: Ella, an embodied social agent, demonstrates transformative potential in advancing embodied intelligence by integrating a structured lifelong memory system with foundation models, enabling it to influence, lead, and cooperate effectively in a 3D open world. Abstract: We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella's capabilities is a structured, long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a name-centric semantic memory for organizing acquired knowledge and a spatiotemporal episodic memory for capturing multimodal experiences. By integrating this lifelong memory system with foundation models, Ella retrieves relevant information for decision-making, plans daily activities, builds social relationships, and evolves autonomously while coexisting with other intelligent beings in the open world. We conduct capability-oriented evaluations in a dynamic 3D open world where 15 agents engage in social activities for days and are assessed with a suite of unseen controlled evaluations. Experimental results show that Ella can influence, lead, and cooperate with other agents well to achieve goals, showcasing its ability to learn effectively through observation and social interaction. Our findings highlight the transformative potential of combining structured memory systems with foundation models for advancing embodied intelligence. More videos can be found at https://umass-embodied-agi.github.io/Ella/.

[178] STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene

Hanyu Zhou,Haonan Wang,Haoyue Liu,Yuxing Duan,Luxin Yan,Gim Hee Lee

Main category: cs.CV

TL;DR: 本文提出了一种基于时空解耦高斯点绘的高动态场景重建框架，结合事件相机和帧相机数据以解决背景与动态物体间的时空特征不匹配问题。

Details

Motivation: 现有的统一表示模型在处理具有潜在不连续时间特征和背景与物体间异构空间特征的高动态场景时存在不足。 Method: 引入事件相机补偿帧相机，并提出一种时空解耦的高斯点绘框架，通过聚类区分背景与物体的时空特征，并利用事件数据引导物体高斯的时空解耦。 Result: 实验验证了该方法在提升背景与动态物体之间的时空判别能力以及时间连续动态场景渲染方面的优越性。 Conclusion: 该方法有效解决了高动态场景中背景与物体的时空特征不匹配问题，提高了重建质量。 Abstract: High-dynamic scene reconstruction aims to represent static background with rigid spatial features and dynamic objects with deformed continuous spatiotemporal features. Typically, existing methods adopt unified representation model (e.g., Gaussian) to directly match the spatiotemporal features of dynamic scene from frame camera. However, this unified paradigm fails in the potential discontinuous temporal features of objects due to frame imaging and the heterogeneous spatial features between background and objects. To address this issue, we disentangle the spatiotemporal features into various latent representations to alleviate the spatiotemporal mismatching between background and objects. In this work, we introduce event camera to compensate for frame camera, and propose a spatiotemporal-disentangled Gaussian splatting framework for high-dynamic scene reconstruction. As for dynamic scene, we figure out that background and objects have appearance discrepancy in frame-based spatial features and motion discrepancy in event-based temporal features, which motivates us to distinguish the spatiotemporal features between background and objects via clustering. As for dynamic object, we discover that Gaussian representations and event data share the consistent spatiotemporal characteristic, which could serve as a prior to guide the spatiotemporal disentanglement of object Gaussians. Within Gaussian splatting framework, the cumulative scene-object disentanglement can improve the spatiotemporal discrimination between background and objects to render the time-continuous dynamic scene. Extensive experiments have been performed to verify the superiority of the proposed method.

[179] MotionGPT3: Human Motion as a Second Modality

Bingfan Zhu,Biao Jiang,Sunyi Wang,Shixiang Tang,Tao Chen,Linjie Luo,Youyi Zheng,Xin Chen

Main category: cs.CV

TL;DR: MotionGPT3 introduces a unified bimodal motion-language framework that effectively addresses challenges in integrating continuous human motion with language, enabling high-fidelity motion modeling without compromising language intelligence.

Details

Motivation: Recent multimodal models have shown promise in unified understanding and generation, but the development of unified motion-language models remains underexplored. Two core challenges—reconstruction gaps between continuous motion and discrete representation, and degradation of language intelligence during unified training—need addressing. Method: The paper proposes MotionGPT3, which decouples motion modeling using separate parameters and integrates it with a pretrained language model through a shared attention mechanism. A motion Variational Autoencoder (VAE) is used to encode raw motion into latent representations, and a diffusion head predicts motion latents directly from hidden states in an autoregressive manner. Result: Extensive experiments demonstrate that MotionGPT3 achieves competitive performance on both motion understanding and generation tasks while maintaining strong language capabilities. Conclusion: MotionGPT3 successfully integrates human motion as a second modality into a unified motion-language model, achieving competitive performance on motion understanding and generation tasks while preserving strong language capabilities. Abstract: Though recent advances in multimodal models have demonstrated strong capabilities and opportunities in unified understanding and generation, the development of unified motion-language models remains underexplored. To enable such models with high-fidelity human motion, two core challenges must be addressed. The first is the reconstruction gap between the continuous motion modality and discrete representation in an autoregressive manner, and the second is the degradation of language intelligence during unified training. Inspired by the mixture of experts, we propose MotionGPT3, a bimodal motion-language model that treats human motion as a second modality, decoupling motion modeling via separate model parameters and enabling both effective cross-modal interaction and efficient multimodal scaling training. To preserve language intelligence, the text branch retains the original structure and parameters of the pretrained language model, while a new motion branch is integrated via a shared attention mechanism, enabling bidirectional information flow between two modalities. We first employ a motion Variational Autoencoder (VAE) to encode raw human motion into latent representations. Based on this continuous latent space, the motion branch predicts motion latents directly from intermediate hidden states using a diffusion head, bypassing discrete tokenization. Extensive experiments show that our approach achieves competitive performance on both motion understanding and generation tasks while preserving strong language capabilities, establishing a unified bimodal motion diffusion framework within an autoregressive manner.

[180] Trident: Detecting Face Forgeries with Adversarial Triplet Learning

Mustafa Hakan Kara,Aysegul Dundar,Uğur Güdükbay

Main category: cs.CV

TL;DR: 提出了一种名为Trident的新框架，用于检测面部伪造，该框架利用三元组学习、Siamese网络和对抗训练提高适应性和泛化能力。

Details

Motivation: 随着深度神经网络生成的面部伪造技术日益复杂，检测数字媒体中的面部篡改已成为一个重要挑战。 Method: 使用三元组学习和对抗训练的Siamese网络架构进行面部伪造检测，同时阻止分类头到嵌入模型的梯度流动。 Result: Trident能够捕捉区分真实样本与篡改样本的细粒度特征，并具有对未知伪造方法的鲁棒性。 Conclusion: Trident展现出在不同基准上的有效性，并通过防止特定伪造伪影的过拟合来增强模型的泛化能力。 Abstract: As face forgeries generated by deep neural networks become increasingly sophisticated, detecting face manipulations in digital media has posed a significant challenge, underscoring the importance of maintaining digital media integrity and combating visual disinformation. Current detection models, predominantly based on supervised training with domain-specific data, often falter against forgeries generated by unencountered techniques. In response to this challenge, we introduce \textit{Trident}, a face forgery detection framework that employs triplet learning with a Siamese network architecture for enhanced adaptability across diverse forgery methods. \textit{Trident} is trained on curated triplets to isolate nuanced differences of forgeries, capturing fine-grained features that distinguish pristine samples from manipulated ones while controlling for other variables. To further enhance generalizability, we incorporate domain-adversarial training with a forgery discriminator. This adversarial component guides our embedding model towards forgery-agnostic representations, improving its robustness to unseen manipulations. In addition, we prevent gradient flow from the classifier head to the embedding model, avoiding overfitting induced by artifacts peculiar to certain forgeries. Comprehensive evaluations across multiple benchmarks and ablation studies demonstrate the effectiveness of our framework. We will release our code in a GitHub repository.

Mona Ahmadian,Amir Shirian,Frank Guerin,Andrew Gilbert

Main category: cs.CV

TL;DR: 本研究提出了一种名为DEL的新框架，在复杂的真实视频中检测和分类多种动作的效果优于现有技术。

Details

Motivation: 现实世界的视频通常包含重叠事件和复杂的时序依赖关系，这使得多模态交互建模特别具有挑战性。 Method: 使用包含音频和视觉特征对齐模块以及多模态交互细化模块的DEL框架，其中前者利用掩码自注意力机制增强内部模式一致性，后者跨多个尺度建模跨模态依赖关系。 Result: 在UnAV-100、THUMOS14、ActivityNet 1.3和EPIC-Kitchens-100数据集上取得了显著的平均mAP增益，分别为+3.3%、+2.6%、+1.2%、+1.7%（动词）和+1.4%（名词）。 Conclusion: DEL框架在多个真实世界 temporal action localization 数据集中达到了最先进的性能，并超越了以前的方法。 Abstract: Real-world videos often contain overlapping events and complex temporal dependencies, making multimodal interaction modeling particularly challenging. We introduce DEL, a framework for dense semantic action localization, aiming to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos. DEL consists of two key modules: the alignment of audio and visual features that leverage masked self-attention to enhance intra-mode consistency and a multimodal interaction refinement module that models cross-modal dependencies across multiple scales, enabling high-level semantics and fine-grained details. Our method achieves state-of-the-art performance on multiple real-world Temporal Action Localization (TAL) datasets, UnAV-100, THUMOS14, ActivityNet 1.3, and EPIC-Kitchens-100, surpassing previous approaches with notable average mAP gains of +3.3%, +2.6%, +1.2%, +1.7% (verb), and +1.4% (noun), respectively.

[182] Transformer-Based Person Search with High-Frequency Augmentation and Multi-Wave Mixing

Qilin Shu,Qixian Zhang,Qi Zhang,Hongyun Zhang,Duoqian Miao,Cairong Zhao

Main category: cs.CV

TL;DR: This paper proposes HAMW, a new method for person search that improves transformer efficiency and performance by enhancing high-frequency features and using wavelet fusion.

Details

Motivation: Transformer-based models face issues like suppression of high-frequency features and high computational cost in person search tasks. This work aims to overcome these limitations for improved performance and efficiency. Method: A three-stage framework combining High-frequency Augmentation and Multi-Wave mixing (HAMW) is proposed. It replaces self-attention layers with multi-level Haar wavelet fusion to improve efficiency and high-frequency feature perception. Result: HAMW achieves state-of-the-art performance on the CUHK-SYSU and PRW datasets while reducing computational complexity. Conclusion: The HAMW method addresses challenges in transformer-based models for person search by enhancing discriminative feature extraction, reducing computational overhead, and achieving state-of-the-art performance on datasets like CUHK-SYSU and PRW. Abstract: The person search task aims to locate a target person within a set of scene images. In recent years, transformer-based models in this field have made some progress. However, they still face three primary challenges: 1) the self-attention mechanism tends to suppress high-frequency components in the features, which severely impacts model performance; 2) the computational cost of transformers is relatively high. To address these issues, we propose a novel High-frequency Augmentation and Multi-Wave mixing (HAMW) method for person search. HAMW is designed to enhance the discriminative feature extraction capabilities of transformers while reducing computational overhead and improving efficiency. Specifically, we develop a three-stage framework that progressively optimizes both detection and re-identification performance. Our model enhances the perception of high-frequency features by learning from augmented inputs containing additional high-frequency components. Furthermore, we replace the self-attention layers in the transformer with a strategy based on multi-level Haar wavelet fusion to capture multi-scale features. This not only lowers the computational complexity but also alleviates the suppression of high-frequency features and enhances the ability to exploit multi-scale information. Extensive experiments demonstrate that HAMW achieves state-of-the-art performance on both the CUHK-SYSU and PRW datasets.

[183] BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion

Dequan Kong,Zhe Zhu,Honghua Chen,Mingqiang Wei

Main category: cs.CV

TL;DR: BridgeShape improves 3D shape completion by modeling the optimal transport path and using a depth-enhanced VQ-VAE for high-fidelity generation.

Details

Motivation: Existing methods fail to explicitly model the optimal global transport path and face resolution constraints when performing diffusion in voxel space. Method: BridgeShape uses latent diffusion Schr"odinger bridge to model the optimal transport path and employs a Depth-Enhanced VQ-VAE for encoding 3D shapes into a compact latent space enriched with DINOv2 features. Result: BridgeShape achieves superior fidelity at higher resolutions and performs well on unseen object classes in large-scale 3D shape completion benchmarks. Conclusion: BridgeShape provides a novel and effective framework for 3D shape completion, achieving state-of-the-art performance on large-scale benchmarks. Abstract: Existing diffusion-based 3D shape completion methods typically use a conditional paradigm, injecting incomplete shape information into the denoising network via deep feature interactions (e.g., concatenation, cross-attention) to guide sampling toward complete shapes, often represented by voxel-based distance functions. However, these approaches fail to explicitly model the optimal global transport path, leading to suboptimal completions. Moreover, performing diffusion directly in voxel space imposes resolution constraints, limiting the generation of fine-grained geometric details. To address these challenges, we propose BridgeShape, a novel framework for 3D shape completion via latent diffusion Schr\"odinger bridge. The key innovations lie in two aspects: (i) BridgeShape formulates shape completion as an optimal transport problem, explicitly modeling the transition between incomplete and complete shapes to ensure a globally coherent transformation. (ii) We introduce a Depth-Enhanced Vector Quantized Variational Autoencoder (VQ-VAE) to encode 3D shapes into a compact latent space, leveraging self-projected multi-view depth information enriched with strong DINOv2 features to enhance geometric structural perception. By operating in a compact yet structurally informative latent space, BridgeShape effectively mitigates resolution constraints and enables more efficient and high-fidelity 3D shape completion. BridgeShape achieves state-of-the-art performance on large-scale 3D shape completion benchmarks, demonstrating superior fidelity at higher resolutions and for unseen object classes.

[184] TVG-SLAM: Robust Gaussian Splatting SLAM with Tri-view Geometric Constraints

Zhen Tan,Xieyuanli Chen,Lei Feng,Yangbing Ge,Shuaifeng Zhi,Jiaxiong Liu,Dewen Hu

Main category: cs.CV

TL;DR: TVG-SLAM通过引入三视角几何范式，提高了仅使用RGB的3D高斯点绘(SLAM)系统的鲁棒性，解决了在户外复杂环境中相机跟踪的问题。

Details

Motivation: 现有的3DGS SLAM系统过于依赖光度渲染损失进行相机跟踪，这在具有严重视角和光照变化的开放户外环境中影响了它们的鲁棒性。 Method: 提出TVG-SLAM，利用密集三视角匹配模块和混合几何约束来增强跟踪稳定性，并采用新的概率初始化策略和动态渲染信任衰减机制以提高地图绘制质量。 Result: 实验显示TVG-SLAM在多个公共户外数据集上优于先前的RGB-only 3DGS-based SLAM系统，在最具挑战性的数据集中将平均绝对轨迹误差(ATE)降低了69.0%，同时实现了最先进的渲染质量。 Conclusion: TVG-SLAM通过结合三视角几何和光度损失的方法显著提升了SLAM系统的跟踪鲁棒性和地图绘制质量，该方法将会开源。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled RGB-only SLAM systems to achieve high-fidelity scene representation. However, the heavy reliance of existing systems on photometric rendering loss for camera tracking undermines their robustness, especially in unbounded outdoor environments with severe viewpoint and illumination changes. To address these challenges, we propose TVG-SLAM, a robust RGB-only 3DGS SLAM system that leverages a novel tri-view geometry paradigm to ensure consistent tracking and high-quality mapping. We introduce a dense tri-view matching module that aggregates reliable pairwise correspondences into consistent tri-view matches, forming robust geometric constraints across frames. For tracking, we propose Hybrid Geometric Constraints, which leverage tri-view matches to construct complementary geometric cues alongside photometric loss, ensuring accurate and stable pose estimation even under drastic viewpoint shifts and lighting variations. For mapping, we propose a new probabilistic initialization strategy that encodes geometric uncertainty from tri-view correspondences into newly initialized Gaussians. Additionally, we design a Dynamic Attenuation of Rendering Trust mechanism to mitigate tracking drift caused by mapping latency. Experiments on multiple public outdoor datasets show that our TVG-SLAM outperforms prior RGB-only 3DGS-based SLAM systems. Notably, in the most challenging dataset, our method improves tracking robustness, reducing the average Absolute Trajectory Error (ATE) by 69.0\% while achieving state-of-the-art rendering quality. The implementation of our method will be released as open-source.

[185] A Hierarchical Slice Attention Network for Appendicitis Classification in 3D CT Scans

Chia-Wen Huang,Haw Hwai,Chien-Chang Lee,Pei-Yuan Wu

Main category: cs.CV

TL;DR: This paper presents a deep learning model for appendicitis classification using 3D CT scans and Slice Attention mechanisms, offering a more efficient and reliable diagnostic solution.

Details

Motivation: Timely and accurate diagnosis of appendicitis is critical in clinical settings to prevent serious complications, but the growing number of cases can overwhelm radiologists, potentially causing delays. Method: A deep learning model using 3D CT scans with Slice Attention mechanisms guided by external 2D datasets was developed. A hierarchical classification framework using pre-trained 2D models was also introduced to differentiate between simple and complicated appendicitis. Result: The approach improves AUC by 3% for appendicitis and 5.9% for complicated appendicitis. Conclusion: The proposed deep learning model provides a more efficient and reliable diagnostic solution for appendicitis compared to previous methods. Abstract: Timely and accurate diagnosis of appendicitis is critical in clinical settings to prevent serious complications. While CT imaging remains the standard diagnostic tool, the growing number of cases can overwhelm radiologists, potentially causing delays. In this paper, we propose a deep learning model that leverages 3D CT scans for appendicitis classification, incorporating Slice Attention mechanisms guided by external 2D datasets to enhance small lesion detection. Additionally, we introduce a hierarchical classification framework using pre-trained 2D models to differentiate between simple and complicated appendicitis. Our approach improves AUC by 3% for appendicitis and 5.9% for complicated appendicitis, offering a more efficient and reliable diagnostic solution compared to previous work.

[186] High-quality Pseudo-labeling for Point Cloud Segmentation with Scene-level Annotation

Lunhao Duan,Shanshan Zhao,Xingxing Weng,Jing Zhang,Gui-Song Xia

Main category: cs.CV

TL;DR: 本文提出了一种结合多模态信息与区域-点语义一致性的新方法，以解决室内点云语义分割中因场景级标注导致的伪标签生成不准确问题。

Details

Motivation: 当前基于场景级标注的方法在缺乏精确点级标签的情况下难以生成准确的伪标签，从而影响分割效果，因此需要一种更有效的方法来提升伪标签的质量。 Method: 引入了跨模态特征引导模块和区域-点语义一致性模块，以对齐点云特征与2D图像像素，并通过区域投票策略生成区域语义来指导点级语义预测。 Result: 所提方法在ScanNet v2和S3DIS数据集上均取得了优于先前方法的性能，同时通过全面的消融实验证明了各个组件的有效性。 Conclusion: 该论文提出了一种高质量伪标签生成框架，用于室内点云语义分割，通过利用多模态信息和区域-点语义一致性显著提高了分割性能。 Abstract: This paper investigates indoor point cloud semantic segmentation under scene-level annotation, which is less explored compared to methods relying on sparse point-level labels. In the absence of precise point-level labels, current methods first generate point-level pseudo-labels, which are then used to train segmentation models. However, generating accurate pseudo-labels for each point solely based on scene-level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high-quality pseudo-label generation framework by exploring contemporary multi-modal information and region-point semantic consistency. Specifically, with a cross-modal feature guidance module, our method utilizes 2D-3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene-level annotation, we introduce a region-point semantic consistency module. It produces regional semantics through a region-voting strategy derived from point-level semantics, which are subsequently employed to guide the point-level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point-level semantic predictions during training and obtain high-quality pseudo-labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene-level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach's individual components. The code is available at https://github.com/LHDuan/WSegPC .

[187] VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions

Marko Mihajlovic,Siwei Zhang,Gen Li,Kaifeng Zhao,Lea Müller,Siyu Tang

Main category: cs.CV

TL;DR: VolumetricSMPL是一种基于神经混合权重的高效体积人体模型，优于现有方法，在推理速度、内存使用和准确性方面有显著提升。

Details

Motivation: 现有的体积人体模型在处理复杂的人体姿态时不够稳健，或计算和内存成本高。需要一种更高效且表达能力强的模型。 Method: 利用神经混合权重（NBW）生成紧凑且高效的MLP解码器，动态混合少量学习的权重矩阵，以减少计算和内存需求，同时保持模型表达能力。 Result: VolumetricSMPL比现有模型COAP快10倍，GPU内存使用量低6倍，准确性更高，并支持SDF进行接触建模。 Conclusion: VolumetricSMPL在效率和性能上都有显著优势，适用于多种复杂任务，如人类与物体的交互重建、3D场景中的人体恢复、场景约束下的动作合成和自相交解决。 Abstract: Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding human-environment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point clouds. To address this limitation, recent research has explored volumetric neural implicit body models. However, existing works are either insufficiently robust for complex human articulations or impose high computational and memory costs, limiting their widespread use. To this end, we introduce VolumetricSMPL, a neural volumetric body model that leverages Neural Blend Weights (NBW) to generate compact, yet efficient MLP decoders. Unlike prior approaches that rely on large MLPs, NBW dynamically blends a small set of learned weight matrices using predicted shape- and pose-dependent coefficients, significantly improving computational efficiency while preserving expressiveness. VolumetricSMPL outperforms prior volumetric occupancy model COAP with 10x faster inference, 6x lower GPU memory usage, enhanced accuracy, and a Signed Distance Function (SDF) for efficient and differentiable contact modeling. We demonstrate VolumetricSMPL's strengths across four challenging tasks: (1) reconstructing human-object interactions from in-the-wild images, (2) recovering human meshes in 3D scenes from egocentric views, (3) scene-constrained motion synthesis, and (4) resolving self-intersections. Our results highlight its broad applicability and significant performance and efficiency gains.

[188] Aggregating Local Saliency Maps for Semi-Global Explainable Image Classification

James Hinns,David Martens

Main category: cs.CV

TL;DR: 本文提出了一种名为Segment Attribution Tables（SATs）的方法，用于将局部显著性解释汇总为半全局洞察，以帮助分析和调试图像分类器。

Details

Motivation: 深度学习在图像分类任务中占据主导地位，但理解模型如何做出预测仍然是一个挑战。现有的局部解释方法如显著性图虽然能可视化特定像素对模型预测的影响，但在审查大量解释以识别重复模式方面存在困难，而全局方法往往过于简化且可能忽略重要的局部行为。 Method: SATs利用图像段（例如吉娃娃的“眼睛”）和显著性图来量化其影响，并使用提供命名段的分割图来生成任何能够产生显著性图的分类器的解释。 Result: SATs能够在最小化分布外测试性能变化的同时揭示模型依赖的背景或水印等虚假关联，从而弥补了过于简化的全局总结和过于详细的局部解释之间的差距。 Conclusion: SATs为分析和调试图像分类器提供了一个实用工具，尤其在处理局部行为与全局理解之间的平衡时表现出色。 Abstract: Deep learning dominates image classification tasks, yet understanding how models arrive at predictions remains a challenge. Much research focuses on local explanations of individual predictions, such as saliency maps, which visualise the influence of specific pixels on a model's prediction. However, reviewing many of these explanations to identify recurring patterns is infeasible, while global methods often oversimplify and miss important local behaviours. To address this, we propose Segment Attribution Tables (SATs), a method for summarising local saliency explanations into (semi-)global insights. SATs take image segments (such as "eyes" in Chihuahuas) and leverage saliency maps to quantify their influence. These segments highlight concepts the model relies on across instances and reveal spurious correlations, such as reliance on backgrounds or watermarks, even when out-of-distribution test performance sees little change. SATs can explain any classifier for which a form of saliency map can be produced, using segmentation maps that provide named segments. SATs bridge the gap between oversimplified global summaries and overly detailed local explanations, offering a practical tool for analysing and debugging image classifiers.

[189] DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection

Kunwei Lv,Ping Lan

Main category: cs.CV

TL;DR: 本文提出了一种新的无人机目标检测框架DGE-YOLO，在多模态信息融合和特征学习方面取得突破，提升了检测效果。

Details

Motivation: 为了应对复杂条件下小目标检测的挑战，并提升多模态输入处理效果，设计了一种更鲁棒且高效的检测框架。 Method: 提出了基于改进YOLO的目标检测框架DGE-YOLO，包括双分支架构、高效的多尺度注意力机制和Gather-and-Distribute模块。 Result: 在Drone Vehicle数据集上的实验表明，DGE-YOLO取得了优越的性能表现。 Conclusion: DGE-YOLO实现了优于现有方法的性能，验证了其在多模态无人机目标检测任务中的有效性。 Abstract: The rapid proliferation of unmanned aerial vehicles (UAVs) has highlighted the importance of robust and efficient object detection in diverse aerial scenarios. Detecting small objects under complex conditions, however, remains a significant challenge. Existing approaches often prioritize inference speed, leading to degraded performance when handling multi-modal inputs. To address this, we present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information. Specifically, we introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images. To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales. Additionally, we replace the conventional neck with a Gather-and-Distribute module to mitigate information loss during feature aggregation. Extensive experiments on the Drone Vehicle dataset demonstrate that DGE-YOLO achieves superior performance over state-of-the-art methods, validating its effectiveness in multi-modal UAV object detection tasks.

[190] PixelBoost: Leveraging Brownian Motion for Realistic-Image Super-Resolution

Aradhana Mishra,Bumshik Lee

Main category: cs.CV

TL;DR: PixelBoost improves image super-resolution by incorporating Brownian motion's stochastic nature, enhancing realism, and speeding up inference.

Details

Motivation: Diffusion-model-based image super-resolution techniques face challenges in maintaining realistic image generation while ensuring computational efficiency, particularly when reducing sampling steps which can lead to less realistic and hazy images. Method: PixelBoost integrates controlled stochasticity into its training regimen, utilizes a sigmoidal noise sequencing method, and focuses on texture and edge definitions to enhance image realism and inference speed. Result: PixelBoost achieves superior results in objective metrics like LPIPS, LOE, PSNR, and SSIM, along with improved visual quality and edge reconstruction capabilities compared to existing methods. Conclusion: The proposed PixelBoost model successfully addresses the trade-off between realistic image generation and computational efficiency in diffusion-model-based image super-resolution by leveraging the stochastic nature of Brownian motion. Abstract: Diffusion-model-based image super-resolution techniques often face a trade-off between realistic image generation and computational efficiency. This issue is exacerbated when inference times by decreasing sampling steps, resulting in less realistic and hazy images. To overcome this challenge, we introduce a novel diffusion model named PixelBoost that underscores the significance of embracing the stochastic nature of Brownian motion in advancing image super-resolution, resulting in a high degree of realism, particularly focusing on texture and edge definitions. By integrating controlled stochasticity into the training regimen, our proposed model avoids convergence to local optima, effectively capturing and reproducing the inherent uncertainty of image textures and patterns. Our proposed model demonstrates superior objective results in terms of learned perceptual image patch similarity (LPIPS), lightness order error (LOE), peak signal-to-noise ratio(PSNR), structural similarity index measure (SSIM), as well as visual quality. To determine the edge enhancement, we evaluated the gradient magnitude and pixel value, and our proposed model exhibited a better edge reconstruction capability. Additionally, our model demonstrates adaptive learning capabilities by effectively adjusting to Brownian noise patterns and introduces a sigmoidal noise sequencing method that simplifies training, resulting in faster inference speeds.

[191] PCLVis: Visual Analytics of Process Communication Latency in Large-Scale Simulation

Chongke Bi,Xin Gao,Baofeng Fu,Yuheng Zhao,Siming Chen,Ying Zhao,Yunhai Wang

Main category: cs.CV

TL;DR: This paper proposes PCLVis, a framework that helps general users analyze process communication latency using MPI data, enabling better simulation optimization.

Details

Motivation: Existing methods for analyzing communication latency depend on physical link layer information, which is not accessible to general users. This paper aims to provide a solution for general users to analyze communication latency. Method: PCLVis uses MPI process communication data to analyze spatial PCL events, constructs a communication-dependency-based DAG for propagation path analysis, and designs a sliding window algorithm and CS-Glyph for visualization. Result: PCLVis successfully analyzed PCL events in simulations on the TH-1A supercomputer, allowing users to improve simulation efficiency. Conclusion: The PCLVis framework can effectively analyze process communication latency events, helping users optimize simulation efficiency. Abstract: Large-scale simulations on supercomputers have become important tools for users. However, their scalability remains a problem due to the huge communication cost among parallel processes. Most of the existing communication latency analysis methods rely on the physical link layer information, which is only available to administrators. In this paper, a framework called PCLVis is proposed to help general users analyze process communication latency (PCL) events. Instead of the physical link layer information, the PCLVis uses the MPI process communication data for the analysis. First, a spatial PCL event locating method is developed. All processes with high correlation are classified into a single cluster by constructing a process-correlation tree. Second, the propagation path of PCL events is analyzed by constructing a communication-dependency-based directed acyclic graph (DAG), which can help users interactively explore a PCL event from the temporal evolution of a located PCL event cluster. In this graph, a sliding window algorithm is designed to generate the PCL events abstraction. Meanwhile, a new glyph called the communication state glyph (CS-Glyph) is designed for each process to show its communication states, including its in/out messages and load balance. Each leaf node can be further unfolded to view additional information. Third, a PCL event attribution strategy is formulated to help users optimize their simulations. The effectiveness of the PCLVis framework is demonstrated by analyzing the PCL events of several simulations running on the TH-1A supercomputer. By using the proposed framework, users can greatly improve the efficiency of their simulations.

[192] Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

Lei-lei Li,Jianwu Fang,Junbin Xiao,Shanmin Pang,Hongkai Yu,Chen Lv,Jianru Xue,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出了Causal-VidSyn和Drive-Gaze，用于生成反映真实因果关系的事故视频，以提高自动驾驶汽车应对事故的能力。

Details

Motivation: 为了提高自动驾驶汽车的安全性，需要能够合成反映真实世界因果关系的事故视频来进行能力测试。 Method: 提出了一种新的扩散模型Causal-VidSyn，并构建了最大的驾驶事故场景中的驾驶员注视数据集Drive-Gaze，以支持Causal-VidSyn。 Result: 实验结果显示，Causal-VidSyn在各种任务中（如事故视频编辑、正常到事故视频扩散和文本到视频生成）在帧质量和因果敏感性方面都优于最先进的视频扩散模型。 Conclusion: Causal-VidSyn可以有效生成具有因果关系的事故视频，从而帮助自动驾驶汽车更好地应对现实中不可承受的事故。 Abstract: Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model, Causal-VidSyn, for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.

[193] Token Activation Map to Visually Explain Multimodal LLMs

Yi Li,Hualiang Wang,Xinpeng Ding,Haonan Wang,Xiaomeng Li

Main category: cs.CV

TL;DR: This paper proposes Token Activation Map (TAM) for explaining MLLMs by addressing redundant activation interferences, resulting in high-quality visualizations and better model understanding.

Details

Motivation: MLLMs lack explainability compared to conventional vision models, which hinders understanding and credibility. Existing methods overlook the interference caused by earlier context tokens in generating reliable explanations. Method: Proposed an estimated causal inference method combined with a novel rank Gaussian filter, termed Token Activation Map (TAM), to address redundant activations in MLLM explanations. Result: TAM achieves superior performance in explaining multiple tokens of MLLMs, offering reliable and high-quality visualizations across various scenarios such as object localization, multi-turn conversation analysis, and visual reasoning. Conclusion: The TAM method significantly outperforms existing SoTA methods in providing high-quality visualization and explanation for MLLMs, enabling applications like object localization, failure case analysis, video visualization, and model understanding. Abstract: Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual comparison, and model understanding (e.g., color, shape, action, location, visual reasoning, multi-turn conversation, etc). The code is available atgithub.com/xmed-lab/TAM.

[194] Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

Jinxing Zhou,Zhihui Li,Yongqiang Yu,Yanghao Zhou,Ruohao Guo,Guangyao Li,Yuxin Mao,Mingfei Han,Xiaojun Chang,Meng Wang

Main category: cs.CV

TL;DR: Mettle is a memory-efficient method for adapting transformers to audio-visual tasks using meta-tokens, achieving good performance with reduced resource consumption.

Details

Motivation: The motivation is to develop a memory-efficient and simple method for adapting pretrained transformers to downstream audio-visual tasks without sacrificing performance. Method: Mettle uses a Layer-Centric Distillation (LCD) module to distill features into compact meta-tokens and introduces a Meta-Token Injection (MTI) module for fine-grained segmentation tasks. Result: Experiments show that Mettle reduces memory usage and training time while achieving competitive accuracy on audio-visual benchmarks. Conclusion: Mettle is an efficient method for adapting large-scale pretrained transformer models to audio-visual tasks, reducing memory usage and training time while maintaining accuracy. Abstract: We present \textbf{Met}a-\textbf{T}oken \textbf{Le}arning (Mettle), a simple and memory-efficient method for adapting large-scale pretrained transformer models to downstream audio-visual tasks. Instead of sequentially modifying the output feature distribution of the transformer backbone, Mettle utilizes a lightweight \textit{Layer-Centric Distillation (LCD)} module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens. This distillation process considers both pretrained knowledge preservation and task-specific adaptation. The obtained meta-tokens can be directly applied to classification tasks, such as audio-visual event localization and audio-visual video parsing. To further support fine-grained segmentation tasks, such as audio-visual segmentation, we introduce a \textit{Meta-Token Injection (MTI)} module, which utilizes the audio and visual meta-tokens distilled from the top transformer layer to guide feature adaptation in earlier layers. Extensive experiments on multiple audiovisual benchmarks demonstrate that our method significantly reduces memory usage and training time while maintaining parameter efficiency and competitive accuracy.

[195] Why Settle for One? Text-to-ImageSet Generation and Evaluation

Chengyou Jia,Xin Shen,Zhuohang Dang,Zhuohang Dang,Changliang Xia,Weijia Wu,Xinyu Zhang,Hangwei Qian,Ivor W. Tsang,Minnan Luo

Main category: cs.CV

TL;DR: 该论文提出了一个新问题：Text-to-ImageSet (T2IS) 生成，并提出了相应的基准、评估框架以及一种无需训练的新方法AutoT2IS，以解决多样化一致性要求下的图像集生成问题。

Details

Motivation: 现有的文本到图像生成方法通常局限于特定领域，缺乏对更广泛应用场景中多样化一致性要求的支持。 Method: 提出了T2IS-Bench和T2IS-Eval作为基准和评估框架，并基于现有的Diffusion Transformers提出了AutoT2IS方法。 Result: AutoT2IS在实验中显著优于现有通用和专用方法，并展示了在实际应用中的潜力。 Conclusion: 论文提出了一种无需训练的方法，能够有效生成满足多种一致性要求的图像集，并在T2IS-Bench上验证了其优越性能。 Abstract: Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $\textbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $\textbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $\textbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers' in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in https://chengyou-jia.github.io/T2IS-Home.

[196] Autoregressive Denoising Score Matching is a Good Video Anomaly Detector

Hanwen Zhang,Congqi Cao,Qinyi Lv,Lingtong Min,Yanning Zhang

Main category: cs.CV

TL;DR: 本文提出了一种改进的视频异常检测方法，解决了现有基于似然方法的局限性，在三个基准测试中表现出色。

Details

Motivation: 基于似然的方法对于位于学习分布附近的局部模式中的异常存在盲区，需要解决视频异常检测在场景、运动和外观方面的独特问题。 Method: 开发了一种噪声条件得分变换器，引入了场景依赖和运动感知的得分函数，并通过一种新的自回归去噪得分匹配机制集成了未受影响的视觉信息。 Result: 该方法在三个流行的视频异常检测基准上展现了最先进的性能。 Conclusion: 通过解决场景、运动和外观方面的三个独特差距，所提出的视频异常检测方法在三个流行基准上展示了最先进的性能。 Abstract: Video anomaly detection (VAD) is an important computer vision problem. Thanks to the mode coverage capabilities of generative models, the likelihood-based paradigm is catching growing interest, as it can model normal distribution and detect out-of-distribution anomalies. However, these likelihood-based methods are blind to the anomalies located in local modes near the learned distribution. To handle these ``unseen" anomalies, we dive into three gaps uniquely existing in VAD regarding scene, motion and appearance. Specifically, we first build a noise-conditioned score transformer for denoising score matching. Then, we introduce a scene-dependent and motion-aware score function by embedding the scene condition of input sequences into our model and assigning motion weights based on the difference between key frames of input sequences. Next, to solve the problem of blindness in principle, we integrate unaffected visual information via a novel autoregressive denoising score matching mechanism for inference. Through autoregressively injecting intensifying Gaussian noise into the denoised data and estimating the corresponding score function, we compare the denoised data with the original data to get a difference and aggregate it with the score function for an enhanced appearance perception and accumulate the abnormal context. With all three gaps considered, we can compute a more comprehensive anomaly indicator. Experiments on three popular VAD benchmarks demonstrate the state-of-the-art performance of our method.

[197] MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Yuhuan Yang,Chaofan Ma,Zhenjie Mao,Jiangchao Yao,Ya Zhang,Yanfeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为MoMa的高效适配框架，通过将Mamba的选择性状态空间建模集成到图像基础模型（IFMs）中，实现完整的时空建模，从而提高视频理解性能。

Details

Motivation: 现有的基于IFMs的视频处理方法通常分别处理空间和时间信息，难以捕捉视频动态的复杂性。因此，作者提出了MoMa来解决这一问题。 Method: MoMa采用了一种称为SeqMod的新操作，用于将时空信息注入预训练的IFMs中，并结合分治调制架构进行高效的视频理解。 Result: 在多个视频基准测试中进行了广泛的实验，结果表明MoMa在降低计算成本的同时实现了优越的性能。 Conclusion: MoMa是一种有效的视频理解框架，能够在保持计算效率的同时提升视频理解性能。 Abstract: Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba's selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost.

[198] Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification

Daqian Shi,Xiaolei Diao,Xu Chen,Cédric M. John

Main category: cs.CV

TL;DR: 本文提出了一种竞争蒸馏策略，通过组织多个网络进行竞争并引入随机扰动，有效提升了深度神经网络的学习性能。

Details

Motivation: 现有的基于蒸馏的优化策略由于对不同迭代中网络间学习方向的影响理解不足，因此改进有限。 Method: 提出了一种新的竞争蒸馏策略，使群体中的每个网络都能根据其表现潜在地充当教师，并引入了随机扰动以优化参数更新过程。 Result: 实验结果显示，竞争蒸馏在加速网络训练和提高学习性能方面效果显著。 Conclusion: 竞争蒸馏策略在多种任务和数据集中实现了有前景的性能，通过引入随机扰动来激励网络实现更好的视觉表示和全局最优。 Abstract: Deep Neural Networks (DNNs) have significantly advanced the field of computer vision. To improve DNN training process, knowledge distillation methods demonstrate their effectiveness in accelerating network training by introducing a fixed learning direction from the teacher network to student networks. In this context, several distillation-based optimization strategies are proposed, e.g., deep mutual learning and self-distillation, as an attempt to achieve generic training performance enhancement through the cooperative training of multiple networks. However, such strategies achieve limited improvements due to the poor understanding of the impact of learning directions among networks across different iterations. In this paper, we propose a novel competitive distillation strategy that allows each network in a group to potentially act as a teacher based on its performance, enhancing the overall learning performance. Competitive distillation organizes a group of networks to perform a shared task and engage in competition, where competitive optimization is proposed to improve the parameter updating process. We further introduce stochastic perturbation in competitive distillation, aiming to motivate networks to induce mutations to achieve better visual representations and global optimum. The experimental results show that competitive distillation achieves promising performance in diverse tasks and datasets.

[199] DDL: A Dataset for Interpretable Deepfake Detection and Localization in Real-World Scenarios

Changtao Miao,Yi Zhang,Weize Gao,Man Luo,Weiwei Feng,Zhiya Tan,Jianshu Li,Ajian Liu,Yunfeng Diao,Qi Chu,Tao Gong,Zhe Li,Weibin Yao,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: 本论文提出了一种新的大规模深度伪造检测与定位数据集（DDL），解决了现有数据集在伪造场景、可解释性和多样性方面的不足。

Details

Motivation: 现有的深度伪造检测模型虽然在检测指标上表现优异，但大多仅提供简单的二分类结果，缺乏可解释性，尤其在法律等关键领域中影响了其可信度和权威性；此外，当前大多数深度伪造数据集主要提供二值标签，仅有少数具备定位注释，且存在伪造场景受限、深度伪造类型多样性不足以及数据规模不够的问题，难以满足复杂现实场景的需求。 Method: 构建了一个新的大规模深度伪造检测与定位（DDL）数据集，包含超过180万伪造样本，涵盖75种不同的深度伪造方法，并引入了四个关键创新点：多样化伪造场景、全面的深度伪造方法、多样的操作模式以及细粒度伪造注释。 Result: 构建出一个包含超过180万伪造样本的大规模深度伪造检测与定位数据集（DDL），涵盖75种深度伪造方法，并提出四个关键创新点以提升数据集的实用性和适用性。 Conclusion: 构建的DDL数据集为下一代深度伪造检测、定位和可解释性方法提供了关键支持，并通过多样化的伪造场景、全面的深度伪造方法、多样的操作模式以及细粒度的伪造注释，提供了更具挑战性的基准。 Abstract: Recent advances in AIGC have exacerbated the misuse of malicious deepfake content, making the development of reliable deepfake detection methods an essential means to address this challenge. Although existing deepfake detection models demonstrate outstanding performance in detection metrics, most methods only provide simple binary classification results, lacking interpretability. In critical domains such as law, interpretability is crucial for enhancing the credibility and authority of decisions. Recent studies attempt to improve the interpretability of classification results by providing spatial manipulation masks or temporal forgery segments. However, the practical effectiveness of these methods remains suboptimal due to limitations of the forgery data. Most current deepfake datasets predominantly offer binary labels, only a few datasets with localization annotations. However, they suffer from restricted forgery scenarios, limited diversity in deepfake types, and insufficient data scale, making them inadequate for complex real-world scenarios. To address this predicament, we construct a novel large-scale deepfake detection and localization ($\textbf{DDL}$) dataset containing over $\textbf{1.8M}$ forged samples and encompassing up to $\textbf{75}$ distinct deepfake methods. The DDL design incorporates four key innovations: (1) $\textbf{Diverse Forgery Scenarios}$, (2) $\textbf{Comprehensive Deepfake Methods}$, (3) $\textbf{Varied Manipulation Modes}$, and (4) $\textbf{Fine-grained Forgery Annotations}$. Through these improvements, our DDL not only provides a more challenging benchmark for complex real-world forgeries, but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods. The DDL dataset project page is on https://deepfake-workshop-ijcai2025.github.io/main/index.html.

[200] DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On

Xiang Xu

Main category: cs.CV

TL;DR: DiffFit is a two-stage framework for virtual try-on that improves garment detail preservation, alignment, and visual realism using a progressive generation approach with latent diffusion models.

Details

Motivation: Current VTON methods struggle with preserving fine-grained garment details, achieving accurate alignment, maintaining efficiency, and generalizing across diverse poses and clothing styles. This work aims to overcome these limitations. Method: DiffFit uses a two-stage latent diffusion framework: the first stage focuses on geometry-aware garment warping, while the second stage refines texture fidelity through a cross-modal conditional diffusion model that integrates multiple inputs. Result: Extensive experiments show that DiffFit outperforms state-of-the-art methods on large-scale VTON benchmarks both quantitatively and in perceptual evaluations. Conclusion: DiffFit provides a novel and efficient solution for high-fidelity virtual try-on by decoupling geometric alignment and appearance refinement, leading to superior performance in visual realism and quantitative metrics. Abstract: Virtual try-on (VTON) aims to synthesize realistic images of a person wearing a target garment, with broad applications in e-commerce and digital fashion. While recent advances in latent diffusion models have substantially improved visual quality, existing approaches still struggle with preserving fine-grained garment details, achieving precise garment-body alignment, maintaining inference efficiency, and generalizing to diverse poses and clothing styles. To address these challenges, we propose DiffFit, a novel two-stage latent diffusion framework for high-fidelity virtual try-on. DiffFit adopts a progressive generation strategy: the first stage performs geometry-aware garment warping, aligning the garment with the target body through fine-grained deformation and pose adaptation. The second stage refines texture fidelity via a cross-modal conditional diffusion model that integrates the warped garment, the original garment appearance, and the target person image for high-quality rendering. By decoupling geometric alignment and appearance refinement, DiffFit effectively reduces task complexity and enhances both generation stability and visual realism. It excels in preserving garment-specific attributes such as textures, wrinkles, and lighting, while ensuring accurate alignment with the human body. Extensive experiments on large-scale VTON benchmarks demonstrate that DiffFit achieves superior performance over existing state-of-the-art methods in both quantitative metrics and perceptual evaluations.

[201] Endo-4DGX: Robust Endoscopic Scene Reconstruction and Illumination Correction with Gaussian Splatting

Yiming Huang,Long Bai,Beilei Cui,Yanheng Li,Tong Chen,Jie Wang,Jinlin Wu,Zhen Lei,Hongbin Liu,Hongliang Ren

Main category: cs.CV

TL;DR: 本文介绍了一种名为 Endo-4DGX 的新方法，它通过照明自适应的高斯点阵列解决了在动态外科场景中由于照明条件恶劣导致的渲染质量问题，并且实验证明其效果优于当前最先进的技术。

Details

Motivation: 3D-GS-based 方法在变化的光照条件下（如低光和过度曝光）仍存在严重的优化问题和糟糕的渲染质量，因此需要提出一种新的方法来解决这些问题。 Method: 引入了一种带有照度嵌入的区域感知增强模块和空间感知调整模块，以模拟高斯级别的子区域亮度和学习视图一致性亮度调整。此外，还采用了一种暴露控制损失来将外观从不良曝光恢复到正常水平。 Result: 实验结果表明，在挑战性的照明环境中，Endo-4DGX明显优于现有的最先进重建和修复方法组合，突显了其在推动机器人辅助手术应用方面的潜力。 Conclusion: Endo-4DGX是一种专为照明不均匀的内窥镜场景设计的新型重建方法，具有照明自适应的高斯点阵列。这种方法在低光和过度曝光条件下实现了卓越的渲染性能，同时保持了几何精度，并显著优于现有最先进的重建和修复方法。 Abstract: Accurate reconstruction of soft tissue is crucial for advancing automation in image-guided robotic surgery. The recent 3D Gaussian Splatting (3DGS) techniques and their variants, 4DGS, achieve high-quality renderings of dynamic surgical scenes in real-time. However, 3D-GS-based methods still struggle in scenarios with varying illumination, such as low light and over-exposure. Training 3D-GS in such extreme light conditions leads to severe optimization problems and devastating rendering quality. To address these challenges, we present Endo-4DGX, a novel reconstruction method with illumination-adaptive Gaussian Splatting designed specifically for endoscopic scenes with uneven lighting. By incorporating illumination embeddings, our method effectively models view-dependent brightness variations. We introduce a region-aware enhancement module to model the sub-area lightness at the Gaussian level and a spatial-aware adjustment module to learn the view-consistent brightness adjustment. With the illumination adaptive design, Endo-4DGX achieves superior rendering performance under both low-light and over-exposure conditions while maintaining geometric accuracy. Additionally, we employ an exposure control loss to restore the appearance from adverse exposure to the normal level for illumination-adaptive optimization. Experimental results demonstrate that Endo-4DGX significantly outperforms combinations of state-of-the-art reconstruction and restoration methods in challenging lighting environments, underscoring its potential to advance robot-assisted surgical applications. Our code is available at https://github.com/lastbasket/Endo-4DGX.

Quang-Huy Che,Vinh-Tiep Nguyen

Main category: cs.CV

TL;DR: This paper presents FastSeg, an efficient training-free framework for open-vocabulary semantic segmentation that addresses issues of existing methods by utilizing a diffusion model's reverse process and introducing mechanisms to enhance attention extraction and spatial consistency.

Details

Motivation: Open-vocabulary semantic segmentation aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Contrastive learning based models often lose fine spatial precision at pixel level due to global representation bias, while diffusion-based models face challenges in balancing iterations with segmentation quality. Method: The paper proposes FastSeg, a novel and efficient training-free framework with only (1+1)-step of reverse process of a pretrained diffusion model. It introduces three key components: a dual-prompt mechanism, a Hierarchical Attention Refinement Method (HARD), and a Test-Time Flipping (TTF) scheme. Result: FastSeg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Conclusion: FastSeg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the number of iterations with the quality of the segmentation. In this work, we propose FastSeg, a novel and efficient training-free framework with only (1+1)-step of reverse process of a pretrained diffusion model (e.g., Stable Diffusion). Moreover, instead of running multiple times for different classes, FastSeg performs segmentation for all classes at once. To further enhance the segmentation quality, FastSeg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances fused cross-attention using scale-aligned selfattention maps, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FastSeg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FastSeg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency.

[203] IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Parker Liu,Chenxin Li,Zhengxin Li,Yipeng Wu,Wuyang Li,Zhiqin Yang,Zhenyuan Zhang,Yunlong Lin,Sirui Han,Brandon Y. Feng

Main category: cs.CV

TL;DR: This paper introduces IR3D-Bench, a new benchmark for Vision-Language Agents that tests their ability to actively recreate 3D scenes from images, moving beyond traditional descriptive benchmarks.

Details

Motivation: To determine whether Vision-Language Models truly understand scenes from visual observations rather than merely performing descriptive or passive recognition tasks. Method: The paper introduces IR3D-Bench, which uses an 'understanding-by-creating' approach. It tasks Vision-Language Agents with recreating the 3D structure of images using programming and rendering tools, thereby testing their generative and tool-using capacities. Result: Initial experiments reveal that while current Vision-Language Models can use tools effectively, they still face challenges in terms of visual precision and geometric accuracy. Conclusion: IR3D-Bench provides a new benchmark for evaluating the active scene understanding capabilities of Vision-Language Agents, highlighting current limitations and promoting further development in this field. Abstract: Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.

[204] CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation

Yi Liu,Shengqian Li,Zuzeng Lin,Feng Wang,Si Liu

Main category: cs.CV

TL;DR: 本文提出CycleVAR，一种结合Softmax松弛量化与视觉自回归生成的新方法，在无监督图像翻译任务中实现了更优性能，尤其在并行生成模式下显著提升了翻译质量与推理效率。

Details

Motivation: 现有的条件自回归图像生成方法在无明确跨域对应关系的无监督图像翻译领域仍有很大潜力未被挖掘，而传统基于向量量化的框架因离散量化导致梯度传播中断，限制了端到端优化的效果。 Method: 提出了一种基于Softmax的连续概率混合量化方法（Softmax Relaxed Quantization），并在该基础上构建了CycleVAR模型，通过两种模式进行图像到图像的翻译：(1) 串行多步生成模式和 (2) 并行单步生成模式。 Result: 实验结果显示，CycleVAR的并行单步生成模式在无监督场景下比串行多步生成模式具有更高的翻译质量和更快的推理速度，且在定量与定性结果上均优于当前最先进的无监督图像翻译模型，例如CycleGAN-Turbo。 Conclusion: CycleVAR通过引入Softmax Relaxed Quantization解决了传统方法中梯度传播受阻的问题，并在无监督图像翻译任务中表现优异，尤其是在并行单步生成模式下，其翻译质量和推理速度均优于串行多步生成模式及其他现有方法如CycleGAN-Turbo。 Abstract: The current conditional autoregressive image generation methods have shown promising results, yet their potential remains largely unexplored in the practical unsupervised image translation domain, which operates without explicit cross-domain correspondences. A critical limitation stems from the discrete quantization inherent in traditional Vector Quantization-based frameworks, which disrupts gradient flow between the Variational Autoencoder decoder and causal Transformer, impeding end-to-end optimization during adversarial training in image space. To tackle this issue, we propose using Softmax Relaxed Quantization, a novel approach that reformulates codebook selection as a continuous probability mixing process via Softmax, thereby preserving gradient propagation. Building upon this differentiable foundation, we introduce CycleVAR, which reformulates image-to-image translation as image-conditional visual autoregressive generation by injecting multi-scale source image tokens as contextual prompts, analogous to prefix-based conditioning in language models. CycleVAR exploits two modes to generate the target image tokens, including (1) serial multi-step generation, enabling iterative refinement across scales, and (2) parallel one-step generation synthesizing all resolution outputs in a single forward pass. Experimental findings indicate that the parallel one-step generation mode attains superior translation quality with quicker inference speed than the serial multi-step mode in unsupervised scenarios. Furthermore, both quantitative and qualitative results indicate that CycleVAR surpasses previous state-of-the-art unsupervised image translation models, \textit{e}.\textit{g}., CycleGAN-Turbo.

[205] GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

Shunsuke Yasuki,Taiki Miyanishi,Nakamasa Inoue,Shuhei Kurita,Koya Sakamoto,Daichi Azuma,Masato Taki,Yutaka Matsuo

Main category: cs.CV

TL;DR: The paper proposes GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes, overcoming limitations of current methods.

Details

Motivation: Existing approaches are typically limited to small-scale environments, lacking scalability and compositional reasoning capabilities necessary for large, complex urban settings. Method: We propose GeoProg3D, which consists of a Geography-aware City-scale 3D Language Field (GCLF) and Geographical Vision APIs (GV-APIs), employing large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF. Result: Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. Conclusion: GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language. Abstract: The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language. The code is available at https://snskysk.github.io/GeoProg3D/.

[206] Layer Decomposition and Morphological Reconstruction for Task-Oriented Infrared Image Enhancement

Siyuan Chai,Xiaodong Guo,Tong Liu

Main category: cs.CV

TL;DR: This paper proposes a task-oriented infrared image enhancement method combining layer decomposition and morphological reconstruction-based saliency extraction to improve image quality for autonomous driving perception tasks, especially under complex weather conditions.

Details

Motivation: Infrared images often suffer from low contrast, particularly in non-heat-emitting targets like bicycles, which affects the performance of high-level vision tasks such as object detection and semantic segmentation. Enhancing contrast without amplifying noise or losing critical information remains a challenge. Method: The method involves two key components: layer decomposition for enhancing scene details while preserving dark region features, and morphological reconstruction-based saliency extraction for extracting and enhancing target information without amplifying noise. Result: Extensive experiments demonstrate that the proposed approach outperforms existing state-of-the-art methods in improving infrared image quality for downstream vision tasks. Conclusion: The paper concludes that the proposed task-oriented infrared image enhancement method effectively improves image quality for object detection and semantic segmentation tasks, outperforming state-of-the-art methods. Abstract: Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non-heat-emitting targets like bicycles, which significantly affects the performance of downstream high-level vision tasks. Furthermore, achieving contrast enhancement without amplifying noise and losing important information remains a challenge. To address these challenges, we propose a task-oriented infrared image enhancement method. Our approach consists of two key components: layer decomposition and saliency information extraction. First, we design an layer decomposition method for infrared images, which enhances scene details while preserving dark region features, providing more features for subsequent saliency information extraction. Then, we propose a morphological reconstruction-based saliency extraction method that effectively extracts and enhances target information without amplifying noise. Our method improves the image quality for object detection and semantic segmentation tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods.

[207] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Yuanhao Cai,He Zhang,Xi Chen,Jinbo Xing,Yiwei Hu,Yuqian Zhou,Kai Zhang,Zhifei Zhang,Soo Ye Kim,Tianyu Wang,Yulun Zhang,Xiaokang Yang,Zhe Lin,Alan Yuille

Main category: cs.CV

TL;DR: This paper introduces a novel framework for multi-subject video customization using a new data construction pipeline, an image-video transfer mixed training strategy, and a diffusion Transformer with innovative embedding mechanisms, achieving superior performance over existing methods.

Details

Motivation: Existing feedforward subject-driven video customization methods primarily focus on single-subject scenarios due to difficulties in constructing multi-subject training data pairs. Additionally, the use of signals like depth, mask, camera, and text prompts for controlling and editing subjects in videos remains underexplored. This paper aims to address these challenges by proposing a method for multi-subject customization and leveraging control signals for instructive editing. Method: A data construction pipeline called VideoCus-Factory is proposed to generate multi-subject training data pairs from unlabeled raw videos. An Image-Video Transfer Mixed (IVTM) training strategy is developed using image editing data, and a diffusion Transformer framework named OmniVCus is introduced with two embedding mechanisms: Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). Result: Experiments show that the proposed method significantly surpasses state-of-the-art approaches in both quantitative and qualitative evaluations of subject-driven video customization. Conclusion: The proposed OmniVCus framework, along with the VideoCus-Factory data construction pipeline and IVTM training approach, significantly outperforms state-of-the-art methods in subject-driven video customization, enabling more flexible and instructive editing of customized videos. Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus

[208] SIEDD: Shared-Implicit Encoder with Discrete Decoders

Vikram Rangarajan,Shishira Maiya,Max Ehrlich,Abhinav Shrivastava

Main category: cs.CV

TL;DR: SIEDD accelerates implicit neural video compression by using a shared encoder and discrete decoders, achieving fast encoding with high quality and control.

Details

Motivation: INRs provide excellent fidelity in video compression but are limited by slow encoding times. Existing acceleration methods sacrifice quality or coordinate-level control. Method: SIEDD employs a shared encoder trained on sparse anchor frames followed by parallel training of lightweight discrete decoders for frame groups, using aggressive coordinate-space sampling. Result: SIEDD achieves 20-30X faster encoding speeds compared to state-of-the-art INR codecs on HD and 4K benchmarks without compromising reconstruction quality or compression ratios. Conclusion: SIEDD offers a scalable and efficient solution for high-fidelity neural video compression, significantly improving encoding speed while maintaining quality and control. Abstract: Implicit Neural Representations (INRs) offer exceptional fidelity for video compression by learning per-video optimized functions, but their adoption is crippled by impractically slow encoding times. Existing attempts to accelerate INR encoding often sacrifice reconstruction quality or crucial coordinate-level control essential for adaptive streaming and transcoding. We introduce SIEDD (Shared-Implicit Encoder with Discrete Decoders), a novel architecture that fundamentally accelerates INR encoding without these compromises. SIEDD first rapidly trains a shared, coordinate-based encoder on sparse anchor frames to efficiently capture global, low-frequency video features. This encoder is then frozen, enabling massively parallel training of lightweight, discrete decoders for individual frame groups, further expedited by aggressive coordinate-space sampling. This synergistic design delivers a remarkable 20-30X encoding speed-up over state-of-the-art INR codecs on HD and 4K benchmarks, while maintaining competitive reconstruction quality and compression ratios. Critically, SIEDD retains full coordinate-based control, enabling continuous resolution decoding and eliminating costly transcoding. Our approach significantly advances the practicality of high-fidelity neural video compression, demonstrating a scalable and efficient path towards real-world deployment. Our codebase is available at https://github.com/VikramRangarajan/SIEDD .

[209] A High-Throughput Platform to Bench Test Smartphone-Based Heart Rate Measurements Derived From Video

Ming-Zher Poh,Jonathan Wang,Jonathan Hsu,Lawrence Cai,Eric Teasley,James A. Taylor,Jameson K. Rogers,Anupam Pathak,Shwetak Patel

Main category: cs.CV

TL;DR: 本文介绍了一种新颖的高通量基准测试平台，用于解决基于智能手机的心率监测应用在性能评估和设备兼容性方面的挑战。

Details

Motivation: 由于设备差异性和碎片化，基于智能手机的心率监测应用在性能评估和设备兼容性方面面临重大挑战，而手动测试不切实际且缺乏标准化方法。 Method: 设计了一个包含测试设备、合成PPG测试视频生成方法和主机机器的系统，用于并行测试12部智能手机。 Result: 系统在输入和测量心率之间的平均绝对百分比误差（MAPE）为0.11% +/- 0.001%，输入和测量PPG信号之间的相关系数为0.92 +/- 0.008。 Conclusion: 该平台为智能手机心率应用的部署前测试提供了一个可扩展的解决方案，以提高应用性能、确保设备兼容性，并推动移动健康领域的发展。 Abstract: Smartphone-based heart rate (HR) monitoring apps using finger-over-camera photoplethysmography (PPG) face significant challenges in performance evaluation and device compatibility due to device variability and fragmentation. Manual testing is impractical, and standardized methods are lacking. This paper presents a novel, high-throughput bench-testing platform to address this critical need. We designed a system comprising a test rig capable of holding 12 smartphones for parallel testing, a method for generating synthetic PPG test videos with controllable HR and signal quality, and a host machine for coordinating video playback and data logging. The system achieved a mean absolute percentage error (MAPE) of 0.11% +/- 0.001% between input and measured HR, and a correlation coefficient of 0.92 +/- 0.008 between input and measured PPG signals using a clinically-validated smartphone-based HR app. Bench-testing results of 20 different smartphone models correctly classified all the devices as meeting the ANSI/CTA accuracy standards for HR monitors (MAPE <10%) when compared to a prospective clinical study with 80 participants, demonstrating high positive predictive value. This platform offers a scalable solution for pre-deployment testing of smartphone HR apps to improve app performance, ensure device compatibility, and advance the field of mobile health.

[210] Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models

Parham Rezaei,Arash Marioriyad,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: This paper proposes PSE, a new evaluation metric, and PSG, a method to improve spatial relationship alignment in text-to-image generation without model fine-tuning.

Details

Motivation: Text-to-image models struggle with compositional generation, particularly in accurately representing specified spatial relationships between objects in the input prompts. Method: A probabilistic framework based on Probability of Superiority (PoS) is introduced to model spatial relationships. PSG uses a PoS-based reward function with gradient-based guidance or search-based strategies for better spatial alignment. Result: Experiments show that PSE aligns better with human judgment than traditional metrics, and PSG outperforms state-of-the-art methods in generating images with accurate spatial configurations. Conclusion: The proposed PSG method effectively improves the alignment of spatial relationships in text-to-image models, while the PSE metric provides a more reliable assessment of spatial relationship accuracy. Abstract: Despite the ability of text-to-image models to generate high-quality, realistic, and diverse images, they face challenges in compositional generation, often struggling to accurately represent details specified in the input prompt. A prevalent issue in compositional generation is the misalignment of spatial relationships, as models often fail to faithfully generate images that reflect the spatial configurations specified between objects in the input prompts. To address this challenge, we propose a novel probabilistic framework for modeling the relative spatial positioning of objects in a scene, leveraging the concept of Probability of Superiority (PoS). Building on this insight, we make two key contributions. First, we introduce a novel evaluation metric, PoS-based Evaluation (PSE), designed to assess the alignment of 2D and 3D spatial relationships between text and image, with improved adherence to human judgment. Second, we propose PoS-based Generation (PSG), an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning. PSG employs a Part-of-Speech PoS-based reward function that can be utilized in two distinct ways: (1) as a gradient-based guidance mechanism applied to the cross-attention maps during the denoising steps, or (2) as a search-based strategy that evaluates a set of initial noise vectors to select the best one. Extensive experiments demonstrate that the PSE metric exhibits stronger alignment with human judgment compared to traditional center-based metrics, providing a more nuanced and reliable measure of complex spatial relationship accuracy in text-image alignment. Furthermore, PSG significantly enhances the ability of text-to-image models to generate images with specified spatial configurations, outperforming state-of-the-art methods across multiple evaluation metrics and benchmarks.

[211] Detecting What Matters: A Novel Approach for Out-of-Distribution 3D Object Detection in Autonomous Vehicles

Menna Taha,Aya Ahmed,Mohammed Karmoose,Yasser Gadallah

Main category: cs.CV

TL;DR: This paper proposes an object detection approach for autonomous vehicles that focuses on determining the harmfulness of objects rather than their specific class, improving the detection of previously unseen objects and enhancing real-time decision-making for safer driving.

Details

Motivation: The motivation behind this approach is to address a significant safety concern: autonomous vehicles' limited ability to detect and respond to Out-of-Distribution (OOD) objects due to reliance on conventional object detection models that classify objects into known classes. Method: The method involves shifting from conventional class-based classification to determining object harmfulness, identifying objects as 'harmful' or 'harmless' based on their position relative to the autonomous vehicle (AV) and its trajectory. Result: The results show that the proposed model effectively detects OOD objects, evaluates their harmfulness, and classifies them accordingly. Conclusion: The paper concludes that their proposed model enhances the decision-making effectiveness of autonomous vehicles in dynamic environments by effectively detecting and evaluating the harmfulness of previously unseen objects. Abstract: Autonomous vehicles (AVs) use object detection models to recognize their surroundings and make driving decisions accordingly. Conventional object detection approaches classify objects into known classes, which limits the AV's ability to detect and appropriately respond to Out-of-Distribution (OOD) objects. This problem is a significant safety concern since the AV may fail to detect objects or misclassify them, which can potentially lead to hazardous situations such as accidents. Consequently, we propose a novel object detection approach that shifts the emphasis from conventional class-based classification to object harmfulness determination. Instead of object detection by their specific class, our method identifies them as either 'harmful' or 'harmless' based on whether they pose a danger to the AV. This is done based on the object position relative to the AV and its trajectory. With this metric, our model can effectively detect previously unseen objects to enable the AV to make safer real-time decisions. Our results demonstrate that the proposed model effectively detects OOD objects, evaluates their harmfulness, and classifies them accordingly, thus enhancing the AV decision-making effectiveness in dynamic environments.

[212] Towards foundational LiDAR world models with efficient latent flow matching

Tianran Liu,Shengwen Zhao,Nicholas Rhinehart

Main category: cs.CV

TL;DR: This paper explores transferable LiDAR world models, achieving better performance with less data and higher efficiency through a novel framework.

Details

Motivation: Existing LiDAR world models are narrowly trained and lack transferability across different domains. This research aims to develop models that generalize well in various scenarios. Method: A latent conditional flow matching (CFM)-based framework was proposed to address inefficiencies in current LiDAR world models. Result: The pre-trained model achieved up to 11% absolute improvement over training from scratch, outperformed existing methods in 30/36 comparisons, and reduced required labeled data to 5%. The CFM-based framework improved computational efficiency significantly. Conclusion: LiDAR world models can be made highly transferable across multiple domains with pre-training, reducing reliance on annotated data and improving efficiency. Abstract: LiDAR-based world models offer more structured and geometry-aware representations than their image-based counterparts. However, existing LiDAR world models are narrowly trained; each model excels only in the domain for which it was built. Can we develop LiDAR world models that exhibit strong transferability across multiple domains? We conduct the first systematic domain transfer study across three demanding scenarios: (i) outdoor to indoor generalization, (ii) sparse-beam \& dense-beam adaptation, and (iii) non-semantic to semantic transfer. Given different amounts of fine-tuning data, our experiments show that a single pre-trained model can achieve up to 11% absolute improvement (83\% relative) over training from scratch and outperforms training from scratch in 30/36 of our comparisons. This transferability of dynamic learning significantly reduces the reliance on manually annotated data for semantic occupancy forecasting: our method exceed the previous semantic occupancy forecasting models with only 5% of the labeled training data required by prior models. We also observed inefficiencies of current LiDAR world models, mainly through their under-compression of LiDAR data and inefficient training objectives. To address this, we propose a latent conditional flow matching (CFM)-based frameworks that achieves state-of-the-art reconstruction accuracy using only half the training data and a compression ratio 6 times higher than that of prior methods. Our model achieves SOTA performance on future-trajectory-conditioned semantic occupancy forecasting while being 23x more computationally efficient (a 28x FPS speedup); and achieves SOTA performance on semantic occupancy forecasting while being 2x more computationally efficient (a 1.1x FPS speedup).

[213] PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions

Mahesh Bhosale,Abdul Wasi,Yuanhao Zhai,Yunjie Tian,Samuel Border,Nan Xi,Pinaki Sarder,Junsong Yuan,David Doermann,Xuan Gong

Main category: cs.CV

TL;DR: This paper proposes PathDiff, a diffusion-based generative model that synthesizes histopathology images using unpaired mask-text data. It offers enhanced control over image semantics and structure, generating high-quality outputs that improve data augmentation for medical tasks.

Details

Motivation: Public datasets lack paired text and mask data for histopathological images, limiting the joint use of these modalities in image generation. Combining both modalities can enhance control over semantics and spatial details. Method: PathDiff integrates both diagnostic text reports and masks into a unified conditioning space, allowing precise control over structural and contextual features in the absence of paired data. Result: PathDiff generates high-quality, semantically accurate images and improves image fidelity, text-image alignment, and faithfulness, enhancing data augmentation for downstream tasks like nuclei segmentation and classification. Extensive experiments demonstrate its superiority over existing methods. Conclusion: PathDiff is a novel diffusion framework that effectively learns from unpaired mask-text data to generate high-quality, semantically accurate histopathology images, offering improved control over structural and contextual features. Abstract: Diffusion-based generative models have shown promise in synthesizing histopathology images to address data scarcity caused by privacy constraints. Diagnostic text reports provide high-level semantic descriptions, and masks offer fine-grained spatial structures essential for representing distinct morphological regions. However, public datasets lack paired text and mask data for the same histopathological images, limiting their joint use in image generation. This constraint restricts the ability to fully exploit the benefits of combining both modalities for enhanced control over semantics and spatial details. To overcome this, we propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data by integrating both modalities into a unified conditioning space. PathDiff allows precise control over structural and contextual features, generating high-quality, semantically accurate images. PathDiff also improves image fidelity, text-image alignment, and faithfulness, enhancing data augmentation for downstream tasks like nuclei segmentation and classification. Extensive experiments demonstrate its superiority over existing methods.

[214] Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation

Dewen Zeng,Xinrong Hu,Yu-Jen Chen,Yawen Wu,Xiaowei Xu,Yiyu Shi

Main category: cs.CV

TL;DR: 文章介绍了一种名为CLDF的新方法，它结合了对比学习和扩散模型，有效提升了弱监督语义分割的效果。

Details

Motivation: 传统的基于CAM的方法在分类和分割之间存在优化差异，导致激活不完整和边界不准确；尽管CDM被用于生成分割掩码，但其生成的显著图容易受到背景变化的影响。 Method: 文章使用了对比学习与扩散模型相结合的方法，将CDM生成的梯度图与CAMs结合，以减少对比学习中的误报/漏报，并用于像素嵌入学习。 Result: 实验结果表明，该方法在两个公开医学数据集的四个分割任务中显著优于现有基线方法。 Conclusion: 本文提出了一种新的方法CLDF，通过对比学习训练像素解码器以实现更精确的分割，从而解决了传统CAM-based方法和CDM在WSSS中的局限性。 Abstract: Weakly supervised semantic segmentation (WSSS) methods using class labels often rely on class activation maps (CAMs) to localize objects. However, traditional CAM-based methods struggle with partial activations and imprecise object boundaries due to optimization discrepancies between classification and segmentation. Recently, the conditional diffusion model (CDM) has been used as an alternative for generating segmentation masks in WSSS, leveraging its strong image generation capabilities tailored to specific class distributions. By modifying or perturbing the condition during diffusion sampling, the related objects can be highlighted in the generated images. Yet, the saliency maps generated by CDMs are prone to noise from background alterations during reverse diffusion. To alleviate the problem, we introduce Contrastive Learning with Diffusion Features (CLDF), a novel method that uses contrastive learning to train a pixel decoder to map the diffusion features from a frozen CDM to a low-dimensional embedding space for segmentation. Specifically, we integrate gradient maps generated from CDM external classifier with CAMs to identify foreground and background pixels with fewer false positives/negatives for contrastive learning, enabling robust pixel embedding learning. Experimental results on four segmentation tasks from two public medical datasets demonstrate that our method significantly outperforms existing baselines.

[215] Time-variant Image Inpainting via Interactive Distribution Transition Estimation

Yun Xing,Qing Guo,Xiaoguang Li,Yihao Huang,Xiaofeng Cao,Di Lin,Ivor Tsang,Lei Ma

Main category: cs.CV

TL;DR: This paper introduces a new method called InDiTE-Diff to address the Time-vAriant iMage inPainting (TAMP) problem, where traditional approaches fail due to large differences between images taken at different times. The new approach performs better than current state-of-the-art techniques.

Details

Motivation: The motivation stems from the limitations of conventional reference-guided image inpainting methods when dealing with time-variant images where both the target and reference images have significant content differences and may be damaged. Method: The authors propose a novel Interactive Distribution Transition Estimation (InDiTE) module and integrate it with a state-of-the-art diffusion model, creating the InDiTE-Diff solution. They also introduce a new dataset called TAMP-Street for benchmarking purposes. Result: Experiments on the TAMP-Street dataset show that the proposed InDiTE-Diff method consistently outperforms existing state-of-the-art methods in handling time-variant image inpainting tasks. Conclusion: The study concludes that the proposed InDiTE-Diff method outperforms state-of-the-art reference-guided image inpainting methods for solving the Time-vAriant iMage inPainting (TAMP) task. Abstract: In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images captured the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the reference image under TAMP setup presents significant content distinction to the target image and potentially also suffers from damages. Such an application frequently happens in our daily lives to restore a damaged image by referring to another reference image, where there is no guarantee of the reference image's source and quality. In particular, our study finds that even state-of-the-art (SOTA) reference-guided image inpainting methods fail to achieve plausible results due to the chaotic image complementation. To address such an ill-posed problem, we propose a novel Interactive Distribution Transition Estimation (InDiTE) module which interactively complements the time-variant images with adaptive semantics thus facilitate the restoration of damaged regions. To further boost the performance, we propose our TAMP solution, namely Interactive Distribution Transition Estimation-driven Diffusion (InDiTE-Diff), which integrates InDiTE with SOTA diffusion model and conducts latent cross-reference during sampling. Moreover, considering the lack of benchmarks for TAMP task, we newly assembled a dataset, i.e., TAMP-Street, based on existing image and mask datasets. We conduct experiments on the TAMP-Street datasets under two different time-variant image inpainting settings, which show our method consistently outperform SOTA reference-guided image inpainting methods for solving TAMP.

[216] Sanitizing Manufacturing Dataset Labels Using Vision-Language Models

Nazanin Mahjourian,Vinh Nguyen

Main category: cs.CV

TL;DR: This paper proposes Vision-Language Sanitization and Refinement (VLSR), a framework leveraging vision-language models to automatically sanitize and refine noisy multi-label manufacturing image datasets, improving label consistency and reducing label vocabulary with minimal human effort.

Details

Motivation: Large-scale datasets, especially those from crowd-sourcing and web-scraping, often contain label noise, inconsistencies, and errors. This issue is particularly severe in manufacturing domains where obtaining high-quality labels is costly and time-consuming. The paper aims to address this challenge through an automated solution for label sanitization and refinement. Method: VLSR embeds images and textual labels into a shared semantic space using the CLIP vision-language model. It performs label sanitization by computing cosine similarity between image-label embeddings to identify weak labels and surface aligned ones. Additionally, it applies density-based clustering on text embeddings followed by iterative cluster merging to unify semantically similar labels. Result: Using the Factorynet dataset with noisy labels, experiments showed that the VLSR framework successfully identifies problematic labels, enhances label consistency, and significantly reduces label vocabulary through clustering. Conclusion: The VLSR framework effectively identifies problematic labels and improves label consistency, reducing the label vocabulary with minimal human intervention and enhancing dataset quality for robust machine learning model training in industrial applications. Abstract: The success of machine learning models in industrial applications is heavily dependent on the quality of the datasets used to train the models. However, large-scale datasets, specially those constructed from crowd-sourcing and web-scraping, often suffer from label noise, inconsistencies, and errors. This problem is particularly pronounced in manufacturing domains, where obtaining high-quality labels is costly and time-consuming. This paper introduces Vision-Language Sanitization and Refinement (VLSR), which is a vision-language-based framework for label sanitization and refinement in multi-label manufacturing image datasets. This method embeds both images and their associated textual labels into a shared semantic space leveraging the CLIP vision-language model. Then two key tasks are addressed in this process by computing the cosine similarity between embeddings. First, label sanitization is performed to identify irrelevant, misspelled, or semantically weak labels, and surface the most semantically aligned label for each image by comparing image-label pairs using cosine similarity between image and label embeddings. Second, the method applies density-based clustering on text embeddings, followed by iterative cluster merging, to group semantically similar labels into unified label groups. The Factorynet dataset, which includes noisy labels from both human annotations and web-scraped sources, is employed to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the VLSR framework successfully identifies problematic labels and improves label consistency. This method enables a significant reduction in label vocabulary through clustering, which ultimately enhances the dataset's quality for training robust machine learning models in industrial applications with minimal human intervention.

[217] AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays

Chenlang Yi,Zizhan Xiong,Qi Qi,Xiyuan Wei,Girish Bathla,Ching-Long Lin,Bobak Jack Mortazavi,Tianbao Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的CLIP模型公平性改进框架AdFair-CLIP，通过对抗性特征干预来减少敏感属性的影响，实验证明该方法能有效提升医学图像分类的公平性和准确性。

Details

Motivation: 尽管CLIP模型在各种视觉任务中表现出色，但对其公平性问题（特别是与种族和性别有关的问题）关注有限，这可能导致诊断结果存在差异，并降低对代表性不足群体的可靠性。 Method: 引入了一种新的框架AdFair-CLIP，使用对抗性特征干预来抑制敏感属性，从而减少虚假相关性并提高预测的公平性。 Result: 在胸片（CXR）数据集上进行的综合实验表明，AdFair-CLIP显著提升了公平性和诊断准确性，并在零样本和小样本场景下保持了强大的泛化能力。 Conclusion: AdFair-CLIP有效地提高了CLIP模型在医学图像分类中的公平性和诊断准确性，同时保持了零样本和小样本场景下的泛化能力。 Abstract: Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, including demographic biases, have received limited attention for CLIP models. This oversight leads to critical issues, particularly those related to race and gender, resulting in disparities in diagnostic outcomes and reduced reliability for underrepresented groups. To address these challenges, we introduce AdFair-CLIP, a novel framework employing adversarial feature intervention to suppress sensitive attributes, thereby mitigating spurious correlations and improving prediction fairness. We conduct comprehensive experiments on chest X-ray (CXR) datasets, and show that AdFair-CLIP significantly enhances both fairness and diagnostic accuracy, while maintaining robust generalization in zero-shot and few-shot scenarios. These results establish new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.

Xuan Yao,Junyu Gao,Changsheng Xu

Main category: cs.CV

TL;DR: NavMorph是一种用于视觉语言导航的自我进化世界模型框架，其通过紧凑的潜在表示和Contextual Evolution Memory提升了代理的环境理解和决策能力，并在基准测试中表现优异。

Details

Motivation: 当前的方法在泛化到新环境和适应导航过程中的变化时存在困难，因此提出了NavMorph以改进这些方面。 Method: 该研究引入了NavMorph，利用紧凑的潜在表示和Contextual Evolution Memory来提升导航性能。 Result: 广泛的实验表明，NavMorph在流行的VLN-CE基准测试中取得了显著的性能提升。 Conclusion: NavMorph是一个自我进化的世界模型框架，它通过紧凑的潜在表示来建模环境动态，增强了代理在VLN-CE任务中的环境理解和决策能力。 Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene-contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that our method achieves notable performance improvements on popular VLN-CE benchmarks. Code is available at \href{https://github.com/Feliciaxyao/NavMorph}{this https URL}.

[219] Interactive Interface For Semantic Segmentation Dataset Synthesis

Ngoc-Do Tran,Minh-Tuan Huynh,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: SynthLab is a user-friendly, modular platform designed to streamline the creation of high-quality synthetic datasets for computer vision tasks, making AI more accessible to non-experts.

Details

Motivation: Creating high-quality annotated datasets for semantic segmentation is resource-intensive, requiring significant time, labor, and financial investment, and often raises privacy concerns. There is a need for a scalable and user-friendly solution. Method: SynthLab provides a modular platform for visual data synthesis and a user-friendly interface that allows quick customization of data pipelines through drag-and-drop actions. Result: Extensive user studies showed SynthLab's flexibility, adaptability, and high accessibility across diverse users with varying expertise levels. Conclusion: SynthLab enables users without deep technical expertise to harness AI for real-world applications. Abstract: The rapid advancement of AI and computer vision has significantly increased the demand for high-quality annotated datasets, particularly for semantic segmentation. However, creating such datasets is resource-intensive, requiring substantial time, labor, and financial investment, and often raises privacy concerns due to the use of real-world data. To mitigate these challenges, we present SynthLab, consisting of a modular platform for visual data synthesis and a user-friendly interface. The modular architecture of SynthLab enables easy maintenance, scalability with centralized updates, and seamless integration of new features. Each module handles distinct aspects of computer vision tasks, enhancing flexibility and adaptability. Meanwhile, its interactive, user-friendly interface allows users to quickly customize their data pipelines through drag-and-drop actions. Extensive user studies involving a diverse range of users across different ages, professions, and expertise levels, have demonstrated flexible usage, and high accessibility of SynthLab, enabling users without deep technical expertise to harness AI for real-world applications.

[220] GeoCD: A Differential Local Approximation for Geodesic Chamfer Distance

Pedro Alonso,Tianrui Li,Chongshou Li

Main category: cs.CV

TL;DR: This paper introduces GeoCD, a new metric for 3D point cloud learning that addresses the limitations of Chamfer Distance by incorporating geodesic distance and topology awareness, leading to better reconstruction results.

Details

Motivation: Chamfer Distance (CD) is widely used in 3D point cloud learning but has limitations because it relies solely on Euclidean distances, which fail to capture the intrinsic geometry of 3D shapes. Method: The authors proposed GeoCD, an improved metric for 3D point cloud learning that approximates geodesic distance while being topology-aware and fully differentiable. They conducted experiments to compare GeoCD with standard CD across various architectures and datasets. Result: GeoCD consistently improved reconstruction quality over standard CD in multiple models and datasets. Fine-tuning models trained with CD using GeoCD for just one epoch led to significant improvements across evaluation metrics. Conclusion: GeoCD, a topology-aware and fully differentiable approximation of geodesic distance, improves 3D point cloud learning by capturing the intrinsic geometry of shapes better than Chamfer Distance. Abstract: Chamfer Distance (CD) is a widely adopted metric in 3D point cloud learning due to its simplicity and efficiency. However, it suffers from a fundamental limitation: it relies solely on Euclidean distances, which often fail to capture the intrinsic geometry of 3D shapes. To address this limitation, we propose GeoCD, a topology-aware and fully differentiable approximation of geodesic distance designed to serve as a metric for 3D point cloud learning. Our experiments show that GeoCD consistently improves reconstruction quality over standard CD across various architectures and datasets. We demonstrate this by fine-tuning several models, initially trained with standard CD, using GeoCD. Remarkably, fine-tuning for a single epoch with GeoCD yields significant gains across multiple evaluation metrics.

[221] Instant GaussianImage: A Generalizable and Self-Adaptive Image Representation via 2D Gaussian Splatting

Zhaojie Zeng,Yuesong Wang,Chao Yang,Tao Guan,Lili Ju

Main category: cs.CV

TL;DR: This paper introduces a faster and more adaptive image representation method using dynamic 2D Gaussian Splatting, achieving improved efficiency without sacrificing rendering quality.

Details

Motivation: INR methods demand substantial GPU resources, while GaussianImage's fixed number of Gaussians limits adaptability. We aim to reduce cost, improve training speed, and enhance adaptability. Method: A network generates a coarse Gaussian representation followed by minimal fine-tuning steps, dynamically adapting the number of Gaussians based on image complexity. Result: Experiments show that the proposed method matches or exceeds GaussianImage's rendering performance with far fewer iterations and up to one order of magnitude reduction in training time. Conclusion: The proposed method based on 2D Gaussian Splatting offers a generalizable, self-adaptive image representation framework that significantly reduces training time and enhances flexibility by dynamically adjusting the number of Gaussian points. Abstract: Implicit Neural Representation (INR) has demonstrated remarkable advances in the field of image representation but demands substantial GPU resources. GaussianImage recently pioneered the use of Gaussian Splatting to mitigate this cost, however, the slow training process limits its practicality, and the fixed number of Gaussians per image limits its adaptability to varying information entropy. To address these issues, we propose in this paper a generalizable and self-adaptive image representation framework based on 2D Gaussian Splatting. Our method employs a network to quickly generate a coarse Gaussian representation, followed by minimal fine-tuning steps, achieving comparable rendering quality of GaussianImage while significantly reducing training time. Moreover, our approach dynamically adjusts the number of Gaussian points based on image complexity to further enhance flexibility and efficiency in practice. Experiments on DIV2K and Kodak datasets show that our method matches or exceeds GaussianImage's rendering performance with far fewer iterations and shorter training times. Specifically, our method reduces the training time by up to one order of magnitude while achieving superior rendering performance with the same number of Gaussians.

[222] Evaluation of Geolocation Capabilities of Multimodal Large Language Models and Analysis of Associated Privacy Risks

Xian Zhang,Xiang Cheng

Main category: cs.CV

TL;DR: 本研究探讨了多模态大语言模型（MLLM）在地理定位任务中的应用，发现其能在1公里半径内以最高49%的准确率定位街道级图像的起源，尽管存在隐私和伦理问题。

Details

Motivation: 多模态大语言模型（MLLMs）的发展引发了关于隐私和伦理的重要关注。由于MLLM能够仅凭视觉内容推断图片的地理位置，这带来了严重的隐私侵犯风险，包括人肉搜索、监控和其他安全威胁。 Method: 本研究对现有的基于MLLM的地理定位技术进行了全面分析，系统地回顾了相关文献，并评估了最先进的视觉推理模型在地理定位任务中的表现，特别是在识别街道视图图像来源方面的表现。 Result: 实证评估显示，最先进的视觉大模型可以在1公里半径范围内以高达49%的准确率定位街道级图像的起源。这种性能突显了模型从视觉数据中提取和利用细粒度地理线索的强大能力。 Conclusion: 该研究确定了有助于成功地理定位的关键视觉元素，并讨论了多模态大语言模型（MLLM）启用的地理定位的潜在隐私影响，以及几种技术和基于政策的对策以减轻相关风险。 Abstract: Objectives: The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly enhanced their reasoning capabilities, enabling a wide range of intelligent applications. However, these advancements also raise critical concerns regarding privacy and ethics. MLLMs are now capable of inferring the geographic location of images -- such as those shared on social media or captured from street views -- based solely on visual content, thereby posing serious risks of privacy invasion, including doxxing, surveillance, and other security threats. Methods: This study provides a comprehensive analysis of existing geolocation techniques based on MLLMs. It systematically reviews relevant litera-ture and evaluates the performance of state-of-the-art visual reasoning models on geolocation tasks, particularly in identifying the origins of street view imagery. Results: Empirical evaluation reveals that the most advanced visual large models can successfully localize the origin of street-level imagery with up to $49\%$ accuracy within a 1-kilometer radius. This performance underscores the models' powerful capacity to extract and utilize fine-grained geographic cues from visual data. Conclusions: Building on these findings, the study identifies key visual elements that contribute to suc-cessful geolocation, such as text, architectural styles, and environmental features. Furthermore, it discusses the potential privacy implications associated with MLLM-enabled geolocation and discuss several technical and policy-based coun-termeasures to mitigate associated risks. Our code and dataset are available at https://github.com/zxyl1003/MLLM-Geolocation-Evaluation.

[223] MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

Jun Huang,Ting Liu,Yihang Wu,Xiaochao Qu,Luoqi Liu,Xiaolin Hu

Main category: cs.CV

TL;DR: 本文提出了一种名为MTADiffusion的新方法用于对象修补，解决了现有方法中的语义、结构和风格问题，并在多个基准测试中表现优异。

Details

Motivation: 现有的图像修复方法存在语义错位、结构失真和风格不一致等问题，因此需要提出一种新的方法来解决这些问题。 Method: 提出了MTADiffusion模型、MTAPipeline自动标注方案、多任务训练策略以及新的修补风格一致性损失函数。此外还构建了包含500万张图像和2500万掩码-文本对的MTADataset。 Result: 在BrushBench和EditBench上的全面评估表明，MTADiffusion与其他方法相比表现优越，达到了最先进的性能。 Conclusion: MTADiffusion实现了比其他方法更先进的性能，通过增强语义能力、结构稳定性和风格一致性解决了现有修补方法的问题。 Abstract: Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.

[224] Qwen-GUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

ZongHan Hsieh,Tzer-Jen Wei

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的视觉语言模型Qwen-GUI-3B，专门用于图形用户界面基础任务，通过创新性的数据处理和训练方法，实现了比现有小于40亿参数模型更优的性能。

Details

Motivation: 大规模视觉语言模型计算密集且不适合消费级硬件，因此需要一个高效的轻量级模型。 Method: 该论文提出了一个两阶段微调策略，并结合了跨平台、多分辨率的数据集，同时采用数据整理和冗余减少策略以提高模型适应性。 Result: Qwen-GUI-3B在标准GUI基准测试中表现出色，在ScreenSpot上达到84.9%的准确率，在ScreenSpot-v2上达到86.4%的准确率。 Conclusion: Qwen-GUI-3B是专为图形用户界面基础任务设计的轻量级视觉语言模型，其性能与显著更大的模型相当。 Abstract: This paper introduces Qwen-GUI-3B, a lightweight Vision-Language Model (VLM) specifically designed for Graphical User Interface grounding tasks, achieving performance competitive with significantly larger models. Unlike large-scale VLMs (>7B parameters) that are computationally intensive and impractical for consumer-grade hardware, Qwen-GUI-3B delivers strong grounding accuracy while being fully trainable on a single GPU (RTX 4090). The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks-including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights Qwen-GUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The Qwen-GUI-3B is available at: https://github.com/Han1018/Qwen-GUI-3B

Mengxiao Tian,Xinxiao Wu,Shuo Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于大语言模型增强的动作感知多模态提示调优方法，旨在提升CLIP在细粒度动作级理解方面的能力。

Details

Motivation: 由于CLIP在理解对象属性和对象间空间关系等细粒度细节上存在不足，现有方法尝试通过引入提示学习实现对象级对齐，但仍缺乏对描述对象状态或关系至关重要的动作感知能力。 Method: 设计了动作三元组提示和动作状态提示，利用大语言模型中隐含的组合语义知识和状态相关的因果知识，并提出一种自适应交互模块来聚合注意力视觉特征，以建立具有辨别力且动作感知的视觉表示。 Result: 在两个基准数据集上的全面实验结果验证了该方法的有效性。 Conclusion: 通过引入大语言模型生成的动作相关外部知识，所提出的动作感知多模态提示调优方法显著提升了CLIP在细粒度动作级理解任务中的性能。 Abstract: Driven by large-scale contrastive vision-language pre-trained models such as CLIP, recent advancements in the image-text matching task have achieved remarkable success in representation learning. Due to image-level visual-language alignment, CLIP falls short in understanding fine-grained details such as object attributes and spatial relationships between objects. Recent efforts have attempted to compel CLIP to acquire structured visual representations by introducing prompt learning to achieve object-level alignment. While achieving promising results, they still lack the capability to perceive actions, which are crucial for describing the states or relationships between objects. Therefore, we propose to endow CLIP with fine-grained action-level understanding by introducing an LLM-enhanced action-aware multi-modal prompt-tuning method, incorporating the action-related external knowledge generated by large language models (LLMs). Specifically, we design an action triplet prompt and an action state prompt to exploit compositional semantic knowledge and state-related causal knowledge implicitly stored in LLMs. Subsequently, we propose an adaptive interaction module to aggregate attentive visual features conditioned on action-aware prompted knowledge for establishing discriminative and action-aware visual representations, which further improves the performance. Comprehensive experimental results on two benchmark datasets demonstrate the effectiveness of our method.

[226] Improve Underwater Object Detection through YOLOv12 Architecture and Physics-informed Augmentation

Tinh Nguyen

Main category: cs.CV

TL;DR: 本文提出YOLOv12，一种结合物理信息增强与改进网络结构的水下目标检测方法，在低能见度条件下表现出色，显著提升实时性能和准确性。

Details

Motivation: 水下目标检测在自主导航、环境监测和海洋勘探中至关重要，但受光衰减、浑浊和遮挡严重影响，现有方法难以在低可见度条件下实现实时部署。 Method: 通过将物理信息增强技术与YOLOv12架构结合，引入Residual ELAN模块以保留浑浊水域中的结构特征，并采用Area Attention机制维持大感受野同时降低计算复杂度。 Result: 在四个具有挑战性的数据集上进行了广泛测试，结果显示YOLOv12在Brackish数据上的mAP达到98.30%，帧率为142 FPS；相比先前模型，遮挡鲁棒性提升了18.9%，小目标召回率提高了22.4%，检测精度最多提高7.94%。 Conclusion: 该论文提出了一种基于YOLOv12架构的水下目标检测方法，结合物理信息增强技术和网络结构改进，在准确性和计算效率方面取得了先进性能，为水下机器人和保护应用提供了有效解决方案。 Abstract: Underwater object detection is crucial for autonomous navigation, environmental monitoring, and marine exploration, but it is severely hampered by light attenuation, turbidity, and occlusion. Current methods balance accuracy and computational efficiency, but they have trouble deploying in real-time under low visibility conditions. Through the integration of physics-informed augmentation techniques with the YOLOv12 architecture, this study advances underwater detection. With Residual ELAN blocks to preserve structural features in turbid waters and Area Attention to maintain large receptive fields for occluded objects while reducing computational complexity. Underwater optical properties are addressed by domain-specific augmentations such as turbulence adaptive blurring, biologically grounded occlusion simulation, and spectral HSV transformations for color distortion. Extensive tests on four difficult datasets show state-of-the-art performance, with Brackish data registering 98.30% mAP at 142 FPS. YOLOv12 improves occlusion robustness by 18.9%, small-object recall by 22.4%, and detection precision by up to 7.94% compared to previous models. The crucial role of augmentation strategy is validated by ablation studies. This work offers a precise and effective solution for conservation and underwater robotics applications.

[227] ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models

Zixun Fang,Kai Zhu,Zhiheng Liu,Yu Liu,Wei Zhai,Yang Cao,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 本文介绍了一种通过解决模态差距来提高全景视频生成质量的新方法，实现了更优的视觉效果和空间连续性。

Details

Motivation: 由于全景数据与透视数据之间的固有模态差异，现有作品无法合成高质量的全景视频，而现代扩散模型的训练数据主要为透视数据。 Method: 设计了一种新的全景表示方法（ViewPoint地图）和Pano-Perspective注意力机制，以有效利用预训练的视角先验知识并捕捉全景空间相关性。 Result: 实验表明，所提方法能够合成高度动态且空间一致的全景视频，并超越了以前的方法。 Conclusion: 该论文提出了一种利用预训练视角视频模型生成全景视频的新框架，在动态性和空间一致性方面优于现有方法，达到了最先进的性能。 Abstract: Panoramic video generation aims to synthesize 360-degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high-quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine-grained visual details simultaneously. With our proposed Pano-Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state-of-the-art performance and surpassing previous methods.

[228] WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

Jiwoo Park,Tae Eun Choi,Youngjun Jun,Seong Jae Hwang

Main category: cs.CV

TL;DR: 本文提出了一种无需额外模块的扩散模型增强方法，通过基于视图引导的扭曲技术提升新视角图像生成的视图一致性。

Details

Motivation: 现有的扩散模型在新视角合成中难以保持视图间空间连续性，虽然已有结合3D模型的方法解决此问题，但这些方法由于复杂的多步骤流程而效率低下。 Method: 利用基于视图引导的扭曲技术，在不使用额外模块的前提下对扩散模型进行增强，实现自适应注意力操作和噪声重新初始化。 Result: 通过所提出的综合度量框架，实验表明该方法在各种扩散模型中均能提高视图一致性，展示了其广泛的应用潜力。 Conclusion: 本文提出了一种新的视图一致的图像生成方法，该方法能够增强扩散模型在无额外模块的情况下保持视图一致性，并通过综合度量框架证明了其在多种扩散模型中的适用性。 Abstract: Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines. This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability.

[229] From Sight to Insight: Unleashing Eye-Tracking in Weakly Supervised Video Salient Object Detection

Qi Qin,Runmin Cong,Gen Zhan,Yiting Liao,Sam Kwong

Main category: cs.CV

TL;DR: This paper proposes a novel approach for video salient object detection using eye-tracking data under weak supervision, achieving state-of-the-art results on benchmark datasets.

Details

Motivation: Eye-tracking annotations are easier to obtain and align well with human visual patterns, making them valuable for improving the accuracy of video salient object detection tasks under weak supervision. Method: The paper introduces a Position and Semantic Embedding (PSE) module, a Semantics and Locality Query (SLQ) Competitor, and an Intra-Inter Mixed Contrastive (IIMC) model for spatiotemporal feature modeling under weak supervision. Result: The model outperforms other methods on multiple evaluation metrics across five popular VSOD benchmarks. Conclusion: The proposed method effectively utilizes fixation information to improve video salient object detection under weak supervision, demonstrating superior performance on five VSOD benchmarks. Abstract: The eye-tracking video saliency prediction (VSP) task and video salient object detection (VSOD) task both focus on the most attractive objects in video and show the result in the form of predictive heatmaps and pixel-level saliency masks, respectively. In practical applications, eye tracker annotations are more readily obtainable and align closely with the authentic visual patterns of human eyes. Therefore, this paper aims to introduce fixation information to assist the detection of video salient objects under weak supervision. On the one hand, we ponder how to better explore and utilize the information provided by fixation, and then propose a Position and Semantic Embedding (PSE) module to provide location and semantic guidance during the feature learning process. On the other hand, we achieve spatiotemporal feature modeling under weak supervision from the aspects of feature selection and feature contrast. A Semantics and Locality Query (SLQ) Competitor with semantic and locality constraints is designed to effectively select the most matching and accurate object query for spatiotemporal modeling. In addition, an Intra-Inter Mixed Contrastive (IIMC) model improves the spatiotemporal modeling capabilities under weak supervision by forming an intra-video and inter-video contrastive learning paradigm. Experimental results on five popular VSOD benchmarks indicate that our model outperforms other competitors on various evaluation metrics.

[230] Lightweight Temporal Transformer Decomposition for Federated Autonomous Driving

Tuong Do,Binh X. Nguyen,Quang D. Tran,Erman Tjiputra,Te-Chuan Chiu,Anh Nguyen

Main category: cs.CV

TL;DR: This paper proposes a lightweight temporal transformer decomposition approach to enhance autonomous driving performance by efficiently utilizing temporal data while reducing model complexity.

Details

Motivation: Traditional vision-based autonomous driving systems struggle in complex environments with single-image inputs, and incorporating temporal data enhances robustness but existing methods are resource-intensive and impractical for training or federated learning. Method: Lightweight temporal transformer decomposition that processes sequential image frames and steering data by breaking down large attention maps into smaller matrices. Result: Experiments on three datasets show the method outperforms recent approaches with real-time performance, further confirmed by real robot experiments. Conclusion: The proposed lightweight temporal transformer decomposition method effectively enhances autonomous driving performance by leveraging temporal information while reducing model complexity. Abstract: Traditional vision-based autonomous driving systems often face difficulties in navigating complex environments when relying solely on single-image inputs. To overcome this limitation, incorporating temporal data such as past image frames or steering sequences, has proven effective in enhancing robustness and adaptability in challenging scenarios. While previous high-performance methods exist, they often rely on resource-intensive fusion networks, making them impractical for training and unsuitable for federated learning. To address these challenges, we propose lightweight temporal transformer decomposition, a method that processes sequential image frames and temporal steering data by breaking down large attention maps into smaller matrices. This approach reduces model complexity, enabling efficient weight updates for convergence and real-time predictions while leveraging temporal information to enhance autonomous driving performance. Intensive experiments on three datasets demonstrate that our method outperforms recent approaches by a clear margin while achieving real-time performance. Additionally, real robot experiments further confirm the effectiveness of our method.

[231] When Test-Time Adaptation Meets Self-Supervised Models

Jisu Han,Jihee Park,Dongyoon Han,Wonjun Hwang

Main category: cs.CV

TL;DR: 本文探讨了无需源域预训练，通过自监督学习（SSL）和测试时适应（TTA）方法的结合提升模型在动态环境中的实用性。

Details

Motivation: 现有的TTA方法在应用于低准确率的自监督模型时存在困难，并且在线适应方法高度依赖源预训练模型性能。 Method: 提出了一种新的自监督TTA协议和一个协作学习框架，该框架结合SSL和TTA模型，利用对比学习和知识蒸馏逐步优化表示。 Result: 实验验证了所提方法的有效性，在多种自监督模型上取得了具有竞争力的表现，即使没有源域预训练。 Conclusion: 该研究证实了通过适当的方法，自监督学习与TTA可以有效结合，从而提高模型在实际应用中的适应性和表现。 Abstract: Training on test-time data enables deep learning models to adapt to dynamic environmental changes, enhancing their practical applicability. Online adaptation from source to target domains is promising but it remains highly reliant on the performance of source pretrained model. In this paper, we investigate whether test-time adaptation (TTA) methods can continuously improve models trained via self-supervised learning (SSL) without relying on source pretraining. We introduce a self-supervised TTA protocol after observing that existing TTA approaches struggle when directly applied to self-supervised models with low accuracy on the source domain. Furthermore, we propose a collaborative learning framework that integrates SSL and TTA models, leveraging contrastive learning and knowledge distillation for stepwise representation refinement. We validate our method on diverse self-supervised models, including DINO, MoCo, and iBOT, across TTA benchmarks. Extensive experiments validate the effectiveness of our approach in SSL, showing that it achieves competitive performance even without source pretraining.

[232] GViT: Representing Images as Gaussians for Visual Recognition

Jefferson Hernandez,Ruozhen He,Guha Balakrishnan,Alexander C. Berg,Vicente Ordonez

Main category: cs.CV

TL;DR: GVIT采用2D高斯表示替代传统patch输入，在保持性能的同时探索了更高效的图像编码方式。

Details

Motivation: 传统视觉Transformer依赖像素或patch网格表示，而本文尝试用更紧凑、可学习的高斯分布来提高效率和性能。 Method: 将图像编码为少量2D高斯分布，并通过与ViT分类器联合优化，利用分类梯度指导高斯分布向类别显著区域调整。 Result: 在ImageNet-1k数据集上，使用ViT-B架构的GVIT达到了76.9%的top-1准确率，接近标准patch-based ViT的表现。 Conclusion: GVIT使用可学习的2D高斯输入表示结合ViT分类器，在ImageNet-1k上达到了76.9%的top-1准确率，接近传统的基于patch的方法。 Abstract: We introduce GVIT, a classification framework that abandons conventional pixel or patch grid input representations in favor of a compact set of learnable 2D Gaussians. Each image is encoded as a few hundred Gaussians whose positions, scales, orientations, colors, and opacities are optimized jointly with a ViT classifier trained on top of these representations. We reuse the classifier gradients as constructive guidance, steering the Gaussians toward class-salient regions while a differentiable renderer optimizes an image reconstruction loss. We demonstrate that by 2D Gaussian input representations coupled with our GVIT guidance, using a relatively standard ViT architecture, closely matches the performance of a traditional patch-based ViT, reaching a 76.9% top-1 accuracy on Imagenet-1k using a ViT-B architecture.

[233] Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound

Yuhao Huang,Yueyue Xu,Haoran Dou,Jiaxiao Deng,Xin Yang,Hongyu Zheng,Dong Ni

Main category: cs.CV

TL;DR: 本研究设计了一种利用3D超声图像诊断先天性子宫异常的新方法，通过结合去噪扩散模型、强化学习框架和不确定性建模，提高了诊断的准确性。

Details

Motivation: 与传统的2D超声相比，3D超声可以提供更清晰的子宫形态可视化，从而准确评估CUA。 Method: 开发了一个具有局部（平面）和全局（体积/文本）指导的去噪扩散模型，并引入了基于强化学习的框架来提取关键切片摘要，还提供了文本驱动的不确定性建模方法。 Result: 在大型3D子宫US数据集上的实验表明，该方法在平面定位和CUA诊断方面均表现出良好的效果。 Conclusion: 本文提出了一种基于3D超声的智能系统，用于同时进行自动化平面定位和CUA诊断，提高了诊断的准确性。 Abstract: Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage, preterm birth, and an increased risk of pregnancy complications. Compared to traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane, providing a clear visualization of the uterine morphology for assessing CUAs accurately. In this paper, we propose an intelligent system for simultaneous automated plane localization and CUA diagnosis. Our highlights are: 1) we develop a denoising diffusion model with local (plane) and global (volume/text) guidance, using an adaptive weighting strategy to optimize attention allocation to different conditions; 2) we introduce a reinforcement learning-based framework with unsupervised rewards to extract the key slice summary from redundant sequences, fully integrating information across multiple planes to reduce learning difficulty; 3) we provide text-driven uncertainty modeling for coarse prediction, and leverage it to adjust the classification probability for overall performance improvement. Extensive experiments on a large 3D uterine US dataset show the efficacy of our method, in terms of plane localization and CUA diagnosis. Code is available at https://github.com/yuhoo0302/CUA-US.

[234] Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention

Weida Wang,Changyong He,Jin Zeng,Di Qiu

Main category: cs.CV

TL;DR: This paper proposes a novel ToF depth denoising approach using motion-invariant graph fusion and geometric attention, significantly improving performance and generalization while ensuring interpretability.

Details

Motivation: ToF sensor depth images are noisy, and existing methods either process single frames or neglect depth variations across multi-frames, causing temporal inconsistency and spatial ambiguity. This work aims to address these limitations. Method: A motion-invariant graph fusion method is introduced, leveraging cross-frame geometric attention for graph fusion. The solution incorporates an image smoothness prior and a data fidelity term, unrolled into iterative filters with adaptively learned weights. Result: The proposed scheme achieves state-of-the-art performance in accuracy and consistency on the synthetic DVToF dataset and demonstrates robust generalization on the real Kinectv2 dataset. Conclusion: The proposed ToF depth denoising network successfully improves temporal stability and spatial sharpness, achieving state-of-the-art performance on both synthetic and real datasets with robust generalization. Abstract: Depth images captured by Time-of-Flight (ToF) sensors are prone to noise, requiring denoising for reliable downstream applications. Previous works either focus on single-frame processing, or perform multi-frame processing without considering depth variations at corresponding pixels across frames, leading to undesirable temporal inconsistency and spatial ambiguity. In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. Specifically, despite depth shifts across frames, graph structures exhibit temporal self-similarity, enabling cross-frame geometric attention for graph fusion. Then, by incorporating an image smoothness prior on the fused graph and data fidelity term derived from ToF noise distribution, we formulate a maximum a posterior problem for ToF denoising. Finally, the solution is unrolled into iterative filters whose weights are adaptively learned from the graph-informed geometric attention, producing a high-performance yet interpretable network. Experimental results demonstrate that the proposed scheme achieves state-of-the-art performance in terms of accuracy and consistency on synthetic DVToF dataset and exhibits robust generalization on the real Kinectv2 dataset. Source code will be released at \href{https://github.com/davidweidawang/GIGA-ToF}{https://github.com/davidweidawang/GIGA-ToF}.

[235] Pyramidal Patchification Flow for Visual Generation

Hui Li,Baoyou Chen,Liwei Zhang,Jiaye Li,Jingdong Wang,Siyu Zhu

Main category: cs.CV

TL;DR: PPFlow通过动态调整patch大小优化Diffusion Transformers的推理速度与性能。

Details

Motivation: 为了减少计算成本并提升DiTs的推理效率，避免传统金字塔结构和renoising技巧的限制。 Method: PPFlow采用不同的patch大小处理不同噪声时间步，并相应调整Unpatchify步骤，同时在完整的潜在表示上操作。 Result: PPFlow实现了比SiT-B/2模型快1.6倍（2级）到2.0倍（3级）的推理速度，在训练FLOPs略低的情况下保持了相似的生成性能。 Conclusion: PPFlow是一种新的Pyramidal Patchification方法，通过使用不同的patch大小和线性投影来提高推理速度并保持图像生成性能。 Abstract: Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at https://github.com/fudan-generative-vision/PPFlow.

[236] Oneta: Multi-Style Image Enhancement Using Eigentransformation Functions

Jiwon Kim,Soohyun Hwang,Dong-O Kim,Changsu Han,Min Kyu Park,Chang-Su Kim

Main category: cs.CV

TL;DR: 本文提出了 Oneta，一种能够处理多种图像增强任务的神经网络模型，利用 eigenTF 和 CCM 实现高效的多风格图像增强。

Details

Motivation: 为了解决多风格图像增强的问题，设计一种高性能、通用性强的新算法，以适应不同的图像增强需求。 Method: Oneta 使用 Y-Net 和 C-Net 来分别预测特征变换函数（eigenTF）和颜色校正矩阵（CCM）的参数，同时采用 K 个可学习标记来支持 K 种增强风格。 Result: Oneta 网络能够在30个数据集中有效完成包括润饰、低光增强、去雾、水下增强、白平衡等6种增强任务。 Conclusion: Oneta 是一个用于多风格图像增强的算法，通过使用两个点算子：强度增强和颜色校正，并且在广泛的实验中展示了其在多种增强任务上的高效性能。 Abstract: The first algorithm, called Oneta, for a novel task of multi-style image enhancement is proposed in this work. Oneta uses two point operators sequentially: intensity enhancement with a transformation function (TF) and color correction with a color correction matrix (CCM). This two-step enhancement model, though simple, achieves a high performance upper bound. Also, we introduce eigentransformation function (eigenTF) to represent TF compactly. The Oneta network comprises Y-Net and C-Net to predict eigenTF and CCM parameters, respectively. To support $K$ styles, Oneta employs $K$ learnable tokens. During training, each style token is learned using image pairs from the corresponding dataset. In testing, Oneta selects one of the $K$ style tokens to enhance an image accordingly. Extensive experiments show that the single Oneta network can effectively undertake six enhancement tasks -- retouching, image signal processing, low-light image enhancement, dehazing, underwater image enhancement, and white balancing -- across 30 datasets.

[237] JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon,Joonghyuk Shin,Jaeseok Jung,Jaesik Park,Youngjung Uh

Main category: cs.CV

TL;DR: 该论文提出了一种名为JAM-Flow的统一框架，用于同时合成和依赖面部运动与语音，以实现更全面的音频-视觉合成。

Details

Motivation: 面部运动和语音之间存在内在联系，但在生成建模中通常被忽视，本文旨在提供一个统一框架来解决这一问题。 Method: 利用流匹配和一种新的多模态扩散变压器（MM-DiT）架构，结合选择性联合注意层，并采用时间对齐的位置嵌入和局部联合注意掩码等关键架构选择。 Result: JAM-Flow支持广泛的条件输入，包括文本、参考音频和参考运动，实现了同步说话头生成、音频驱动动画等多种任务的整合。 Conclusion: JAM-Flow通过统一框架同时合成和依赖面部运动和语音，显著提升了多模态生成建模的整体音频-视频合成的实用性。 Abstract: The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web

[238] LH2Face: Loss function for Hard High-quality Face

Fan Xie,Pan Cao

Main category: cs.CV

TL;DR: 本文提出了一种新的人脸识别损失函数LH2Face，通过改进相似度测量、引入自适应间隔机制以及联合优化表示空间分布与人脸重建，显著提升了在困难高质量人脸数据上的识别性能。

Details

Motivation: 传统的人脸识别方法基于余弦相似度和Softmax分类，对困难样本处理效果不佳，且引入角度或余弦间隔的策略忽略了人脸质量和识别难度，训练策略过于统一。因此需要一种更有效的损失函数来解决这些问题。 Method: 提出了一种新的损失函数LH2Face，基于von Mises-Fisher分布设计了新的相似度测量方法，并结合Softmax实现了一个自适应间隔的多分类方法（Uncertainty-Aware Margin Function），同时采用基于代理的损失函数优化表示空间分布，并通过渲染器进行人脸识别和重建的联合优化。 Result: LH2Face在IJB-B数据集上取得了49.39%的准确率，超过现有方法。 Conclusion: LH2Face在硬性高质量人脸数据集上优于其他相似方案，在IJB-B数据集上达到了49.39%的准确率，比第二名的方法提高了2.37%。 Abstract: In current practical face authentication systems, most face recognition (FR) algorithms are based on cosine similarity with softmax classification. Despite its reliable classification performance, this method struggles with hard samples. A popular strategy to improve FR performance is incorporating angular or cosine margins. However, it does not take face quality or recognition hardness into account, simply increasing the margin value and thus causing an overly uniform training strategy. To address this problem, a novel loss function is proposed, named Loss function for Hard High-quality Face (LH2Face). Firstly, a similarity measure based on the von Mises-Fisher (vMF) distribution is stated, specifically focusing on the logarithm of the Probability Density Function (PDF), which represents the distance between a probability distribution and a vector. Then, an adaptive margin-based multi-classification method using softmax, called the Uncertainty-Aware Margin Function, is implemented in the article. Furthermore, proxy-based loss functions are used to apply extra constraints between the proxy and sample to optimize their representation space distribution. Finally, a renderer is constructed that optimizes FR through face reconstruction and vice versa. Our LH2Face is superior to similiar schemes on hard high-quality face datasets, achieving 49.39% accuracy on the IJB-B dataset, which surpasses the second-place method by 2.37%.

[239] OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving

Mingqian Ji,Jian Yang,Shanshan Zhang

Main category: cs.CV

TL;DR: 本文提出了OcRFDet，一种基于以物体为中心辐射场的新方法，用于多视角3D目标检测，在nuScenes数据集上取得了优异表现。

Details

Motivation: 现有的多视角3D检测方法通常隐式地将2D特征转换为3D空间，限制了模型性能。受辐射场在3D重建中成功的启发，作者试图将其应用于增强3D几何估计能力。 Method: 提出了一种新的多视角3D目标检测框架，利用OcRF对前景物体进行建模并抑制背景噪声，同时使用HOA模块结合2D BEV特征提升检测精度。 Result: 在nuScenes数据集上的实验表明，OcRFDet在测试集上达到了57.2% mAP和64.8% NDS，优于之前的最先进方法。 Conclusion: OcRFDet通过引入以物体为中心的辐射场（OcRF）和高度感知不透明度注意力机制（HOA），有效提升了多视角3D目标检测的性能，并在nuScenes数据集上达到了SOTA效果。 Abstract: Current multi-view 3D object detection methods typically transfer 2D features into 3D space using depth estimation or 3D position encoder, but in a fully data-driven and implicit manner, which limits the detection performance. Inspired by the success of radiance fields on 3D reconstruction, we assume they can be used to enhance the detector's ability of 3D geometry estimation. However, we observe a decline in detection performance, when we directly use them for 3D rendering as an auxiliary task. From our analysis, we find the performance drop is caused by the strong responses on the background when rendering the whole scene. To address this problem, we propose object-centric radiance fields, focusing on modeling foreground objects while discarding background noises. Specifically, we employ Object-centric Radiance Fields (OcRF) to enhance 3D voxel features via an auxiliary task of rendering foreground objects. We further use opacity - the side-product of rendering- to enhance the 2D foreground BEV features via Height-aware Opacity-based Attention (HOA), where attention maps at different height levels are generated separately via multiple networks in parallel. Extensive experiments on the nuScenes validation and test datasets demonstrate that our OcRFDet achieves superior performance, outperforming previous state-of-the-art methods with 57.2$\%$ mAP and 64.8$\%$ NDS on the nuScenes test benchmark. Code will be available at https://github.com/Mingqj/OcRFDet.

[240] Metadata, Wavelet, and Time Aware Diffusion Models for Satellite Image Super Resolution

Luigi Sigillo,Renato Giamba,Danilo Comminiello

Main category: cs.CV

TL;DR: This paper proposes MWT-Diff, a new framework for enhancing satellite image resolution by combining diffusion models with wavelet transforms, effectively overcoming sensor limitations and delivering high-quality results.

Details

Motivation: High-resolution satellite imagery is essential for applications like environmental monitoring, disaster response, and agricultural management, but its acquisition is limited by the spatial and temporal constraints of satellite sensors and high costs of frequent observations. Method: The paper introduces a metadata-, wavelet-, and time-aware encoder (MWT-Encoder) to generate embeddings capturing metadata attributes, multi-scale frequency information, and temporal relationships. These embeddings guide hierarchical diffusion dynamics to reconstruct high-resolution images from low-resolution inputs. Result: Comparative analysis across multiple datasets showed MWT-Diff outperforms recent approaches in terms of perceptual quality metrics such as FID and LPIPS. Conclusion: MWT-Diff is an effective framework for satellite image super-resolution that combines latent diffusion models with wavelet transforms, addressing spatial, temporal, and cost-related limitations of current satellite sensors. Abstract: The acquisition of high-resolution satellite imagery is often constrained by the spatial and temporal limitations of satellite sensors, as well as the high costs associated with frequent observations. These challenges hinder applications such as environmental monitoring, disaster response, and agricultural management, which require fine-grained and high-resolution data. In this paper, we propose MWT-Diff, an innovative framework for satellite image super-resolution (SR) that combines latent diffusion models with wavelet transforms to address these challenges. At the core of the framework is a novel metadata-, wavelet-, and time-aware encoder (MWT-Encoder), which generates embeddings that capture metadata attributes, multi-scale frequency information, and temporal relationships. The embedded feature representations steer the hierarchical diffusion dynamics, through which the model progressively reconstructs high-resolution satellite imagery from low-resolution inputs. This process preserves critical spatial characteristics including textural patterns, boundary discontinuities, and high-frequency spectral components essential for detailed remote sensing analysis. The comparative analysis of MWT-Diff across multiple datasets demonstrated favorable performance compared to recent approaches, as measured by standard perceptual quality metrics including FID and LPIPS.

[241] Event-based Tiny Object Detection: A Benchmark Dataset and Baseline

Nuo Chen,Chao Xiao,Yimian Dai,Shiman He,Miao Li,Wei An

Main category: cs.CV

TL;DR: This paper presents EV-UAV, a large-scale dataset for event-based small object detection, and EV-SpSegNet with STC loss, which improves detection of UAVs in complex environments.

Details

Motivation: Traditional frame-based cameras struggle with small object detection in complex environments due to low frame rates and limited dynamic range, necessitating better solutions like event cameras for accurate anti-UAV systems. Method: A new dataset (EV-UAV) was created with over 2.3 million event-level annotations. A novel method, EV-SpSegNet, along with a Spatiotemporal Correlation (STC) loss, was proposed to detect small UAVs by leveraging motion continuity in event data. Result: The proposed method demonstrates superior performance on the EV-UAV dataset, establishing a benchmark for future research in event-based small object detection. Conclusion: The paper introduces EV-UAV, the first large-scale dataset for event-based small object detection in anti-UAV tasks, and proposes EV-SpSegNet with STC loss for superior detection performance. Abstract: Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Event-based Small object detection (EVSOD) dataset (namely EV-UAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 $\times$ 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the superiority of our method and provide a benchmark for future research in EVSOD. The dataset and code are at https://github.com/ChenYichen9527/Ev-UAV.

[242] StackCLIP: Clustering-Driven Stacked Prompt in Zero-Shot Industrial Anomaly Detection

Yanning Hou,Yanran Ruan,Junfa Li,Shanshan Wang,Jianfeng Qiu,Ke Xu

Main category: cs.CV

TL;DR: This paper proposes StackCLIP, a novel method for zero-shot industrial anomaly detection and segmentation that enhances text-image feature alignment through multicategory name stacking, clustering-driven prompts, ensemble feature alignment, and refined prompt learning, achieving state-of-the-art results.

Details

Motivation: The motivation is to enhance the alignment between text and image features in the CLIP model for better performance in zero-shot industrial anomaly detection tasks. Current methods using specific category prompts during pretraining cause overfitting and limit generalization, necessitating a new approach. Method: The method involves transforming category names through multicategory name stacking to create stacked prompts. It introduces two components: the CSP module, which constructs generic prompts by stacking semantically analogous categories and utilizes multi-object textual feature fusion; and the EFA module, which trains knowledge-specific linear layers tailored for each stack cluster and integrates them adaptively. Additionally, the RPL module refines prompt learning leveraging the generalization power of stacked prompts. Result: Extensive testing on seven industrial anomaly detection datasets demonstrates that the proposed method achieves state-of-the-art performance in both zero-shot anomaly detection and segmentation tasks. The approach also shows robust generalization across classification tasks. Conclusion: The proposed StackCLIP model with Clustering-Driven Stacked Prompts (CSP) and Ensemble Feature Alignment (EFA) modules, along with the Regulating Prompt Learning (RPL) framework, achieves state-of-the-art performance in zero-shot industrial anomaly detection and segmentation tasks by enhancing text-image feature alignment and generalization. Abstract: Enhancing the alignment between text and image features in the CLIP model is a critical challenge in zero-shot industrial anomaly detection tasks. Recent studies predominantly utilize specific category prompts during pretraining, which can cause overfitting to the training categories and limit model generalization. To address this, we propose a method that transforms category names through multicategory name stacking to create stacked prompts, forming the basis of our StackCLIP model. Our approach introduces two key components. The Clustering-Driven Stacked Prompts (CSP) module constructs generic prompts by stacking semantically analogous categories, while utilizing multi-object textual feature fusion to amplify discriminative anomalies among similar objects. The Ensemble Feature Alignment (EFA) module trains knowledge-specific linear layers tailored for each stack cluster and adaptively integrates them based on the attributes of test categories. These modules work together to deliver superior training speed, stability, and convergence, significantly boosting anomaly segmentation performance. Additionally, our stacked prompt framework offers robust generalization across classification tasks. To further improve performance, we introduce the Regulating Prompt Learning (RPL) module, which leverages the generalization power of stacked prompts to refine prompt learning, elevating results in anomaly detection classification tasks. Extensive testing on seven industrial anomaly detection datasets demonstrates that our method achieves state-of-the-art performance in both zero-shot anomaly detection and segmentation tasks.

[243] Dataset Distillation via Vision-Language Category Prototype

Yawen Zou,Guang Li,Duo Su,Zi Wang,Jun Yu,Chao Zhang

Main category: cs.CV

TL;DR: 本文提出一种新的数据集蒸馏方法，结合视觉与语言信息，提升模型泛化能力并生成更合理的图像。

Details

Motivation: 传统数据集蒸馏方法主要关注从图像中提取信息，忽略了数据中的语义信息，导致模型在处理复杂任务时泛化能力受限，可能产生不合理的输出或遗漏关键对象。 Method: 通过引入来自开源大语言模型生成的描述性文本信息作为文本原型，与图像原型协同合成数据，从而在数据集蒸馏中融入语义信息。 Result: 该方法能够生成逻辑一致且包含目标物体的图像，在多个数据集上表现出优异的验证性能和强大的泛化能力，且适用于没有预先存在文本描述的数据集。 Conclusion: 研究提出了一种结合视觉-语言方法的新型数据集蒸馏框架，通过引入文本原型来提升模型的泛化能力和生成逻辑一致性图像的能力，并实现了最先进的验证性能。 Abstract: Dataset distillation (DD) condenses large datasets into compact yet informative substitutes, preserving performance comparable to the original dataset while reducing storage, transmission costs, and computational consumption. However, previous DD methods mainly focus on distilling information from images, often overlooking the semantic information inherent in the data. The disregard for context hinders the model's generalization ability, particularly in tasks involving complex datasets, which may result in illogical outputs or the omission of critical objects. In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. Notably, the text prototypes utilized in this study are derived from descriptive text information generated by an open-source large language model. This framework demonstrates broad applicability across datasets without pre-existing text descriptions, expanding the potential of dataset distillation beyond traditional image-based approaches. Compared to other methods, the proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization. Source code and generated data are available in https://github.com/zou-yawen/Dataset-Distillation-via-Vision-Language-Category-Prototype/

[244] PBCAT: Patch-based composite adversarial training against physically realizable attacks on object detection

Xiao Li,Yiming Zhu,Yifan Huang,Wei Zhang,Yingzhe He,Jie Shi,Xiaolin Hu

Main category: cs.CV

TL;DR: This paper introduces PBCAT, a novel adversarial training method that effectively defends object detectors against various physically realizable attacks, such as adversarial patches and textures, by combining localized and global adversarial perturbations.

Details

Motivation: Object detection systems are vulnerable to physically realizable attacks like adversarial patches and textures, posing serious security threats. While adversarial training (AT) has been explored for classification models under $l_\infty$ attacks, its application to object detectors against diverse physical attacks remains limited. Method: The authors proposed PBCAT, a Patch-Based Composite Adversarial Training method that combines small-area gradient-guided adversarial patches with global imperceptible perturbations to enhance model robustness. Result: Extensive experiments show that PBCAT significantly outperforms existing defense methods, improving detection accuracy by 29.7% under adversarial texture attacks. Conclusion: PBCAT is an effective defense strategy that improves robustness against various physically realizable attacks, including adversarial patches and textures. Abstract: Object detection plays a crucial role in many security-sensitive applications. However, several recent studies have shown that object detectors can be easily fooled by physically realizable attacks, \eg, adversarial patches and recent adversarial textures, which pose realistic and urgent threats. Adversarial Training (AT) has been recognized as the most effective defense against adversarial attacks. While AT has been extensively studied in the $l_\infty$ attack settings on classification models, AT against physically realizable attacks on object detectors has received limited exploration. Early attempts are only performed to defend against adversarial patches, leaving AT against a wider range of physically realizable attacks under-explored. In this work, we consider defending against various physically realizable attacks with a unified AT method. We propose PBCAT, a novel Patch-Based Composite Adversarial Training strategy. PBCAT optimizes the model by incorporating the combination of small-area gradient-guided adversarial patches and imperceptible global adversarial perturbations covering the entire image. With these designs, PBCAT has the potential to defend against not only adversarial patches but also unseen physically realizable attacks such as adversarial textures. Extensive experiments in multiple settings demonstrated that PBCAT significantly improved robustness against various physically realizable attacks over state-of-the-art defense methods. Notably, it improved the detection accuracy by 29.7\% over previous defense methods under one recent adversarial texture attack.

[245] CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Qiming Li,Zekai Ye,Xiaocheng Feng,Weihong Zhong,Libo Qin,Ruihan Chen,Baohang Li,Kui Jiang,Yaowei Wang,Ting Liu,Bing Qin

Main category: cs.CV

TL;DR: 本文介绍了一种新的训练无关方法CAI，用于缓解大型视觉-语言模型中的幻觉问题，具有高效且有效的特点。

Details

Motivation: 尽管大型视觉-语言模型在解释视觉信息方面表现出强大的能力，但它们经常产生偏离视觉信息的内容，导致对象幻觉。现有方法通常依赖昂贵的手动注释和训练成本，或显著增加推理时间。 Method: 通过观察大型视觉-语言模型在回答标题查询时对视觉信息的注意力更强的现象，提出了Caption-sensitive Attention Intervention (CAI) 方法，这是一种无需训练、可插拔使用的幻觉缓解技术。 Result: 实验结果显示，CAI在多个任务上实现了最先进的幻觉缓解性能，同时仅带来极小的额外推理成本。 Conclusion: 本文提出了一种名为CAI的方法，能够以较低的推理成本有效缓解大型视觉-语言模型中的幻觉问题，并在四个基准测试中展现了最先进的性能。 Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.

[246] AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval

Suyash Maniyar,Vishvesh Trivedi,Ajoy Mondal,Anand Mishra,C. V. Jawahar

Main category: cs.CV

TL;DR: 本文提出了SynLecSlideGen，这是一种利用大型语言模型生成高质量合成幻灯片的方法，通过在合成数据上预训练并在真实数据上进行少样本迁移学习，有效减少了对手动标注数据的需求。

Details

Motivation: 标注大量讲义幻灯片进行监督训练是劳动密集型且需要专业知识的，因此需要一种减少依赖手动注释的方法。 Method: 提出了一种大型语言模型（LLM）引导的合成幻灯片生成流程，并使用合成数据进行预训练，然后在真实数据上进行少样本迁移学习。 Result: 实验结果表明，在真实数据上进行少样本迁移学习时，使用合成幻灯片进行预训练显著提高了模型性能。 Conclusion: SynLecSlideGen是一个有效的合成幻灯片生成流程，可以显著提高少量真实数据下的模型性能。 Abstract: Lecture slide element detection and retrieval are key problems in slide understanding. Training effective models for these tasks often depends on extensive manual annotation. However, annotating large volumes of lecture slides for supervised training is labor intensive and requires domain expertise. To address this, we propose a large language model (LLM)-guided synthetic lecture slide generation pipeline, SynLecSlideGen, which produces high-quality, coherent and realistic slides. We also create an evaluation benchmark, namely RealSlide by manually annotating 1,050 real lecture slides. To assess the utility of our synthetic slides, we perform few-shot transfer learning on real data using models pre-trained on them. Experimental results show that few-shot transfer learning with pretraining on synthetic slides significantly improves performance compared to training only on real data. This demonstrates that synthetic data can effectively compensate for limited labeled lecture slides. The code and resources of our work are publicly available on our project website: https://synslidegen.github.io/.

[247] SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion

Zhengkang Xiang,Zizhao Li,Amir Khodabandeh,Kourosh Khoshelham

Main category: cs.CV

TL;DR: 本文提出了SG-LDM，一种基于语义引导的激光雷达扩散模型，通过显式语义条件和潜在对齐实现高质量点云生成与跨域翻译，有效增强了数据多样性和下游任务性能。

Details

Motivation: 现有的激光雷达点云生成方法主要集中在无条件生成上，缺乏在实际应用中的潜力挖掘。 Method: 提出了一种名为SG-LDM的Semantic-Guided Lidar Diffusion Model，结合潜在空间对齐和显式语义条件进行点云生成和跨域翻译。 Result: SG-LDM在生成高保真激光雷达点云方面达到了SOTA效果，并且其翻译框架显著提升了下游分割任务的表现。 Conclusion: SG-LDM实现了基于语义引导的激光雷达点云生成，并通过扩散模型框架提升了下游感知任务的数据增强性能。 Abstract: Lidar point cloud synthesis based on generative models offers a promising solution to augment deep learning pipelines, particularly when real-world data is scarce or lacks diversity. By enabling flexible object manipulation, this synthesis approach can significantly enrich training datasets and enhance discriminative models. However, existing methods focus on unconditional lidar point cloud generation, overlooking their potential for real-world applications. In this paper, we propose SG-LDM, a Semantic-Guided Lidar Diffusion Model that employs latent alignment to enable robust semantic-to-lidar synthesis. By directly operating in the native lidar space and leveraging explicit semantic conditioning, SG-LDM achieves state-of-the-art performance in generating high-fidelity lidar point clouds guided by semantic labels. Moreover, we propose the first diffusion-based lidar translation framework based on SG-LDM, which enables cross-domain translation as a domain adaptation strategy to enhance downstream perception performance. Systematic experiments demonstrate that SG-LDM significantly outperforms existing lidar diffusion models and the proposed lidar translation framework further improves data augmentation performance in the downstream lidar segmentation task.

[248] PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum

Shiqi Zhang,Sha Zhang,Jiajun Deng,Yedong Shen,Mingxiao MA,Yanyong Zhang

Main category: cs.CV

TL;DR: PGOV3D提出了一种新的开放词汇3D语义分割框架，利用局部到全局的课程学习策略提高模型效果。

Details

Motivation: 现有方法将多视图图像仅作为传递开放词汇信息的中间媒介，忽略了其丰富的语义内容和跨视角对应关系，这限制了模型的有效性。因此，需要一种新方法来提升开放词汇3D语义分割的效果。 Method: PGOV3D框架采用了从局部到全局的课程学习策略，分为两个阶段：第一阶段使用密集语义信息但几何结构较简单的部分场景进行预训练；第二阶段对完整场景点云进行微调，并通过聚合部分词汇生成伪标签。 Result: 实验结果表明，PGOV3D在多个基准测试中表现出色，能够有效弥合密集部分观测与大规模3D环境之间的语义差距。 Conclusion: PGOV3D实现了在ScanNet、ScanNet200和S3DIS基准测试中具有竞争力的开放词汇3D语义分割性能。 Abstract: Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, which limits model effectiveness. To address this, we propose PGOV3D, a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation. The key innovation lies in a two-stage training strategy. In the first stage, we pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry. These partial point clouds are derived from multi-view RGB-D inputs via pixel-wise depth projection. To enable open-vocabulary learning, we leverage a multi-modal large language model (MLLM) and a 2D segmentation foundation model to generate open-vocabulary labels for each viewpoint, offering rich and aligned supervision. An auxiliary inter-frame consistency module is introduced to enforce feature consistency across varying viewpoints and enhance spatial understanding. In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex. We aggregate the partial vocabularies associated with each scene and generate pseudo labels using the pre-trained model, effectively bridging the semantic gap between dense partial observations and large-scale 3D environments. Extensive experiments on ScanNet, ScanNet200, and S3DIS benchmarks demonstrate that PGOV3D achieves competitive performance in open-vocabulary 3D semantic segmentation.

[249] AttentionGS: Towards Initialization-Free 3D Gaussian Splatting via Structural Attention

Ziao Liu,Zhenjia Li,Yifeng Shi,Xiangang Li

Main category: cs.CV

TL;DR: AttentionGS improves upon 3D Gaussian Splatting by eliminating reliance on high-quality initial point clouds through the use of structural attention mechanisms.

Details

Motivation: The reliance of 3D Gaussian Splatting on high-quality point clouds from Structure-from-Motion (SfM) limits its applicability, especially in texture-deficient or constrained-view scenarios. Method: AttentionGS leverages structural attention for direct 3D reconstruction from random initialization, using geometric attention initially to recover global structure and later incorporating texture attention to refine details. Opacity-weighted gradients are used to enhance surface reconstruction. Result: AttentionGS significantly outperforms state-of-the-art methods on multiple benchmark datasets, particularly under unreliable point cloud initialization conditions. Conclusion: AttentionGS provides a more robust and flexible approach to 3D Gaussian Splatting, particularly effective where point cloud initialization is unreliable. Abstract: 3D Gaussian Splatting (3DGS) is a powerful alternative to Neural Radiance Fields (NeRF), excelling in complex scene reconstruction and efficient rendering. However, it relies on high-quality point clouds from Structure-from-Motion (SfM), limiting its applicability. SfM also fails in texture-deficient or constrained-view scenarios, causing severe degradation in 3DGS reconstruction. To address this limitation, we propose AttentionGS, a novel framework that eliminates the dependency on high-quality initial point clouds by leveraging structural attention for direct 3D reconstruction from randomly initialization. In the early training stage, we introduce geometric attention to rapidly recover the global scene structure. As training progresses, we incorporate texture attention to refine fine-grained details and enhance rendering quality. Furthermore, we employ opacity-weighted gradients to guide Gaussian densification, leading to improved surface reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that AttentionGS significantly outperforms state-of-the-art methods, particularly in scenarios where point cloud initialization is unreliable. Our approach paves the way for more robust and flexible 3D Gaussian Splatting in real-world applications.

[250] TurboVSR: Fantastic Video Upscalers and Where to Find Them

Zhongdao Wang,Guodongfang Zhao,Jingjing Ren,Bailan Feng,Shifeng Zhang,Wenbo Li

Main category: cs.CV

TL;DR: 本文提出了一种高效视频超分辨率模型TurboVSR，通过压缩自编码器、分步条件学习和快捷扩散模型设计，在保证质量的同时大幅提升了处理速度并支持更高分辨率。

Details

Motivation: 基于扩散模型的生成方法在视频超分辨率任务中表现出色，但计算效率较低，因此需要提出一种高效的解决方案来解决这一问题。 Method: TurboVSR的核心设计包括三个方面：使用高压缩比的自编码器减少token数量；引入分步条件学习策略降低训练难度；将预训练扩散模型转换为可进行快速采样的快捷模型以加速推理过程。 Result: TurboVSR在2秒长的1080p视频上仅需7秒即可完成超分辨率处理，比现有方法快100倍以上，同时保持了高质量的细节生成能力，并能支持高达4K的图像超分辨率。 Conclusion: TurboVSR能够实现与现有最先进的视频超分辨率方法相当的性能，同时其推理速度提高了100倍以上，并且支持更高分辨率（如4K）的图像超分辨率处理。 Abstract: Diffusion-based generative models have demonstrated exceptional promise in the video super-resolution (VSR) task, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper, we present TurboVSR, an ultra-efficient diffusion-based video super-resolution model. Our core design comprises three key aspects: (1) We employ an autoencoder with a high compression ratio of 32$\times$32$\times$8 to reduce the number of tokens. (2) Highly compressed latents pose substantial challenges for training. We introduce factorized conditioning to mitigate the learning complexity: we first learn to super-resolve the initial frame; subsequently, we condition the super-resolution of the remaining frames on the high-resolution initial frame and the low-resolution subsequent frames. (3) We convert the pre-trained diffusion model to a shortcut model to enable fewer sampling steps, further accelerating inference. As a result, TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video. TurboVSR also supports image resolution by considering image as a one-frame video. Our efficient design makes SR beyond 1080p possible, results on 4K (3648$\times$2048) image SR show surprising fine details.

[251] Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Shaofei Huang,Rui Ling,Tianrui Hui,Hongyu Li,Xu Zhou,Shifeng Zhang,Si Liu,Richang Hong,Meng Wang

Main category: cs.CV

TL;DR: This paper introduces a Vision-Centric Transformer (VCT) framework for Audio-Visual Segmentation (AVS), which addresses the limitations of traditional audio-centric approaches. The method leverages vision-derived queries and includes a PPQG module to enhance audio-visual information aggregation. Experimental results show that the VCT framework achieves state-of-the-art performance on the AVSBench dataset.

Details

Motivation: Traditional audio-centric Transformers face two main issues: perception ambiguity due to mixed audio signals and weakened dense prediction ability from visual detail loss. This work aims to overcome these limitations by introducing a vision-centric approach. Method: The authors propose a Vision-Centric Transformer (VCT) framework that uses vision-derived queries to iteratively gather corresponding audio and visual information. It also incorporates a Prototype Prompted Query Generation (PPQG) module for generating semantically aware and visually rich queries through audio prototype prompting and pixel context grouping. Result: Extensive experiments show that the VCT framework outperforms existing methods, achieving new state-of-the-art results on the AVSBench dataset. The framework demonstrates improved ability in distinguishing sound sources from mixed audio and accurately delineating object contours. Conclusion: The paper concludes that the proposed Vision-Centric Transformer (VCT) framework achieves state-of-the-art performance on three subsets of the AVSBench dataset by overcoming the limitations of traditional audio-centric Transformers. Abstract: Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at https://github.com/spyflying/VCT_AVS.

[252] Brain Tumor Detection through Thermal Imaging and MobileNET

Roham Maiti,Debasmita Bhoumik

Main category: cs.CV

TL;DR: This paper proposes a MobileNET-based model for fast and efficient brain tumor detection with high accuracy, addressing the limitations of traditional and classical machine learning methods.

Details

Motivation: Brain tumors pose significant health risks, and early detection is crucial. Traditional methods like MRI and CT scans are costly and require specialized expertise. Machine learning models have shown promise but face challenges in computational efficiency and accessibility. Method: MobileNET model was used to develop an efficient brain tumor detection system, leveraging image processing techniques for accurate results. Result: The proposed method achieved an average accuracy of 98.5% in brain tumor detection. Conclusion: The research successfully utilizes the MobileNET model for efficient brain tumor detection, achieving high accuracy while reducing computational demands and training time. Abstract: Brain plays a crucial role in regulating body functions and cognitive processes, with brain tumors posing significant risks to human health. Precise and prompt detection is a key factor in proper treatment and better patient outcomes. Traditional methods for detecting brain tumors, that include biopsies, MRI, and CT scans often face challenges due to their high costs and the need for specialized medical expertise. Recent developments in machine learning (ML) and deep learning (DL) has exhibited strong capabilities in automating the identification and categorization of brain tumors from medical images, especially MRI scans. However, these classical ML models have limitations, such as high computational demands, the need for large datasets, and long training times, which hinder their accessibility and efficiency. Our research uses MobileNET model for efficient detection of these tumors. The novelty of this project lies in building an accurate tumor detection model which use less computing re-sources and runs in less time followed by efficient decision making through the use of image processing technique for accurate results. The suggested method attained an average accuracy of 98.5%.

[253] Blending Concepts with Text-to-Image Diffusion Models

Lorenzo Olearo,Giorgio Longari,Alessandro Raganato,Rafael Peñaloza,Simone Melzi

Main category: cs.CV

TL;DR: 该论文发现扩散模型可以在不额外训练的情况下，通过多种方法将多个文本概念融合成新图像，且效果因方法与输入设置而异。

Details

Motivation: 探索扩散模型是否能将不同的文本概念（从具体物体到抽象思想）融合为一个包含每个概念本质的新图像。 Method: 研究者调查了四种不同的融合方法，分别利用扩散模型的不同部分（如提示调度、嵌入插值或层条件控制）进行实验，并进行了系统性的实验和用户研究。 Result: 实验证明现代扩散模型无需进一步训练即可实现创意融合，用户研究表明没有单一最佳方法，每种技术在特定条件下表现更佳。 Conclusion: 扩散模型在零样本框架下展现出将不同概念融合为新的视觉实体的能力，但不同方法在不同场景下各有优劣，表明模型对输入细节敏感。 Abstract: Diffusion models have dramatically advanced text-to-image generation in recent years, translating abstract concepts into high-fidelity images with remarkable ease. In this work, we examine whether they can also blend distinct concepts, ranging from concrete objects to intangible ideas, into coherent new visual entities under a zero-shot framework. Specifically, concept blending merges the key attributes of multiple concepts (expressed as textual prompts) into a single, novel image that captures the essence of each concept. We investigate four blending methods, each exploiting different aspects of the diffusion pipeline (e.g., prompt scheduling, embedding interpolation, or layer-wise conditioning). Through systematic experimentation across diverse concept categories, such as merging concrete concepts, synthesizing compound words, transferring artistic styles, and blending architectural landmarks, we show that modern diffusion models indeed exhibit creative blending capabilities without further training or fine-tuning. Our extensive user study, involving 100 participants, reveals that no single approach dominates in all scenarios: each blending technique excels under certain conditions, with factors like prompt ordering, conceptual distance, and random seed affecting the outcome. These findings highlight the remarkable compositional potential of diffusion models while exposing their sensitivity to seemingly minor input variations.

[254] Unified Multimodal Understanding via Byte-Pair Visual Encoding

Wanpeng Zhang,Yicheng Feng,Hao Luo,Yijiang Li,Zihao Yue,Sipeng Zheng,Zongqing Lu

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态理解框架，通过将字节对编码应用于视觉标记并结合多阶段训练策略，提升了多模态模型在跨模态关系理解和视觉信息推理上的性能。

Details

Motivation: 尽管多模态大语言模型在视觉-语言理解方面取得了显著进展，但有效对齐不同模态仍然是一个基本挑战。 Method: 应用字节对编码（byte-pair encoding）到视觉标记中，并引入了优先级引导的编码方案以及基于课程驱动的数据组成的多阶段训练过程。 Result: 全面实验表明，在多种视觉-语言任务上性能得到了提升。 Conclusion: 该研究通过统一的多模态理解框架，为构建更具能力和高效的多模态基础模型做出了贡献。 Abstract: Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.

[255] VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation

Peng Huang,Junhu Fu,Bowen Guo,Zeju Li,Yuanyuan Wang,Yi Guo

Main category: cs.CV

TL;DR: 本文提出VAP-Diffusion框架，利用多模态大语言模型的知识提升医学图像生成质量与多样性，并通过原型条件机制增强对未见描述组合的适应能力。

Details

Motivation: 传统的生成模型需要丰富的属性信息而不仅仅是标签来生成逼真且多样的医学图像，但这些详细描述往往难以获取。 Method: 提出了一种名为VAP-Diffusion的框架，结合Chain-of-Thought提示方法从MLLMs中提取可靠的描述信息，并引入Prototype Condition Mechanism来限制测试嵌入与训练嵌入的相似性。 Result: 在三种常见类型的医学影像共计四个数据集上的实验验证了VAP-Diffusion的有效性，表明其在生成质量和多样性方面的优越性能。 Conclusion: VAP-Diffusion框架有效提升了医学图像生成的质量和多样性，通过利用预训练多模态大语言模型的外部知识，并设计了原型条件机制以增强模型对未见描述组合的鲁棒性。 Abstract: As the appearance of medical images is influenced by multiple underlying factors, generative models require rich attribute information beyond labels to produce realistic and diverse images. For instance, generating an image of skin lesion with specific patterns demands descriptions that go beyond diagnosis, such as shape, size, texture, and color. However, such detailed descriptions are not always accessible. To address this, we explore a framework, termed Visual Attribute Prompts (VAP)-Diffusion, to leverage external knowledge from pre-trained Multi-modal Large Language Models (MLLMs) to improve the quality and diversity of medical image generation. First, to derive descriptions from MLLMs without hallucination, we design a series of prompts following Chain-of-Thoughts for common medical imaging tasks, including dermatologic, colorectal, and chest X-ray images. Generated descriptions are utilized during training and stored across different categories. During testing, descriptions are randomly retrieved from the corresponding category for inference. Moreover, to make the generator robust to unseen combination of descriptions at the test time, we propose a Prototype Condition Mechanism that restricts test embeddings to be similar to those from training. Experiments on three common types of medical imaging across four datasets verify the effectiveness of VAP-Diffusion.

[256] MReg: A Novel Regression Model with MoE-based Video Feature Mining for Mitral Regurgitation Diagnosis

Zhe Liu,Yuhao Huang,Lian Liu,Chengrui Zhang,Haotian Lin,Tong Han,Zhiyuan Zhu,Yanlin Chen,Yuerui Chen,Dong Ni,Zhongshan Gou,Xin Yang

Main category: cs.CV

TL;DR: This paper proposes MReg, an automated model for diagnosing mitral regurgitation using color Doppler echocardiography videos, achieving better accuracy and interpretability than existing methods.

Details

Motivation: To minimize user dependence and improve accuracy in diagnosing mitral regurgitation while aligning with clinical workflow, addressing limitations of previous approaches. Method: MReg was developed using color Doppler echocardiography videos following comprehensive feature mining strategies. It formulates MR diagnosis as a regression task, uses a feature selection and amplification mechanism, and introduces a feature summary module inspired by the Mixture-of-Experts concept. Result: MReg demonstrated superior performance compared to other weakly supervised video anomaly detection and supervised classification methods on a dataset of 1868 cases with three graded regurgitation labels. Conclusion: The study successfully introduced MReg, an automated diagnosis model for mitral regurgitation that aligns with clinical workflow and enhances diagnostic accuracy and interpretability through innovative feature mining strategies. Abstract: Color Doppler echocardiography is a crucial tool for diagnosing mitral regurgitation (MR). Recent studies have explored intelligent methods for MR diagnosis to minimize user dependence and improve accuracy. However, these approaches often fail to align with clinical workflow and may lead to suboptimal accuracy and interpretability. In this study, we introduce an automated MR diagnosis model (MReg) developed on the 4-chamber cardiac color Doppler echocardiography video (A4C-CDV). It follows comprehensive feature mining strategies to detect MR and assess its severity, considering clinical realities. Our contribution is threefold. First, we formulate the MR diagnosis as a regression task to capture the continuity and ordinal relationships between categories. Second, we design a feature selection and amplification mechanism to imitate the sonographer's diagnostic logic for accurate MR grading. Third, inspired by the Mixture-of-Experts concept, we introduce a feature summary module to extract the category-level features, enhancing the representational capacity for more accurate grading. We trained and evaluated our proposed MReg on a large in-house A4C-CDV dataset comprising 1868 cases with three graded regurgitation labels. Compared to other weakly supervised video anomaly detection and supervised classification methods, MReg demonstrated superior performance in MR diagnosis. Our code is available at: https://github.com/cskdstz/MReg.

[257] Towards Markerless Intraoperative Tracking of Deformable Spine Tissue

Connor Daly,Elettra Marconi,Marco Riva,Jinendra Ekanayake,Daniel S. Elson,Ferdinando Rodriguez y Baena

Main category: cs.CV

TL;DR: 该研究提出了一种基于RGB-D成像的术中脊柱跟踪新方法，有望提高手术效率和精确度。

Details

Motivation: 无标记跟踪可以减少手术时间和复杂性，但其应用目前仅限于尸体研究。因此需要一种新的方法来扩展其在真实临床环境中的应用。 Method: 开发了一个用于捕获术前和术中脊柱状态之间变形的系统，并提出了一个用于预测配准关键区域的多任务框架。 Result: 本文生成了首个用于脊柱手术的真实临床RGB-D数据集，并训练了一个术中分割网络。 Conclusion: 本文介绍了SpineAlign系统和CorrespondNet框架，为术中和术前脊柱状态之间的关键区域配准提供了新的方法。 Abstract: Consumer-grade RGB-D imaging for intraoperative orthopedic tissue tracking is a promising method with high translational potential. Unlike bone-mounted tracking devices, markerless tracking can reduce operating time and complexity. However, its use has been limited to cadaveric studies. This paper introduces the first real-world clinical RGB-D dataset for spine surgery and develops SpineAlign, a system for capturing deformation between preoperative and intraoperative spine states. We also present an intraoperative segmentation network trained on this data and introduce CorrespondNet, a multi-task framework for predicting key regions for registration in both intraoperative and preoperative scenes.

[258] On the Domain Robustness of Contrastive Vision-Language Models

Mario Koddenbrock,Rudolf Hoffmann,David Brodmann,Erik Rodner

Main category: cs.CV

TL;DR: Deepbench是一种基于大型语言模型的新框架，用于评估视觉-语言模型在特定领域的鲁棒性，无需标记数据，并为领域感知的鲁棒性评估提供了开放源代码支持。

Details

Motivation: 尽管预训练基础模型在通用基准测试中表现良好，但在特定领域的性能可能会显著下降，因此需要一种能够评估这些模型在实际应用场景中鲁棒性的方法。 Method: 通过利用大型语言模型生成与特定部署领域相关的现实且上下文感知的图像损坏，而无需标记数据。 Result: 通过对六种真实世界领域的对比实验，发现不同架构和变体在鲁棒性方面存在显著差异，强调了针对性评估的重要性。 Conclusion: Deepbench是一个开源框架，用于评估视觉-语言模型在特定领域中的鲁棒性，并推动了对领域感知鲁棒性的进一步研究。 Abstract: In real-world vision-language applications, practitioners increasingly rely on large, pretrained foundation models rather than custom-built solutions, despite limited transparency regarding their training data and processes. While these models achieve impressive performance on general benchmarks, their effectiveness can decline notably under specialized domain shifts, such as unique imaging conditions or environmental variations. In this work, we introduce Deepbench, a framework designed to assess domain-specific robustness of vision-language models (VLMs). Deepbench leverages a large language model (LLM) to generate realistic, context-aware image corruptions tailored to specific deployment domains without requiring labeled data. We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains and observe substantial variability in robustness, highlighting the need for targeted, domain-aware evaluation. Deepbench is released as open-source software to support further research into domain-aware robustness assessment.

[259] Partial Forward Blocking: A Novel Data Pruning Paradigm for Lossless Training Acceleration

Dongyue Wu,Zilin Guo,Jialong Zuo,Nong Sang,Changxin Gao

Main category: cs.CV

TL;DR: 本文提出了一种新的数据修剪框架Partial Forward Blocking (PFB)，该框架利用浅层特征和概率密度估计来提高训练效率，同时降低计算成本。

Details

Motivation: 现有的数据修剪方法通常依赖梯度反向传播或代理模型，导致额外的计算成本。为了加速训练并减少这些成本，需要一种新的方法。 Method: 提出了一种名为Partial Forward Blocking (PFB) 的新框架，其通过浅层特征提取和概率密度估计进行样本重要性评估，并在训练过程中动态修剪不重要的样本。 Result: 实验表明，PFB在ImageNet数据集上实现了0.5%的准确率提升，并在修剪40%的数据后减少了33%的训练时间。 Conclusion: PFB是一种无需梯度或代理模型的高效数据修剪方法，通过利用浅层特征和概率密度估计，在减少计算开销的同时提高了训练效率和准确性。 Abstract: The ever-growing size of training datasets enhances the generalization capability of modern machine learning models but also incurs exorbitant computational costs. Existing data pruning approaches aim to accelerate training by removing those less important samples. However, they often rely on gradients or proxy models, leading to prohibitive additional costs of gradient back-propagation and proxy model training. In this paper, we propose Partial Forward Blocking (PFB), a novel framework for lossless training acceleration. The efficiency of PFB stems from its unique adaptive pruning pipeline: sample importance is assessed based on features extracted from the shallow layers of the target model. Less important samples are then pruned, allowing only the retained ones to proceed with the subsequent forward pass and loss back-propagation. This mechanism significantly reduces the computational overhead of deep-layer forward passes and back-propagation for pruned samples, while also eliminating the need for auxiliary backward computations and proxy model training. Moreover, PFB introduces probability density as an indicator of sample importance. Combined with an adaptive distribution estimation module, our method dynamically prioritizes relatively rare samples, aligning with the constantly evolving training state. Extensive experiments demonstrate the significant superiority of PFB in performance and speed. On ImageNet, PFB achieves a 0.5% accuracy improvement and 33% training time reduction with 40% data pruned.

[260] Pruning by Block Benefit: Exploring the Properties of Vision Transformer Blocks during Domain Adaptation

Patrick Glandorf,Bodo Rosenhahn

Main category: cs.CV

TL;DR: This paper proposes P3B, a novel pruning method for Vision Transformers that evaluates block-level contributions to optimize parameter allocation, achieving superior performance in transfer learning scenarios with high sparsity.

Details

Motivation: To overcome the limitations of existing pruning methods that fail to accurately evaluate weight significance on unseen data or task-sensitive layers, leading to suboptimal performance. Method: The study introduces Pruning by Block Benefit (P3B), which optimizes pruning by evaluating the relative contribution of each block in the network, setting layer-wise keep ratios for optimal resource allocation. Result: P3B outperforms classical pruning approaches by enabling reactivation of late-converging blocks and achieves state-of-the-art results with minimal accuracy loss (0.64%) under high sparsity (70% parameter reduction). Conclusion: P3B proves to be an effective pruning method, particularly beneficial in transfer learning tasks and capable of maintaining high performance even with significant parameter reduction. Abstract: Vision Transformer have set new benchmarks in several tasks, but these models come with the lack of high computational costs which makes them impractical for resource limited hardware. Network pruning reduces the computational complexity by removing less important operations while maintaining performance. However, pruning a model on an unseen data domain, leads to a misevaluation of weight significance, resulting in suboptimal resource assignment. In this work, we find that task-sensitive layers initially fail to improve the feature representation on downstream tasks, leading to performance loss for early pruning decisions. To address this problem, we introduce Pruning by Block Benefit (P3B), a pruning method that utilizes the relative contribution on block level to globally assign parameter resources. P3B identifies low-impact components to reduce parameter allocation while preserving critical ones. Classical pruning mask optimization struggles to reactivate zero-mask-elements. In contrast, P3B sets a layerwise keep ratio based on global performance metrics, ensuring the reactivation of late-converging blocks. We show in extensive experiments that P3B is a state of the art pruning method with most noticeable gains in transfer learning tasks. Notably, P3B is able to conserve high performance, even in high sparsity regimes of 70% parameter reduction while only losing 0.64% in accuracy.

[261] A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement

Gaozheng Pei,Ke Ma,Dongpeng Zhang,Chengzhi Sun,Qianqian Xu,Qingming Huang

Main category: cs.CV

TL;DR: 为了解决基于扩散模型的对抗样本生成方法在任务泛化和可迁移性方面的局限性，本文提出了一种新的统一框架，并在实际竞赛中证明了其优越性能。

Details

Motivation: 由于扩散模型依赖其判别能力，现有的基于扩散模型的对抗样本生成方法难以推广到传统图像分类以外的任务，例如Deepfake检测，且传统增强对抗样本可迁移性的策略难以直接应用。 Method: 通过结合传统对抗样本生成的可迁移性增强策略与基于扩散模型的图像编辑技术，构建了一个能够适用于多种任务（如Deepfake检测）的对抗样本生成框架。 Result: 该方法在ACM MM25会议上的“首届Deepfake检测器对抗攻击挑战赛”中获得第一名，验证了其有效性。 Conclusion: 本文提出了一种统一的框架，将传统的可迁移性增强策略无缝整合到基于扩散模型的图像编辑对抗样本生成中，从而提高了其在更广泛下游任务中的适用性和效果。 Abstract: Due to their powerful image generation capabilities, diffusion-based adversarial example generation methods through image editing are rapidly gaining popularity. However, due to reliance on the discriminative capability of the diffusion model, these diffusion-based methods often struggle to generalize beyond conventional image classification tasks, such as in Deepfake detection. Moreover, traditional strategies for enhancing adversarial example transferability are challenging to adapt to these methods. To address these challenges, we propose a unified framework that seamlessly incorporates traditional transferability enhancement strategies into diffusion model-based adversarial example generation via image editing, enabling their application across a wider range of downstream tasks. Our method won first place in the "1st Adversarial Attacks on Deepfake Detectors: A Challenge in the Era of AI-Generated Media" competition at ACM MM25, which validates the effectiveness of our approach.

[262] SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

Shuai Tan,Biao Gong,Yujie Wei,Shiwei Zhang,Zhuoxin Liu,Dandan Zheng,Jingdong Chen,Yan Wang,Hao Ouyang,Kecheng Zheng,Yujun Shen

Main category: cs.CV

TL;DR: SynMotion addresses the challenge of capturing complex motion in videos by combining semantic understanding with visual adaptation, leading to better performance compared to existing methods.

Details

Motivation: The motivation stems from the limitations of current approaches that rely on semantic-level alignment or visual representation alone, which leads to either overlooking visual complexity or causing semantic confusion in representing intended actions. Method: The method involves a dual-embedding semantic comprehension mechanism to disentangle subject and motion representations, integration of parameter-efficient motion adapters for enhanced motion fidelity, and an embedding-specific training strategy with alternate optimization of subject and motion embeddings using the Subject Prior Video dataset. Result: Experimental results across both T2V and I2V settings demonstrate that SynMotion outperforms existing baselines. A new benchmark called MotionBench was also introduced to evaluate diverse motion patterns. Conclusion: SynMotion is a new motion-customized video generation model that effectively combines semantic guidance and visual adaptation to overcome the limitations of existing approaches in handling complex spatio-temporal patterns in video data. Abstract: Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which \textbf{alternately optimizes} subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \method outperforms existing baselines. Project page: https://lucaria-academy.github.io/SynMotion/

[263] Single Image Test-Time Adaptation via Multi-View Co-Training

Smriti Joshi,Richard Osuala,Lidia Garrucho,Kaisar Kushibar,Dimitri Kessler,Oliver Diaz,Karim Lekadir

Main category: cs.CV

TL;DR: This paper introduces a novel test-time adaptation technique that enables accurate medical image segmentation with minimal data, outperforming existing approaches.

Details

Motivation: Test-time adaptation is crucial in clinical settings where on-the-fly domain adjustments are needed, but current techniques require large datasets and focus on 2D images, which limits their applicability. Method: The method enforces feature and prediction consistency through uncertainty-guided self-training, enabling effective adaptation using only a single test-time image. Result: The method was validated on three breast MRI datasets for tumor segmentation, showing an average improvement of 3.75% in Dice Similarity Coefficient over state-of-the-art methods. Conclusion: The proposed Patch-Based Multi-View Co-Training method for Single Image Test-Time adaptation achieves performance close to the supervised benchmark and outperforms existing methods in volumetric segmentation. Abstract: Test-time adaptation enables a trained model to adjust to a new domain during inference, making it particularly valuable in clinical settings where such on-the-fly adaptation is required. However, existing techniques depend on large target domain datasets, which are often impractical and unavailable in medical scenarios that demand per-patient, real-time inference. Moreover, current methods commonly focus on two-dimensional images, failing to leverage the volumetric richness of medical imaging data. Bridging this gap, we propose a Patch-Based Multi-View Co-Training method for Single Image Test-Time adaptation. Our method enforces feature and prediction consistency through uncertainty-guided self-training, enabling effective volumetric segmentation in the target domain with only a single test-time image. Validated on three publicly available breast magnetic resonance imaging datasets for tumor segmentation, our method achieves performance close to the upper bound supervised benchmark while also outperforming all existing state-of-the-art methods, on average by a Dice Similarity Coefficient of 3.75%. We publicly share our accessible codebase, readily integrable with the popular nnUNet framework, at https://github.com/smriti-joshi/muvi.git.

[264] Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion

Haoyang Chen,Dongfang Sun,Caoyuan Ma,Shiqin Wang,Kewei Zhang,Zheng Wang,Zhixiang Wang

Main category: cs.CV

TL;DR: Subjective Camera is a new paradigm that generates photorealistic images from verbal descriptions and rough sketches, overcoming limitations in language ambiguity, modality gaps, and sketch quality through a concept-sequential and reward-guided framework.

Details

Motivation: Existing methods struggle with subjective input biases, the large modality gap between planar sketches and 3D priors, and performance degradation due to low-quality sketches. There is a need for a solution that can generate photorealistic images from mental impressions without resource-intensive model adaptation or strict requirements on sketch precision. Method: The approach uses text-reward optimization to establish appearance priors, implements sequence-aware disentangled generation based on sketching order, and employs latent optimization to bridge the modality gap. It also incorporates a hierarchical reward-guided framework for improved performance with rough sketches. Result: Comprehensive evaluations demonstrate that the proposed framework achieves state-of-the-art performance in maintaining both semantic and spatial coherence across diverse datasets. Conclusion: The proposed Subjective Camera framework successfully overcomes key limitations in existing methods by accommodating subjective user inputs, bridging the modality gap between sketches and 3D priors, and enabling high-quality image generation from rough sketches without requiring artistic expertise. Abstract: We propose Subjective Camera, a human-as-imaging-device paradigm that reconstructs real-world scenes from mental impressions through synergistic use of verbal descriptions and progressive rough sketches. This approach overcomes dual limitations of language ambiguity and sketch abstraction by treating the user's drawing sequence as priors, effectively translating subjective perceptual expectations into photorealistic images. Existing approaches face three fundamental barriers: (1) user-specific subjective input biases, (2) huge modality gap between planar sketch and 3D priors in diffusion, and (3) sketch quality-sensitive performance degradation. Current solutions either demand resource-intensive model adaptation or impose impractical requirements on sketch precision. Our framework addresses these challenges through concept-sequential generation. (1) We establish robust appearance priors through text-reward optimization, and then implement sequence-aware disentangled generation that processes concepts in sketching order; these steps accommodate user-specific subjective expectation in a train-free way. (2) We employ latent optimization that effectively bridges the modality gap between planar sketches and 3D priors in diffusion. (3) Our hierarchical reward-guided framework enables the use of rough sketches without demanding artistic expertise. Comprehensive evaluation across diverse datasets demonstrates that our approach achieves state-of-the-art performance in maintaining both semantic and spatial coherence.

[265] When Small Guides Large: Cross-Model Co-Learning for Test-Time Adaptation

Chang'an Yi,Xiaohui Deng,Guohao Chen,Yan Zhou,Qinghua Lu,Shuaicheng Niu

Main category: cs.CV

TL;DR: This paper proposes COCA, a cross-model co-learning framework for test-time adaptation, which leverages knowledge sharing between models to improve domain adaptation performance.

Details

Motivation: Existing TTA methods focus on single-model adaptation. This work explores how cross-model knowledge can improve TTA performance, even when models differ significantly in size. Method: COCA introduces two strategies: Co-adaptation for integrating complementary knowledge across models and Self-adaptation for enhancing individual model strengths through unsupervised learning. Result: COCA improves ViT-Base's average adaptation accuracy on ImageNet-C from 51.7% to 64.5% using MobileViT's guidance, demonstrating its effectiveness as a plug-and-play module. Conclusion: The proposed COCA framework significantly enhances TTA performance across models of different sizes, establishing a new direction for cross-model co-learning in unsupervised online settings. Abstract: Test-time Adaptation (TTA) adapts a given model to testing domain data with potential domain shifts through online unsupervised learning, yielding impressive performance. However, to date, existing TTA methods primarily focus on single-model adaptation. In this work, we investigate an intriguing question: how does cross-model knowledge influence the TTA process? Our findings reveal that, in TTA's unsupervised online setting, each model can provide complementary, confident knowledge to the others, even when there are substantial differences in model size. For instance, a smaller model like MobileViT (10.6M parameters) can effectively guide a larger model like ViT-Base (86.6M parameters). In light of this, we propose COCA, a Cross-Model Co-Learning framework for TTA, which mainly consists of two main strategies. 1) Co-adaptation adaptively integrates complementary knowledge from other models throughout the TTA process, reducing individual model biases. 2) Self-adaptation enhances each model's unique strengths via unsupervised learning, enabling diverse adaptation to the target domain. Extensive experiments show that COCA, which can also serve as a plug-and-play module, significantly boosts existing SOTAs, on models with various sizes--including ResNets, ViTs, and Mobile-ViTs--via cross-model co-learned TTA. For example, with Mobile-ViT's guidance, COCA raises ViT-Base's average adaptation accuracy on ImageNet-C from 51.7% to 64.5%. The code is publicly available at https://github.com/ycarobot/COCA.

[266] Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

Guiyu Zhang,Chen Shi,Zijian Jiang,Xunzhi Xiang,Jingjing Qian,Shaoshuai Shi,Li Jiang

Main category: cs.CV

TL;DR: Proteus-ID is a new diffusion-based framework that improves video identity customization by ensuring identity consistency and generating more natural motion.

Details

Motivation: The task of video identity customization seeks to generate realistic and coherent videos from a single reference image and text prompt, addressing challenges like maintaining identity consistency and generating natural motion. Method: Proteus-ID uses a diffusion-based framework with three key components: Multimodal Identity Fusion (MIF), Time-Aware Identity Injection (TAII), and Adaptive Motion Learning (AML). Result: Proteus-ID achieves superior performance in identity preservation, text alignment, and motion quality compared to existing methods. Conclusion: Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality for video identity customization. Abstract: Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at https://grenoble-zhang.github.io/Proteus-ID/.

[267] Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?

Annika Mütze,Sadia Ilyas,Christian Dörpelkus,Matthias Rottmann

Main category: cs.CV

TL;DR: 本研究利用合成数据挑战开放词汇目标检测模型，揭示其系统性失败模式，并提出改进数据采集策略。

Details

Motivation: 由于开放词汇目标检测器如Grounding DINO在大规模多样化数据上训练，其局限性尚不清楚，尤其是在安全关键应用中的泛化能力存在担忧。 Method: 设计了两个自动化管道，使用稳定扩散（stable diffusion）生成具有高度语义多样性的不寻常目标的图像内容，并基于WordNet和ChatGPT采样多个名词。 Result: 实验表明，通过合成数据可以挑战开放词汇目标检测器，特别是在忽略目标方面。此外，发现这些模型更依赖于目标位置而非语义。 Conclusion: 合成数据提供了一种系统挑战开放词汇目标检测器的方法，并揭示了模型在目标位置和语义方面的依赖性。 Abstract: Open-vocabulary object detectors such as Grounding DINO are trained on vast and diverse data, achieving remarkable performance on challenging datasets. Due to that, it is unclear where to find their limitations, which is of major concern when using in safety-critical applications. Real-world data does not provide sufficient control, required for a rigorous evaluation of model generalization. In contrast, synthetically generated data allows to systematically explore the boundaries of model competence/generalization. In this work, we address two research questions: 1) Can we challenge open-vocabulary object detectors with generated image content? 2) Can we find systematic failure modes of those models? To address these questions, we design two automated pipelines using stable diffusion to inpaint unusual objects with high diversity in semantics, by sampling multiple substantives from WordNet and ChatGPT. On the synthetically generated data, we evaluate and compare multiple open-vocabulary object detectors as well as a classical object detector. The synthetic data is derived from two real-world datasets, namely LostAndFound, a challenging out-of-distribution (OOD) detection benchmark, and the NuImages dataset. Our results indicate that inpainting can challenge open-vocabulary object detectors in terms of overlooking objects. Additionally, we find a strong dependence of open-vocabulary models on object location, rather than on object semantics. This provides a systematic approach to challenge open-vocabulary models and gives valuable insights on how data could be acquired to effectively improve these models.

[268] Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking

Shiao Wang,Ju Huang,Qingchuan Ma,Jinfeng Gao,Chunyi Xu,Xiao Wang,Lan Chen,Bo Jiang

Main category: cs.CV

TL;DR: 本文提出了一种高效的RGB-事件对象跟踪框架Mamba-FETrack V2，该框架基于线性复杂度的Vision Mamba网络，并通过动态生成模态特定的可学习提示向量来实现跨模态交互和融合。

Details

Motivation: 现有多种模态跟踪算法依赖高复杂度的Vision Transformer架构，导致计算开销大且限制了跨模态交互的有效性。 Method: 设计了一个轻量级的Prompt Generator，利用每种模态的嵌入特征与共享提示池结合，动态生成模态特定的可学习提示向量。这些提示连同模态特定的嵌入特征一起输入到基于Vision Mamba的FEMamba骨干网络中，以统一的方式进行提示引导的特征提取、跨模态交互和融合。 Result: 在多个RGB-Event跟踪基准上的实验评估表明，所提出的跟踪框架具有优越的性能和效率，包括短期数据集COESOT和长期数据集FE108及FELT V2。 Conclusion: Mamba-FETrack V2提供了一种高效且有效的RGB-事件对象跟踪解决方案，能够减少计算开销并提升跨模态交互的效果。 Abstract: Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effectiveness of cross-modal interactions. In this paper, we propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network, termed Mamba-FETrack V2. Specifically, we first design a lightweight Prompt Generator that utilizes embedded features from each modality, together with a shared prompt pool, to dynamically generate modality-specific learnable prompt vectors. These prompts, along with the modality-specific embedded features, are then fed into a Vision Mamba-based FEMamba backbone, which facilitates prompt-guided feature extraction, cross-modal interaction, and fusion in a unified manner. Finally, the fused representations are passed to the tracking head for accurate target localization. Extensive experimental evaluations on multiple RGB-Event tracking benchmarks, including short-term COESOT dataset and long-term datasets, i.e., FE108 and FELT V2, demonstrate the superior performance and efficiency of the proposed tracking framework. The source code and pre-trained models will be released on https://github.com/Event-AHU/Mamba_FETrack

[269] Visual Textualization for Image Prompted Object Detection

Yongjian Wu,Yang Zhou,Jiya Saiyin,Bingzheng Wei,Yan Xu

Main category: cs.CV

TL;DR: VisTex-OVLM是一种新的图像提示目标检测方法，通过将视觉样本转换为文本特征来提高模型对罕见类别的检测能力，同时保持原有模型结构和泛化性。

Details

Motivation: 为了提升对象级视觉语言模型在检测难以用文本描述且在预训练数据中几乎缺失的罕见类别的性能。 Method: 使用多尺度文本化模块和多阶段融合策略，将来自视觉样本的视觉信息整合到文本特征空间中，生成指导OVLM的文本化视觉标记。 Result: VisTex-OVLM在开放集数据集上展示了优越的性能，并在PASCAL VOC和MSCOCO少样本基准测试中达到了最先进的结果。 Conclusion: VisTex-OVLM通过引入视觉文本化，有效提升了对象级视觉语言模型在检测罕见类别上的能力，并且保持了其原有的架构和泛化能力。 Abstract: We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization -- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pre-training data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM's pre-training data and achieves state-of-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO. The code will be released at https://github.com/WitGotFlg/VisTex-OVLM.

[270] Controllable Reference-Based Real-World Remote Sensing Image Super-Resolution with Generative Diffusion Priors

Ce Wang,Wanjie Sun

Main category: cs.CV

TL;DR: 本文提出了一种新的可控参考超分辨率扩散模型CRefDiff，有效解决了现有方法中的问题，并在实际遥感图像中表现出色。

Details

Motivation: 现有的RefSR方法在处理现实世界的复杂性时存在不足。 Method: 提出了CRefDiff，基于预训练Stable Diffusion模型，并引入双分支融合机制和Better Start策略。 Result: CRefDiff在各种指标上都达到了最先进的水平，并显著加速了推理过程。 Conclusion: CRefDiff实现了最先进的性能，并改进了下游任务。 Abstract: Super-resolution (SR) techniques can enhance the spatial resolution of remote sensing images by utilizing low-resolution (LR) images to reconstruct high-resolution (HR) images, enabling more efficient large-scale earth observation applications. While single-image super-resolution (SISR) methods have shown progress, reference-based super-resolution (RefSR) offers superior performance by incorporating historical HR images alongside current LR observations. However, existing RefSR methods struggle with real-world complexities, such as cross-sensor resolution gap and significant land cover changes, often leading to under-generation or over-reliance on reference image. To address these challenges, we propose CRefDiff, a novel controllable reference-based diffusion model for real-world remote sensing image SR. To address the under-generation problem, CRefDiff is built upon the pretrained Stable Diffusion model, leveraging its powerful generative prior to produce accurate structures and textures. To mitigate over-reliance on the reference, we introduce a dual-branch fusion mechanism that adaptively integrates both local and global information from the reference image. Moreover, this novel dual-branch design enables reference strength control during inference, enhancing interactivity and flexibility of the model. Finally, a strategy named Better Start is proposed to significantly reduce the number of denoising steps, thereby accelerating the inference process. To support further research, we introduce Real-RefRSSRD, a new real-world RefSR dataset for remote sensing images, consisting of HR NAIP and LR Sentinel-2 image pairs with diverse land cover changes and significant temporal gaps. Extensive experiments on Real-RefRSSRD show that CRefDiff achieves state-of-the-art performance across various metrics and improves downstream tasks such as scene classification and semantic segmentation.

[271] Towards Initialization-free Calibrated Bundle Adjustment

Carl Olsson,Amanda Nilsson

Main category: cs.CV

TL;DR: 本文提出了一种新的初始化自由的BA方法，通过引入配对相对旋转估计，能够利用已知的相机校准信息，从而产生接近度量的解决方案。

Details

Motivation: 现有的pOSE方法由于无法结合相机校准信息，导致解决方案仅确定到场景的射影变换，并且需要更多的数据才能成功重建。因此，作者希望找到一种能够利用相机校准信息的方法以提高重建的准确性。 Method: 该方法通过引入仅对相似性变换不变的配对相对旋转估计，将旋转平均化集成到pOSE框架中，从而实现无需初始化的校准SfM。 Result: 实验结果表明，该方法能够从随机起始解可靠地优化目标函数并收敛到全局最小值，从而实现准确的近度量重建。 Conclusion: 该方法有效地解决了现有pOSE方法在利用相机校准信息方面的不足，实现了更准确的近度量重建。 Abstract: A recent series of works has shown that initialization-free BA can be achieved using pseudo Object Space Error (pOSE) as a surrogate objective. The initial reconstruction-step optimizes an objective where all terms are projectively invariant and it cannot incorporate knowledge of the camera calibration. As a result, the solution is only determined up to a projective transformation of the scene and the process requires more data for successful reconstruction. In contrast, we present a method that is able to use the known camera calibration thereby producing near metric solutions, that is, reconstructions that are accurate up to a similarity transformation. To achieve this we introduce pairwise relative rotation estimates that carry information about camera calibration. These are only invariant to similarity transformations, thus encouraging solutions that preserve metric features of the real scene. Our method can be seen as integrating rotation averaging into the pOSE framework striving towards initialization-free calibrated SfM. Our experimental evaluation shows that we are able to reliably optimize our objective, achieving convergence to the global minimum with high probability from random starting solutions, resulting in accurate near metric reconstructions.

[272] MadCLIP: Few-shot Medical Anomaly Detection with CLIP

Mahshid Shiri,Cigdem Beyan,Vittorio Murino

Main category: cs.CV

TL;DR: This paper proposes MadCLIP, a new few-shot anomaly detection method for medical data based on the CLIP model, which performs well for both image-level and pixel-level tasks without needing synthetic data or memory banks.

Details

Motivation: The motivation behind this work is to develop an innovative few-shot anomaly detection approach for medical data using the pre-trained CLIP model, adapted for both image-level and pixel-level anomaly detection tasks. Method: MadCLIP uses a dual-branch design with learnable adapters in the CLIP vision encoder to separately capture normal and abnormal features. Learnable text prompts are used for semantic alignment, and SigLIP loss is applied to handle the many-to-one relationship between images and unpaired text prompts. Result: The approach was validated on multiple modalities and showed better performance than existing methods in both same-dataset and cross-dataset evaluations for anomaly classification and segmentation. Conclusion: The paper concludes that their proposed method, MadCLIP, achieves superior performance in image-level anomaly classification and pixel-level anomaly segmentation without relying on synthetic data or memory banks. Abstract: An innovative few-shot anomaly detection approach is presented, leveraging the pre-trained CLIP model for medical data, and adapting it for both image-level anomaly classification (AC) and pixel-level anomaly segmentation (AS). A dual-branch design is proposed to separately capture normal and abnormal features through learnable adapters in the CLIP vision encoder. To improve semantic alignment, learnable text prompts are employed to link visual features. Furthermore, SigLIP loss is applied to effectively handle the many-to-one relationship between images and unpaired text prompts, showcasing its adaptation in the medical field for the first time. Our approach is validated on multiple modalities, demonstrating superior performance over existing methods for AC and AS, in both same-dataset and cross-dataset evaluations. Unlike prior work, it does not rely on synthetic data or memory banks, and an ablation study confirms the contribution of each component. The code is available at https://github.com/mahshid1998/MadCLIP.

[273] Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

Shiming Chen,Bowen Duan,Salman Khan,Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: 提出了一种名为LaZSL的新型本地对齐视觉语言模型，用于解决大规模视觉语言模型在零样本学习中缺乏可解释性的问题。

Details

Motivation: 大规模视觉语言模型通常缺乏可解释性，因此需要开发一种将语言集成到离散属性中的可解释模型。 Method: 通过最优传输进行局部视觉语义对齐来执行视觉区域与其相关属性之间的交互。 Result: 实验表明该方法具有增强的可解释性、提高的准确性和强大的领域泛化能力。 Conclusion: LaZSL提供了一种新的本地对齐视觉语言模型，用于可解释的零样本学习，不需要额外训练即可实现有效的对齐和可解释的相似性。 Abstract: Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and their associated attributes, facilitating effective alignment and providing interpretable similarity without the need for additional training. Extensive experiments demonstrate that our method offers several advantages, including enhanced interpretability, improved accuracy, and strong domain generalization. Codes available at: https://github.com/shiming-chen/LaZSL.

[274] Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

Haoji Zhang,Yiqin Wang,Yansong Tang,Yong Liu,Jiashi Feng,Xiaojie Jin

Main category: cs.CV

TL;DR: Flash-VStream是一种高效的视频语言模型，通过设计的Flash Memory模块处理极长视频并实现实时用户查询响应，证明了其在多个基准测试中的卓越性能和效率。

Details

Motivation: 现有的多模态大语言模型在理解和处理长视频方面仍面临挑战，因为其长上下文特性导致了显著的计算和内存开销。 Method: 设计了一个包含低容量上下文记忆模块和高容量增强记忆模块的Flash Memory模块。 Result: 与现有模型相比，Flash-VStream在推理延迟方面有显著降低，并在多个长视频基准测试中展示了最先进的性能和出色的效率。 Conclusion: Flash-VStream是一个高效的视频语言模型，能够处理极长的视频，并在实时响应用户查询方面表现出色。 Abstract: Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. Code is available at https://github.com/IVGSZ/Flash-VStream.

[275] Spatially Gene Expression Prediction using Dual-Scale Contrastive Learning

Mingcheng Qu,Yuncong Wu,Donglin Di,Yue Gao,Tonghua Su,Yang Song,Lei Fan

Main category: cs.CV

TL;DR: 本文提出了一种新的基因表达预测框架NH2ST，结合了病理图像的空间上下文信息以及基因表达多模态数据，解决了现有方法忽略复杂交互的问题，并取得了显著优于现有技术的表现。

Details

Motivation: 现有的基因表达预测方法通常依赖于单一图像块或单一病理模态，忽视了目标区域与其邻近信息（例如基因共表达）之间复杂的相互作用。这导致无法建立相邻区域间的连接，并捕捉跨模态的复杂关系。 Method: 提出了一种名为NH2ST的框架，包含查询分支和邻居分支，通过交叉注意力机制和对比学习来捕捉病理图像与基因表达之间的内在关联并确保对齐。 Result: 在六个数据集上的大量实验表明，该模型在PCC指标上超越现有方法超过20%。 Conclusion: NH2ST框架在基因表达预测任务中表现优异，超越了现有方法，并强调了整合空间上下文和多模态数据的重要性。 Abstract: Spatial transcriptomics (ST) provides crucial insights into tissue micro-environments, but is limited to its high cost and complexity. As an alternative, predicting gene expression from pathology whole slide images (WSI) is gaining increasing attention. However, existing methods typically rely on single patches or a single pathology modality, neglecting the complex spatial and molecular interactions between target and neighboring information (e.g., gene co-expression). This leads to a failure in establishing connections among adjacent regions and capturing intricate cross-modal relationships. To address these issues, we propose NH2ST, a framework that integrates spatial context and both pathology and gene modalities for gene expression prediction. Our model comprises a query branch and a neighbor branch to process paired target patch and gene data and their neighboring regions, where cross-attention and contrastive learning are employed to capture intrinsic associations and ensure alignments between pathology and gene expression. Extensive experiments on six datasets demonstrate that our model consistently outperforms existing methods, achieving over 20% in PCC metrics. Codes are available at https://github.com/MCPathology/NH2ST

[276] Low-latency vision transformers via large-scale multi-head attention

Ronit D. Gross,Tal Halevi,Ella Koresh,Yarden Tzach,Ido Kanter

Main category: cs.CV

TL;DR: 研究揭示了变压器模型中多头注意力机制的学习机制，通过提升信噪比提高了分类准确性，并通过结合卷积层减少了延迟，为深度学习提供了新见解。

Details

Motivation: 最近发现，在分类任务中，变压器块内的多头注意力（MHA）会在几个头上出现自发对称破缺现象，每个头通过其单节点性能（SNP）协作将注意力集中在一部分标签上。这项研究旨在将这一机制推广到大规模MHA（LS-MHA）中。 Method: 研究利用单头性能(SHP)矩阵来量化多头注意力机制中的自发对称破缺现象，并将其类比于卷积神经网络中的单过滤器性能。 Result: 结果显示，每个SHP矩阵包含多个单元簇，使得每个标签被少数头以可忽略的噪声显式识别，从而在变压器块中提高了信噪比(SNR)，进而提升了分类准确性。这种机制导致了几种达到相同准确性的不同ViT架构的出现。 Conclusion: ViT架构不仅提高了分类准确性，还通过使用卷积层替换初始变压器块实现了显著的延迟减少，同时保持了准确性。此外，该学习机制扩展到自然语言处理任务有潜力提供新的深度学习见解。 Abstract: The emergence of spontaneous symmetry breaking among a few heads of multi-head attention (MHA) across transformer blocks in classification tasks was recently demonstrated through the quantification of single-nodal performance (SNP). This finding indicates that each head focuses its attention on a subset of labels through cooperation among its SNPs. This underlying learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance (SHP), analogous to single-filter performance in convolutional neural networks (CNNs). The results indicate that each SHP matrix comprises multiple unit clusters such that each label being explicitly recognized by a few heads with negligible noise. This leads to an increased signal-to-noise ratio (SNR) along the transformer blocks, thereby improving classification accuracy. These features give rise to several distinct vision transformer (ViT) architectures that achieve the same accuracy but differ in their LS-MHA structures. As a result, their soft committee yields superior accuracy, an outcome not typically observed in CNNs which rely on hundreds of filters. In addition, a significant reduction in latency is achieved without affecting the accuracy by replacing the initial transformer blocks with convolutional layers. This substitution accelerates early-stage learning, which is then improved by subsequent transformer layers. The extension of this learning mechanism to natural language processing tasks, based on quantitative differences between CNNs and ViT architectures, has the potential to yield new insights in deep learning. The findings are demonstrated using compact convolutional transformer architectures trained on the CIFAR-100 dataset.

[277] PointSSIM: A novel low dimensional resolution invariant image-to-image comparison metric

Oscar Ovanger,Ragnar Hauge,Jacob Skauvold,Michael J. Pyrcz,Jo Eidsvik

Main category: cs.CV

TL;DR: This paper introduces PointSSIM, a resolution-invariant image comparison metric inspired by structural similarity and mathematical morphology, which transforms binary images into marked point patterns for robust comparison.

Details

Motivation: The motivation is to develop a resolution-invariant, low-dimensional image-to-image comparison metric that enables robust comparison across binary images of varying resolutions. Method: PointSSIM transforms binary images into marked point pattern representations by extracting anchor points and uses a summary vector to capture intensity, connectivity, complexity, and structural attributes for image comparisons. Result: Results show that PointSSIM provides an efficient and reliable method for image comparison across different resolutions. Conclusion: PointSSIM is an efficient and reliable method for image comparison, especially suitable for applications requiring structural analysis across different resolutions. Abstract: This paper presents PointSSIM, a novel low-dimensional image-to-image comparison metric that is resolution invariant. Drawing inspiration from the structural similarity index measure and mathematical morphology, PointSSIM enables robust comparison across binary images of varying resolutions by transforming them into marked point pattern representations. The key features of the image, referred to as anchor points, are extracted from binary images by identifying locally adaptive maxima from the minimal distance transform. Image comparisons are then performed using a summary vector, capturing intensity, connectivity, complexity, and structural attributes. Results show that this approach provides an efficient and reliable method for image comparison, particularly suited to applications requiring structural analysis across different resolutions.

[278] Refine Any Object in Any Scene

Ziwei Chen,Ziling Liu,Zitong Huang,Mingqi Gao,Feng Zheng

Main category: cs.CV

TL;DR: 本文提出了一种名为RAISE的新框架，用于解决场景重建中对象视角缺失的问题，通过两阶段优化方法实现高保真几何和外观恢复，并在多个基准测试中取得了领先效果。

Details

Motivation: 视角缺失在场景重建中很常见，这使得在保持场景级表示的同时实现高质量的对象级建模非常困难。为了解决这个问题，需要一种能够恢复缺失视角下对象细节的方法。 Method: 提出了一种名为RAISE的3D增强框架，通过使用具有强3D理解能力的生成模型来替代降质对象，并采用两阶段优化策略（7-DOF姿态对齐和注册约束增强）逐步优化几何和纹理。 Result: 实验表明，RAISE在新视角合成和几何补全任务上显著优于现有最先进方法。 Conclusion: RAISE框架在场景重建中有效解决了视角缺失导致的对象建模问题，实现了高保真的对象几何和外观恢复。 Abstract: Viewpoint missing of objects is common in scene reconstruction, as camera paths typically prioritize capturing the overall scene structure rather than individual objects. This makes it highly challenging to achieve high-fidelity object-level modeling while maintaining accurate scene-level representation. Addressing this issue is critical for advancing downstream tasks requiring detailed object understanding and appearance modeling. In this paper, we introduce Refine Any object In any ScenE (RAISE), a novel 3D enhancement framework that leverages 3D generative priors to recover fine-grained object geometry and appearance under missing views. Starting from substituting degraded objects with proxies, via a 3D generative model with strong 3D understanding, RAISE progressively refines geometry and texture by aligning each proxy to its degraded counterpart in 7-DOF pose, followed by correcting spatial and appearance inconsistencies via registration-constrained enhancement. This two-stage refinement ensures the high-fidelity geometry and appearance of the original object in unseen views while maintaining consistency in spatial positioning, observed geometry, and appearance. Extensive experiments on challenging benchmarks show that RAISE significantly outperforms state-of-the-art methods in both novel view synthesis and geometry completion tasks. RAISE is made publicly available at https://github.com/PolySummit/RAISE.

[279] RGC-VQA: An Exploration Database for Robotic-Generated Video Quality Assessment

Jianing Jin,Jiangyong Ying,Huiyu Duan,Liu Yang,Sijing Wu,Yunhao Li,Yushuo Zheng,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文提出了机器人生成内容（RGC）的概念，并建立了首个RGC数据库（RGCD）。研究发现现有的视频质量评估模型在处理RGC内容时表现不佳，因此需要开发专门针对RGC的评估模型。

Details

Motivation: 随着机器人平台日益融入日常生活，机器人生成的视频开始出现在流媒体平台上，但目前缺乏对这些视频感知质量的专门研究。 Method: 建立了一个包含2100个视频的机器人生成内容数据库（RGCD），并进行了主观视频质量评估实验和基准测试实验，以评估最先进的视频质量评估模型的表现。 Result: 实验结果表明，现有的视频质量评估模型在处理复杂的机器人生成内容时存在显著限制。 Conclusion: 研究揭示了现有视频质量评估模型在处理机器人生成内容时的局限性，并强调需要专门针对RGC的视频质量评估模型。 Abstract: As camera-equipped robotic platforms become increasingly integrated into daily life, robotic-generated videos have begun to appear on streaming media platforms, enabling us to envision a future where humans and robots coexist. We innovatively propose the concept of Robotic-Generated Content (RGC) to term these videos generated from egocentric perspective of robots. The perceptual quality of RGC videos is critical in human-robot interaction scenarios, and RGC videos exhibit unique distortions and visual requirements that differ markedly from those of professionally-generated content (PGC) videos and user-generated content (UGC) videos. However, dedicated research on quality assessment of RGC videos is still lacking. To address this gap and to support broader robotic applications, we establish the first Robotic-Generated Content Database (RGCD), which contains a total of 2,100 videos drawn from three robot categories and sourced from diverse platforms. A subjective VQA experiment is conducted subsequently to assess human visual perception of robotic-generated videos. Finally, we conduct a benchmark experiment to evaluate the performance of 11 state-of-the-art VQA models on our database. Experimental results reveal significant limitations in existing VQA models when applied to complex, robotic-generated content, highlighting a critical need for RGC-specific VQA models. Our RGCD is publicly available at: https://github.com/IntMeGroup/RGC-VQA.

[280] HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity

Yida Wang,Xueyang Zhang,Kun Zhan,Peng Jia,Xianpeng Lang

Main category: cs.CV

TL;DR: HiNeuS通过统一框架解决了神经表面重建中的多视角辐射不一致、无纹理区域关键点缺失和结构退化问题，实现了几何和外观约束的协同进化，并在多个评估中表现出色。

Details

Motivation: 解决神经表面重建中几何保真度与光度一致性的协调问题，特别是多视角辐射不一致、无纹理区域关键点缺失和联合优化中的过度强制Eikonal约束结构退化问题。 Method: 引入了基于SDF引导光线追踪的微分可见性验证、射线对齐几何块的平面共形正则化、以及基于局部辐射梯度的物理基础Eikonal松弛方法。 Result: 综合评估显示该方法在合成和真实世界数据集上均达到最先进性能，包括Chamfer距离减少21.4%和PSNR提高2.32 dB，且能够成功应用于材料分解和视图一致性重照明等逆向渲染任务。 Conclusion: HiNeuS实现了几何和外观约束的协同进化，展示了其在逆向渲染任务中的泛化能力。 Abstract: Neural surface reconstruction faces persistent challenges in reconciling geometric fidelity with photometric consistency under complex scene conditions. We present HiNeuS, a unified framework that holistically addresses three core limitations in existing approaches: multi-view radiance inconsistency, missing keypoints in textureless regions, and structural degradation from over-enforced Eikonal constraints during joint optimization. To resolve these issues through a unified pipeline, we introduce: 1) Differential visibility verification through SDF-guided ray tracing, resolving reflection ambiguities via continuous occlusion modeling; 2) Planar-conformal regularization via ray-aligned geometry patches that enforce local surface coherence while preserving sharp edges through adaptive appearance weighting; and 3) Physically-grounded Eikonal relaxation that dynamically modulates geometric constraints based on local radiance gradients, enabling detail preservation without sacrificing global regularity. Unlike prior methods that handle these aspects through sequential optimizations or isolated modules, our approach achieves cohesive integration where appearance-geometry constraints evolve synergistically throughout training. Comprehensive evaluations across synthetic and real-world datasets demonstrate state-of-the-art performance, including a 21.4% reduction in Chamfer distance over reflection-aware baselines and 2.32 dB PSNR improvement against neural rendering counterparts. Qualitative analyses reveal superior capability in recovering specular instruments, urban layouts with centimeter-scale infrastructure, and low-textured surfaces without local patch collapse. The method's generalizability is further validated through successful application to inverse rendering tasks, including material decomposition and view-consistent relighting.

[281] A Closer Look at Conditional Prompt Tuning for Vision-Language Models

Ji Zhang,Shihan Wu,Lianli Gao,Jingkuan Song,Nicu Sebe,Heng Tao Shen

Main category: cs.CV

TL;DR: This paper introduces Class-adaptive Prompt Tuning (CaPT) and its extension DeCaPT to address the Base-New Tradeoff (BNT) problem in Vision-Language models, achieving superior performance over existing methods.

Details

Motivation: Existing prompt tuning methods face the Base-New Tradeoff (BNT) dilemma, where better performance on base tasks leads to reduced generalization for new tasks. Conditional Prompt Tuning using Visual Image Information (VII) has shown limited success, prompting the need for a more effective conditioning strategy like TCI. Method: The authors propose Class-adaptive Prompt Tuning (CaPT), which learns dynamic prompts conditioned on Textual Class Information (TCI) to overcome the BNT problem. They also introduce a new conditional PT approach called DeCaPT by combining CaPT with the DePT framework. Result: Extensive experiments across 11 datasets demonstrate that CaPT consistently improves performance over five strong unconditional PT baselines at minimal computational cost. DeCaPT outperforms the state-of-the-art conditional PT method by an average of 3.49% in H ACC scores. Conclusion: Class-adaptive Prompt Tuning (CaPT) effectively addresses the Base-New Tradeoff (BNT) dilemma in Vision-Language Pretrained Models (VLPMs), enhancing model generalization to new tasks. The proposed DeCaPT framework, integrating CaPT with DePT, surpasses existing state-of-the-art conditional PT schemes. Abstract: Despite the great promise of Prompt Tuning (PT) in adapting large Vision-Language Pretrained Models (VLPMs) to downstream tasks, they often struggle to overcome the Base-New Tradeoff (BNT) dilemma: as VLPMs are better tuned to a base task, their ability to generalize to new tasks diminishes. Recent work on conditional PT addresses this problem by replacing static prompts with dynamic Visual Image Information (VII)-conditioned prompts, improving the model's generalization to new tasks to some extent. In this work, we first identify a critical issue with existing conditional PT methods: using VII as the "condition" of prompts yields suboptimal performance, and even random noise-conditioned prompts can outperform the VII-conditioned counterparts. On further analysis, we find that learning dynamic prompts conditioned on Textual Class Information (TCI) is the key to solving the BNT problem. Motivated by this, we then propose Class-adaptive Prompt Tuning (CaPT), which enables fast adaptation of tuned models to new classes by learning TCI-conditioned prompts from base classes. Remarkably, CaPT can be used as a plugin to mitigate the BNT problem for existing unconditional PT schemes. Extensive experiments on 11 datasets show that CaPT consistently improves the performance of five strong unconditional PT baselines with negligible additional computational cost. Additionally, by integrating CaPT with our recently proposed DePT framework, we devise a new conditional PT approach, termed DeCaPT, which outperforms the H ACC of the state-of-the-art conditional PT scheme by 3.49%, averaged over the 11 datasets. Code: https://github.com/Koorye/CaPT.

[282] VMoBA: Mixture-of-Block Attention for Video Diffusion Models

Jianzong Wu,Liang Hou,Haotian Yang,Xin Tao,Ye Tian,Pengfei Wan,Di Zhang,Yunhai Tong

Main category: cs.CV

TL;DR: 该论文提出了一种专为视频扩散模型设计的稀疏注意力机制VMoBA，在不牺牲生成质量的前提下，有效降低了计算复杂度并提高了训练和推理效率。

Details

Motivation: 现有的稀疏注意力方法未能充分捕捉视频数据中的时空特性，而全注意力机制存在二次复杂度瓶颈，因此需要一种专门适应视频扩散模型的稀疏注意力机制。 Method: 提出了一种新的稀疏注意力机制VMoBA，包含三个关键修改：逐层递归块划分方案、全局块选择和基于阈值的块选择方法。 Result: 实验表明，VMoBA在长序列上显著加速了视频扩散模型的训练，达到了2.92倍FLOPs和1.48倍延迟加速，并在高分辨率视频生成中表现良好。 Conclusion: VMoBA展现出在训练和无训练推理中的竞争力，显著加速了视频扩散模型的处理速度，同时保持或提升了生成质量。 Abstract: The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.

[283] Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction

Jiahao Ma,Lei Wang,Miaomiao liu,David Ahmedt-Aristizabal,Chuong Nguyen

Main category: cs.CV

TL;DR: Puzzles enhances 3D reconstruction by generating synthetic training data from minimal input, achieving strong results with less data.

Details

Motivation: The lack of diverse and large-scale training data limits the performance of current multi-view 3D reconstruction methods like DUST3R. Method: Puzzles synthesizes video-depth data by applying targeted image transformations that simulate diverse camera trajectories and realistic scene geometry. Result: Models trained with Puzzles using only 10% of original data achieve accuracy comparable to those trained on full datasets, demonstrating significant data efficiency and performance improvement. Conclusion: Puzzles is an effective data augmentation strategy that improves the performance of video-based 3D reconstruction models by generating high-quality posed video-depth data from limited inputs. Abstract: Multi-view 3D reconstruction remains a core challenge in computer vision. Recent methods, such as DUST3R and its successors, directly regress pointmaps from image pairs without relying on known scene geometry or camera parameters. However, the performance of these models is constrained by the diversity and scale of available training data. In this work, we introduce Puzzles, a data augmentation strategy that synthesizes an unbounded volume of high-quality posed video-depth data from a single image or video clip. By simulating diverse camera trajectories and realistic scene geometry through targeted image transformations, Puzzles significantly enhances data variety. Extensive experiments show that integrating Puzzles into existing video-based 3D reconstruction pipelines consistently boosts performance without modifying the underlying network architecture. Notably, models trained on only ten percent of the original data augmented with Puzzles still achieve accuracy comparable to those trained on the full dataset. Code is available at https://jiahao-ma.github.io/puzzles/.

Reihaneh Zohrabi,Hosein Hasani,Mahdieh Soleymani Baghshah,Anna Rohrbach,Marcus Rohrbach,Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: SPROD enhances OOD detection by mitigating spurious correlations, offering strong performance improvements without extra data or tuning.

Details

Motivation: Existing OOD detection methods are vulnerable to spurious correlations, which compromise model robustness in real-world applications. Method: Proposed SPROD, a prototype-based post-hoc method that refines class prototypes to reduce bias from spurious features. Result: SPROD achieves an average improvement of 4.7% in AUROC and 9.3% in FPR@95 over the second-best method across various datasets including CelebA, Waterbirds, and Animals MetaCoCo. Conclusion: SPROD improves OOD detection performance by addressing spurious correlations without additional data or hyperparameter tuning, showing superior results on multiple datasets. Abstract: Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications, where they frequently face data distributions unseen during training. Despite progress, existing methods are often vulnerable to spurious correlations that mislead models and compromise robustness. To address this, we propose SPROD, a novel prototype-based OOD detection approach that explicitly addresses the challenge posed by unknown spurious correlations. Our post-hoc method refines class prototypes to mitigate bias from spurious features without additional data or hyperparameter tuning, and is broadly applicable across diverse backbones and OOD detection settings. We conduct a comprehensive spurious correlation OOD detection benchmarking, comparing our method against existing approaches and demonstrating its superior performance across challenging OOD datasets, such as CelebA, Waterbirds, UrbanCars, Spurious Imagenet, and the newly introduced Animals MetaCoCo. On average, SPROD improves AUROC by 4.7% and FPR@95 by 9.3% over the second best.

[285] PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View

Longliang Liu,Miaojie Feng,Junda Cheng,Jijun Xiang,Xuan Zhu,Xin Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为PriOr-Flow的新颖双分支框架，用于解决全景光学流估计中的失真问题，尤其在极地地区表现出色。

Details

Motivation: 球面到平面投影（如等矩形投影ERP）导致的严重失真会显著降低传统基于透视的光流方法的性能，尤其是在极地地区。为了解决这个问题，提出了PriOr-Flow。 Method: 引入了Dual-Cost Collaborative Lookup（DCCL）操作符和Ortho-Driven Distortion Compensation（ODDC）模块，以联合检索相关性信息并迭代优化运动特征。 Result: 广泛的实验表明，PriOr-Flow在各种基于透视的迭代光流方法中具有兼容性，并且在公共可用的全景光流数据集上持续达到最先进水平。 Conclusion: PriOr-Flow是一个新的双分支框架，通过利用正交视图的低失真特性来增强全景光学流估计，特别是在极地地区。它与各种基于透视的迭代光流方法兼容，并在公共可用的全景光流数据集中实现了最先进的性能。 Abstract: Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-to-plane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features from both branches, further suppressing polar distortions. Extensive experiments demonstrate that PriOr-Flow is compatible with various perspective-based iterative optical flow methods and consistently achieves state-of-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for wide-field motion estimation. The code is publicly available at: https://github.com/longliangLiu/PriOr-Flow.

[286] GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Hamza Rasaee,Taha Koleilat,Hassan Rivaz

Main category: cs.CV

TL;DR: This study introduces a prompt-driven vision-language model (VLM) for accurate and generalizable object segmentation in ultrasound imaging, demonstrating superior performance over existing methods while reducing dependence on large annotated datasets.

Details

Motivation: Accurate and generalizable object segmentation in ultrasound imaging is challenging due to anatomical variability, diverse imaging protocols, and limited annotated data. This study aims to address these challenges by leveraging vision-language models. Method: A prompt-driven vision-language model (VLM) was developed by integrating Grounding DINO with SAM2. Fine-tuning of Grounding DINO for the ultrasound domain was performed using Low Rank Adaptation (LoRA). The model was tested across multiple ultrasound organs using diverse public datasets. Result: The proposed approach outperformed state-of-the-art segmentation methods such as UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets and maintained strong performance on unseen datasets without additional fine-tuning. Conclusion: The proposed prompt-driven vision-language model (VLM) demonstrates significant promise in achieving scalable and robust ultrasound image analysis, reducing the reliance on large, organ-specific annotated datasets. Abstract: Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on code.sonography.ai after acceptance.

[287] Three-dimensional end-to-end deep learning for brain MRI analysis

Radhika Juglan,Marta Ligero,Zunamys I. Carrero,Asier Rabasco,Tim Lenz,Leo Misera,Gregory Patrick Veldhuizen,Paul Kuntke,Hagen H. Kitzler,Sven Nebelung,Daniel Truhn,Jakob Nikolas Kather

Main category: cs.CV

TL;DR: This study shows that simpler convolutional neural networks (like SFCN) generalize better than more complex deep learning models when predicting age and sex from brain imaging data across multiple independent datasets.

Details

Motivation: Deep learning methods are increasingly used in brain imaging, but their generalizability across diverse imaging cohorts remains underexplored. Age and sex are key neurobiological markers, making them ideal targets for assessing model performance. Method: The study evaluated three 3D deep learning architectures (SFCN, DenseNet, and Swin Transformers) for age and sex prediction using T1-weighted MRI data from four independent cohorts (UKB, DLBS, PPMI, IXI). Result: SFCN consistently outperformed other models, achieving an AUC of 1.00 in internal testing for sex classification and MAE of 2.66 for age prediction in the UK Biobank dataset. It also showed superior performance in external test sets across multiple metrics. Conclusion: Simpler convolutional networks like SFCN outperform more complex architectures such as DenseNet and Swin Transformers in brain image analysis, showing better generalizability across different datasets. Abstract: Deep learning (DL) methods are increasingly outperforming classical approaches in brain imaging, yet their generalizability across diverse imaging cohorts remains inadequately assessed. As age and sex are key neurobiological markers in clinical neuroscience, influencing brain structure and disease risk, this study evaluates three of the existing three-dimensional architectures, namely Simple Fully Connected Network (SFCN), DenseNet, and Shifted Window (Swin) Transformers, for age and sex prediction using T1-weighted MRI from four independent cohorts: UK Biobank (UKB, n=47,390), Dallas Lifespan Brain Study (DLBS, n=132), Parkinson's Progression Markers Initiative (PPMI, n=108 healthy controls), and Information eXtraction from Images (IXI, n=319). We found that SFCN consistently outperformed more complex architectures with AUC of 1.00 [1.00-1.00] in UKB (internal test set) and 0.85-0.91 in external test sets for sex classification. For the age prediction task, SFCN demonstrated a mean absolute error (MAE) of 2.66 (r=0.89) in UKB and 4.98-5.81 (r=0.55-0.70) across external datasets. Pairwise DeLong and Wilcoxon signed-rank tests with Bonferroni corrections confirmed SFCN's superiority over Swin Transformer across most cohorts (p<0.017, for three comparisons). Explainability analysis further demonstrates the regional consistency of model attention across cohorts and specific to each task. Our findings reveal that simpler convolutional networks outperform the denser and more complex attention-based DL architectures in brain image analysis by demonstrating better generalizability across different datasets.

[288] Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su,Peng Xia,Hangyu Guo,Zhenhua Liu,Yan Ma,Xiaoye Qu,Jiaqi Liu,Yanshu Li,Kaide Zeng,Zhengyuan Yang,Linjie Li,Yu Cheng,Heng Ji,Junxian He,Yi R.,Fung

Main category: cs.CV

TL;DR: This paper discusses the transition in artificial intelligence from merely processing images to actually thinking with them, highlighting the potential for more advanced and human-like multimodal reasoning.

Details

Motivation: The motivation behind this paper is the observation that current text-centric approaches in multimodal reasoning create a 'semantic gap' between perceptual data and symbolic thought, which needs to be overcome for more effective AI. Method: The paper provides a survey of the emerging paradigm of thinking with images, establishing its foundational principles and a three-stage framework, reviewing core methods, analyzing evaluation benchmarks and applications, and identifying challenges and future directions. Result: The result of the study is a structured overview of the evolution of intelligence in AI towards increased cognitive autonomy, characterized by the use of visual information as an intermediate step in thought processes. Conclusion: The paper concludes that there is a paradigm shift in AI from models that think about images to those that can think with images, creating a more human-aligned multimodal AI. Abstract: Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.

[289] Evaluating the Impact of Khmer Font Types on Text Recognition

Vannkinh Nom,Souhail Bakkali,Muhammad Muzzamil Luqman,Mickael Coustaty,Jean-Marc Ogier

Main category: cs.CV

TL;DR: 该研究测试了多种高棉文字体对OCR识别精度的影响，发现某些字体表现优秀，而另一些则较差，突出了字体选择在优化OCR系统中的重要性。

Details

Motivation: 高棉文字体种类繁多且结构独特，给光学字符识别（OCR）系统带来了挑战，因此需要研究不同字体对OCR准确率的影响以提升系统性能。 Method: 使用Pytesseract评估19种随机选择的高棉字体对文本识别精度的影响。 Result: 在所选字体中，Khmer、Odor MeanChey、Siemreap、Sithi Manuss和Battambang表现出了较高的准确性，而iSeth First、Bayon和Dangrek则表现不佳。 Conclusion: Khmer OCR系统的准确性受字体选择的影响显著，特定字体如Khmer、Odor MeanChey、Siemreap等表现优异，而iSeth First、Bayon和Dangrek则较差。研究强调了优化高棉文字体识别中字体选择的重要性，并为开发更强大的OCR系统提供了有价值的见解。 Abstract: Text recognition is significantly influenced by font types, especially for complex scripts like Khmer. The variety of Khmer fonts, each with its unique character structure, presents challenges for optical character recognition (OCR) systems. In this study, we evaluate the impact of 19 randomly selected Khmer font types on text recognition accuracy using Pytesseract. The fonts include Angkor, Battambang, Bayon, Bokor, Chenla, Dangrek, Freehand, Kh Kompong Chhnang, Kh SN Kampongsom, Khmer, Khmer CN Stueng Songke, Khmer Savuth Pen, Metal, Moul, Odor MeanChey, Preah Vihear, Siemreap, Sithi Manuss, and iSeth First. Our comparison of OCR performance across these fonts reveals that Khmer, Odor MeanChey, Siemreap, Sithi Manuss, and Battambang achieve high accuracy, while iSeth First, Bayon, and Dangrek perform poorly. This study underscores the critical importance of font selection in optimizing Khmer text recognition and provides valuable insights for developing more robust OCR systems.

Boyue Xu,Ruichao Hou,Tongwei Ren,Gangshan Wu

Main category: cs.CV

TL;DR: This paper proposes a novel visual and memory dual adapter (VMDA) for multi-modal tracking, enhancing prompt learning through adaptive cue transfer and memory-driven temporal propagation, achieving strong performance on multiple tracking benchmarks.

Details

Motivation: Existing prompt-learning-based multi-modal trackers struggle to learn reliable prompts due to limited exploitation of critical cues in frequency and temporal domains. The motivation is to improve representation learning by effectively integrating cross-modal and temporal information. Method: The authors propose a visual and memory dual adapter (VMDA) with two key components: a visual adapter for transferring discriminative cues between modalities using frequency, spatial, and channel-wise features; and a memory adapter inspired by human memory mechanisms to store and propagate temporal information across video sequences. Result: Extensive experiments show that the proposed method achieves state-of-the-art results on various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Conclusion: The paper concludes that the proposed VMDA method achieves state-of-the-art performance in multi-modal tracking tasks, demonstrating its effectiveness and robustness. Abstract: Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at https://github.com/xuboyue1999/mmtrack.git.

[291] Toward Simple and Robust Contrastive Explanations for Image Classification by Leveraging Instance Similarity and Concept Relevance

Yuliia Kaidashova,Bettina Finzel,Ute Schmid

Main category: cs.CV

TL;DR: This paper explores contrastive explanations for image classification models by using concept relevance and instance embeddings. It finds that higher concept relevance leads to simpler explanations and varying robustness under image transformations.

Details

Motivation: To enhance interpretability of deep learning models by providing contrastive explanations that clarify why one class is preferred over another, addressing gaps in understanding model decision-making. Method: The approach leverages concept-based reasoning with fine-tuned deep learning models, computes contrasts based on instance embeddings, and evaluates explanation complexity across different relevance scores and image augmentations. Result: Higher concept relevance correlates with shorter, less complex explanations, while lower relevance leads to more complex ones. Explanations show varied robustness under image transformations like rotation and noise. Conclusion: The study highlights the potential for building more interpretable and robust AI systems through concept-based contrastive explanations, emphasizing the importance of relevance scoring in explanation quality. Abstract: Understanding why a classification model prefers one class over another for an input instance is the challenge of contrastive explanation. This work implements concept-based contrastive explanations for image classification by leveraging the similarity of instance embeddings and relevance of human-understandable concepts used by a fine-tuned deep learning model. Our approach extracts concepts with their relevance score, computes contrasts for similar instances, and evaluates the resulting contrastive explanations based on explanation complexity. Robustness is tested for different image augmentations. Two research questions are addressed: (1) whether explanation complexity varies across different relevance ranges, and (2) whether explanation complexity remains consistent under image augmentations such as rotation and noise. The results confirm that for our experiments higher concept relevance leads to shorter, less complex explanations, while lower relevance results in longer, more diffuse explanations. Additionally, explanations show varying degrees of robustness. The discussion of these findings offers insights into the potential of building more interpretable and robust AI systems.

[292] StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving

Ruiyang Hao,Bowen Jing,Haibao Yu,Zaiqing Nie

Main category: cs.CV

TL;DR: 本文提出了首个支持个性化端到端自动驾驶的大规模真实世界数据集，并建立了一个融合主客观注释的基准平台，推动以人为本的自动驾驶研究。

Details

Motivation: 尽管个性化在传统自动驾驶系统中已有研究，但在端到端自动驾驶中仍被忽视，而用户对齐的行为对于提升信任、舒适度和自动驾驶技术普及至关重要。 Method: 构建了一个大规模真实世界数据集，通过行为分布分析和基于规则的启发式方法提取客观偏好注释，并利用视觉语言模型生成主观注释，最终通过人工验证获得高质量标签。 Result: 开发了首个包含多样化驾驶偏好的大规模真实世界数据集，并建立了个性化E2EAD的基础平台，验证了偏好条件对现有模型的影响。 Conclusion: 该论文提出了一种个性化端到端自动驾驶（E2EAD）模型的评估基准，并展示了结合个性化偏好后，车辆行为更符合人类驾驶。 Abstract: While personalization has been explored in traditional autonomous driving systems, it remains largely overlooked in end-to-end autonomous driving (E2EAD), despite its growing prominence. This gap is critical, as user-aligned behavior is essential for trust, comfort, and widespread adoption of autonomous vehicles. A core challenge is the lack of large-scale real-world datasets annotated with diverse and fine-grained driving preferences, hindering the development and evaluation of personalized E2EAD models. In this work, we present the first large-scale real-world dataset enriched with annotations capturing diverse driving preferences, establishing a foundation for personalization in E2EAD. We extract static environmental features from real-world road topology and infer dynamic contextual cues using a fine-tuned visual language model (VLM), enabling consistent and fine-grained scenario construction. Based on these scenarios, we derive objective preference annotations through behavioral distribution analysis and rule-based heuristics. To address the inherent subjectivity of driving style, we further employ the VLM to generate subjective annotations by jointly modeling scene semantics and driver behavior. Final high-quality labels are obtained through a human-in-the-loop verification process that fuses both perspectives. Building on this dataset, we propose the first benchmark for evaluating personalized E2EAD models. We assess several state-of-the-art models with and without preference conditioning, demonstrating that incorporating personalized preferences results in behavior more aligned with human driving. Our work lays the foundation for personalized E2EAD by providing a standardized platform to systematically integrate human preferences into data-driven E2EAD systems, catalyzing future research in human-centric autonomy.

[293] Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data

Shubhabrata Mukherjee,Jack Lang,Obeen Kwon,Iryna Zenyuk,Valerie Brogden,Adam Weber,Daniela Ushizima

Main category: cs.CV

TL;DR: This paper proposes Zenesis, a no-code interactive platform, to address the challenges faced by zero-shot and prompt-based technologies when dealing with scarce scientific image sets.

Details

Motivation: Zero-shot and prompt-based technologies struggle with valuable yet scarce scientific image sets. We propose Zenesis to minimize barriers posed by data readiness for scientific images. Method: We develop lightweight multi-modal adaptation techniques that enable zero-shot operation on raw scientific data, along with human-in-the-loop refinement and heuristic-based temporal enhancement options. Result: Zenesis significantly outperforms baseline methods, achieving an average accuracy of 0.947, an Intersection over Union (IOU) of 0.858, and a Dice score of 0.923 for amorphous catalyst samples and accuracy of 0.987, an IOU of 0.857, and a Dice score of 0.923 for crystalline samples. Conclusion: Zenesis is a powerful tool for scientific applications, particularly in fields where high-quality annotated datasets are unavailable, accelerating accurate analysis of experimental imaging. Abstract: Zero-shot and prompt-based technologies capitalized on using frequently occurring images to transform visual reasoning tasks, which explains why such technologies struggle with valuable yet scarce scientific image sets. In this work, we propose Zenesis, a comprehensive no-code interactive platform designed to minimize barriers posed by data readiness for scientific images. We develop lightweight multi-modal adaptation techniques that enable zero-shot operation on raw scientific data, along with human-in-the-loop refinement and heuristic-based temporal enhancement options. We demonstrate the performance of our approach through comprehensive comparison and validation on challenging Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) data of catalyst-loaded membranes. Zenesis significantly outperforms baseline methods, achieving an average accuracy of 0.947, an Intersection over Union (IOU) of 0.858, and a Dice score of 0.923 for amorphous catalyst samples and accuracy of 0.987, an IOU of 0.857, and a Dice score of 0.923 for crystalline samples. These results mark a substantial improvement over traditional methods like Otsu thresholding and even advanced models like Segment Anything Model (SAM) when used in isolation. Our results demonstrate that Zenesis is a powerful tool for scientific applications, particularly in fields where high-quality annotated datasets are unavailable, accelerating accurate analysis of experimental imaging.

[294] A Survey on Vision-Language-Action Models for Autonomous Driving

Sicong Jiang,Zilin Huang,Kangan Qian,Ziang Luo,Tianze Zhu,Yang Zhong,Yihong Tang,Menglin Kong,Yunlong Wang,Siwen Jiao,Hao Ye,Zihao Sheng,Xin Zhao,Tuopu Wen,Zheng Fu,Sikai Chen,Kun Jiang,Diange Yang,Seongjin Choi,Lijun Sun

Main category: cs.CV

TL;DR: This paper surveys Vision-Language-Action (VLA) models for autonomous driving, analyzing their architecture, evolution, performance, and challenges while offering a consolidated reference for future research.

Details

Motivation: The motivation stems from the rapid development of MLLM and VLA paradigms, which promise to enhance autonomous driving systems by enabling high-level instruction interpretation and complex reasoning about traffic scenes. The fragmented literature necessitates a structured overview. Method: The authors conducted a comprehensive survey of over 20 representative VLA models in the autonomous driving domain, formalized architectural components, traced the evolution of these models, and evaluated datasets and benchmarks. Result: The survey provides a formalization of shared architectural building blocks, an evolution analysis from early explainer to reasoning-centric models, comparison of over 20 models, and consolidation of datasets and benchmarks. Challenges and future directions are also detailed. Conclusion: This survey concludes that VLA models offer significant potential for advancing autonomous vehicles by integrating visual perception, language understanding, and decision-making capabilities. It identifies open challenges such as robustness, real-time efficiency, and formal verification. Abstract: The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructions, reason about complex traffic scenes, and make their own decisions. However, the literature remains fragmented and is rapidly expanding. This survey offers the first comprehensive overview of VLA for Autonomous Driving (VLA4AD). We (i) formalize the architectural building blocks shared across recent work, (ii) trace the evolution from early explainer to reasoning-centric VLA models, and (iii) compare over 20 representative models according to VLA's progress in the autonomous driving domain. We also consolidate existing datasets and benchmarks, highlighting protocols that jointly measure driving safety, accuracy, and explanation quality. Finally, we detail open challenges - robustness, real-time efficiency, and formal verification - and outline future directions of VLA4AD. This survey provides a concise yet complete reference for advancing interpretable socially aligned autonomous vehicles. Github repo is available at \href{https://github.com/JohnsonJiang1996/Awesome-VLA4AD}{SicongJiang/Awesome-VLA4AD}.

[295] Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios

Deng Li,Aming Wu,Yang Li,Yaowei Wang,Yahong Han

Main category: cs.CV

TL;DR: 本文提出了一种新的持续测试时适应机制，通过特征解耦、参数生成和对齐方法来提升目标检测模型在变化环境中的泛化能力。

Details

Motivation: 环境的变化给基于封闭集假设训练的目标检测器带来了挑战，因此需要一种新的机制来提升其泛化能力。 Method: 设计了一种基于双路径LoRA的域感知适配器来解耦特征，并提出了一种基于条件扩散的参数生成机制和基于类中心的最优传输对齐方法以增强适应性和防止灾难性遗忘。 Result: 所提出的机制在多个连续域自适应目标检测任务上展示了有效性，并通过可视化证明了其提取特征的能力。 Conclusion: 实验结果表明，该方法在连续域自适应目标检测任务中具有良好的泛化能力和性能，并且提取的特征能够捕捉更多与物体相关的信息。 Abstract: In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by fine-tuning a few specific parameters, e.g., BatchNorm layers. However, based on a small number of test images, fine-tuning certain parameters may affect the representation ability of other fixed parameters, leading to performance degradation. Instead, we explore a new mechanism, i.e., converting the fine-tuning process to a specific-parameter generation. Particularly, we first design a dual-path LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components, enabling efficient adaptation. Additionally, a conditional diffusion-based parameter generation mechanism is presented to synthesize the adapter's parameters based on the current environment, preventing the optimization from getting stuck in local optima. Finally, we propose a class-centered optimal transport alignment method to mitigate catastrophic forgetting. Extensive experiments conducted on various continuous domain adaptive object detection tasks demonstrate the effectiveness. Meanwhile, visualization results show that the representation extracted by the generated parameters can capture more object-related information and strengthen the generalization ability.

[296] Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

Wonwoong Cho,Yanxia Zhang,Yan-Ying Chen,David I. Inouye

Main category: cs.CV

TL;DR: IT-Blender automates blending of visual and textual concepts using diffusion models, improving creative outcomes.

Details

Motivation: Cross-modal conceptual blending in humans is limited by cognitive biases like design fixation, which restricts exploration in the design space. Method: IT-Blender uses pretrained diffusion models (SD and FLUX) with blended attention to encode and blend visual and textual concepts in a disentangled manner. Result: IT-Blender outperforms baselines significantly in blending visual and textual concepts without losing details. Conclusion: IT-Blender provides a method to automate cross-modal conceptual blending, enhancing human creativity effectively. Abstract: Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.

[297] WaRA: Wavelet Low Rank Adaptation

Moein Heidari,Yasamin Medghalchi,Mahdi Khoursha,Reza Rezaeian,Ilker Hacihaliloglu

Main category: cs.CV

TL;DR: WaRA improves parameter-efficient fine-tuning by leveraging wavelet-based multi-resolution analysis, achieving better performance with reduced computational complexity.

Details

Motivation: Existing PEFT techniques like LoRA overlook local or multi-scale structures, limiting their ability to capture complex patterns. Method: Wavelet transforms are used to decompose the weight update matrix, enabling low-rank factorization and compressed adaptation parameters. Result: WaRA outperforms existing methods in vision tasks such as image generation and classification, while also proving effective in language tasks. Conclusion: WaRA is a novel PEFT method that effectively captures multi-resolution features and demonstrates superior performance in both vision and language tasks. Abstract: Parameter-efficient fine-tuning (PEFT) has gained widespread adoption across various applications. Among PEFT techniques, Low-Rank Adaptation (LoRA) and its extensions have emerged as particularly effective, allowing efficient model adaptation while significantly reducing computational overhead. However, existing approaches typically rely on global low-rank factorizations, which overlook local or multi-scale structure, failing to capture complex patterns in the weight updates. To address this, we propose WaRA, a novel PEFT method that leverages wavelet transforms to decompose the weight update matrix into a multi-resolution representation. By performing low-rank factorization in the wavelet domain and reconstructing updates through an inverse transform, WaRA obtains compressed adaptation parameters that harness multi-resolution analysis, enabling it to capture both coarse and fine-grained features while providing greater flexibility and sparser representations than standard LoRA. Through comprehensive experiments and analysis, we demonstrate that WaRA performs superior on diverse vision tasks, including image generation, classification, and semantic segmentation, significantly enhancing generated image quality while reducing computational complexity. Although WaRA was primarily designed for vision tasks, we further showcase its effectiveness in language tasks, highlighting its broader applicability and generalizability. The code is publicly available at \href{GitHub}{https://github.com/moeinheidari7829/WaRA}.

[298] MILo: Mesh-In-the-Loop Gaussian Splatting for Detailed and Efficient Surface Reconstruction

Antoine Guédon,Diego Gomez,Nissim Maruani,Bingchen Gong,George Drettakis,Maks Ovsjanikov

Main category: cs.CV

TL;DR: This paper proposes MILo, a novel Gaussian Splatting framework that efficiently extracts lightweight, high-quality surface meshes from 3D Gaussians by introducing a differentiable mesh construction process during training.

Details

Motivation: Extracting accurate surface meshes from Gaussian Splatting remains challenging due to costly post-processing steps that lead to loss of fine geometric details, excessive time consumption, and overly dense meshes. Additionally, converting from volumetric to surface representation limits preservation of all captured geometric structures. Method: The paper introduces MILo, which uses a fully differentiable procedure to construct a mesh—including vertex locations and connectivity—from Gaussian parameters at every training iteration. It incorporates three key contributions: bidirectional consistency framework, adaptive mesh extraction using Gaussians as differentiable pivots for Delaunay triangulation, and a new method for computing signed distance values from Gaussians to enable precise surface extraction. Result: MILo reconstructs complete scenes with state-of-the-art quality while requiring an order of magnitude fewer mesh vertices than prior approaches. The resulting lightweight meshes preserve fine geometric details and are well suited for downstream applications such as physics simulations and animation. Conclusion: MILo effectively bridges the gap between volumetric and surface representations by differentiably extracting a mesh directly from 3D Gaussians, resulting in high-quality reconstructions with significantly fewer mesh vertices compared to previous methods. The lightweight meshes produced are ideal for applications like physics simulations or animation. Abstract: While recent advances in Gaussian Splatting have enabled fast reconstruction of high-quality 3D scenes from images, extracting accurate surface meshes remains a challenge. Current approaches extract the surface through costly post-processing steps, resulting in the loss of fine geometric details or requiring significant time and leading to very dense meshes with millions of vertices. More fundamentally, the a posteriori conversion from a volumetric to a surface representation limits the ability of the final mesh to preserve all geometric structures captured during training. We present MILo, a novel Gaussian Splatting framework that bridges the gap between volumetric and surface representations by differentiably extracting a mesh from the 3D Gaussians. We design a fully differentiable procedure that constructs the mesh-including both vertex locations and connectivity-at every iteration directly from the parameters of the Gaussians, which are the only quantities optimized during training. Our method introduces three key technical contributions: a bidirectional consistency framework ensuring both representations-Gaussians and the extracted mesh-capture the same underlying geometry during training; an adaptive mesh extraction process performed at each training iteration, which uses Gaussians as differentiable pivots for Delaunay triangulation; a novel method for computing signed distance values from the 3D Gaussians that enables precise surface extraction while avoiding geometric erosion. Our approach can reconstruct complete scenes, including backgrounds, with state-of-the-art quality while requiring an order of magnitude fewer mesh vertices than previous methods. Due to their light weight and empty interior, our meshes are well suited for downstream applications such as physics simulations or animation.

[299] DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

Xiangtai Li,Tao Zhang,Yanwei Li,Haobo Yuan,Shihao Chen,Yikang Zhou,Jiahao Meng,Yueyi Sun,Shilin Xu,Lu Qi,Tianheng Cheng,Yi Lin,Zilong Huang,Wenhao Huang,Jiashi Feng,Guang Shi

Main category: cs.CV

TL;DR: DenseWorld-1M是第一个大规模、详细、密集的现实世界图像字幕数据集，通过三阶段标注流程生成，解决了现有数据集缺乏视觉实体位置和关系的问题。

Details

Motivation: 现有的多模态大模型需要大规模高质量的数据集支持，但目前的字幕数据集在细节描述、关系表达和高分辨率图像中的物体描述方面存在不足，因此需要一个新的数据集来弥补这一空白。 Method: 提出了一种三阶段的标注流程：第一阶段获取实体级别的掩码和标签；第二阶段在第一阶段结果的指导下生成对象级别的详细字幕；第三阶段将对象字幕和掩码合并为空间和关系密集的字幕。同时提出了两个加速标注并提升质量的VLM模型：Detailed Region Caption模型和Spatial Caption Merging模型。 Result: 实验表明，在多种设置下（包括视觉语言理解、视觉定位和区域字幕生成），DenseWorld-1M数据集和相关模型均表现出色，验证了其有效性。 Conclusion: DenseWorld-1M为社区提供了一个全新的、高质量的大规模图像字幕数据集，并通过高效的标注流程和先进模型提升了多模态任务的表现。 Abstract: Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill this gap for the community, we present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world. We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging. The first stage obtains entity-level masks and labels. The second stage generates the object-level, detailed captions with the guidance of masks and labels from the first stage. The final stage merges object captions and masks into spatial and relational dense captions. To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model. Extensive experiments on various settings, including vision-language understanding, visual grounding, and region caption generation, demonstrate the effectiveness of our DenseWorld-1M dataset and labeling models.

[300] Epona: Autoregressive Diffusion World Model for Autonomous Driving

Kaiwen Zhang,Zhenyu Tang,Xiaotao Hu,Xingang Pan,Xiaoyang Guo,Yuan Liu,Jingwei Huang,Li Yuan,Qian Zhang,Xiao-Xiao Long,Xun Cao,Wei Yin

Main category: cs.CV

TL;DR: 本文提出了名为Epona的自回归扩散世界模型，解决了自动驾驶领域中长期视频预测和轨迹规划的问题，并取得了卓越性能。

Details

Motivation: 现有的基于视频扩散的世界模型在灵活长度和长视野预测以及轨迹规划集成方面存在困难，因为它们依赖于固定长度帧序列的全局联合分布建模。 Method: 提出了一种自回归扩散模型（Epona），采用解耦时空因子化和模块化轨迹与视频预测方法，并引入链式前向训练策略来解决自回归循环中的误差累积问题。 Result: 实验结果表明，Epona在FVD指标上比现有方法提高了7.4%，且能够实现更长时间的预测。此外，它还表现出作为实时运动规划器的强大能力，在NAVSIM基准测试中优于强端到端规划器。 Conclusion: Epona通过解耦时空分解和模块化轨迹与视频预测，实现了高质量、长时间的自动驾驶世界建模和实时运动规划。 Abstract: Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4\% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Code will be publicly available at \href{https://github.com/Kevin-thu/Epona/}{https://github.com/Kevin-thu/Epona/}.

[301] TextMesh4D: High-Quality Text-to-4D Mesh Generation

Sisi Dai,Xinxin Su,Boyan Wan,Ruizhen Hu,Kai Xu

Main category: cs.CV

TL;DR: 本文提出了TextMesh4D，一个用于高质量文本到4D生成的新框架，它利用每面雅可比矩阵进行不同iable网格表示，并通过两阶段策略及新的正则化项来提升性能。

Details

Motivation: 尽管扩散生成模型在图像、视频和3D内容创作方面取得了进展，但基于扩散引导的动态3D内容生成（文本到4D）仍是一个未被充分探索的难题。 Method: 通过使用每个面的雅可比矩阵作为可微分网格表示，将4D生成分解为静态对象创建和动态运动合成两个阶段，并引入灵活性-刚性正则化项以稳定优化过程。 Result: 实验表明，TextMesh4D 在时间一致性、结构保真度和视觉真实感方面均达到先进水平，并且仅需单个24GB GPU即可运行，具有较低的GPU内存开销。 Conclusion: TextMesh4D 提供了一种创新的、成本效益高的文本驱动4D网格生成方法，并在时间一致性、结构保真度和视觉真实感方面达到了最先进的结果。 Abstract: Recent advancements in diffusion generative models significantly advanced image, video, and 3D content creation from user-provided text prompts. However, the challenging problem of dynamic 3D content generation (text-to-4D) with diffusion guidance remains largely unexplored. In this paper, we introduce TextMesh4D, a novel framework for high-quality text-to-4D generation. Our approach leverages per-face Jacobians as a differentiable mesh representation and decomposes 4D generation into two stages: static object creation and dynamic motion synthesis. We further propose a flexibility-rigidity regularization term to stabilize Jacobian optimization under video diffusion priors, ensuring robust geometric performance. Experiments demonstrate that TextMesh4D achieves state-of-the-art results in terms of temporal consistency, structural fidelity, and visual realism. Moreover, TextMesh4D operates with a low GPU memory overhead-requiring only a single 24GB GPU-offering a cost-effective yet high-quality solution for text-driven 4D mesh generation. The code will be released to facilitate future research in text-to-4D generation.

[302] Calligrapher: Freestyle Text Image Customization

Yue Ma,Qingyan Bai,Hao Ouyang,Ka Leong Cheng,Qiuyu Wang,Hongyu Liu,Zichen Liu,Haofan Wang,Jingye Chen,Yujun Shen,Qifeng Chen

Main category: cs.CV

TL;DR: Calligrapher is a new framework for digital calligraphy that improves style control and typographic customization using a diffusion-based approach with innovative technical contributions.

Details

Motivation: The motivation is to overcome the challenges of precise style control and data dependency in typographic customization while integrating advanced text customization with artistic typography for digital calligraphy and design applications. Method: Calligrapher uses a diffusion-based framework with a self-distillation mechanism, a localized style injection framework, and an in-context generation mechanism to extract style features and embed reference images into the denoising process. Result: Extensive evaluations show that Calligrapher accurately reproduces intricate stylistic details and achieves precise glyph positioning across diverse fonts and design contexts. Conclusion: Calligrapher successfully addresses the challenges of style control and data dependency in typographic customization, surpassing traditional models by automating high-quality, visually consistent typography. Abstract: We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.

[303] FADRM: Fast and Accurate Data Residual Matching for Dataset Distillation

Jiacheng Cui,Xinyue Bi,Yaxin Luo,Xiaohan Zhao,Jiacheng Liu,Zhiqiang Shen

Main category: cs.CV

TL;DR: FADRM: A new state-of-the-art method for dataset distillation that improves efficiency and accuracy using Data Residual Matching.

Details

Motivation: The motivation is to explore the potential of residual connection in data-centric approaches, specifically for dataset distillation tasks. Method: FADRM introduces Data Residual Matching, leveraging data-level skip connections and optimization-level refinements to improve computational efficiency. Result: FADRM achieves 47.7% test accuracy in single-model dataset distillation and 50.0% in multi-model dataset distillation, surpassing existing methods. Conclusion: The proposed FADRM method establishes a new state-of-the-art for dataset distillation, demonstrating substantial improvements in both efficiency and effectiveness. Abstract: Residual connection has been extensively studied and widely applied at the model architecture level. However, its potential in the more challenging data-centric approaches remains unexplored. In this work, we introduce the concept of Data Residual Matching for the first time, leveraging data-level skip connections to facilitate data generation and mitigate data information vanishing. This approach maintains a balance between newly acquired knowledge through pixel space optimization and existing core local information identification within raw data modalities, specifically for the dataset distillation task. Furthermore, by incorporating optimization-level refinements, our method significantly improves computational efficiency, achieving superior performance while reducing training time and peak GPU memory usage by 50%. Consequently, the proposed method Fast and Accurate Data Residual Matching for Dataset Distillation (FADRM) establishes a new state-of-the-art, demonstrating substantial improvements over existing methods across multiple dataset benchmarks in both efficiency and effectiveness. For instance, with ResNet-18 as the student model and a 0.8% compression ratio on ImageNet-1K, the method achieves 47.7% test accuracy in single-model dataset distillation and 50.0% in multi-model dataset distillation, surpassing RDED by +5.7% and outperforming state-of-the-art multi-model approaches, EDC and CV-DD, by +1.4% and +4.0%. Code is available at: https://github.com/Jiacheng8/FADRM.

[304] How to Design and Train Your Implicit Neural Representation for Video Compression

Matthew Gwilliam,Roy Zhang,Namitha Padmanabhan,Hongyang Du,Abhinav Shrivastava

Main category: cs.CV

TL;DR: This paper introduces Rabbit NeRV (RNeRV), a state-of-the-art method for video compression using implicit neural representations (INR), achieving better performance than existing approaches. It also explores hyper-networks to improve encoding speed and compression quality.

Details

Motivation: Implicit neural representation (INR) methods for video compression offer competitive visual quality and compression ratios but suffer from slow encoding speeds due to per-sample network training. This work aims to address these limitations by optimizing INR design and exploring hyper-networks for faster, practical adoption. Method: The authors developed a library to analyze components of the NeRV family for video compression using INR. They propose Rabbit NeRV (RNeRV), which optimizes these components for improved performance, and introduce masking of INR weights during training with hyper-networks to enhance compression quality and enable faster encoding. Result: RNeRV achieves an average improvement of +1.27% PSNR over the best-performing alternatives when given equal training time on UVG videos at 1080p. Using hyper-networks with weight masking leads to 1.7% improvements in PSNR and MS-SSIM at 0.037 bpp on UCF-101. Increasing hyper-network parameters results in further gains of 2.5%/2.7% in PSNR/MS-SSIM while maintaining similar compression speed. Conclusion: Rabbit NeRV (RNeRV) is a state-of-the-art configuration for video INR design that achieves better PSNR compared to existing methods, and the use of hyper-networks with masked weight prediction improves compression quality and allows for real-time encoding. Abstract: Implicit neural representation (INR) methods for video compression have recently achieved visual quality and compression ratios that are competitive with traditional pipelines. However, due to the need for per-sample network training, the encoding speeds of these methods are too slow for practical adoption. We develop a library to allow us to disentangle and review the components of methods from the NeRV family, reframing their performance in terms of not only size-quality trade-offs, but also impacts on training time. We uncover principles for effective video INR design and propose a state-of-the-art configuration of these components, Rabbit NeRV (RNeRV). When all methods are given equal training time (equivalent to 300 NeRV epochs) for 7 different UVG videos at 1080p, RNeRV achieves +1.27% PSNR on average compared to the best-performing alternative for each video in our NeRV library. We then tackle the encoding speed issue head-on by investigating the viability of hyper-networks, which predict INR weights from video inputs, to disentangle training from encoding to allow for real-time encoding. We propose masking the weights of the predicted INR during training to allow for variable, higher quality compression, resulting in 1.7% improvements to both PSNR and MS-SSIM at 0.037 bpp on the UCF-101 dataset, and we increase hyper-network parameters by 0.4% for 2.5%/2.7% improvements to PSNR/MS-SSIM with equal bpp and similar speeds. Our project website is available at https://mgwillia.github.io/vinrb/ and our code is available at https://github.com/mgwillia/vinrb.

Table of Contents

cs.CL [Back]

[1] Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans

[2] AI Agents-as-Judge: Automated Assessment of Accuracy, Consistency, Completeness and Clarity for Enterprise Documents

[3] Hallucination Detection with Small Language Models

[4] PromptAug: Fine-grained Conflict Classification Using Data Augmentation

[5] AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text

[6] Towards Text-free Graph Foundation Models: Rethinking Multi-Domain Graph Contrastive Learning

[7] Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis

[8] Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation

[9] MisinfoTeleGraph: Network-driven Misinformation Detection for German Telegram Messages

[10] RExBench: Can coding agents autonomously implement AI research extensions?

[11] Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks

[12] Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge

[13] Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions

[14] VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs

[15] Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report

[16] The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

[17] Jan-nano Technical Report

[18] Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

[19] ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

[20] MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs

[21] Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models

[22] Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization

[23] Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems

[24] DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

[25] Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions

[26] Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

[27] On the Generalizability of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"

[28] A Systematic Study of Compositional Syntactic Transformer Language Models

[29] SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

[30] MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition

[31] Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning

[32] Text2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries

[33] From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship

[34] FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes

[35] Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models

[36] Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning

[37] Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format

[38] LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation

[39] Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion

[40] Benchmarking Deep Search over Heterogeneous Enterprise Data

[41] Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions

[42] V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy

[43] RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams

[44] Generalist Reward Models: Found Inside Large Language Models

[45] Two Spelling Normalization Approaches Based on Large Language Models

[46] Objective-Free Local Learning and Emergent Language Structure in Thinking Machines

[47] Ensemble BERT for Medication Event Classification on Electronic Health Records (EHRs)

[48] Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family

[49] ATGen: A Framework for Active Text Generation

[50] Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs

[51] Hierarchical Memory Organization for Wikipedia Generation

[52] Datasets for Fairness in Language Models: An In-Depth Survey

[53] TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

[54] Pipelined Decoder for Efficient Context-Aware Text Generation

[55] What to Keep and What to Drop: Adaptive Table Filtering Framework

[56] Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

[57] Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably

[58] NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning

[59] On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator?

[60] Semantic-guided Diverse Decoding for Large Language Model

[61] Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs

[62] Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack

[63] Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation

[64] L0: Reinforcement Learning to Become General Agents

[65] AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data

[66] Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences

[67] Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model

[68] Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It

[69] Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting

[70] The Trilemma of Truth in Large Language Models

[71] IMPACT: Inflectional Morphology Probes Across Complex Typologies

[72] Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages

[73] Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs

[74] Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

[75] TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

[76] Machine Understanding of Scientific Language

[77] Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

[78] Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective