cs.CL [Back]

[1] ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning

Shu Zhao,Tan Yu,Anbang Xu,Japinder Singh,Aaditya Shukla,Rama Akkiraju

Main category: cs.CL

TL;DR: ParallelSearch improves the efficiency of multi-step information retrieval by enabling large language models to execute multiple search operations concurrently, significantly reducing processing time while maintaining accuracy.

Details

Motivation: Current search agents process queries sequentially, creating a bottleneck for efficiency, especially in tasks requiring multiple entity comparisons. Method: ParallelSearch introduces a reinforcement learning framework with dedicated reward functions that encourage identification of independent query components for parallel execution. Result: ParallelSearch achieves a 12.7% performance improvement on parallelizable questions and requires only 69.6% of the LLM calls compared to sequential approaches, with an average 2.9% gain across seven benchmarks. Conclusion: ParallelSearch effectively enhances computational efficiency by enabling concurrent execution of multiple search operations while maintaining answer accuracy, overcoming the sequential processing limitations of existing methods. Abstract: Reasoning-augmented search agents such as Search-R1, trained via reinforcement learning with verifiable rewards (RLVR), demonstrate remarkable capabilities in multi-step information retrieval from external knowledge sources. These agents address the limitations of their parametric memory by dynamically gathering relevant facts to address complex reasoning tasks. However, existing approaches suffer from a fundamental architectural limitation: they process search queries strictly sequentially, even when handling inherently parallelizable and logically independent comparisons. This sequential bottleneck significantly constrains computational efficiency, particularly for queries that require multiple entity comparisons. To address this critical limitation, we propose ParallelSearch, a novel reinforcement learning framework that empowers large language models (LLMs) to recognize parallelizable query structures and execute multiple search operations concurrently. Our approach introduces dedicated reward functions that incentivize the identification of independent query components while preserving answer accuracy through jointly considering correctness, query decomposition quality, and parallel execution benefits. Comprehensive experiments demonstrate that ParallelSearch outperforms state-of-the-art baselines by an average performance gain of 2.9% across seven question-answering benchmarks. Notably, on parallelizable questions, our method achieves a 12.7% performance improvement while requiring only 69.6% of the LLM calls compared to sequential approaches.

[2] Leveraging Large Language Models for Rare Disease Named Entity Recognition

Nan Miles Xi,Yu Deng,Lin Wang

Main category: cs.CL

TL;DR: This study evaluates GPT-4o for Named Entity Recognition in the rare disease domain, demonstrating its effectiveness as a scalable alternative to traditional supervised models, particularly in low-resource settings.

Details

Motivation: Named Entity Recognition in the rare disease domain is challenging due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions. Method: evaluated GPT-4o using prompt-based strategies like zero-shot prompting, few-shot in-context learning, RAG, and task-level fine-tuning; designed a structured prompting framework with domain-specific knowledge; introduced two semantically guided few-shot example selection methods. Result: GPT-4o achieved competitive or superior performance compared to BioClinicalBERT; task-level fine-tuning achieved new state-of-the-art results; few-shot prompting showed high returns at low token budgets; RAG offered marginal additional benefit; identified common failure modes like boundary drift and type confusion. Conclusion: prompt-optimized LLMs can serve as effective, scalable alternatives to traditional supervised models in biomedical NER, especially for rare disease applications with scarce annotated data. Abstract: Named Entity Recognition (NER) in the rare disease domain poses unique challenges due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions. In this study, we evaluate the capabilities of GPT-4o for rare disease NER under low-resource settings, using a range of prompt-based strategies including zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning. We design a structured prompting framework that encodes domain-specific knowledge and disambiguation rules for four entity types. We further introduce two semantically guided few-shot example selection methods to improve in-context performance while reducing labeling effort. Experiments on the RareDis Corpus show that GPT-4o achieves competitive or superior performance compared to BioClinicalBERT, with task-level fine-tuning yielding new state-of-the-art (SOTA) results. Cost-performance analysis reveals that few-shot prompting delivers high returns at low token budgets, while RAG offers marginal additional benefit. An error taxonomy highlights common failure modes such as boundary drift and type confusion, suggesting opportunities for post-processing and hybrid refinement. Our results demonstrate that prompt-optimized LLMs can serve as effective, scalable alternatives to traditional supervised models in biomedical NER, particularly in rare disease applications where annotated data is scarce.

[3] TEN: Table Explicitization, Neurosymbolically

Nikita Mehrotra,Aayush Kumar,Sumit Gulwani,Arjun Radhakrishna,Ashish Tiwari

Main category: cs.CL

TL;DR: TEN结合神经模型与符号检查，有效提升表格数据提取的准确性与可验证性。

Details

Motivation: 从非结构化文本中提取表格数据存在挑战，纯神经方法易产生幻觉且无法强制执行硬约束。 Method: TEN采用结构分解提示方法，结合大型语言模型生成初始表格，并通过符号检查器检测错误，利用批判-LLM生成修正指导，形成自调试循环。 Result: TEN在多个数据集和指标上实现了更高的精确匹配准确率和更低的幻觉率。 Conclusion: TEN显著优于纯神经基线，用户研究证实其表格更准确且更易于验证和修正。 Abstract: We present a neurosymbolic approach, TEN, for extracting tabular data from semistructured input text. This task is particularly challenging for text input that does not use special delimiters consistently to separate columns and rows. Purely neural approaches perform poorly due to hallucinations and their inability to enforce hard constraints. TEN uses Structural Decomposition prompting - a specialized chain-of-thought prompting approach - on a large language model (LLM) to generate an initial table, and thereafter uses a symbolic checker to evaluate not only the well-formedness of that table, but also detect cases of hallucinations or forgetting. The output of the symbolic checker is processed by a critique-LLM to generate guidance for fixing the table, which is presented to the original LLM in a self-debug loop. Our extensive experiments demonstrate that TEN significantly outperforms purely neural baselines across multiple datasets and metrics, achieving significantly higher exact match accuracy and substantially reduced hallucination rates. A 21-participant user study further confirms that TEN's tables are rated significantly more accurate (mean score: 5.0 vs 4.3; p = 0.021), and are consistently preferred for ease of verification and correction, with participants favoring our method in over 60% of the cases.

[4] Decoding Neural Emotion Patterns through Natural Language Processing Embeddings

Gideon Vos,Maryam Ebrahimpour,Liza van Eijk,Zoltan Sarnyai,Mostafa Rahimi Azghadi

Main category: cs.CL

TL;DR: 本研究开发了一种新方法，通过计算框架将文本情感映射到大脑区域，无需传统神经影像技术，这种方法能够分析自然语言，区分健康和抑郁个体，并评估AI情感表达。

Details

Motivation: 理解语言中的情感表达如何与大脑功能相关联是计算神经科学和情感计算中的一个挑战，传统的神经影像技术成本高昂且局限于实验室环境，而数字文本的丰富性为情感-大脑映射提供了新的途径。 Method: 使用OpenAI的text-embedding-ada-002生成高维语义表示，应用降维和聚类技术来识别情感组，并将它们映射到与情感处理相关的18个脑区。 Result: 结果显示了具有高度空间特异性的神经解剖学合理映射。抑郁受试者表现出与负面情绪相关的边缘系统参与度更高。离散情感成功被区分。大型语言模型生成的文本在基本情感分布上与人类相匹配，但在共情和自我参照区域（内侧前额叶和后扣带皮层）缺乏细微的激活。 Conclusion: 本研究提出了一种新的计算框架，无需神经影像即可将文本情感内容映射到解剖定义的脑区，这种方法具有成本效益且可扩展，能够区分临床人群，并为评估人工智能情感表达提供了基于大脑的基准。 Abstract: Understanding how emotional expression in language relates to brain function is a challenge in computational neuroscience and affective computing. Traditional neuroimaging is costly and lab-bound, but abundant digital text offers new avenues for emotion-brain mapping. Prior work has largely examined neuroimaging-based emotion localization or computational text analysis separately, with little integration. We propose a computational framework that maps textual emotional content to anatomically defined brain regions without requiring neuroimaging. Using OpenAI's text-embedding-ada-002, we generate high-dimensional semantic representations, apply dimensionality reduction and clustering to identify emotional groups, and map them to 18 brain regions linked to emotional processing. Three experiments were conducted: i) analyzing conversational data from healthy vs. depressed subjects (DIAC-WOZ dataset) to compare mapping patterns, ii) applying the method to the GoEmotions dataset and iii) comparing human-written text with large language model (LLM) responses to assess differences in inferred brain activation. Emotional intensity was scored via lexical analysis. Results showed neuroanatomically plausible mappings with high spatial specificity. Depressed subjects exhibited greater limbic engagement tied to negative affect. Discrete emotions were successfully differentiated. LLM-generated text matched humans in basic emotion distribution but lacked nuanced activation in empathy and self-referential regions (medial prefrontal and posterior cingulate cortex). This cost-effective, scalable approach enables large-scale analysis of naturalistic language, distinguishes between clinical populations, and offers a brain-based benchmark for evaluating AI emotional expression.

[5] The Human-AI Hybrid Delphi Model: A Structured Framework for Context-Rich, Expert Consensus in Complex Domains

Cathy Speed,Ahmed A. Metwally

Main category: cs.CL

TL;DR: This study introduces the HAH-Delphi framework, combining AI and human experts to improve consensus building, showing high accuracy and efficiency across multiple domains.

Details

Motivation: Traditional methods for expert consensus, such as Delphi studies and consensus conferences, face challenges like high panel burden, oversimplification, and difficulty handling conditional nuance, especially under current conditions of information overload and reliance on unfiltered public sources. Method: The study designed and evaluated a three-phase Human-AI Hybrid Delphi (HAH-Delphi) framework that combines a generative AI model (Gemini 2.5 Pro), small panels of senior human experts, and structured facilitation to enhance consensus development. Result: In Phase I, the AI replicated 95% of published expert consensus conclusions; in Phase II, it showed 95% directional agreement with human experts but lacked experiential nuance. In Phase III, small expert panels achieved over 90% consensus coverage and reached thematic saturation early, with AI support aiding divergence resolution and accelerating consensus. Conclusion: The HAH-Delphi framework successfully integrates AI and human expertise to overcome limitations of traditional consensus methods, offering a scalable and robust approach to generating high-quality, context-sensitive consensus across diverse domains. Abstract: Expert consensus plays a critical role in domains where evidence is complex, conflicting, or insufficient for direct prescription. Traditional methods, such as Delphi studies, consensus conferences, and systematic guideline synthesis, offer structure but face limitations including high panel burden, interpretive oversimplification, and suppression of conditional nuance. These challenges are now exacerbated by information overload, fragmentation of the evidence base, and increasing reliance on publicly available sources that lack expert filtering. This study introduces and evaluates a Human-AI Hybrid Delphi (HAH-Delphi) framework designed to augment expert consensus development by integrating a generative AI model (Gemini 2.5 Pro), small panels of senior human experts, and structured facilitation. The HAH-Delphi was tested in three phases: retrospective replication, prospective comparison, and applied deployment in two applied domains (endurance training and resistance and mixed cardio/strength training). The AI replicated 95% of published expert consensus conclusions in Phase I and showed 95% directional agreement with senior human experts in Phase II, though it lacked experiential and pragmatic nuance. In Phase III, compact panels of six senior experts achieved >90% consensus coverage and reached thematic saturation before the final participant. The AI provided consistent, literature-grounded scaffolding that supported divergence resolution and accelerated saturation. The HAH-Delphi framework offers a flexible, scalable approach for generating high-quality, context-sensitive consensus. Its successful application across health, coaching, and performance science confirms its methodological robustness and supports its use as a foundation for generating conditional, personalised guidance and published consensus frameworks at scale.

[6] Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Ju-Chieh Chou,Jiawei Zhou,Karen Livescu

Main category: cs.CL

TL;DR: 本研究提出了一种新的无文本监督的语音语言模型，通过联合建模语义和声学信息，改善了语音生成中的声学细节控制。

Details

Motivation: 现有的无文本监督的语音语言模型无法有效获取声学上下文信息，缺乏对声学细节的控制。 Method: 使用流匹配目标，在语义标记的条件下预测连续向量，并研究了该方法的设计空间。 Result: 该方法在语言可能性基准测试中表现与其他模型相当，但在提示生成中提供了更好的声学细节。 Conclusion: 该研究提出了一种联合建模语言和声学信息的方法，通过生成语义标记和连续的声学帧表示，改善了语音生成中的声学细节控制。 Abstract: Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

[7] APIO: Automatic Prompt Induction and Optimization for Grammatical Error Correction and Text Simplification

Artem Chernodub,Aman Saini,Yejin Huh,Vivek Kulkarni,Vipul Raheja

Main category: cs.CL

TL;DR: APIO是一种无需手动种子提示的高效提示优化方法，在语法错误纠正和文本简化任务中表现优异。

Details

Motivation: 为了提高自然语言处理任务中大型语言模型的表现，需要开发自动提示优化方法，以替代手动工程提示的方法，如思维链提示。 Method: APIO通过不依赖手动指定的种子提示，自动优化大型语言模型（LLM）的提示。 Result: APIO在语法错误纠正和文本简化任务上实现了最先进的表现，并且数据、代码、提示和输出已公开。 Conclusion: APIO作为一种新的提示归纳和优化方法，在语法错误纠正和文本简化任务中实现了最先进的性能。 Abstract: Recent advancements in large language models (LLMs) have enabled a wide range of natural language processing (NLP) tasks to be performed through simple prompt-based interactions. Consequently, several approaches have been proposed to engineer prompts that most effectively enable LLMs to perform a given task (e.g., chain-of-thought prompting). In settings with a well-defined metric to optimize model performance, automatic prompt optimization (APO) methods have been developed to refine a seed prompt. Advancing this line of research, we propose APIO, a simple but effective prompt induction and optimization approach for the tasks of Grammatical Error Correction (GEC) and Text Simplification, without relying on manually specified seed prompts. APIO achieves a new state-of-the-art performance for purely LLM-based prompting methods on these tasks. We make our data, code, prompts, and outputs publicly available.

[8] Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models

Ting Cai,Stephen Sheen,AnHai Doan

Main category: cs.CL

TL;DR: 本文提出了Columbo，一种用于解决表列名缩写扩展问题的基于LLM的解决方案，其性能显著优于现有方法。

Details

Motivation: 解决表列名缩写扩展问题是许多下游数据任务的关键，而之前工作使用的合成公共数据存在重大局限性。 Method: 开发了一个名为Columbo的LLM-based解决方案，并引入了新的同义词感知度量标准来更准确地评估列名扩展的准确性。 Result: Columbo在5个数据集上比当前最先进的解决方案NameGuess高出4-29%，并且已经在环境科学的主要数据门户EDI上投入生产使用。 Conclusion: Columbo是一个基于LLM的强大解决方案，利用上下文、规则、推理链和标记级分析来解决表列名缩写扩展问题，并且在多个数据集上显著优于现有解决方案。 Abstract: Expanding the abbreviated column names of tables, such as ``esal'' to ``employee salary'', is critical for numerous downstream data tasks. This problem arises in enterprises, domain sciences, government agencies, and more. In this paper we make three contributions that significantly advances the state of the art. First, we show that synthetic public data used by prior work has major limitations, and we introduce 4 new datasets in enterprise/science domains, with real-world abbreviations. Second, we show that accuracy measures used by prior work seriously undercount correct expansions, and we propose new synonym-aware measures that capture accuracy much more accurately. Finally, we develop Columbo, a powerful LLM-based solution that exploits context, rules, chain-of-thought reasoning, and token-level analysis. Extensive experiments show that Columbo significantly outperforms NameGuess, the current most advanced solution, by 4-29\%, over 5 datasets. Columbo has been used in production on EDI, a major data portal for environmental sciences.

[9] Leveraging Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech

Lavanya Shankar,Leibny Paola Garcia Perera

Main category: cs.CL

TL;DR: 本文研究了使用Zipformer模型在不平衡语言数据中的语言识别问题，并展示了其在实际场景中的潜力。

Details

Motivation: 本文的动机是解决双语环境中儿童导向场景下的代码切换和语言识别挑战。 Method: 使用Zipformer模型来处理包含普通话和英语的不平衡语言数据，并通过选择内部层来提取嵌入向量进行比较。 Result: 该方法在语言识别任务中取得了81.89%的平衡准确率，比基线提高了15.47%。 Conclusion: 该论文得出结论，Zipformer模型在处理不平衡语言数据方面表现优异，提高了语言识别的平衡准确率。 Abstract: Code-switching and language identification in child-directed scenarios present significant challenges, particularly in bilingual environments. This paper addresses this challenge by using Zipformer to handle the nuances of speech, which contains two imbalanced languages, Mandarin and English, in an utterance. This work demonstrates that the internal layers of the Zipformer effectively encode the language characteristics, which can be leveraged in language identification. We present the selection methodology of the inner layers to extract the embeddings and make a comparison with different back-ends. Our analysis shows that Zipformer is robust across these backends. Our approach effectively handles imbalanced data, achieving a Balanced Accuracy (BAC) of 81.89%, a 15.47% improvement over the language identification baseline. These findings highlight the potential of the transformer encoder architecture model in real scenarios.

[10] From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text

Ridwan Mahbub,Mohammed Saidul Islam,Mir Tafseer Nayeem,Md Tahmid Rahman Laskar,Mizanur Rahman,Shafiq Joty,Enamul Hoque

Main category: cs.CL

TL;DR: This paper shows that Vision-Language Models (VLMs) can amplify geo-economic biases when summarizing charts, giving more positive descriptions to high-income countries. Debiasing techniques tested were only partially effective, indicating the need for better strategies.

Details

Motivation: The motivation is to investigate potential biases in Vision-Language Models (VLMs) when generating textual summaries of charts, particularly how these models might amplify geo-economic biases and cause societal harm. Method: The researchers conducted a large-scale evaluation across 6,000 chart-country pairs using six proprietary and open-source VLMs. They analyzed how a country's economic status influenced the sentiment of chart summaries. They also explored inference-time prompt-based debiasing techniques using positive distractors. Result: The analysis revealed that VLMs tend to generate more positive descriptions for high-income countries. Models like GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5 showed varying degrees of bias. The tested debiasing methods were not fully effective. Conclusion: The study concludes that current Vision-Language Models (VLMs) demonstrate geo-economic bias by generating more positive summaries for high-income countries compared to middle- or low-income countries. Debiasing techniques applied were only partially effective, highlighting the need for more robust strategies. Abstract: Charts are very common for exploring data and communicating insights, but extracting key takeaways from charts and articulating them in natural language can be challenging. The chart-to-text task aims to automate this process by generating textual summaries of charts. While with the rapid advancement of large Vision-Language Models (VLMs), we have witnessed great progress in this domain, little to no attention has been given to potential biases in their outputs. This paper investigates how VLMs can amplify geo-economic biases when generating chart summaries, potentially causing societal harm. Specifically, we conduct a large-scale evaluation of geo-economic biases in VLM-generated chart summaries across 6,000 chart-country pairs from six widely used proprietary and open-source models to understand how a country's economic status influences the sentiment of generated summaries. Our analysis reveals that existing VLMs tend to produce more positive descriptions for high-income countries compared to middle- or low-income countries, even when country attribution is the only variable changed. We also find that models such as GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5 exhibit varying degrees of bias. We further explore inference-time prompt-based debiasing techniques using positive distractors but find them only partially effective, underscoring the complexity of the issue and the need for more robust debiasing strategies. Our code and dataset are publicly available here.

[11] User-centric Subjective Leaderboard by Customizable Reward Modeling

Qi Jia,Xiujie Song,Zicheng Zhang,Yijin Guo,Kaiwei Zhang,Zijian Chen,Guangtao Zhai

Main category: cs.CL

TL;DR: This paper introduces a new User-Centric Subjective Leaderboard (USL) powered by Customizable Reward Models (CRMs) that dynamically ranks large language models based on real-world user preferences, outperforming existing top models.

Details

Motivation: Existing benchmarks for large language models focus on verifiable tasks, which offer limited utility for users trying to select the most suitable models for individual needs. This motivates the need for a more user-centric and dynamic evaluation method. Method: The research is based on an analysis of over 10,000 subjective human preference queries. It introduces Customizable Reward Models (CRMs) that are used to create a dynamic and preference-driven ranking system for LLMs, known as the User-Centric Subjective Leaderboard (USL). Result: The introduced CRM, with only 4B parameters, outperforms state-of-the-art models like GPT-4.1 and Gemini-2.5-pro, demonstrating exceptional generalization capabilities. The USL shows strong negative correlations to contradictory preferences, indicating its effectiveness in handling diverse user preferences. Conclusion: The study concludes that the User-Centric Subjective Leaderboard (USL), powered by Customizable Reward Models (CRMs), effectively ranks large language models (LLMs) based on dynamic, preference-driven real-world scenarios, surpassing the performance of current leading models. Abstract: Existing benchmarks for large language models (LLMs) predominantely focus on assessing their capabilities through verifiable tasks. Such objective and static benchmarks offer limited utility for practical LLM selection, making it difficult for users to find suitable models for their individual needs. To bridge this gap, we present the first User-Centric Subjective Leaderboard (USL), which provides a preference-driven, dynamic ranking of LLMs across diverse real-world scenarios. Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries. Our investigation reveals significant diversity and contradictions in human preferences, which limit the effectiveness of state-of-the-art reward models. To address this, we introduce Customizable Reward Models (CRMs). With only 4B parameters, our CRM surpasses the performance of leading models such as GPT-4.1 and Gemini-2.5-pro, showing exceptional generalization capabilities across new topics and criteria. The USL, powered by CRMs, exhibits strong negative correlations to contradictory preferences.

[12] Learning Facts at Scale with Active Reading

Jessy Lin,Vincent-Pierre Berges,Xilun Chen,Wen-Tau Yih,Gargi Ghosh,Barlas Oğuz

Main category: cs.CL

TL;DR: Active Reading improves knowledge retention and recall in LLMs, significantly outperforming standard training methods and enabling the development of highly factual models at scale.

Details

Motivation: LLMs are known to store vast knowledge but often struggle to reliably recall facts. The need for tools that ensure consistent and reliable learning of specific knowledge motivated the development of Active Reading. Method: The researchers trained models using Active Reading, where models develop self-generated learning strategies while studying a given set of materials. They evaluated the approach on expert domains like Wikipedia and FinanceBench, comparing performance against vanilla finetuning and other data augmentation techniques. Result: Models trained with Active Reading showed significant improvements, achieving 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative improvement) and 26% on FinanceBench (+160% relative improvement). Additionally, Meta WikiExpert-8B outperformed larger models on factual QA. Conclusion: Active Reading can be effectively used to enhance knowledge absorption in LLMs, outperforming traditional methods like vanilla finetuning, and can be applied at scale to build more factual models. Abstract: LLMs are known to store vast amounts of knowledge in their parametric memory. However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood. Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently. To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies. First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations. We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark. Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models. As a demonstration of this, we release Meta WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA.

[13] From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation

Siyuan Meng,Junming Liu,Yirong Chen,Song Mao,Pinlong Cai,Guohang Yan,Botian Shi,Ding Wang

Main category: cs.CL

TL;DR: The paper introduces Dynamic Passage Selector (DPS), a novel reranking framework for retrieval-augmented generation systems that improves passage selection for complex queries, enhancing reasoning capabilities by capturing inter-passage dependencies and dynamically selecting relevant passages without modifying the standard RAG pipeline.

Details

Motivation: RAG systems struggle with complex multi-hop queries due to limitations in reranking modules that score passages independently and select a fixed Top-K size, leading to either omission of crucial information or introduction of noise. Method: Dynamic Passage Selector (DPS) is introduced as a novel reranking framework that captures inter-passage dependencies and dynamically selects the most relevant set of passages for generation. Result: Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods, with a notable improvement of 30.06% and 15.4% in F1-score on the MuSiQue dataset over Qwen3-reranker and RankingGPT, respectively. Conclusion: DPS substantially enhances reasoning capabilities in complex RAG scenarios by enabling adaptive evidence selection. Abstract: Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios.

[14] LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation

Jakub Šmíd,Pavel Přibáň,Pavel Král

Main category: cs.CL

TL;DR: 本文提出了一种无需翻译工具的跨语言细粒度情感分析方法，通过大语言模型生成高质量伪标注数据，在多个语言和模型上表现优于现有方法。

Details

Motivation: 现有的跨语言细粒度情感分析方法通常依赖于不可靠的翻译工具，而本文旨在通过利用大语言模型生成高质量的伪标注数据来克服这一问题。 Method: 首先训练一个ABSA模型以获得未标注目标语言数据的预测，然后使用大语言模型（LLM）生成更自然、更能反映这些噪声预测的句子，最后在生成的伪标注数据集上进一步微调ABSA模型。 Result: 该方法在多种语言和模型上均表现出色，超越了以往最先进的基于翻译的方法，并且生成模型也支持微调后的LLM优于小型多语言模型。 Conclusion: 该论文提出的框架无需翻译工具即可在目标语言中生成高质量的伪标注数据，并且在六种语言和五种骨干模型上证明了其有效性，超越了以往基于翻译的方法。 Abstract: Cross-lingual aspect-based sentiment analysis (ABSA) involves detailed sentiment analysis in a target language by transferring knowledge from a source language with available annotated data. Most existing methods depend heavily on often unreliable translation tools to bridge the language gap. In this paper, we propose a new approach that leverages a large language model (LLM) to generate high-quality pseudo-labelled data in the target language without the need for translation tools. First, the framework trains an ABSA model to obtain predictions for unlabelled target language data. Next, LLM is prompted to generate natural sentences that better represent these noisy predictions than the original text. The ABSA model is then further fine-tuned on the resulting pseudo-labelled dataset. We demonstrate the effectiveness of this method across six languages and five backbone models, surpassing previous state-of-the-art translation-based approaches. The proposed framework also supports generative models, and we show that fine-tuned LLMs outperform smaller multilingual models.

[15] Cross-lingual Aspect-Based Sentiment Analysis: A Survey on Tasks, Approaches, and Challenges

Jakub Šmíd,Pavel Král

Main category: cs.CL

TL;DR: This paper surveys cross-lingual aspect-based sentiment analysis (ABSA), summarizing tasks, datasets, and transfer methods, while highlighting challenges and future directions.

Details

Motivation: Cross-lingual ABSA remains under-explored despite progress in monolingual ABSA; the paper aims to fill this gap with a comprehensive survey. Method: The authors summarize key ABSA tasks, review datasets and modeling paradigms, and analyze cross-lingual transfer methods in ABSA research. Result: The paper provides a structured overview of cross-lingual ABSA, including insights from monolingual, multilingual, and LLM-based approaches, and identifies challenges and future research directions. Conclusion: The study concludes that cross-lingual ABSA requires further exploration, particularly in leveraging existing monolingual and multilingual work and advancing with LLMs. Abstract: Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that focuses on understanding opinions at the aspect level, including sentiment towards specific aspect terms, categories, and opinions. While ABSA research has seen significant progress, much of the focus has been on monolingual settings. Cross-lingual ABSA, which aims to transfer knowledge from resource-rich languages (such as English) to low-resource languages, remains an under-explored area, with no systematic review of the field. This paper aims to fill that gap by providing a comprehensive survey of cross-lingual ABSA. We summarize key ABSA tasks, including aspect term extraction, aspect sentiment classification, and compound tasks involving multiple sentiment elements. Additionally, we review the datasets, modelling paradigms, and cross-lingual transfer methods used to solve these tasks. We also examine how existing work in monolingual and multilingual ABSA, as well as ABSA with LLMs, contributes to the development of cross-lingual ABSA. Finally, we highlight the main challenges and suggest directions for future research to advance cross-lingual ABSA systems.

[16] UWBa at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval

Ladislav Lenc,Daniel Cífka,Jiří Martínek,Jakub Šmíd,Pavel Král

Main category: cs.CL

TL;DR: 本文介绍了一种基于大型语言模型组合的零样本事实核查声明检索系统，使用文本嵌入和余弦相似度进行匹配，最终NV-Embed-v2模型表现最佳。

Details

Motivation: 动机是开发一种零样本系统，用于事实核查声明检索，以提高准确性，并探索多语言和单语模型的性能。 Method: 该研究采用了多种最先进的大型语言模型来获取文本嵌入，并通过组合这些模型来优化结果。通过余弦相似度测量来识别每个帖子的最相关声明。 Result: 该系统在单语子任务中获得了第7名，在跨语言子任务中获得了第9名。NVIDIA NV-Embed-v2模型表现最佳，某些语言从模型组合中获益。 Conclusion: 本文的结论是，通过使用最先进的大型语言模型组合，可以实现有效的零样本事实核查声明检索。最佳结果由NVIDIA NV-Embed-v2模型获得，某些语言从模型组合中受益。 Abstract: This paper presents a zero-shot system for fact-checked claim retrieval. We employed several state-of-the-art large language models to obtain text embeddings. The models were then combined to obtain the best possible result. Our approach achieved 7th place in monolingual and 9th in cross-lingual subtasks. We used only English translations as an input to the text embedding models since multilingual models did not achieve satisfactory results. We identified the most relevant claims for each post by leveraging the embeddings and measuring cosine similarity. Overall, the best results were obtained by the NVIDIA NV-Embed-v2 model. For some languages, we benefited from model combinations (NV-Embed & GPT or Mistral).

[17] COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation

Yunxiao Wang,Meng Liu,Wenqi Liu,Kaiyu Jiang,Bin Wen,Fan Yang,Tingting Gao,Guorui Zhou,Liqiang Nie

Main category: cs.CL

TL;DR: 论文提出结合心理学原理的可控共情推理方法，通过构建细粒度数据集和强化学习优化策略，显著提升模型情感支持能力。

Details

Motivation: 当前模型在情感支持对话中缺乏深入的共情推理能力，需要结合心理学原理来提升情感健康支持效果。 Method: 结合自然语言推理和结构化的心理学步骤，构建了细粒度数据集，并采用强化学习和统一过程-结果奖励模型进行训练优化，引入了基于个性的对话重写和冗余感知奖励重加权策略。 Result: 该方法显著提升了模型的情感支持能力，有效缓解了熵崩溃带来的响应重复问题。 Conclusion: 该论文提出了一种可控的共情推理方法，显著提高了模型的情感支持能力，推动了类人共情支持系统的发展。 Abstract: Emotional support conversations are crucial for promoting emotional well-being, yet current models often lack deep empathetic reasoning grounded in psychological principles. To address this, we propose controllable empathetic reasoning, which combines natural language reasoning with structured psychological steps. We construct a fine-grained dataset annotated with reasoning correctness and response preferences to enable this capability. To further enhance training, we employ reinforcement learning with a unified process-outcome reward model that delivers precise feedback. To mitigate response repetitiveness from entropy collapse, we introduce personality-based dialogue rewriting and a redundancy-aware reward reweighting strategy. Our approach significantly improves model's emotional support ability, advancing the development of empathetic, human-like support systems.

[18] The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage

Skyler Hallinan,Jaehun Jung,Melanie Sclar,Ximing Lu,Abhilasha Ravichander,Sahana Ramnath,Yejin Choi,Sai Praneeth Karimireddy,Niloofar Mireshghallah,Xiang Ren

Main category: cs.CL

TL;DR: This paper introduces the N-Gram Coverage Attack, a new black-box membership inference method that relies only on text outputs. It demonstrates effectiveness against closed models like GPT-4 and shows that newer models like GPT-4o are more robust to such attacks.

Details

Motivation: Current state-of-the-art membership inference attacks require access to models' hidden states or probability distributions, limiting their applicability to API-access-only models like GPT-4. This work aims to overcome that limitation. Method: N-Gram Coverage Attack is introduced, which uses n-gram overlap metrics to infer membership by comparing generated text outputs with the ground truth suffix after conditioning on a prefix. Multiple sequences are generated from the model to enhance attack performance. Result: The N-Gram Coverage Attack outperforms existing black-box methods and performs comparably or even better than state-of-the-art white-box attacks. The attack performance improves with an increase in the number of generated sequences, and it successfully investigates closed models like GPT-4o, revealing increased robustness against membership inference. Conclusion: Membership inference attacks can now be applied to black-box models using only text outputs, and the success rate increases with the attack compute budget. Newer models like GPT-4o show improved robustness against such attacks. Abstract: Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models' hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections.

[19] AINL-Eval 2025 Shared Task: Detection of AI-Generated Scientific Abstracts in Russian

Tatiana Batura,Elena Bruches,Milana Shvenk,Valentin Malykh

Main category: cs.CL

TL;DR: The AINL-Eval 2025 Shared Task introduced a comprehensive dataset and evaluation framework for detecting AI-generated scientific abstracts in Russian, fostering ongoing research and development in this critical area.

Details

Motivation: The challenge of distinguishing between human- and AI-generated content in scientific publishing, particularly in multilingual contexts with limited detection resources. Method: Introduction of the AINL-Eval 2025 Shared Task with a large-scale dataset comprising 52,305 samples of human-written and AI-generated abstracts from 12 scientific domains and five state-of-the-art LLMs. Result: The shared task attracted 10 teams and 159 submissions, with top systems showing strong performance in identifying AI-generated content. Conclusion: The AINL-Eval 2025 Shared Task successfully fostered research and development in detecting AI-generated scientific abstracts, demonstrating the possibility of robust solutions capable of generalizing to unseen domains and models. Abstract: The rapid advancement of large language models (LLMs) has revolutionized text generation, making it increasingly difficult to distinguish between human- and AI-generated content. This poses a significant challenge to academic integrity, particularly in scientific publishing and multilingual contexts where detection resources are often limited. To address this critical gap, we introduce the AINL-Eval 2025 Shared Task, specifically focused on the detection of AI-generated scientific abstracts in Russian. We present a novel, large-scale dataset comprising 52,305 samples, including human-written abstracts across 12 diverse scientific domains and AI-generated counterparts from five state-of-the-art LLMs (GPT-4-Turbo, Gemma2-27B, Llama3.3-70B, Deepseek-V3, and GigaChat-Lite). A core objective of the task is to challenge participants to develop robust solutions capable of generalizing to both (i) previously unseen scientific domains and (ii) models not included in the training data. The task was organized in two phases, attracting 10 teams and 159 submissions, with top systems demonstrating strong performance in identifying AI-generated content. We also establish a continuous shared task platform to foster ongoing research and long-term progress in this important area. The dataset and platform are publicly available at https://github.com/iis-research-team/AINL-Eval-2025.

[20] Improving Diversity in Language Models: When Temperature Fails, Change the Loss

Alexandre Verine,Florian Le Bronnec,Kunhao Zheng,Alexandre Allauzen,Yann Chevaleyre,Benjamin Negrevergne

Main category: cs.CL

TL;DR: 本文研究了语言模型中解码温度调整的效果，并提出利用精确率-召回率框架改进损失函数的方法，以实现更好的精确率与召回率权衡。

Details

Motivation: 论文的动机是解决语言模型中多样性提升的难题，指出当前通过提高解码温度增加多样性的方法存在局限性，需要更有效的调整方法。 Method: 论文通过分析解码温度调整的简单案例，探讨了降低温度如何提高质量（精确率），而提高温度为何难以提升覆盖率（召回率），并提出了一种改进的损失函数方法。 Result: 实验结果表明，所提出的方法在精确率和召回率之间的权衡上显著优于传统方法，为更通用和鲁棒的语言建模技术提供了可能。 Conclusion: 该论文得出的结论是，通过重新思考语言模型中的损失函数，并利用精确率-召回率框架，可以实现比单纯结合负对数似然训练和温度缩放更好的精确率和召回率之间的权衡。 Abstract: Increasing diversity in language models is a challenging yet essential objective. A common approach is to raise the decoding temperature. In this work, we investigate this approach through a simplistic yet common case to provide insights into why decreasing temperature can improve quality (Precision), while increasing it often fails to boost coverage (Recall). Our analysis reveals that for a model to be effectively tunable through temperature adjustments, it must be trained toward coverage. To address this, we propose rethinking loss functions in language models by leveraging the Precision-Recall framework. Our results demonstrate that this approach achieves a substantially better trade-off between Precision and Recall than merely combining negative log-likelihood training with temperature scaling. These findings offer a pathway toward more versatile and robust language modeling techniques.

[21] EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

Yaoning Wang,Jiahao Ying,Yixin Cao,Yubo Ma,Yugang Jiang

Main category: cs.CL

TL;DR: EffiEval是一种无需训练的高效基准测试方法，能够有效解决数据冗余问题，同时保持高评估可靠性，并且具有良好的灵活性和可扩展性。

Details

Motivation: 大语言模型（LLMs）的快速发展和日益庞大且多样化的评估基准的发展给模型评估带来了巨大的计算挑战。 Method: EffiEval基于模型效用指数（MUI）自适应选择高质量代表性子集。 Result: 实验表明，EffiEval仅使用原始数据的一小部分即可实现与全数据集评估相当的排名一致性。 Conclusion: EffiEval是一种无需训练的高效基准测试方法，能够有效解决数据冗余问题，同时保持高评估可靠性。 Abstract: The rapid advancement of large language models (LLMs) and the development of increasingly large and diverse evaluation benchmarks have introduced substantial computational challenges for model assessment. In this paper, we present EffiEval, a training-free approach for efficient benchmarking that effectively addresses data redundancy while maintaining high evaluation reliability. Our method is specifically designed to meet three key criteria for high-quality evaluation: representativeness, by ensuring comprehensive coverage of model capabilities; fairness, by remaining independent of model performance during sample selection to avoid bias; and generalizability, by enabling flexible transfer across datasets and model families without reliance on large-scale evaluation data. Unlike traditional methods that rely on absolute performance or require extensive evaluation data, our approach adaptively selects high-quality representative subsets based on the Model Utility Index (MUI). Extensive experiments on multiple public benchmarks and diverse LLMs demonstrate that EffiEval achieves strong ranking consistency with full-dataset evaluation using only a small fraction of the original data. Furthermore, our method is flexible and scalable in size, allowing users to balance evaluation efficiency and representativeness according to specific needs. Overall, EffiEval provides a practical and generalizable solution for reliable, fair, and efficient evaluation in the era of LLMs.

[22] Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

Ziyang Ma,Qingyue Yuan,Linhai Zhang,Deyu Zhou

Main category: cs.CL

TL;DR: This paper introduces SLowED, a safe distillation method for Small Language Models that maintains safety while improving reasoning, without extra computation or data.

Details

Motivation: The motivation is to address the safety risks introduced during CoT distillation of SLMs using LLM-generated rationales, as existing safety alignment methods often require extra computation or data and may harm reasoning performance. Method: The authors proposed SLowED, a safe distillation method with two modules: Slow Tuning, which limits weight changes during training, and Low-Entropy Masking, which excludes unnecessary tokens from fine-tuning. They conducted experiments on three SLMs across reasoning and safety benchmarks. Result: SLowED successfully retains SLM safety and improves reasoning capability compared to existing methods. Ablation studies confirm the effectiveness of both Slow Tuning and Low-Entropy Masking in maintaining and prolonging safe training. Conclusion: The study concludes that the proposed SLowED method effectively maintains the safety of SLMs during CoT distillation while improving reasoning capabilities without requiring additional computation or data. Abstract: Previous chain-of-thought (CoT) distillation methods primarily focused on enhancing the reasoning capabilities of Small Language Models (SLMs) by utilizing high-quality rationales generated by powerful Large Language Models (LLMs, e.g., GPT-4). However, few works have noted the negative effects on SLM safety brought by the training, which are revealed in this study. Although there are works on safety alignment that fine-tune language models or manipulate model weights to defend against harmful inputs, they require extra computation or annotated data, and probably impact the reasoning ability of SLMs. In this paper, we investigate how to maintain the safety of SLMs during the CoT distillation process. Specifically, we propose a safe distillation method, Slow Tuning and Low-Entropy Masking Distillation (SLowED), containing two modules: Slow Tuning and Low-Entropy Masking. Slow Tuning scales down the magnitude of model weight changes to optimize the model weights in the neighboring space near the initial weight distribution. Low-Entropy Masking masks low-entropy tokens, which are regarded as unnecessary learning targets, to exclude them from fine-tuning. Experiments on three SLMs (Qwen2.5-1.5B, Llama-3.2-1B, BLOOM-1.1B) across reasoning benchmarks (BBH, BB-Sub, ARC, AGIEval) and safety evaluation (AdvBench) show that SLowED retains the safety of SLMs and comparably improves their reasoning capability compared to existing distillation methods. Furthermore, our ablation study presents the effectiveness of Slow Tuning and Low-Entropy Masking, with the former maintaining the model's safety in the early stage and the latter prolonging the safe training epochs.

[23] Evaluating the Role of Large Language Models in Legal Practice in India

Rahul Hemrajani

Main category: cs.CL

TL;DR: 这篇论文研究了大型语言模型（LLM）在印度法律背景下执行关键法律任务的能力，发现它们在起草和问题识别方面表现出色，但在专业法律研究中存在幻觉和错误。

Details

Motivation: 随着人工智能在法律领域的应用，了解LLM能否胜任关键法律任务变得尤为重要。 Method: 作者通过调查实验，将LLM的输出与初级律师的输出进行比较，并由高级法律学生评估其有用性、准确性和全面性。 Result: LLM在起草和问题识别方面表现出色，但在专业法律研究中经常生成错误或虚构的输出。 Conclusion: 虽然LLM可以增强某些法律任务，但人类专业知识对于细微推理和精确应用法律仍然至关重要。 Abstract: The integration of Artificial Intelligence(AI) into the legal profession raises significant questions about the capacity of Large Language Models(LLM) to perform key legal tasks. In this paper, I empirically evaluate how well LLMs, such as GPT, Claude, and Llama, perform key legal tasks in the Indian context, including issue spotting, legal drafting, advice, research, and reasoning. Through a survey experiment, I compare outputs from LLMs with those of a junior lawyer, with advanced law students rating the work on helpfulness, accuracy, and comprehensiveness. LLMs excel in drafting and issue spotting, often matching or surpassing human work. However, they struggle with specialised legal research, frequently generating hallucinations, factually incorrect or fabricated outputs. I conclude that while LLMs can augment certain legal tasks, human expertise remains essential for nuanced reasoning and the precise application of law.

[24] The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models

Ridwan Mahbub,Mohammed Saidul Islam,Md Tahmid Rahman Laskar,Mizanur Rahman,Mir Tafseer Nayeem,Enamul Hoque

Main category: cs.CL

TL;DR: This study shows that Vision-Language Models (VLMs) are often misled by deceptive visual designs in data charts, emphasizing the need for improved safeguards against visual misinformation.

Details

Motivation: As VLMs are increasingly used by non-expert users to interpret visualizations, understanding their susceptibility to deceptive visual designs is critical to prevent misinformation. Method: The study evaluates VLMs' ability to interpret misleading visualizations by analyzing over 16,000 responses from ten different models across eight distinct types of misleading chart designs. Result: The analysis demonstrates that most VLMs are deceived by misleading visual designs, leading to altered interpretations of charts despite the underlying data remaining the same. Conclusion: The study concludes that most Vision-Language Models (VLMs) are susceptible to deceptive visual designs in information visualizations, highlighting the need for robust safeguards against visual misinformation. Abstract: Information visualizations are powerful tools that help users quickly identify patterns, trends, and outliers, facilitating informed decision-making. However, when visualizations incorporate deceptive design elements-such as truncated or inverted axes, unjustified 3D effects, or violations of best practices-they can mislead viewers and distort understanding, spreading misinformation. While some deceptive tactics are obvious, others subtly manipulate perception while maintaining a facade of legitimacy. As Vision-Language Models (VLMs) are increasingly used to interpret visualizations, especially by non-expert users, it is critical to understand how susceptible these models are to deceptive visual designs. In this study, we conduct an in-depth evaluation of VLMs' ability to interpret misleading visualizations. By analyzing over 16,000 responses from ten different models across eight distinct types of misleading chart designs, we demonstrate that most VLMs are deceived by them. This leads to altered interpretations of charts, despite the underlying data remaining the same. Our findings highlight the need for robust safeguards in VLMs against visual misinformation.

[25] Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

Vaishnavi Shrivastava,Ahmed Awadallah,Vidhisha Balachandran,Shivam Garg,Harkirat Behl,Dimitris Papailiopoulos

Main category: cs.CL

TL;DR: 本文提出了一种名为GFPO的方法，通过训练时采样更大的组并根据响应长度和token效率过滤响应来减少模型推理时的长度膨胀问题，同时保持准确性。

Details

Motivation: 大型语言模型在使用可验证奖励进行强化学习时往往为了准确性而牺牲长度，导致响应长度膨胀，其中许多token只是重复、冗余的内容。 Method: 引入了GFPO（Group Filtered Policy Optimization）方法，并提出了Adaptive Difficulty GFPO，通过训练时采样更大的组并根据响应长度和token效率过滤响应来优化模型。 Result: 在Phi-4-reasoning模型上，GFPO在具有挑战性的STEM和编码基准上将GRPO的长度膨胀减少了46-71%，优化奖励每token进一步将减少幅度提高到71-85%。 Conclusion: GFPO有效地减少了模型在推理时的长度膨胀问题，同时保持了准确性，并通过Adaptive Difficulty GFPO进一步提高了计算效率和准确性之间的平衡。 Abstract: Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length--inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely "filler": repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO's length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute--a simple yet effective trade-off for efficient reasoning.

[26] Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation

Seokgi Lee

Main category: cs.CL

TL;DR: A new RAG framework improves multihop question answering by decomposing questions and using answerable-question embeddings for retrieval.

Details

Motivation: The motivation is to address the ambiguity in multihop queries and improve retrieval accuracy by focusing on distinct knowledge facets through query decomposition and embedding generation. Method: The method involves decomposing multihop questions into single-hop subquestions using an LLM, generating answerable questions from document chunks, and using question-question similarity for retrieval. Result: The system outperforms baseline RAG systems on three multihop question datasets: MuSiQue, 2WikiMultiHopQa, and HotpotQA. Conclusion: The proposed RAG framework enhances performance in multihop question answering by leveraging answerable-question embeddings and LLM-based query decomposition. Abstract: We introduce a novel retrieval-augmented generation (RAG) framework tailored for multihop question answering. First, our system uses large language model (LLM) to decompose complex multihop questions into a sequence of single-hop subquestions that guide document retrieval. This decomposition mitigates the ambiguity inherent in multi-hop queries by clearly targeting distinct knowledge facets. Second, instead of embedding raw or chunked documents directly, we generate answerable questions from each document chunk using Qwen3-8B, embed these generated questions, and retrieve relevant chunks via question-question embedding similarity. During inference, the retrieved chunks are then fed along with the original question into the RAG pipeline. We evaluate on three multihop question datasets (MuSiQue, 2WikiMultiHopQa, HotpotQA) from LongBench. Our method improves RAG performacne compared to baseline systems. Our contributions highlight the benefits of using answerable-question embeddings for RAG, and the effectiveness of LLM-based query decomposition for multihop scenarios.

[27] Echoes of Agreement: Argument Driven Opinion Shifts in Large Language Models

Avneet Kaur

Main category: cs.CL

TL;DR: 该论文研究了提示中的立场论点如何影响LLMs的输出，发现模型会倾向于迎合提示中的观点。

Details

Motivation: 现有研究主要关注LLMs在政治话题上的偏见，但对提示本身如何影响模型输出立场的研究不足。了解模型如何响应带有观点的提示对于评估偏见的鲁棒性至关重要。 Method: 论文通过在单轮和多轮对话环境中引入支持性和反驳性论点，进行实验评估模型输出的立场变化。 Result: 实验表明，带有立场的提示会显著影响模型的输出方向，并且论点的强弱会影响模型响应的一致性比例。 Conclusion: 论文得出结论，LLMs在面对带有立场的提示时表现出谄媚倾向，会调整自身立场以迎合提示中的观点，这对衡量政治偏见和制定有效的缓解策略具有重要意义。 Abstract: There have been numerous studies evaluating bias of LLMs towards political topics. However, how positions towards these topics in model outputs are highly sensitive to the prompt. What happens when the prompt itself is suggestive of certain arguments towards those positions remains underexplored. This is crucial for understanding how robust these bias evaluations are and for understanding model behaviour, as these models frequently interact with opinionated text. To that end, we conduct experiments for political bias evaluation in presence of supporting and refuting arguments. Our experiments show that such arguments substantially alter model responses towards the direction of the provided argument in both single-turn and multi-turn settings. Moreover, we find that the strength of these arguments influences the directional agreement rate of model responses. These effects point to a sycophantic tendency in LLMs adapting their stance to align with the presented arguments which has downstream implications for measuring political bias and developing effective mitigation strategies.

[28] UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

Shuhei Kato

Main category: cs.CL

TL;DR: UtterTune는 대규모 언어 모델(TTS) 시스템을 기반으로 한 경량 적응 방법으로, 일본어 발음과 억양 제어를 개선하면서도 다른 언어의 성능을 유지합니다.

Details

Motivation: LLM 기반 TTS 시스템은 자연스러움을 달성했으나, 특히 암묵적 G2P 처리 방식에서 발음 및 억양 정확도가 여전히 과제로 남아 있습니다. Method: UtterTune은 저계수 적응(LoRA)을 활용하여 일본어의 음소 수준에서 세그멘탈 발음과 음성 억양을 조절할 수 있도록 설계되었습니다. Result: 객관적 및 주관적 평가를 통해 UtterTune이 발음 제어 능력과 자연스러움, 화자 유사성을 효과적으로 유지함을 확인했습니다. Conclusion: UtterTune은 경량 설계로 TTS 시스템 내 특정 언어의 발음 제어를 향상시키며, 다국어 환경에서도 효과적으로 작동합니다. Abstract: We propose UtterTune, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness.

[29] Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study

Mahdi Dhaini,Juraj Vladika,Ege Erdogan,Zineb Attaoui,Gjergji Kasneci

Main category: cs.CL

TL;DR: This paper proposes an automated framework using LLMs to generate textual explanations for NLP tasks, showing they can match the effectiveness of human annotations while enabling scalability.

Details

Motivation: The motivation stems from the high cost and labor intensity of human-annotated textual explanations in NLP, which limits scalability. The study aims to explore whether LLM-generated explanations can serve as a viable alternative. Method: The study introduces an automated framework that uses state-of-the-art LLMs to generate textual explanations. These explanations are evaluated using NLG metrics, and their impact on PLMs and LLMs is analyzed in natural language inference tasks on two benchmark datasets. Result: Experiments show that automated explanations are highly competitive with human-annotated ones in improving model performance on NLP tasks. Conclusion: The study concludes that automated explanations generated by LLMs can effectively enhance NLP datasets and improve model performance, providing a scalable alternative to human annotation. Abstract: In the rapidly evolving field of Explainable Natural Language Processing (NLP), textual explanations, i.e., human-like rationales, are pivotal for explaining model predictions and enriching datasets with interpretable labels. Traditional approaches rely on human annotation, which is costly, labor-intensive, and impedes scalability. In this work, we present an automated framework that leverages multiple state-of-the-art large language models (LLMs) to generate high-quality textual explanations. We rigorously assess the quality of these LLM-generated explanations using a comprehensive suite of Natural Language Generation (NLG) metrics. Furthermore, we investigate the downstream impact of these explanations on the performance of pre-trained language models (PLMs) and LLMs across natural language inference tasks on two diverse benchmark datasets. Our experiments demonstrate that automated explanations exhibit highly competitive effectiveness compared to human-annotated explanations in improving model performance. Our findings underscore a promising avenue for scalable, automated LLM-based textual explanation generation for extending NLP datasets and enhancing model performance.

[30] Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges

Mahdi Dhaini,Tobias Müller,Roksoliana Rabets,Gjergji Kasneci

Main category: cs.CL

TL;DR: The paper explores the practical adoption and effectiveness of explainable NLP methods from the perspective of industry practitioners and academic researchers, identifying conceptual gaps and the need for improved user-centric frameworks.

Details

Motivation: Despite increasing attention given to explainable NLP, practitioners' perspectives regarding its practical adoption and effectiveness remain underexplored. Method: A qualitative interview-based study with industry practitioners and complementary interviews with academic researchers. Result: Findings reveal conceptual gaps, low satisfaction with current explainability methods, and highlight evaluation challenges. Conclusion: The paper concludes that there is a need for clear definitions and user-centric frameworks for better adoption of explainable NLP in practice. Abstract: The field of explainable natural language processing (NLP) has grown rapidly in recent years. The growing opacity of complex models calls for transparency and explanations of their decisions, which is crucial to understand their reasoning and facilitate deployment, especially in high-stakes environments. Despite increasing attention given to explainable NLP, practitioners' perspectives regarding its practical adoption and effectiveness remain underexplored. This paper addresses this research gap by investigating practitioners' experiences with explainability methods, specifically focusing on their motivations for adopting such methods, the techniques employed, satisfaction levels, and the practical challenges encountered in real-world NLP applications. Through a qualitative interview-based study with industry practitioners and complementary interviews with academic researchers, we systematically analyze and compare their perspectives. Our findings reveal conceptual gaps, low satisfaction with current explainability methods, and highlight evaluation challenges. Our findings emphasize the need for clear definitions and user-centric frameworks for better adoption of explainable NLP in practice.

[31] BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

Ahmed Masry,Abhay Puri,Masoud Hashemi,Juan A. Rodriguez,Megh Thakkar,Khyati Mahajan,Vikas Yadav,Sathwik Tejaswi Madhusudhan,Alexandre Piché,Dzmitry Bahdanau,Christopher Pal,David Vazquez,Enamul Hoque,Perouz Taslakian,Sai Rajeswar,Spandana Gella

Main category: cs.CL

TL;DR: The paper proposes BigCharts, a new dataset creation pipeline and training framework that significantly improves chart reasoning performance by generating visually diverse charts and integrating supervised fine-tuning with reinforcement learning.

Details

Motivation: Current vision-language models (VLMs) struggle with chart comprehension due to limitations in training datasets, including lack of diversity, real-world authenticity, and estimation errors in automatically extracted data tables. Additionally, reliance on supervised fine-tuning with low-quality datasets limits model effectiveness. Method: The authors introduced BigCharts, a dataset creation pipeline that generates visually diverse chart images by conditioning the rendering process on real-world charts. They also proposed a training framework combining supervised fine-tuning with GRPO-based reinforcement learning and novel reward signals for chart reasoning. Result: The BigCharts-R1 model outperforms existing methods on multiple chart question-answering benchmarks, even surpassing larger open-source and closed-source models. Conclusion: The proposed BigCharts dataset and training framework significantly improve chart reasoning performance, achieving state-of-the-art results on multiple benchmarks. Abstract: Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. Although current vision-language models (VLMs) have made significant progress, they continue to struggle with chart comprehension due to training on datasets that lack diversity and real-world authenticity, or on automatically extracted underlying data tables of charts, which can contain numerous estimation errors. Furthermore, existing models only rely on supervised fine-tuning using these low-quality datasets, severely limiting their effectiveness. To address these issues, we first propose BigCharts, a dataset creation pipeline that generates visually diverse chart images by conditioning the rendering process on real-world charts sourced from multiple online platforms. Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity, while still retaining accurate underlying data due to our proposed replotting process. Additionally, we introduce a comprehensive training framework that integrates supervised fine-tuning with Group Relative Policy Optimization (GRPO)-based reinforcement learning. By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization across diverse chart styles and domains, resulting in a state-of-the-art chart reasoning model, BigCharts-R1. Extensive experiments demonstrate that our models surpass existing methods on multiple chart question-answering benchmarks compared to even larger open-source and closed-source models.

[32] A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems

Aishik Mandal,Prottay Kumar Adhikary,Hiba Arnaout,Iryna Gurevych,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文综述了用于训练AI临床助手的临床心理健康数据集的现状，强调了数据集的分类、可访问性和文化背景，并指出当前数据集存在的问题，如纵向数据缺乏、文化语言代表性有限等，最后提出了未来数据集的标准化建议。

Details

Motivation: 由于心理健康障碍在全球范围内上升，而专业临床医生资源未能同步增长，因此需要利用AI技术来辅助心理健康诊断和治疗，而高质量的临床训练数据集是开发可靠AI系统的关键。 Method: 本文对现有的临床心理健康数据集进行了全面调查，并根据心理障碍类型、数据模态、任务类型、可访问性和社会文化背景对这些数据集进行了分类，同时研究了合成临床心理健康数据。 Result: 发现了当前数据集存在纵向数据缺乏、文化语言代表性有限、数据收集和注释标准不一致以及合成数据中模态缺乏等关键问题。 Conclusion: 本文总结了数据集构建和标准化过程中的关键挑战，并提出了促进更稳健、可推广和公平的心理健康AI系统的可行建议。 Abstract: Mental health disorders are rising worldwide. However, the availability of trained clinicians has not scaled proportionally, leaving many people without adequate or timely support. To bridge this gap, recent studies have shown the promise of Artificial Intelligence (AI) to assist mental health diagnosis, monitoring, and intervention. However, the development of efficient, reliable, and ethical AI to assist clinicians is heavily dependent on high-quality clinical training datasets. Despite growing interest in data curation for training clinical AI assistants, existing datasets largely remain scattered, under-documented, and often inaccessible, hindering the reproducibility, comparability, and generalizability of AI models developed for clinical mental health care. In this paper, we present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants. We categorize these datasets by mental disorders (e.g., depression, schizophrenia), data modalities (e.g., text, speech, physiological signals), task types (e.g., diagnosis prediction, symptom severity estimation, intervention generation), accessibility (public, restricted or private), and sociocultural context (e.g., language and cultural background). Along with these, we also investigate synthetic clinical mental health datasets. Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data. We conclude by outlining key challenges in curating and standardizing future datasets and provide actionable recommendations to facilitate the development of more robust, generalizable, and equitable mental health AI systems.

[33] Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Weigao Sun,Jiaxi Hu,Yucheng Zhou,Jusen Du,Disen Lan,Kexin Wang,Tong Zhu,Xiaoye Qu,Yu Zhang,Xiaoyu Mo,Daizong Liu,Yuxuan Liang,Wenliang Chen,Guoqi Li,Yu Cheng

Main category: cs.CL

TL;DR: This survey explores innovative LLM architectures designed to overcome the computational challenges of traditional transformers, enhancing efficiency and scalability for practical deployment.

Details

Motivation: The traditional transformer architecture requires substantial computations and poses challenges for large-scale training and deployment, which motivates the exploration of more efficient LLM architectures. Method: The study systematically examines innovative LLM architectures that address the limitations of traditional transformers and improve efficiency. Result: The survey identifies and analyzes various efficient LLM techniques, including linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures, and emerging diffusion LLMs. Conclusion: The survey presents a blueprint of modern efficient LLM architectures and aims to motivate future research toward more efficient, versatile AI systems. Abstract: Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

[34] PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Mo Yu,Tsz Ting Chung,Chulun Zhou,Tong Li,Rui Lu,Jiangnan Li,Liyan Xu,Haoshu Lu,Ning Zhang,Jing Li,Jie Zhou

Main category: cs.CL

TL;DR: The paper introduces PRELUDE, a new benchmark for evaluating long-context understanding and reasoning, revealing significant gaps between current models and human performance.

Details

Motivation: The motivation behind this study is to develop a more rigorous benchmark for evaluating long-context understanding, which demands global comprehension and deep reasoning beyond current benchmarks. Method: The study introduces PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the original book's narrative. Result: Experimental results show that in-context learning, RAG, in-domain training with state-of-the-art LLMs, and commercial DeepResearch services lag behind humans by more than 15%. Additionally, models often produce correct answers with flawed reasoning, leading to a reasoning accuracy gap of over 30% compared to humans. Conclusion: The study concludes that there is significant room for improvement in long-context understanding and reasoning in state-of-the-art models. Abstract: We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

[35] Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription

Abdul Rehman Antall,Naveed Akhtar

Main category: cs.CL

TL;DR: This study evaluates the feasibility of using lightweight Whisper models for Urdu speech recognition in low-resource settings, finding that Whisper-Small performs best but still faces significant challenges.

Details

Motivation: Urdu, despite being the 10th most spoken language globally with over 230 million speakers, has limited representation in automatic speech recognition systems due to dialectal diversity, code-switching, and sparse training data. Method: Benchmarked lightweight Whisper models (Tiny, Base, Small) on a curated Urdu dataset using word error rate (WER) without fine-tuning. Result: Whisper-Small achieved the lowest WER of 33.68%, outperforming Whisper-Tiny (67.08% WER) and Whisper-Base (53.67% WER). Conclusion: Whisper-Small demonstrates promise for deployable Urdu ASR, but significant gaps remain in phonetic accuracy and lexical coherence, emphasizing the need for future research into effective low-resource ASR systems. Abstract: This study evaluates the feasibility of lightweight Whisper models (Tiny, Base, Small) for Urdu speech recognition in low-resource settings. Despite Urdu being the 10th most spoken language globally with over 230 million speakers, its representation in automatic speech recognition (ASR) systems remains limited due to dialectal diversity, code-switching, and sparse training data. We benchmark these models on a curated Urdu dataset using word error rate (WER), without fine-tuning. Results show Whisper-Small achieves the lowest error rates (33.68\% WER), outperforming Tiny (67.08\% WER) and Base (53.67\% WER). Qualitative analysis reveals persistent challenges in phonetic accuracy and lexical coherence, particularly for complex utterances. While Whisper-Small demonstrates promise for deployable Urdu ASR, significant gaps remain. Our findings emphasize lay the groundwork for future research into effective, low-resource ASR systems.

[36] Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Jiaqi Cao,Jiarui Wang,Rubin Wei,Qipeng Guo,Kai Chen,Bowen Zhou,Zhouhan Lin

Main category: cs.CL

TL;DR: 本文提出了一种高效的领域适应方法 Memory Decoder，无需修改模型参数即可即插即用地提升多个预训练语言模型在特定领域的性能。

Details

Motivation: 现有的领域适应方法如 DAPT 需要昂贵的全参数训练并遭受灾难性遗忘，而 RAG 因最近邻搜索和较长的上下文引入了较大的推理延迟。因此，需要一种更高效的领域适应方法。 Method: Memory Decoder 使用一个小的 Transformer 解码器来学习模仿外部非参数检索器的行为，从而实现高效的领域适应，无需修改原始模型参数。 Result: 实验结果表明，Memory Decoder 有效地将多种 Qwen 和 Llama 模型适配到生物医学、金融和法律三个特定领域，平均降低了 6.17 点困惑度。 Conclusion: Memory Decoder 提供了一种新颖的、以预训练记忆为中心的领域自适应范式，可以即插即用的方式集成到共享分词器的任何预训练语言模型中，显著提升目标领域内的性能。 Abstract: Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model's parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.

[37] A Survey of Cognitive Distortion Detection and Classification in NLP

Archie Sage,Jeroen Keppens,Helen Yannakoudakis

Main category: cs.CL

TL;DR: The paper surveys 38 studies on NLP techniques for detecting cognitive distortions in mental health, identifies inconsistencies in the field, and proposes a consolidated taxonomy and challenges for more coherent research.

Details

Motivation: The motivation is to address the fragmentation and inconsistencies in the field of applying NLP to mental health, specifically in detecting cognitive distortions, and to provide a foundation for more coherent and reproducible research. Method: The paper conducts a survey of 38 studies over two decades, offering a structured review of datasets, modeling approaches, and evaluation strategies related to the automatic detection of cognitive distortions. Result: The paper provides a consolidated taxonomy for cognitive distortions, summarizes common task setups, and highlights open challenges in the field. Conclusion: This paper concludes that while there is growing interest and momentum in applying NLP techniques to detect and classify cognitive distortions, the field remains fragmented with inconsistencies in taxonomies, task formulations, and evaluation practices. The authors emphasize the need for coherence and reproducibility moving forward. Abstract: As interest grows in the application of natural language processing (NLP) techniques to mental health, a growing body of work explores the automatic detection and classification of cognitive distortions (CDs). CDs are habitual patterns of negatively biased or flawed thinking that distort how people perceive events, judge themselves, and react to the world around them. Identifying and addressing them is an important part of therapy. Despite its momentum, the field remains fragmented, with inconsistencies in CD taxonomies, task formulations, and evaluation practices. This survey reviews 38 studies spanning two decades, providing a structured overview of datasets, modelling approaches, and evaluation strategies. We provide a consolidated CD taxonomy reference, summarise common task setups, and highlight open challenges to support more coherent and reproducible research in this emerging area.

[38] Language of Persuasion and Misrepresentation in Business Communication: A Textual Detection Approach

Sayem Hossen,Monalisa Moon Joti,Md. Golam Rashed

Main category: cs.CL

TL;DR: This paper explores how deceptive language in digitized business communication can be systematically detected using a blend of rhetorical theory, psychology, and AI, achieving high accuracy in controlled environments but facing challenges in multilingual contexts.

Details

Motivation: The digitization of business communication has transformed persuasive discourse, enabling both transparency and deception. This paper aims to address the need for systematic detection of deceptive language, especially as AI-driven discourse becomes more sophisticated. Method: The study synthesizes classical rhetoric, communication psychology, linguistic theory, and empirical studies across financial reporting, sustainability discourse, and digital marketing to systematically detect deceptive language using persuasive lexicon. Computational textual analysis and personalized transformer models are employed in controlled settings. Result: The study achieved detection accuracies of over 99% in controlled settings using computational textual analysis and personalized transformer models. However, reproducing this performance across multilingual settings is problematic due to data scarcity and lack of multilingual text-processing infrastructures. Conclusion: There is a growing gap between theoretical and empirical communication representations, necessitating robust AI-based text-identification systems as AI-driven discourse becomes more human-like. Abstract: Business communication digitisation has reorganised the process of persuasive discourse, which allows not only greater transparency but also advanced deception. This inquiry synthesises classical rhetoric and communication psychology with linguistic theory and empirical studies in the financial reporting, sustainability discourse, and digital marketing to explain how deceptive language can be systematically detected using persuasive lexicon. In controlled settings, detection accuracies of greater than 99% were achieved by using computational textual analysis as well as personalised transformer models. However, reproducing this performance in multilingual settings is also problematic and, to a large extent, this is because it is not easy to find sufficient data, and because few multilingual text-processing infrastructures are in place. This evidence shows that there has been an increasing gap between the theoretical representations of communication and those empirically approximated, and therefore, there is a need to have strong automatic text-identification systems where AI-based discourse is becoming more realistic in communicating with humans.

[39] A Comprehensive Evaluation framework of Alignment Techniques for LLMs

Muneeza Azmat,Momin Abbas,Maysa Malfiza Garcia de Macedo,Marcelo Carpinette Grave,Luan Soares de Souza,Tiago Machado,Rogerio A de Paula,Raya Horesh,Yixin Chen,Heloisa Caroline de Souza Pereira Candello,Rebecka Nordenlow,Aminat Adebiyi

Main category: cs.CL

TL;DR: This paper introduces a multi-dimensional evaluation framework to systematically compare alignment techniques for Large Language Models, identifying their strengths, limitations, and guiding future research.

Details

Motivation: The motivation stems from the increasing integration of Large Language Models into real-world applications and the critical need to ensure their outputs align with human values and safety standards. Method: The paper proposes a comprehensive evaluation framework that systematically compares alignment techniques across four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. Result: The experiments conducted across diverse base models and alignment strategies demonstrate the utility of the proposed framework in systematically evaluating alignment paradigms. Conclusion: The paper concludes that a multi-dimensional evaluation framework can effectively identify the strengths and limitations of current alignment techniques for LLMs, offering insights for future research. Abstract: As Large Language Models (LLMs) become increasingly integrated into real-world applications, ensuring their outputs align with human values and safety standards has become critical. The field has developed diverse alignment approaches including traditional fine-tuning methods (RLHF, instruction tuning), post-hoc correction systems, and inference-time interventions, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these paradigms and guide deployment decisions. This paper introduces a multi-dimensional evaluation of alignment techniques for LLMs, a comprehensive evaluation framework that provides a systematic comparison across all major alignment paradigms. Our framework assesses methods along four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. Through experiments across diverse base models and alignment strategies, we demonstrate the utility of our framework in identifying strengths and limitations of current state-of-the-art models, providing valuable insights for future research directions.

[40] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

Lingjie Jiang,Shaohan Huang,Xun Wu,Yixia Li,Dongdong Zhang,Furu Wei

Main category: cs.CL

TL;DR: VisCodex通过融合视觉和编码语言模型，在多模态代码生成任务上实现了显著的性能提升。

Details

Motivation: 多模态大语言模型（MLLMs）在视觉和文本理解的整合方面取得了显著进展，但它们从多模态输入生成代码的能力仍然有限。 Method: 利用基于任务向量的模型融合技术，将最先进的编码LLM整合到强大的视觉-语言框架中，并提出了新的多模态编码数据集（MCD）和一个具有挑战性的基准测试InfiBench-V。 Result: VisCodex在开放源代码MLLMs中实现了最先进的性能，并接近专有模型如GPT-4o的性能，证明了模型融合策略和新数据集的有效性。 Conclusion: VisCodex是一个有效的框架，通过融合视觉和编码语言模型展现出强大的多模态代码生成能力，其性能接近专有模型如GPT-4o。 Abstract: Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

[41] Specialised or Generic? Tokenization Choices for Radiology Language Models

Hermione Warr,Wentian Xu,Harry Anthony,Yasin Ibrahim,Daniel McGowan,Konstantinos Kamnitsas

Main category: cs.CL

TL;DR: This paper explores the impact of using different vocabularies in language models for radiology report summarization, showing that domain-specific tokenizers improve performance and reduce computational demands.

Details

Motivation: The impact of vocabulary used by language models in radiology remains under-explored. Method: The authors systematically compared general, medical, and domain-specific tokenizers on radiology report summarization across three imaging modalities, with and without pre-training on PubMed abstracts. Result: Medical and domain-specific vocabularies outperformed natural language alternatives when models were trained from scratch. Pre-training mitigated performance differences, while domain-specific tokenizers achieved the best results and reduced memory requirements. Conclusion: Adapting language model vocabularies to the clinical domain provides practical benefits such as improved performance and reduced computational demands, making them more effective for healthcare applications. Abstract: The vocabulary used by language models (LM) - defined by the tokenizer - plays a key role in text generation quality. However, its impact remains under-explored in radiology. In this work, we address this gap by systematically comparing general, medical, and domain-specific tokenizers on the task of radiology report summarisation across three imaging modalities. We also investigate scenarios with and without LM pre-training on PubMed abstracts. Our findings demonstrate that medical and domain-specific vocabularies outperformed widely used natural language alternatives when models are trained from scratch. Pre-training partially mitigates performance differences between tokenizers, whilst the domain-specific tokenizers achieve the most favourable results. Domain-specific tokenizers also reduce memory requirements due to smaller vocabularies and shorter sequences. These results demonstrate that adapting the vocabulary of LMs to the clinical domain provides practical benefits, including improved performance and reduced computational demands, making such models more accessible and effective for both research and real-world healthcare settings.

[42] Shaping Event Backstories to Estimate Potential Emotion Contexts

Johannes Schäfer,Roman Klinger

Main category: cs.CL

TL;DR: This paper proposes a novel approach that adds reasonable contexts to event descriptions to understand whether these enriched contexts enable human annotators to annotate emotions more reliably.

Details

Motivation: Emotion analysis is an inherently ambiguous task. Previous work studied annotator properties to explain disagreement, but this overlooks the possibility that ambiguity may stem from missing information about the context of events. Method: We disambiguate a target event description by automatically generating multiple event chains conditioned on differing emotions. Result: Through automatic and human evaluation, we find that contextual narratives enhance the interpretation of specific emotions and support annotators in producing more consistent annotations. Conclusion: contextual narratives enhance the interpretation of specific emotions and support annotators in producing more consistent annotations. Abstract: Emotion analysis is an inherently ambiguous task. Previous work studied annotator properties to explain disagreement, but this overlooks the possibility that ambiguity may stem from missing information about the context of events. In this paper, we propose a novel approach that adds reasonable contexts to event descriptions, which may better explain a particular situation. Our goal is to understand whether these enriched contexts enable human annotators to annotate emotions more reliably. We disambiguate a target event description by automatically generating multiple event chains conditioned on differing emotions. By combining techniques from short story generation in various settings, we achieve coherent narratives that result in a specialized dataset for the first comprehensive and systematic examination of contextualized emotion analysis. Through automatic and human evaluation, we find that contextual narratives enhance the interpretation of specific emotions and support annotators in producing more consistent annotations.

[43] Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Fares Antaki,David Mikhail,Daniel Milad,Danny A Mammo,Sumit Sharma,Sunil K Srivastava,Bing Yu Chen,Samir Touma,Mertcan Sevgi,Jonathan El-Khoury,Pearse A Keane,Qingyu Chen,Yih Chung Tham,Renaud Duval

Main category: cs.CL

TL;DR: GPT-5系列模型在眼科医学问答任务中表现优异，GPT-5-high在准确性和推理质量上排名第一，而GPT-5-mini-low在成本效益方面表现最佳。

Details

Motivation: 研究旨在评估最新一代推理模型GPT-5系列在复杂医学问答任务中的性能，特别是在准确性和成本效益方面找到最优配置。 Method: 评估了OpenAI的GPT-5系列的12种配置，以及o1-high、o3-high和GPT-4o模型，使用了260个封闭访问的多选题，来自美国眼科学会基础临床科学课程（BCSC）数据集。主要结果是多选题的准确性；次要结果包括使用Bradley-Terry模型进行头对头排名，使用参考锚定的成对LLM作为评估框架来评估推理质量，并使用基于token的成本估计分析准确性和成本之间的权衡。 Result: GPT-5-high在准确性上表现最佳（0.965；95% CI，0.942-0.985），优于所有GPT-5-nano变体（P < .001），o1-high（P = .04）和GPT-4o（P < .001），但不如o3-high（0.958；95% CI，0.931-0.981）。GPT-5-high在准确性和推理质量上均排名第一，而GPT-5-mini-low在成本效益方面表现最佳。 Conclusion: GPT-5系列模型在眼科医学问答任务中表现出色，其中GPT-5-high在准确性和推理质量上排名最高，而GPT-5-mini-low在成本效益方面表现最佳。研究结果为GPT-5在高质量眼科数据集上的性能提供了基准，并展示了推理努力对准确性的影响。 Abstract: Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI's GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano variants (P < .001), o1-high (P = .04), and GPT-4o (P < .001), but not o3-high (0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high). Cost-accuracy analysis identified several GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low offering the most favorable low-cost, high-performance balance. These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of LLM-generated answers against reference standards in ophthalmology.

[44] Which one Performs Better? Wav2Vec or Whisper? Applying both in Badini Kurdish Speech to Text (BKSTT)

Renas Adnan,Hossein Hassani

Main category: cs.CL

TL;DR: This study develops and evaluates speech-to-text models for the underrepresented Badini Kurdish dialect, showing that the Wav2Vec2 model significantly outperforms Whisper in accuracy and readability.

Details

Motivation: To address the lack of speech-to-text (STT) systems for the Badini Kurdish dialect, despite its significant speaker base, and promote its use in technology while increasing its global visibility. Method: A dataset of Badini Kurdish was created using children's stories, narrated by six narrators, resulting in 17 hours of audio recordings. After preprocessing, 15 hours of speech data were used to train and evaluate the Wav2Vec2-Large-XLSR-53 and Whisper-small language models. Result: The Wav2Vec2-Large-XLSR-53 model achieved 90.38% readability and 82.67% accuracy, compared to 65.45% readability and 53.17% accuracy for the Whisper-small model. Conclusion: The study concludes that the Wav2Vec2-Large-XLSR-53 model significantly outperforms the Whisper-small model in terms of transcription accuracy and readability for the Badini Kurdish dialect, highlighting its potential for practical applications. Abstract: Speech-to-text (STT) systems have a wide range of applications. They are available in many languages, albeit at different quality levels. Although Kurdish is considered a less-resourced language from a processing perspective, SST is available for some of the Kurdish dialects, for instance, Sorani (Central Kurdish). However, that is not applied to other Kurdish dialects, Badini and Hawrami, for example. This research is an attempt to address this gap. Bandin, approximately, has two million speakers, and STT systems can help their community use mobile and computer-based technologies while giving their dialect more global visibility. We aim to create a language model based on Badini's speech and evaluate its performance. To cover a conversational aspect, have a proper confidence level of grammatical accuracy, and ready transcriptions, we chose Badini kids' stories, eight books including 78 stories, as the textual input. Six narrators narrated the books, which resulted in approximately 17 hours of recording. We cleaned, segmented, and tokenized the input. The preprocessing produced nearly 15 hours of speech, including 19193 segments and 25221 words. We used Wav2Vec2-Large-XLSR-53 and Whisper-small to develop the language models. The experiments indicate that the transcriptions process based on the Wav2Vec2-Large-XLSR-53 model provides a significantly more accurate and readable output than the Whisper-small model, with 90.38% and 65.45% readability, and 82.67% and 53.17% accuracy, respectively.

[45] Neural Bandit Based Optimal LLM Selection for a Pipeline of Tasks

Baran Atalar,Eddie Zhang,Carlee Joe-Wong

Main category: cs.CL

TL;DR: This paper proposes a neural contextual bandit-based algorithm to dynamically select sequences of large language models (LLMs) for complex tasks, where the output of each LLM influences subsequent subtasks. The method effectively improves task success rates and cost efficiency without requiring prior LLM performance data.

Details

Motivation: As LLMs become more popular and customizable, the need for efficient strategies to select sequences of LLMs for specialized and complex tasks increases, especially when subtask outputs influence downstream performance. Method: A neural contextual bandit-based algorithm was developed to model LLM success on subtasks in an online manner, enabling the selection of optimal LLM sequences without relying on historical performance data. Result: Experiments on telecommunications question answering and medical diagnosis prediction datasets demonstrated that the proposed approach outperforms existing LLM selection algorithms in handling complex, dependent subtasks. Conclusion: The study concludes that the proposed neural contextual bandit-based algorithm effectively selects sequences of LLMs for complex tasks, outperforming other LLM selection algorithms in terms of success rate and cost efficiency. Abstract: With the increasing popularity of large language models (LLMs) for a variety of tasks, there has been a growing interest in strategies that can predict which out of a set of LLMs will yield a successful answer at low cost. This problem promises to become more and more relevant as providers like Microsoft allow users to easily create custom LLM "assistants" specialized to particular types of queries. However, some tasks (i.e., queries) may be too specialized and difficult for a single LLM to handle alone. These applications often benefit from breaking down the task into smaller subtasks, each of which can then be executed by a LLM expected to perform well on that specific subtask. For example, in extracting a diagnosis from medical records, one can first select an LLM to summarize the record, select another to validate the summary, and then select another, possibly different, LLM to extract the diagnosis from the summarized record. Unlike existing LLM selection or routing algorithms, this setting requires that we select a sequence of LLMs, with the output of each LLM feeding into the next and potentially influencing its success. Thus, unlike single LLM selection, the quality of each subtask's output directly affects the inputs, and hence the cost and success rate, of downstream LLMs, creating complex performance dependencies that must be learned and accounted for during selection. We propose a neural contextual bandit-based algorithm that trains neural networks that model LLM success on each subtask in an online manner, thus learning to guide the LLM selections for the different subtasks, even in the absence of historical LLM performance data. Experiments on telecommunications question answering and medical diagnosis prediction datasets illustrate the effectiveness of our proposed approach compared to other LLM selection algorithms.

cs.CV [Back]

[46] A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection

Mohammad Zia Ur Rehman,Sufyaan Zahoor,Areeb Manzoor,Musharaf Maqbool,Nagendra Kumar

Main category: cs.CV

TL;DR: This paper proposes a specialized multimodal framework for detecting sexist and misogynistic content on social media, combining attention mechanisms, graph-based feature refinement, and content-specific learning for improved performance.

Details

Motivation: General offensive content detection struggles with identifying misogynistic content, necessitating tailored solutions. Method: A framework with three modules: MANM, GFRM, and CFLM, along with misogyny-specific lexicon scoring and test-time augmentation. Result: The method achieved an average improvement of 10.17% and 8.88% in macro-F1 scores on the MAMI and MMHS150K datasets, respectively. Conclusion: The proposed multimodal framework outperforms existing methods in detecting misogynistic and sexist content on social media. Abstract: A substantial portion of offensive content on social media is directed towards women. Since the approaches for general offensive content detection face a challenge in detecting misogynistic content, it requires solutions tailored to address offensive content against women. To this end, we propose a novel multimodal framework for the detection of misogynistic and sexist content. The framework comprises three modules: the Multimodal Attention module (MANM), the Graph-based Feature Reconstruction Module (GFRM), and the Content-specific Features Learning Module (CFLM). The MANM employs adaptive gating-based multimodal context-aware attention, enabling the model to focus on relevant visual and textual information and generating contextually relevant features. The GFRM module utilizes graphs to refine features within individual modalities, while the CFLM focuses on learning text and image-specific features such as toxicity features and caption features. Additionally, we curate a set of misogynous lexicons to compute the misogyny-specific lexicon score from the text. We apply test-time augmentation in feature space to better generalize the predictions on diverse inputs. The performance of the proposed approach has been evaluated on two multimodal datasets, MAMI and MMHS150K, with 11,000 and 13,494 samples, respectively. The proposed method demonstrates an average improvement of 10.17% and 8.88% in macro-F1 over existing methods on the MAMI and MMHS150K datasets, respectively.

[47] IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection

Yanhui Li,Yunkang Cao,Chengliang Liu,Yuan Xiong,Xinghui Dong,Chao Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为 IAD-R1 的通用后训练框架，用于提升视觉-语言模型在工业异常检测中的性能。通过两阶段训练策略，IAD-R1 显著提高了模型的异常感知和解释能力，并在多个基准数据集上表现出色，甚至超越了当前主流的商业模型。

Details

Motivation: 工业异常检测在现代制造业中至关重要，但由于缺陷样本稀缺，传统检测方法的应用受到限制。尽管视觉-语言模型 (VLMs) 在泛化能力方面具有显著优势，但其在工业异常检测中的表现仍然有限。因此，需要一种通用的方法来增强 VLMs 在这一领域的能力。 Method: IAD-R1 采用两阶段训练策略：第一阶段是感知激活监督微调 (PA-SFT)，使用精心构建的高质量 Chain-of-Thought 数据集 (Expert-AD) 来提升模型的异常感知能力；第二阶段是结构化对照组相对策略优化 (SC-GRPO)，通过精心设计的奖励函数，实现从“异常感知”到“异常解释”的能力跃迁。 Result: 实验结果显示，IAD-R1 在 7 种不同架构和参数规模的 VLM 上均取得了显著提升，在 6 个工业异常检测基准数据集上的平均准确率最高提升了 43.3%。此外，使用 IAD-R1 训练的 0.5B 参数模型在零样本设置下超越了 GPT-4.1 和 Claude-Sonnet-4 等商业模型。 Conclusion: IAD-R1 是一种通用的后训练框架，可以显著提升视觉-语言模型在工业异常检测中的性能，并且其效果在多个基准数据集中表现优异，超越了包括 GPT-4.1 和 Claude-Sonnet-4 在内的商业模型。 Abstract: Industrial anomaly detection is a critical component of modern manufacturing, yet the scarcity of defective samples restricts traditional detection methods to scenario-specific applications. Although Vision-Language Models (VLMs) demonstrate significant advantages in generalization capabilities, their performance in industrial anomaly detection remains limited. To address this challenge, we propose IAD-R1, a universal post-training framework applicable to VLMs of different architectures and parameter scales, which substantially enhances their anomaly detection capabilities. IAD-R1 employs a two-stage training strategy: the Perception Activation Supervised Fine-Tuning (PA-SFT) stage utilizes a meticulously constructed high-quality Chain-of-Thought dataset (Expert-AD) for training, enhancing anomaly perception capabilities and establishing reasoning-to-answer correlations; the Structured Control Group Relative Policy Optimization (SC-GRPO) stage employs carefully designed reward functions to achieve a capability leap from "Anomaly Perception" to "Anomaly Interpretation". Experimental results demonstrate that IAD-R1 achieves significant improvements across 7 VLMs, attaining up to 43.3% enhancement in average accuracy on 6 industrial anomaly detection benchmark datasets. Notably, the 0.5B parameter model trained with IAD-R1 surpasses commercial models including GPT-4.1 and Claude-Sonnet-4 in zero-shot settings, demonstrating the effectiveness and superiority of IAD-R1. The dataset, code, and all model weights will be publicly available at https://github.com/Yanhui-Lee/IAD-R1.

[48] A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality

Rongqian Chen,Allison Andreyev,Yanming Xiu,Mahdi Imani,Bin Li,Maria Gorlatova,Gang Tan,Tian Lan

Main category: cs.CV

TL;DR: This paper introduces CADAR, a neurosymbolic method for detecting cognitive attacks in augmented reality by combining vision-language models and particle filtering, offering improved accuracy and interpretability over existing approaches.

Details

Motivation: Current AR cognitive attack detection methods focus on visual changes at the pixel- or image-level without semantic reasoning, or rely on black-box vision-language models with limited interpretability. This limits their effectiveness in complex AR attack scenarios. Method: CADAR uses a neurosymbolic approach that integrates multimodal vision-language inputs through neural vision-language models (VLMs) to generate a symbolic perception-graph representation. This is followed by particle-filter based statistical reasoning for cognitive attack detection. Result: Experiments on an extended AR cognitive attack dataset showed that CADAR improved accuracy by up to 10.7% over strong baselines in challenging AR attack scenarios. Conclusion: CADAR provides a promising approach for cognitive attack detection in AR by combining the adaptability of pre-trained vision-language models with the interpretability and reasoning capabilities of particle filtering, achieving higher accuracy in challenging attack scenarios. Abstract: Augmented Reality (AR) enriches perception by overlaying virtual elements on the physical world. Due to its growing popularity, cognitive attacks that alter AR content to manipulate users' semantic perception have received increasing attention. Existing detection methods often focus on visual changes, which are restricted to pixel- or image-level processing and lack semantic reasoning capabilities, or they rely on pre-trained vision-language models (VLMs), which function as black-box approaches with limited interpretability. In this paper, we present CADAR, a novel neurosymbolic approach for cognitive attack detection in AR. It fuses multimodal vision-language inputs using neural VLMs to obtain a symbolic perception-graph representation, incorporating prior knowledge, salience weighting, and temporal correlations. The model then enables particle-filter based statistical reasoning -- a sequential Monte Carlo method -- to detect cognitive attacks. Thus, CADAR inherits the adaptability of pre-trained VLM and the interpretability and reasoning rigor of particle filtering. Experiments on an extended AR cognitive attack dataset show accuracy improvements of up to 10.7% over strong baselines on challenging AR attack scenarios, underscoring the promise of neurosymbolic methods for effective and interpretable cognitive attack detection.

[49] RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System

Abdolazim Rezaei,Mehdi Sookhak,Mahboobeh Haghparast

Main category: cs.CV

TL;DR: RL-MoE 通过将视觉数据转换为文本描述，有效平衡了隐私保护和数据效用，解决了智能交通系统中隐私和数据需求之间的冲突。

Details

Motivation: 在智能交通系统中，AI摄像头的普及引发了丰富的视觉数据需求与隐私权之间的严重冲突。现有的隐私保护机制，如模糊化或加密，通常不够充分，导致隐私受损或数据效用严重下降。 Method: RL-MoE 将敏感的视觉数据转换为保护隐私的文本描述，结合了Mixture-of-Experts (MoE) 架构和Reinforcement Learning (RL) 代理，以优化生成文本的语义准确性和隐私保护。 Result: 实验表明，RL-MoE 提供了卓越的隐私保护，将CFP-FP数据集上的重放攻击成功率降低至9.4%，同时生成的文本内容比基线方法更丰富。 Conclusion: RL-MoE 提供了一种实用且可扩展的解决方案，用于在隐私敏感领域构建可信的AI系统，为更安全的智慧城市和自动驾驶网络铺平了道路。 Abstract: The proliferation of AI-powered cameras in Intelligent Transportation Systems (ITS) creates a severe conflict between the need for rich visual data and the fundamental right to privacy. Existing privacy-preserving mechanisms, such as blurring or encryption, are often insufficient, creating an undesirable trade-off where either privacy is compromised against advanced reconstruction attacks or data utility is critically degraded. To resolve this impasse, we propose RL-MoE, a novel framework that transforms sensitive visual data into privacy-preserving textual descriptions, eliminating the need for direct image transmission. RL-MoE uniquely combines a Mixture-of-Experts (MoE) architecture for nuanced, multi-aspect scene decomposition with a Reinforcement Learning (RL) agent that optimizes the generated text for a dual objective of semantic accuracy and privacy preservation. Extensive experiments demonstrate that RL-MoE provides superior privacy protection, reducing the success rate of replay attacks to just 9.4\% on the CFP-FP dataset, while simultaneously generating richer textual content than baseline methods. Our work provides a practical and scalable solution for building trustworthy AI systems in privacy-sensitive domains, paving the way for more secure smart city and autonomous vehicle networks.

[50] Synthetic Data Generation for Emotional Depth Faces: Optimizing Conditional DCGANs via Genetic Algorithms in the Latent Space and Stabilizing Training with Knowledge Distillation

Seyed Muhammad Hossein Mousavi,S. Younes Mirinezhad

Main category: cs.CV

TL;DR: 本文提出了一种用于合成深度面部生成的新框架，通过优化GAN和遗传算法解决了情感计算中数据集不足的问题，并在分类准确率和图像质量评估指标上取得了优异成果。

Details

Motivation: 情感计算面临的主要挑战是缺乏高质量、多样化的深度面部数据集来识别细微的情感表达。 Method: 使用优化GAN结合知识蒸馏（EMA教师模型）来稳定训练、提高质量和防止模式崩溃，并应用遗传算法根据图像统计信息进化GAN潜在向量，以提高目标情感的多样性和视觉质量。此外，通过提取并连接LBP、HOG、Sobel边缘和强度直方图特征，利用XGBoost达到情感分类的高准确率。 Result: 该方法在分类准确率（94%和96%）、FID、IS、SSIM和PSNR评估指标上均优于GAN、VAE、GMM和KDE等现有方法。 Conclusion: 该论文提出了一种基于优化GAN和遗传算法的合成深度面部生成框架，在多样性、质量以及情感识别准确率方面优于现有方法。 Abstract: Affective computing faces a major challenge: the lack of high-quality, diverse depth facial datasets for recognizing subtle emotional expressions. We propose a framework for synthetic depth face generation using an optimized GAN with Knowledge Distillation (EMA teacher models) to stabilize training, improve quality, and prevent mode collapse. We also apply Genetic Algorithms to evolve GAN latent vectors based on image statistics, boosting diversity and visual quality for target emotions. The approach outperforms GAN, VAE, GMM, and KDE in both diversity and quality. For classification, we extract and concatenate LBP, HOG, Sobel edge, and intensity histogram features, achieving 94% and 96% accuracy with XGBoost. Evaluation using FID, IS, SSIM, and PSNR shows consistent improvement over state-of-the-art methods.

[51] $Δ$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation

Jucheng Hu,Suorong Yang,Dongzhan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为Δ-AttnMask的数据高效框架，用于视觉指令微调（VIF），通过注意力引导的模型隐藏状态掩码来量化样本质量，实现了在仅使用20%数据的情况下达到最先进的性能，并加速了训练过程。

Details

Motivation: 视觉指令微调（VIF）对于后训练视觉-语言模型（VLMs）至关重要，但与单模态指令微调不同，VIF需要多模态数据来实现视觉和文本的联合理解，因此通常需要更多的数据。然而，数据选择对于VIF的影响至关重要，但这一领域仍未得到充分研究。 Method: 本文提出了一种名为Δ-AttnMask的框架，该框架通过计算原始状态与使用高注意力区域掩码后的状态之间的损失差异（Δ）来内在评估样本质量，而无需领域标签、辅助模型或额外训练。 Result: 实验表明，Δ-AttnMask在多个VLM和数据集上使用仅20%的数据达到了最先进的性能，训练速度提高了5倍，并且整体准确率超过了全数据集基线10.1%。 Conclusion: Δ-AttnMask是一种模型无关和数据无关的设计，具有广泛的适用性，能够有效解决VIF中的数据选择挑战。 Abstract: Visual Instruction Finetuning (VIF) is pivotal for post-training Vision-Language Models (VLMs). Unlike unimodal instruction finetuning in plain-text large language models, which mainly requires instruction datasets to enable model instruction-following ability, VIF also requires multimodal data to enable joint visual and textual understanding; therefore, it typically requires more data. Consequently, VIF imposes stricter data selection challenges: the method must scale efficiently to handle larger data demands while ensuring the quality of both visual and textual content, as well as their alignment. Despite its critical impact on performance, data selection for VIF remains an understudied area. In this paper, we propose $\Delta$-AttnMask. This data-efficient framework quantifies sample quality through attention-guided masking of the model's hidden states, jointly evaluating image-text pairs without requiring domain labels, auxiliary models, or extra training. By computing loss differences ($\Delta$) between the original states and states masked using high-attention regions, $\Delta$-AttnMask intrinsically assesses sample quality. Experiments across multiple VLMs and datasets show that $\Delta$-AttnMask achieves state-of-the-art performance with just 20% of data, accelerating training by 5x while surpassing full-dataset baselines by +10.1% in overall accuracy. Its model-agnostic and data-agnostic design ensures broad applicability across modalities and architectures.

[52] Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

Masoumeh Sharafi,Soufiane Belharbi,Houssem Ben Salem,Ali Etemad,Alessandro Lameiras Koerich,Marco Pedersoli,Simon Bacon,Eric Granger

Main category: cs.CV

TL;DR: This paper proposes Personalized Feature Translation (PFT), a lightweight and efficient method for facial expression recognition that adapts models using only neutral target data without source data or image synthesis.

Details

Motivation: The motivation stems from the limitations of deep FER models in handling subtle expressions and inter-subject variability, along with the need for domain adaptation without compromising data privacy or increasing computational costs. Method: Personalized Feature Translation (PFT) is introduced, which operates in the latent space to adapt models using neutral target data by translating subject-specific style features while preserving expression information. Result: PFT demonstrates improved performance in adapting facial expression recognition models by leveraging latent space translation, reducing computational overhead and eliminating reliance on source data or complex image synthesis. Conclusion: The paper concludes that the proposed PFT method effectively enhances facial expression recognition by efficiently adapting models using only neutral target data without the need for image synthesis or source data. Abstract: Facial expression recognition (FER) models are employed in many video-based affective computing applications, such as human-computer interaction and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting their performance in real-world applications. To improve their performance, source-free domain adaptation (SFDA) methods have been proposed to personalize a pretrained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission constraints. This paper addresses a challenging scenario where source data is unavailable for adaptation, and only unlabeled target data consisting solely of neutral expressions is available. SFDA methods are not typically designed to adapt using target data from only a single class. Further, using models to generate facial images with non-neutral expressions can be unstable and computationally intensive. In this paper, personalized feature translation (PFT) is proposed for SFDA. Unlike current image translation methods for SFDA, our lightweight method operates in the latent space. We first pre-train the translator on the source domain data to transform the subject-specific style features from one source subject into another. Expression information is preserved by optimizing a combination of expression consistency and style-aware objectives. Then, the translator is adapted on neutral target data, without using source data or image synthesis. By translating in the latent space, PFT avoids the complexity and noise of face expression generation, producing discriminative embeddings optimized for classification. Using PFT eliminates the need for image synthesis, reduces computational overhead (using a lightweight translator), and only adapts part of the model, making the method efficient compared to image-based translation.

[53] GANime: Generating Anime and Manga Character Drawings from Sketches with Deep Learning

Tai Vu,Robert Yang

Main category: cs.CV

TL;DR: This study identifies C-GAN as the most effective model for translating sketches into high-quality, high-resolution colorized anime images, closely matching human-created artwork.

Details

Motivation: The process of generating fully colorized drawings from sketches is a costly bottleneck in the manga and anime industry, prompting the need for an efficient and effective solution. Method: Examined multiple models for image-to-image translation, including Neural Style Transfer, C-GAN, and CycleGAN, and assessed them qualitatively and quantitatively. Result: C-GAN was found to be the most effective model in generating high-quality and high-resolution images compared to other models. Conclusion: C-GAN is the most suitable model for generating high-quality and high-resolution colorized images from sketches, closely resembling human-created images. Abstract: The process of generating fully colorized drawings from sketches is a large, usually costly bottleneck in the manga and anime industry. In this study, we examine multiple models for image-to-image translation between anime characters and their sketches, including Neural Style Transfer, C-GAN, and CycleGAN. By assessing them qualitatively and quantitatively, we find that C-GAN is the most effective model that is able to produce high-quality and high-resolution images close to those created by humans.

[54] MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

Fan Zhang,Zebang Cheng,Chong Deng,Haoxuan Li,Zheng Lian,Qian Chen,Huadai Liu,Wen Wang,Yi-Fan Zhang,Renrui Zhang,Ziyu Guo,Zhihong Zhu,Hao Wu,Haixin Wang,Yefeng Zheng,Xiaojiang Peng,Xian Wu,Kun Wang,Xiangang Li,Jieping Ye,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出了MME-Emotion，这是迄今为止最大的针对MLLMs的情感智能基准测试，旨在评估模型的情感理解与推理能力。

Details

Motivation: 当前情感基准仍然有限，还不清楚MLLMs在不同场景中的泛化能力及其识别情感状态背后触发因素的推理能力。 Method: 引入MME-Emotion，这是一个系统基准，包含超过6000个精选视频片段，具有任务特定的问答对，并通过混合度量的整体评估套件进行分析。 Result: 通过严格评估20个先进的MLLMs，发现当前MLLMs的情感智能令人不满意，表现最好的模型在我们的基准测试中的识别得分仅为39.3%，思维链（CoT）得分为56.0%。 Conclusion: MME-Emotion为未来MLLMs情感智能的发展提供了基础。 Abstract: Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only $39.3\%$ recognition score and $56.0\%$ Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs' emotional intelligence in the future.

[55] Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

Zuoou Li,Weitong Zhang,Jingyuan Wang,Shuyuan Zhang,Wenjia Bai,Bernhard Kainz,Mengyun Qiao

Main category: cs.CV

TL;DR: 研究介紹了一種新的四軸評估框架和Balanced Structural Decomposition (BSD)策略，以提高對多模態大型語言模型的攻擊效果，揭示了現有安全系統的弱點。

Details

Motivation: 目前對抗性提示的漏洞是一個嚴重問題，現有的評估標準可能高估了此類攻擊的有效性。 Method: 提出了一個四軸評估框架，並開發了一種稱為Balanced Structural Decomposition (BSD)的遞歸重寫策略。 Result: BSD方法在13種商業和開源MLLM中測試，顯示出更高的攻擊成功率，更多有害輸出，更少拒絕率。 Conclusion: BSD方法揭示了現有多模態安全系統中以前未被重視的弱點，並顯示了其在提高攻擊成功率和有害輸出方面的有效性。 Abstract: Multimodal large language models (MLLMs) are widely used in vision-language reasoning tasks. However, their vulnerability to adversarial prompts remains a serious concern, as safety mechanisms often fail to prevent the generation of harmful outputs. Although recent jailbreak strategies report high success rates, many responses classified as "successful" are actually benign, vague, or unrelated to the intended malicious goal. This mismatch suggests that current evaluation standards may overestimate the effectiveness of such attacks. To address this issue, we introduce a four-axis evaluation framework that considers input on-topicness, input out-of-distribution (OOD) intensity, output harmfulness, and output refusal rate. This framework identifies truly effective jailbreaks. In a substantial empirical study, we reveal a structural trade-off: highly on-topic prompts are frequently blocked by safety filters, whereas those that are too OOD often evade detection but fail to produce harmful content. However, prompts that balance relevance and novelty are more likely to evade filters and trigger dangerous output. Building on this insight, we develop a recursive rewriting strategy called Balanced Structural Decomposition (BSD). The approach restructures malicious prompts into semantically aligned sub-tasks, while introducing subtle OOD signals and visual cues that make the inputs harder to detect. BSD was tested across 13 commercial and open-source MLLMs, where it consistently led to higher attack success rates, more harmful outputs, and fewer refusals. Compared to previous methods, it improves success rates by $67\%$ and harmfulness by $21\%$, revealing a previously underappreciated weakness in current multimodal safety systems.

[56] Towards Scalable Training for Handwritten Mathematical Expression Recognition

Haoyang Li,Jiaqing Li,Jialun Cao,Zongyuan Yang,Yongping Xiong

Main category: cs.CV

TL;DR: 为解决手写数学公式识别领域数据稀缺问题，研究者提出 Tex80M 数据集和 TexTeller 模型，实现了识别性能的重大突破。

Details

Motivation: 由于手写数学公式识别（HMER）领域数据稀缺，主要因为手动标注繁琐且昂贵，因此需要构建大规模数据集以推动该领域发展。 Method: 提出了一种新方法，结合有限的手写公式与大规模 LaTeX 渲染公式，并开发了一个可扩展的数据引擎来生成复杂的 LaTeX 序列，构建了名为 Tex80M 的最大公式数据集（超过 8000 万样本）。 Result: 成功构建了 Tex80M 数据集，并通过混合训练开发了 TexTeller 模型，在多个基准测试中实现了最先进的性能。 Conclusion: TexTeller 是首个在大规模数据集上训练的手写数学公式识别模型，并在几乎所有基准测试中表现达到 SOTA。 Abstract: Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.

[57] Gradient-Direction-Aware Density Control for 3D Gaussian Splatting

Zheng Zhou,Yu-Jie Xiong,Chun-Ming Xia,Jia-Chen Zhang,Hong-Jian Zhan

Main category: cs.CV

TL;DR: Gradient-Direction-Aware Gaussian Splatting (GDAGS) improves 3D scene representation by adaptively controlling Gaussian density based on gradient direction, leading to enhanced rendering quality and reduced memory consumption.

Details

Motivation: Existing 3DGS approaches suffer from over-reconstruction due to ineffective splitting of large Gaussians and over-densification from redundant component proliferation, limiting their performance in complex scenarios. Method: GDAGS introduces a gradient coherence ratio (GCR) and a nonlinear dynamic weighting mechanism to adaptively control Gaussian density based on gradient direction awareness. Result: GDAGS achieves superior rendering quality, mitigates over-reconstruction, suppresses over-densification, and constructs compact scene representations with significantly reduced memory usage. Conclusion: GDAGS effectively addresses over-reconstruction and over-densification issues in 3D Gaussian Splatting, achieving superior rendering quality and reducing memory consumption by 50%. Abstract: The emergence of 3D Gaussian Splatting (3DGS) has significantly advanced novel view synthesis through explicit scene representation, enabling real-time photorealistic rendering. However, existing approaches manifest two critical limitations in complex scenarios: (1) Over-reconstruction occurs when persistent large Gaussians cannot meet adaptive splitting thresholds during density control. This is exacerbated by conflicting gradient directions that prevent effective splitting of these Gaussians; (2) Over-densification of Gaussians occurs in regions with aligned gradient aggregation, leading to redundant component proliferation. This redundancy significantly increases memory overhead due to unnecessary data retention. We present Gradient-Direction-Aware Gaussian Splatting (GDAGS), a gradient-direction-aware adaptive density control framework to address these challenges. Our key innovations: the gradient coherence ratio (GCR), computed through normalized gradient vector norms, which explicitly discriminates Gaussians with concordant versus conflicting gradient directions; and a nonlinear dynamic weighting mechanism leverages the GCR to enable gradient-direction-aware density control. Specifically, GDAGS prioritizes conflicting-gradient Gaussians during splitting operations to enhance geometric details while suppressing redundant concordant-direction Gaussians. Conversely, in cloning processes, GDAGS promotes concordant-direction Gaussian densification for structural completion while preventing conflicting-direction Gaussian overpopulation. Comprehensive evaluations across diverse real-world benchmarks demonstrate that GDAGS achieves superior rendering quality while effectively mitigating over-reconstruction, suppressing over-densification, and constructing compact scene representations with 50\% reduced memory consumption through optimized Gaussians utilization.

[58] FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

Fengxian Ji,Jingpu Yang,Zirui Song,Yuanxi Wang,Zhexuan Cui,Yuke Li,Qian Jiang,Miao Fang,Xiuying Chen

Main category: cs.CV

TL;DR: 本文介绍FineState-Bench，首个用于细粒度GUI代理操作的评估和诊断标准，以及用于分析感知和定位能力的插件式视觉诊断助手（VDA）。

Details

Motivation: 现有GUI代理评估框架存在根本缺陷，过于关注粗粒度任务完成，而忽视了对于实际应用至关重要的细粒度控制能力。 Method: 引入了FineState-Bench，一个多平台框架，以及插件式的视觉诊断助手（VDA）用于分析感知和定位能力。 Result: 最先进的模型在细粒度交互准确率上仅达到32.8%，使用VDA进行控制实验显示理想的视觉定位可以将Gemini-2.5-Flash的成功率提高14.9%。 Conclusion: FineState-Bench确认了当前GUI代理的主要瓶颈是基本的视觉定位能力，并且通过VDA的控制实验量化了视觉能力的影响。 Abstract: With the rapid advancement of generative artificial intelligence technology, Graphical User Interface (GUI) agents have demonstrated tremendous potential for autonomously managing daily tasks through natural language instructions. However, current evaluation frameworks for GUI agents suffer from fundamental flaws: existing benchmarks overly focus on coarse-grained task completion while neglecting fine-grained control capabilities crucial for real-world applications. To address this, we introduce FineState-Bench, the first evaluation and diagnostic standard for fine-grained GUI proxy operations, designed to quantify fine-grained control. This multi-platform (desktop, Web, mobile) framework includes 2257 task benchmarks in four components and uses a four-phase indicator for comprehensive perception-to-control assessment. To analyze perception and positioning for refined operations, we developed the plug-and-play Visual Diagnostic Assistant (VDA), enabling the first quantitative decoupling analysis of these capabilities. Experimental results on our benchmark show that the most advanced models achieve only 32.8% fine-grained interaction accuracy. Using our VDA in controlled experiments, quantifying the impact of visual capabilities, we showed that ideal visual localization boosts Gemini-2.5-Flash's success rate by 14.9\%. Our diagnostic framework confirms for the first time that the primary bottleneck for current GUI proxies is basic visual positioning capability.All resources are fully open-source. github: https://github.com/AnonymousThewarehouse/FineState-Bench huggingface: https://huggingface.co/datasets/Willtime2006/Static-FineBench

Jeffri Murrugarra-LLerena,Haoran Niu,K. Suzanne Barber,Hal Daumé III,Yang Trista Cao,Paola Cascante-Bonilla

Main category: cs.CV

TL;DR: FiGPriv是一个新的隐私保护框架，它通过选择性地屏蔽高风险私人信息，同时保留低风险信息，从而在保护用户隐私的同时，提高了视觉语言模型（VLMs）提供有用响应和识别图像内容的能力。

Details

Motivation: 现有的隐私保护方法依赖于粗粒度分割，这通常以可用性为代价统一屏蔽整个私人对象。随着视觉语言模型（VLMs）驱动的视觉助手系统越来越普及，尤其是对于盲人和低视力用户来说，他们可能会在无意间捕捉到图像中的个人隐私信息，因此对用户隐私的担忧日益增加。 Method: 结合细粒度分割和数据驱动的风险评分机制 Result: 使用BIV-Priv-Seg数据集评估我们的框架，结果表明FiGPriv保留了+26%的图像内容，提高了VLMs提供有用响应的能力11%，识别图像内容的能力45%。 Conclusion: FiGPriv是一种细粒度的隐私保护框架，可以有效地保护用户隐私，同时提高VLMs提供有用响应的能力。 Abstract: As visual assistant systems powered by visual language models (VLMs) become more prevalent, concerns over user privacy have grown, particularly for blind and low vision users who may unknowingly capture personal private information in their images. Existing privacy protection methods rely on coarse-grained segmentation, which uniformly masks entire private objects, often at the cost of usability. In this work, we propose FiGPriv, a fine-grained privacy protection framework that selectively masks only high-risk private information while preserving low-risk information. Our approach integrates fine-grained segmentation with a data-driven risk scoring mechanism. We evaluate our framework using the BIV-Priv-Seg dataset and show that FiG-Priv preserves +26% of image content, enhancing the ability of VLMs to provide useful responses by 11% and identify the image content by 45%, while ensuring privacy protection. Project Page: https://artcs1.github.io/VLMPrivacy/

[60] Harnessing Input-Adaptive Inference for Efficient VLN

Dongwoo Kang,Akhil Perincherry,Zachary Coalson,Aiden Gabriel,Stefan Lee,Sanghyun Hong

Main category: cs.CV

TL;DR: This paper proposes an input-adaptive navigation method to improve the computational efficiency of vision-and-language navigation (VLN) models by introducing three adaptive algorithms targeting spatial, intra-model, and temporal efficiency, resulting in over a 2× reduction in computation without significant performance loss.

Details

Motivation: Large-scale history-aware multi-modal transformer models used in vision-and-language navigation (VLN) are computationally expensive, which limits their practical deployment in resource-constrained environments. Method: The authors introduce three adaptive algorithms targeting spatial, intra-model, and temporal efficiency: selective processing of panoramic views, importance-based adaptive thresholding for early-exit methods, and caching of previously processed views. Result: Evaluations on seven VLN benchmarks show over a 2× reduction in computation across three standard agents in both standard and continuous environments without significant loss in performance. Conclusion: The proposed input-adaptive navigation method significantly enhances VLN model efficiency, achieving over a 2× reduction in computation without substantial performance degradation across multiple benchmarks. Abstract: An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input-adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose importance-based adaptive thresholding for the early-exit methods. (3) To improve temporal efficiency, we implement a caching mechanism that prevents reprocessing of views previously seen by the agent. In evaluations on seven VLN benchmarks, we demonstrate over a 2$\times$ reduction in computation across three off-the-shelf agents in both standard and continuous environments. Our code is publicly available at https://github.com/secure-ai-systems-group/adaptive-vision-and-language-navigation.

[61] SegDAC: Segmentation-Driven Actor-Critic for Visual Reinforcement Learning

Alexandre Brown,Glen Berseth

Main category: cs.CV

TL;DR: The paper introduces SegDAC, a Segmentation-Driven Actor-Critic method for visual reinforcement learning that achieves superior visual generalization and sample efficiency on diverse manipulation tasks.

Details

Motivation: Integrating large perception models into RL for visual generalization and improved sample efficiency remains unclear. Method: SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground segments semantically via text prompts. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. Result: SegDAC achieves significantly better visual generalization. Conclusion: SegDAC doubles prior performance on the hardest setting and matches or surpasses prior methods in sample efficiency across all evaluated tasks. Abstract: Visual reinforcement learning (RL) is challenging due to the need to learn both perception and actions from high-dimensional inputs and noisy rewards. Although large perception models exist, integrating them effectively into RL for visual generalization and improved sample efficiency remains unclear. We propose SegDAC, a Segmentation-Driven Actor-Critic method. SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground segments semantically via text prompts. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, which covers diverse manipulation tasks under strong visual perturbations, we demonstrate that SegDAC achieves significantly better visual generalization, doubling prior performance on the hardest setting and matching or surpassing prior methods in sample efficiency across all evaluated tasks.

[62] Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model

Yifan Jiang,Ahmad Shariftabrizi,Venkata SK. Manem

Main category: cs.CV

TL;DR: Lung-DDPM+ 是一种高效的生成模型，能够在保持高质量的同时显著提高采样效率和降低资源消耗，适用于肺部CT图像的生成和相关医学应用。

Details

Motivation: 现有的肺部癌症诊断生成模型存在效率低下和解剖结构不精确的问题，限制了其在临床中的应用。因此需要提出一种更加高效和精确的生成模型。 Method: Lung-DDPM+ 是 Lung-DDPM 的改进版本，采用肺部DPM求解器加速，并通过结节语义布局进行引导，从而在保证生成质量的同时提高了采样效率。 Result: 在公共LIDC-IDRI数据集上的评估结果显示，Lung-DDPM+ 在浮点运算次数（FLOPs）上减少了8倍，GPU内存消耗降低了6.8倍，采样速度提高了14倍，同时保持了与Lung-DDPM和其他最先进生成模型相当的生成质量。此外，通过经验丰富的放射科医生进行的视觉图灵测试也表明了该方法生成样本的高质量和高保真度。 Conclusion: Lung-DDPM+ 是一种改进的去噪扩散概率模型，能够高效生成高质量的胸部CT图像，具有广泛的潜在应用前景，如通用肿瘤合成和医学图像中的病变生成。 Abstract: Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8$\times$ fewer FLOPs (floating point operations per second), 6.8$\times$ lower GPU memory consumption, and 14$\times$ faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM-PLUS.

[63] UltraLight Med-Vision Mamba for Classification of Neoplastic Progression in Tubular Adenomas

Aqsa Sultana,Nordin Abouzahra,Ahmed Rahu,Brian Shula,Brandon Combs,Derrick Forchetti,Theus Aspiras,Vijayan K. Asari

Main category: cs.CV

TL;DR: Ultralight Med-Vision Mamba, a state-space model, improves the accuracy and efficiency of identifying precancerous polyps during colonoscopies, enabling better risk assessment and personalized patient care.

Details

Motivation: Identifying precancerous polyps during routine colonoscopy screenings is crucial for reducing the risk of colorectal cancer. Improved methods for accurate classification and risk assessment enable personalized surveillance protocols and better patient outcomes. Method: The study utilizes Ultralight Med-Vision Mamba, an advanced deep learning algorithm based on a state-space model (SSM), designed to model long- and short-range dependencies and enhance image generalization for precise adenoma classification. Result: Ultralight Med-Vision Mamba demonstrated excellent performance in adenoma classification and stratification, offering enhanced accuracy in risk assessment, along with benefits in computational speed and scalability for clinical applications. Conclusion: Ultralight Med-Vision Mamba, a state-space based model, has proven effective in analyzing whole slide images with high computational speed and scalability, making it a promising tool for real-time clinical deployment in identifying precancerous polyps during colonoscopy screenings. Abstract: Identification of precancerous polyps during routine colonoscopy screenings is vital for their excision, lowering the risk of developing colorectal cancer. Advanced deep learning algorithms enable precise adenoma classification and stratification, improving risk assessment accuracy and enabling personalized surveillance protocols that optimize patient outcomes. Ultralight Med-Vision Mamba, a state-space based model (SSM), has excelled in modeling long- and short-range dependencies and image generalization, critical factors for analyzing whole slide images. Furthermore, Ultralight Med-Vision Mamba's efficient architecture offers advantages in both computational speed and scalability, making it a promising tool for real-time clinical deployment.

[64] Blink-to-code: real-time Morse code communication via eye blink detection and classification

Anushka Bhatt

Main category: cs.CV

TL;DR: This study presents a low-cost, real-time system that translates eye blinks into Morse code for individuals with severe motor impairments, achieving 62% accuracy and response times of 18-20 seconds.

Details

Motivation: The motivation behind this study is to develop an accessible communication tool for individuals with severe motor impairments, enabling them to interact with their environment through a simple and low-cost method. Method: The study proposes a real-time system using a standard webcam and computer vision to detect and classify eye blinks as short (dot) or long (dash), which are then decoded into alphanumeric characters. Result: Experiments conducted with five participants showed a decoding accuracy of 62% and response times ranging from 18 to 20 seconds. Conclusion: The study concludes that the proposed real-time system for translating eye blinks into Morse code offers a viable and low-cost assistive communication method for individuals with severe motor impairments. Abstract: This study proposes a real-time system that translates voluntary eye blinks into Morse code, enabling communication for individuals with severe motor impairments. Using a standard webcam and computer vision, the system detects and classifies blinks as short (dot) or long (dash), then decodes them into alphanumeric characters. Experiments with five participants show 62% decoding accuracy and 18-20 seconds response times, demonstrating a viable, low-cost assistive communication method.

[65] FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition

Md. Milon Islam,Md Rezwanul Haque,S M Taslim Uddin Raju,Fakhri Karray

Main category: cs.CV

TL;DR: 本文提出了一种名为FusionEnsemble-Net的新方法，通过注意力机制融合多种时空网络的模型，以提高意大利手语的识别准确率。

Details

Motivation: 医疗沟通中准确识别手语面临重大挑战，需要能够准确解释复杂多模态手势的框架。 Method: FusionEnsemble-Net是一种基于注意力机制的时空网络集成方法，通过同步处理RGB视频和多普勒雷达模态数据，利用注意力机制融合模块持续融合两种模态的特征，最后通过分类集成头结合四个不同融合通道的输出，提高模型的鲁棒性。 Result: 实验表明，FusionEnsemble-Net在大规模MultiMeDaLIS数据集上取得了99.44%的测试准确率，超过了现有最先进的方法。 Conclusion: 研究表明，基于注意力机制融合的多种时空网络集成方法，为复杂多模态孤立手势识别任务提供了一个鲁棒且准确的框架。 Abstract: Accurate recognition of sign language in healthcare communication poses a significant challenge, requiring frameworks that can accurately interpret complex multimodal gestures. To deal with this, we propose FusionEnsemble-Net, a novel attention-based ensemble of spatiotemporal networks that dynamically fuses visual and motion data to enhance recognition accuracy. The proposed approach processes RGB video and range Doppler map radar modalities synchronously through four different spatiotemporal networks. For each network, features from both modalities are continuously fused using an attention-based fusion module before being fed into an ensemble of classifiers. Finally, the outputs of these four different fused channels are combined in an ensemble classification head, thereby enhancing the model's robustness. Experiments demonstrate that FusionEnsemble-Net outperforms state-of-the-art approaches with a test accuracy of 99.44% on the large-scale MultiMeDaLIS dataset for Italian Sign Language. Our findings indicate that an ensemble of diverse spatiotemporal networks, unified by attention-based fusion, yields a robust and accurate framework for complex, multimodal isolated gesture recognition tasks. The source code is available at: https://github.com/rezwanh001/Multimodal-Isolated-Italian-Sign-Language-Recognition.

[66] A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition

Md Rezwanul Haque,Md. Milon Islam,S M Taslim Uddin Raju,Fakhri Karray

Main category: cs.CV

TL;DR: 该研究提出了一种针对连续手语识别中用户间差异和新句子结构泛化问题的双架构框架，显著提升了性能，并建立了新的基准。

Details

Motivation: CSLR面临显著的用户间差异和对新句子结构泛化能力差的问题，传统方法难以有效解决这些挑战。因此，需要开发针对CSLR特定问题的任务专用网络。 Method: 提出了一种双架构框架，包括用于解决SI问题的Signer-Invariant Conformer和用于处理US任务的Multi-Scale Fusion Transformer。Conformer结合了卷积和多头自注意力机制，Transformer则采用了双路径时间编码器以捕捉细粒度的姿态动态。 Result: 在Isharah-1000数据集上的实验表明，所提出的Conformer在SI任务上实现了13.07%的词错误率（WER），比现有最佳结果降低了13.53%；Transformer在US任务上实现了47.78%的WER，超越了以往研究。在SignEval 2025 CSLR挑战赛中，该团队在US任务中排名第二，在SI任务中排名第四。 Conclusion: 研究得出，针对连续手语识别（CSLR）中的Signer-Independent（SI）和Unseen-Sentences（US）挑战，采用任务特定网络设计可以显著提升性能，并建立了新的基准。 Abstract: Continuous Sign Language Recognition (CSLR) faces multiple challenges, including significant inter-signer variability and poor generalization to novel sentence structures. Traditional solutions frequently fail to handle these issues efficiently. For overcoming these constraints, we propose a dual-architecture framework. For the Signer-Independent (SI) challenge, we propose a Signer-Invariant Conformer that combines convolutions with multi-head self-attention to learn robust, signer-agnostic representations from pose-based skeletal keypoints. For the Unseen-Sentences (US) task, we designed a Multi-Scale Fusion Transformer with a novel dual-path temporal encoder that captures both fine-grained posture dynamics, enabling the model's ability to comprehend novel grammatical compositions. Experiments on the challenging Isharah-1000 dataset establish a new standard for both CSLR benchmarks. The proposed conformer architecture achieves a Word Error Rate (WER) of 13.07% on the SI challenge, a reduction of 13.53% from the state-of-the-art. On the US task, the transformer model scores a WER of 47.78%, surpassing previous work. In the SignEval 2025 CSLR challenge, our team placed 2nd in the US task and 4th in the SI task, demonstrating the performance of these models. The findings validate our key hypothesis: that developing task-specific networks designed for the particular challenges of CSLR leads to considerable performance improvements and establishes a new baseline for further research. The source code is available at: https://github.com/rezwanh001/MSLR-Pose86K-CSLR-Isharah.

[67] What Can We Learn from Inter-Annotator Variability in Skin Lesion Segmentation?

Kumar Abhishek,Jeremy Kawahara,Ghassan Hamarneh

Main category: cs.CV

TL;DR: This paper introduces IMA++, a large skin lesion segmentation dataset, showing that inter-annotator agreement (IAA) correlates with malignancy and can enhance model accuracy when used as a clinical feature.

Details

Motivation: Medical image segmentation often suffers from intra- and inter-annotator variability, particularly for lesions with ambiguous boundaries. This variability is linked to malignancy, making it important to study and leverage for improved diagnostic models. Method: The authors curated the IMA++ dataset, a large multi-annotator skin lesion segmentation dataset, and conducted an in-depth analysis of variability factors. They measured inter-annotator agreement (IAA) using Dice scores, predicted IAA from dermoscopic images, and incorporated IAA into a multi-task learning framework to assess its impact on model performance. Result: A statistically significant (p<0.001) association was found between IAA and lesion malignancy. IAA could be predicted directly from images with a mean absolute error of 0.108. Incorporating IAA into a multi-task learning framework improved balanced accuracy by 4.2% across multiple models and datasets. Conclusion: The study concludes that inter-annotator agreement (IAA) is significantly associated with the malignancy of skin lesions, and IAA can be used as a 'soft' clinical feature to improve balanced accuracy in multi-task learning models for skin lesion segmentation. Abstract: Medical image segmentation exhibits intra- and inter-annotator variability due to ambiguous object boundaries, annotator preferences, expertise, and tools, among other factors. Lesions with ambiguous boundaries, e.g., spiculated or infiltrative nodules, or irregular borders per the ABCD rule, are particularly prone to disagreement and are often associated with malignancy. In this work, we curate IMA++, the largest multi-annotator skin lesion segmentation dataset, on which we conduct an in-depth study of variability due to annotator, malignancy, tool, and skill factors. We find a statistically significant (p<0.001) association between inter-annotator agreement (IAA), measured using Dice, and the malignancy of skin lesions. We further show that IAA can be accurately predicted directly from dermoscopic images, achieving a mean absolute error of 0.108. Finally, we leverage this association by utilizing IAA as a "soft" clinical feature within a multi-task learning objective, yielding a 4.2% improvement in balanced accuracy averaged across multiple model architectures and across IMA++ and four public dermoscopic datasets. The code is available at https://github.com/sfu-mial/skin-IAV.

[68] X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents

Guoxian Song,Hongyi Xu,Xiaochen Zhao,You Xie,Tianpei Gu,Zenan Li,Chenxu Zhang,Linjie Luo

Main category: cs.CV

TL;DR: X-UniMotion提出了一种新的隐含运动表征方法，通过自监督学习和解耦设计实现高保真跨身份运动迁移。

Details

Motivation: 现有的运动迁移方法依赖显式骨骼姿态和启发式跨身份调整，难以保留高保真细节和身份无关的运动表达。 Method: 通过自监督、端到端框架联合学习运动编码器和基于DiT的视频生成模型，使用2D空间和颜色增强以及合成3D渲染来实现动作-身份解耦。 Result: X-UniMotion在跨身份运动迁移任务中表现优于现有技术，能够生成高保真、富有表现力的动画。 Conclusion: X-UniMotion是一种统一且富有表现力的全身人体运动隐含表征方法，能够实现跨身份的高保真运动迁移。 Abstract: We present X-UniMotion, a unified and expressive implicit latent representation for whole-body human motion, encompassing facial expressions, body poses, and hand gestures. Unlike prior motion transfer methods that rely on explicit skeletal poses and heuristic cross-identity adjustments, our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens -- one for facial expression, one for body pose, and one for each hand. These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer across subjects with diverse identities, poses, and spatial configurations. To achieve this, we introduce a self-supervised, end-to-end framework that jointly learns the motion encoder and latent representation alongside a DiT-based video generative model, trained on large-scale, diverse human motion datasets. Motion-identity disentanglement is enforced via 2D spatial and color augmentations, as well as synthetic 3D renderings of cross-identity subject pairs under shared poses. Furthermore, we guide motion token learning with auxiliary decoders that promote fine-grained, semantically aligned, and depth-aware motion embeddings. Extensive experiments show that X-UniMotion outperforms state-of-the-art methods, producing highly expressive animations with superior motion fidelity and identity preservation.

[69] DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection

Kang Ni,Minrui Zou,Yuxuan Li,Xiang Li,Kehua Guo,Ming-Ming Cheng,Yimian Dai

Main category: cs.CV

TL;DR: 本文提出了一种新的SAR目标检测方法DenoDet V2，利用带间相互调制机制，在变换域中解构和调制特征，从而实现更优的去噪效果。

Details

Motivation: 合成孔径雷达（SAR）目标检测中的主要挑战在于相干噪声的广泛影响。现有方法通常通过分析或增强目标空间域特征来实现隐式去噪，而本文提出了一种全新的视角。 Method: 通过精心设计的注意力架构，在变换域中解构和调制特征，利用幅度和相位信息的互补性进行带间相互调制。 Result: 在多个SAR数据集上的实验表明，DenoDet V2达到了最先进的性能，并在SARDet-100K数据集上比DenoDet V1提高了0.8%。 Conclusion: DenoDet V2实现了比DenoDet V1更先进的性能，并且模型复杂度降低了一半。 Abstract: One of the primary challenges in Synthetic Aperture Radar (SAR) object detection lies in the pervasive influence of coherent noise. As a common practice, most existing methods, whether handcrafted approaches or deep learning-based methods, employ the analysis or enhancement of object spatial-domain characteristics to achieve implicit denoising. In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. Compared to DenoDet V1, DenoDet V2 is a major advancement that exploits the complementary nature of amplitude and phase information through a band-wise mutual modulation mechanism, which enables a reciprocal enhancement between phase and amplitude spectra. Extensive experiments on various SAR datasets demonstrate the state-of-the-art performance of DenoDet V2. Notably, DenoDet V2 achieves a significant 0.8\% improvement on SARDet-100K dataset compared to DenoDet V1, while reducing the model complexity by half. The code is available at https://github.com/GrokCV/GrokSAR.

[70] Skyshield: Event-Driven Submillimetre Thin Obstacle Detection for Drone Flight Safety

Zhengli Zhang,Xinyu Luo,Yuchen Sun,Wenhua Ding,Dongyu Huang,Xinlei Chen

Main category: cs.CV

TL;DR: 本文提出SkyShield，一种用于检测亚毫米级障碍物的事件驱动轻量级框架，具有高精度和低延迟。

Details

Motivation: 传统传感器（如RGB相机、LiDAR和深度相机）难以检测亚毫米级障碍物，例如钢丝和风筝线。 Method: 使用基于事件流的检测方法，采用轻量U-Net架构和Dice-Contour正则化损失。 Result: 实验结果表明，所提出的事件驱动方法达到了0.7088的平均F1分数，并具有21.2毫秒的低延迟。 Conclusion: SkyShield是一个轻量级的端到端事件驱动框架，适用于边缘和移动平台的亚毫米级障碍感知。 Abstract: Drones operating in complex environments face a significant threat from thin obstacles, such as steel wires and kite strings at the submillimeter level, which are notoriously difficult for conventional sensors like RGB cameras, LiDAR, and depth cameras to detect. This paper introduces SkyShield, an event-driven, end-to-end framework designed for the perception of submillimeter scale obstacles. Drawing upon the unique features that thin obstacles present in the event stream, our method employs a lightweight U-Net architecture and an innovative Dice-Contour Regularization Loss to ensure precise detection. Experimental results demonstrate that our event-based approach achieves mean F1 Score of 0.7088 with a low latency of 21.2 ms, making it ideal for deployment on edge and mobile platforms.

[71] Autonomous AI Bird Feeder for Backyard Biodiversity Monitoring

El Mustapha Mansouri

Main category: cs.CV

TL;DR: 本文介绍了一种用于比利时城市花园自主后院鸟类监测的低成本本地系统。

Details

Motivation: 展示了一种低成本的、本地化的比利时城市花园自主后院鸟类监测系统。 Method: 使用运动触发的IP相机通过FTP将短片段上传到本地服务器，在本地服务器上对帧进行采样，并使用Detectron2定位鸟类；然后通过在比利时40种鸟类子集上微调的EfficientNet-B3模型对裁剪区域进行分类。 Result: 检测器引导的裁剪提高了分类准确性，分类器在验证集上的性能很高（约99.5%），在实际应用中也具有实用性（top-1约88%）。 Conclusion: 该系统能够在没有独立GPU的普通硬件上运行，保持隐私并避免云费用，展示了家庭中公民科学级生物多样性记录的可行性。 Abstract: This paper presents a low cost, on premise system for autonomous backyard bird monitoring in Belgian urban gardens. A motion triggered IP camera uploads short clips via FTP to a local server, where frames are sampled and birds are localized with Detectron2; cropped regions are then classified by an EfficientNet-B3 model fine tuned on a 40-species Belgian subset derived from a larger Kaggle corpus. All processing runs on commodity hardware without a discrete GPU, preserving privacy and avoiding cloud fees. The physical feeder uses small entry ports (30 mm) to exclude pigeons and reduce nuisance triggers. Detector-guided cropping improves classification accuracy over raw-frame classification. The classifier attains high validation performance on the curated subset (about 99.5 percent) and delivers practical field accuracy (top-1 about 88 percent) on held-out species, demonstrating feasibility for citizen-science-grade biodiversity logging at home.

[72] Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving

Guangxun Zhu,Shiyu Fan,Hang Dai,Edmond S. L. Ho

Main category: cs.CV

TL;DR: Waymo-3DSkelMo是一个基于Waymo感知数据集的大规模高质量3D骨骼运动数据集，解决了现有数据集在时间连续性和质量上的不足。

Details

Motivation: 现有的3D运动数据集主要依赖于单目RGB视频帧估计3D姿态，存在遮挡和时间连续性不足的问题，导致人类运动质量低下。 Method: 利用3D人体形状和运动先验知识，从原始LiDAR点云中提取高质量的3D姿态序列。 Result: Waymo-3DSkelMo数据集涵盖了超过800种真实驾驶场景中的14,000多秒数据，每个场景平均有27个智能体（最大场景多达250个智能体），并建立了3D姿态预测基准。 Conclusion: Waymo-3DSkelMo是一个高质量、时间连贯的3D骨骼运动数据集，为未来在复杂城市环境中进行细粒度人类行为理解的研究提供了重要资源。 Abstract: Large-scale high-quality 3D motion datasets with multi-person interactions are crucial for data-driven models in autonomous driving to achieve fine-grained pedestrian interaction understanding in dynamic urban environments. However, existing datasets mostly rely on estimating 3D poses from monocular RGB video frames, which suffer from occlusion and lack of temporal continuity, thus resulting in unrealistic and low-quality human motion. In this paper, we introduce Waymo-3DSkelMo, the first large-scale dataset providing high-quality, temporally coherent 3D skeletal motions with explicit interaction semantics, derived from the Waymo Perception dataset. Our key insight is to utilize 3D human body shape and motion priors to enhance the quality of the 3D pose sequences extracted from the raw LiDRA point clouds. The dataset covers over 14,000 seconds across more than 800 real driving scenarios, including rich interactions among an average of 27 agents per scene (with up to 250 agents in the largest scene). Furthermore, we establish 3D pose forecasting benchmarks under varying pedestrian densities, and the results demonstrate its value as a foundational resource for future research on fine-grained human behavior understanding in complex urban environments. The dataset and code will be available at https://github.com/GuangxunZhu/Waymo-3DSkelMo

[73] RampNet: A Two-Stage Pipeline for Bootstrapping Curb Ramp Detection in Streetscape Images from Open Government Metadata

John S. O'Meara,Jared Hwang,Zeyu Wang,Michael Saugstad,Jon E. Froehlich

Main category: cs.CV

TL;DR: 本文提出了一种名为RampNet的两阶段流水线，通过自动生成大规模高质量的路缘坡道检测数据集并训练改进的ConvNeXt V2模型，显著提高了路缘坡道检测的性能。

Details

Motivation: 由于缺乏大规模、高质量的数据集，图像中稳健地检测路缘坡道仍然是一个开放性问题。之前的工作尝试通过众包或手动标注数据来改善数据可用性，但这些努力在质量或规模上常常不足。 Method: RampNet包括两个阶段：第一阶段通过将政府提供的路缘坡道位置数据自动转换为全景图像中的像素坐标，生成了超过210,000个标注的Google街景全景图；第二阶段使用生成的数据集训练了一个改进的ConvNeXt V2模型。 Result: 在评估流水线的两个阶段时，生成的数据集达到了94.0%的精确度和92.5%的召回率，检测模型达到了0.9236 AP，远超之前的工作。 Conclusion: 本文提出了RampNet，一个两阶段的流水线，用于扩展路缘坡道检测数据集并改进模型性能。最终生成了第一个大规模、高质量的路缘坡道检测数据集、基准和模型。 Abstract: Curb ramps are critical for urban accessibility, but robustly detecting them in images remains an open problem due to the lack of large-scale, high-quality datasets. While prior work has attempted to improve data availability with crowdsourced or manually labeled data, these efforts often fall short in either quality or scale. In this paper, we introduce and evaluate a two-stage pipeline called RampNet to scale curb ramp detection datasets and improve model performance. In Stage 1, we generate a dataset of more than 210,000 annotated Google Street View (GSV) panoramas by auto-translating government-provided curb ramp location data to pixel coordinates in panoramic images. In Stage 2, we train a curb ramp detection model (modified ConvNeXt V2) from the generated dataset, achieving state-of-the-art performance. To evaluate both stages of our pipeline, we compare to manually labeled panoramas. Our generated dataset achieves 94.0% precision and 92.5% recall, and our detection model reaches 0.9236 AP -- far exceeding prior work. Our work contributes the first large-scale, high-quality curb ramp detection dataset, benchmark, and model.

Badi Li,Ren-jie Lu,Yu Zhou,Jingke Meng,Wei-shi Zheng

Main category: cs.CV

TL;DR: 本文提出 GOAL 框架，通过生成流模型结合大语言模型的空间先验信息，实现更准确和泛化的室内环境语义建模，显著提升了对象目标导航任务的表现。

Details

Motivation: 传统方法依赖确定性和判别性模型来完成语义地图，忽视了室内布局中的固有不确定性，从而限制了其在未见环境中的泛化能力。因此，需要一种能够更好建模场景不确定性的方法。 Method: GOAL 是一种基于生成流的框架，将从大语言模型中推断出的空间先验编码为二维高斯场，并将其注入目标地图，以丰富上下文知识并提高模型的泛化能力。 Result: 实验表明，GOAL 在 MP3D 和 Gibson 数据集上达到了最先进的性能，并在转移到 HM3D 数据集时表现出良好的泛化能力。 Conclusion: GOAL 框架通过结合观察区域与由大语言模型增强的全场景语义地图，有效解决了室内环境语义分布建模的问题，并在多个数据集上展示了最先进的性能和强大的泛化能力。 Abstract: The Object Goal Navigation (ObjectNav) task challenges agents to locate a specified object in an unseen environment by imagining unobserved regions of the scene. Prior approaches rely on deterministic and discriminative models to complete semantic maps, overlooking the inherent uncertainty in indoor layouts and limiting their ability to generalize to unseen environments. In this work, we propose GOAL, a generative flow-based framework that models the semantic distribution of indoor environments by bridging observed regions with LLM-enriched full-scene semantic maps. During training, spatial priors inferred from large language models (LLMs) are encoded as two-dimensional Gaussian fields and injected into target maps, distilling rich contextual knowledge into the flow model and enabling more generalizable completions. Extensive experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D and Gibson, and shows strong generalization in transfer settings to HM3D. Codes and pretrained models are available at https://github.com/Badi-Li/GOAL.

[75] What-Meets-Where: Unified Learning of Action and Contact Localization in a New Dataset

Yuxiao Wang,Yu Lei,Wolin Liang,Weiying Xue,Zhenao Wei,Nan Zhuang,Qi Liu

Main category: cs.CV

TL;DR: PaIR-Net通过结合动作语义与身体部位接触区域的预测，提供了一种新的视觉任务来更好地理解多样视觉环境中的动作。

Details

Motivation: 当前的方法往往无法充分捕捉动作语义和其在场景中的空间上下文化的联合建模，为此需要引入一种新的视觉任务来弥补这一空白。 Method: PaIR-Net框架包含三个关键组件：CPAM用于识别与接触相关的身体部位，PGCS用于像素级接触分割，IIM用于整合全局交互关系。 Result: 实验评估表明，PaIR-Net显著优于基线方法，数据集PaIR包含了13,979张涵盖654个动作、80个物体类别和17个身体部位的图像。 Conclusion: PaIR-Net在预测动作语义和身体部位接触区域方面显著优于基线方法，同时每个架构组件的有效性也通过消融实验得到了验证。 Abstract: People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component. The code and dataset will be released upon publication.

[76] MPT: Motion Prompt Tuning for Micro-Expression Recognition

Jiateng Liu,Hengcan Shi,Feng Chen,Zhiwen Shao,Yaonan Wang,Jianfei Cai,Wenming Zheng

Main category: cs.CV

TL;DR: This paper proposes Motion Prompt Tuning (MPT) to adapt large pre-training models (LMs) for micro-expression recognition (MER) by extracting subtle motions as prompts, outperforming state-of-the-art approaches on three MER datasets.

Details

Motivation: Micro-expression recognition (MER) is vital in affective computing with applications in medical diagnosis, lie detection, and criminal investigation. However, ME datasets are limited by scarce training samples due to the need for expert annotations. While large pre-training models (LMs) offer strong general representations, they struggle to capture the subtle and transient facial movements critical for MER. Method: The authors propose Motion Prompt Tuning (MPT), which includes motion prompt generation via motion magnification and Gaussian tokenization to extract subtle motion cues. A group adapter is also designed and integrated into the LM to enhance its performance in the MER domain. Result: Extensive experiments on three widely used MER datasets demonstrate that MPT consistently outperforms state-of-the-art methods, validating its effectiveness in capturing subtle facial movements for MER. Conclusion: Motion Prompt Tuning (MPT) is a novel and effective approach for adapting large pre-training models to micro-expression recognition, addressing the challenge of scarce annotated data and improving the model's ability to capture subtle facial dynamics. Abstract: Micro-expression recognition (MER) is crucial in the affective computing field due to its wide application in medical diagnosis, lie detection, and criminal investigation. Despite its significance, obtaining micro-expression (ME) annotations is challenging due to the expertise required from psychological professionals. Consequently, ME datasets often suffer from a scarcity of training samples, severely constraining the learning of MER models. While current large pre-training models (LMs) offer general and discriminative representations, their direct application to MER is hindered by an inability to capture transitory and subtle facial movements-essential elements for effective MER. This paper introduces Motion Prompt Tuning (MPT) as a novel approach to adapting LMs for MER, representing a pioneering method for subtle motion prompt tuning. Particularly, we introduce motion prompt generation, including motion magnification and Gaussian tokenization, to extract subtle motions as prompts for LMs. Additionally, a group adapter is carefully designed and inserted into the LM to enhance it in the target MER domain, facilitating a more nuanced distinction of ME representation. Furthermore, extensive experiments conducted on three widely used MER datasets demonstrate that our proposed MPT consistently surpasses state-of-the-art approaches and verifies its effectiveness.

[77] RASR: Retrieval-Augmented Super Resolution for Practical Reference-based Image Restoration

Jiaqi Yan,Shuning Xu,Xiangyu Chen,Dell Zhang,Jie Tang,Gangshan Wu,Jie Liu

Main category: cs.CV

TL;DR: RASRNet是一种新的RefSR范式，它通过自动检索语义相关的高分辨率图像来增强低质量输入，从而在实际应用场景中实现可扩展和灵活的RefSR。

Details

Motivation: 现有的RefSR方法依赖于手动策划的目标-参考图像对，这严重限制了它们在现实世界场景中的实用性。 Method: RASRNet结合了语义参考检索器和基于扩散的RefSR生成器，以实现更现实的纹理生成。 Result: RASRNet在RASR-Flickr30数据集上的实验显示，其在PSNR上提高了+0.38 dB，在LPIPS上降低了-0.0131，同时生成了更加真实的纹理。 Conclusion: RASRNet通过结合语义参考检索器和基于扩散的RefSR生成器，成功实现了更现实的纹理生成，并在RASR-Flickr30数据集上证明了其性能优于SISR基线模型。 Abstract: Reference-based Super Resolution (RefSR) improves upon Single Image Super Resolution (SISR) by leveraging high-quality reference images to enhance texture fidelity and visual realism. However, a critical limitation of existing RefSR approaches is their reliance on manually curated target-reference image pairs, which severely constrains their practicality in real-world scenarios. To overcome this, we introduce Retrieval-Augmented Super Resolution (RASR), a new and practical RefSR paradigm that automatically retrieves semantically relevant high-resolution images from a reference database given only a low-quality input. This enables scalable and flexible RefSR in realistic use cases, such as enhancing mobile photos taken in environments like zoos or museums, where category-specific reference data (e.g., animals, artworks) can be readily collected or pre-curated. To facilitate research in this direction, we construct RASR-Flickr30, the first benchmark dataset designed for RASR. Unlike prior datasets with fixed target-reference pairs, RASR-Flickr30 provides per-category reference databases to support open-world retrieval. We further propose RASRNet, a strong baseline that combines a semantic reference retriever with a diffusion-based RefSR generator. It retrieves relevant references based on semantic similarity and employs a diffusion-based generator enhanced with semantic conditioning. Experiments on RASR-Flickr30 demonstrate that RASRNet consistently improves over SISR baselines, achieving +0.38 dB PSNR and -0.0131 LPIPS, while generating more realistic textures. These findings highlight retrieval augmentation as a promising direction to bridge the gap between academic RefSR research and real-world applicability.

[78] HyperKD: Distilling Cross-Spectral Knowledge in Masked Autoencoders via Inverse Domain Shift with Spatial-Aware Masking and Specialized Loss

Abdul Matin,Tanjim Bin Faruk,Shrideep Pallickara,Sangmi Lee Pallickara

Main category: cs.CV

TL;DR: HyperKD is a knowledge distillation framework that bridges the spectral domain gap in hyperspectral remote sensing, enabling more effective use of foundation models for geospatial applications.

Details

Motivation: The challenge of applying foundation models to hyperspectral remote sensing due to spectral disparities and limited data availability motivates the development of HyperKD. Method: HyperKD uses a knowledge distillation framework with a Masked Autoencoder, incorporating spectral range-based channel alignment, spatial feature-guided masking, and an enhanced loss function. Result: HyperKD improves representation learning in MAEs, resulting in better reconstruction fidelity and performance on downstream tasks like land cover classification and soil organic carbon prediction. Conclusion: HyperKD successfully bridges the spectral domain gap in hyperspectral remote sensing by leveraging knowledge distillation, enhancing the applicability of foundation models in geospatial tasks. Abstract: The proliferation of foundation models, pretrained on large-scale unlabeled datasets, has emerged as an effective approach in creating adaptable and reusable architectures that can be leveraged for various downstream tasks using satellite observations. However, their direct application to hyperspectral remote sensing remains challenging due to inherent spectral disparities and the scarcity of available observations. In this work, we present HyperKD, a novel knowledge distillation framework that enables transferring learned representations from a teacher model into a student model for effective development of a foundation model on hyperspectral images. Unlike typical knowledge distillation frameworks, which use a complex teacher to guide a simpler student, HyperKD enables an inverse form of knowledge transfer across different types of spectral data, guided by a simpler teacher model. Building upon a Masked Autoencoder, HyperKD distills knowledge from the Prithvi foundational model into a student tailored for EnMAP hyperspectral imagery. HyperKD addresses the inverse domain adaptation problem with spectral gaps by introducing a feature-based strategy that includes spectral range-based channel alignment, spatial feature-guided masking, and an enhanced loss function tailored for hyperspectral images. HyperKD bridges the substantial spectral domain gap, enabling the effective use of pretrained foundation models for geospatial applications. Extensive experiments show that HyperKD significantly improves representation learning in MAEs, leading to enhanced reconstruction fidelity and more robust performance on downstream tasks such as land cover classification, crop type identification, and soil organic carbon prediction, underpinning the potential of knowledge distillation frameworks in remote sensing analytics with hyperspectral imagery.

[79] Animate-X++: Universal Character Image Animation with Dynamic Backgrounds

Shuai Tan,Biao Gong,Zhuoxin Liu,Yan Wang,Xi Chen,Yifan Feng,Hengshuang Zhao

Main category: cs.CV

TL;DR: 本文提出 Animate-X++，一种基于 DiT 的通用角色动画框架，通过姿态指示器增强运动表示，并结合多任务训练实现动态背景生成，显著提升了拟人化角色的动画质量和适用性。

Details

Motivation: 现有的角色图像动画方法主要适用于人类形象，难以推广到拟人化角色，并且只能生成静态背景的视频，限制了视频的真实性。因此，本文旨在解决这两个问题，提高动画生成的质量和适用范围。 Method: Animate-X++ 基于 DiT 框架，引入了姿态指示器（Pose Indicator），通过隐式和显式方式捕捉驱动视频中的运动模式。同时，采用多任务训练策略，联合训练动画和文本到视频（TI2V）任务，以实现文本驱动的背景动态效果。 Result: Animate-X++ 在通用性和动画质量方面表现出色，能够生成高质量、具有动态背景的拟人化角色动画，并在新提出的 A2Bench 基准上验证了其有效性。 Conclusion: Animate-X++ 是一种通用的动画框架，适用于各种角色类型，包括拟人化角色，并通过引入姿态指示器和多任务训练策略，实现了更真实和高质量的角色图像动画生成。 Abstract: Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Furthermore, previous methods could only generate videos with static backgrounds, which limits the realism of the videos. For the first challenge, our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X++, a universal animation framework based on DiT for various character types, including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of DiT by simulating possible inputs in advance that may arise during inference. For the second challenge, we introduce a multi-task training strategy that jointly trains the animation and TI2V tasks. Combined with the proposed partial parameter training, this approach achieves not only character animation but also text-driven background dynamics, making the videos more realistic. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A2Bench) to evaluate the performance of Animate-X++ on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X++.

[80] IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding

Junxian Li,Beining Xu,Di Zhang

Main category: cs.CV

TL;DR: This paper proposes IAG, an input-aware backdoor attack for vision-language models, effectively manipulating visual grounding with high success rates and stealthiness.

Details

Motivation: Security vulnerabilities in vision-language models (VLMs), particularly in visual grounding tasks, remain underexplored in the context of backdoor attacks. Method: An adaptive trigger generator using a text-conditional U-Net embeds semantic attack information, while a reconstruction loss ensures visual stealthiness. Result: ASR@0.5 on InternVL-2.5-8B reaches over 65% across various testing sets, with minimal accuracy drop on clean samples. Conclusion: IAG demonstrates a robust and transferable input-aware backdoor attack method against VLMs with high ASR and minimal impact on accuracy. Abstract: Vision-language models (VLMs) have shown significant advancements in tasks such as visual grounding, where they localize specific objects in images based on natural language queries and images. However, security issues in visual grounding tasks for VLMs remain underexplored, especially in the context of backdoor attacks. In this paper, we introduce a novel input-aware backdoor attack method, IAG, designed to manipulate the grounding behavior of VLMs. This attack forces the model to ground a specific target object in the input image, regardless of the user's query. We propose an adaptive trigger generator that embeds the semantic information of the attack target's description into the original image using a text-conditional U-Net, thereby overcoming the open-vocabulary attack challenge. To ensure the attack's stealthiness, we utilize a reconstruction loss to minimize visual discrepancies between poisoned and clean images. Additionally, we introduce a unified method for generating attack data. IAG is evaluated theoretically and empirically, demonstrating its feasibility and effectiveness. Notably, our ASR@0.5 on InternVL-2.5-8B reaches over 65\% on various testing sets. IAG also shows promising potential on manipulating Ferret-7B and LlaVA-1.5-7B with very little accuracy decrease on clean samples. Extensive specific experiments, such as ablation study and potential defense, also indicate the robustness and transferability of our attack.

[81] RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

Wen Huang,Jiarui Yang,Tao Dai,Jiawei Li,Shaoxiong Zhan,Bin Wang,Shu-Tao Xia

Main category: cs.CV

TL;DR: RelayFormer是一种新的视觉操作定位框架，能够统一处理图像和视频，具有良好的跨模态泛化能力和高效处理能力。

Details

Motivation: 现有的视觉操作定位方法往往缺乏跨模态泛化能力，且难以高效处理高分辨率或长时输入，因此需要一种更高效和通用的解决方案。 Method: RelayFormer采用灵活的本地单元和Global-Local Relay Attention (GLoRA)机制，并通过轻量级适配模块与现有的Transformer-based骨干网络集成，同时设计了一个基于查询的轻量级掩码解码器，支持视频序列的一次性推理。 Result: RelayFormer在多个基准测试中实现了最先进的定位性能，同时支持高效处理高分辨率图像和长视频序列。 Conclusion: RelayFormer是一个统一且模块化的架构，适用于图像和视频中的视觉操作定位，它通过灵活的本地单元和全局-本地中继注意力机制实现了可扩展、分辨率无关的处理，并且集成了现有的Transformer骨干网络。 Abstract: Visual manipulation localization (VML) -- across both images and videos -- is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: https://github.com/WenOOI/RelayFormer.

[82] Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

Hao Yu,Rupayan Mallick,Margrit Betke,Sarah Adel Bargal

Main category: cs.CV

TL;DR: GEN-AFFECT is a new framework for personalized avatar generation that successfully maintains identity while generating a wide range of facial expressions.

Details

Motivation: Existing approaches for customized 2D avatars often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. Method: GEN-AFFECT conditions a multimodal diffusion transformer on an extracted identity-expression representation and employs consistent attention at inference for information sharing across generated expressions. Result: GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods in terms of accuracy of generated expressions, preservation of identity, and consistency of target identity across fine-grained facial expressions. Conclusion: GEN-AFFECT is a successful framework for generating expressive and identity-consistent avatars with a diverse set of facial expressions, outperforming previous state-of-the-art methods. Abstract: Different forms of customized 2D avatars are widely used in gaming applications, virtual communication, education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose GEN-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion transformer on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. GEN-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions.

[83] Event-driven Robust Fitting on Neuromorphic Hardware

Tam Ngoc-Bang Nguyen,Anh-Dzung Doan,Zhipeng Cai,Tat-Jun Chin

Main category: cs.CV

TL;DR: 这篇论文介绍了一种基于Intel Loihi 2神经拟态硬件的新型尖峰神经网络，用于实现节能的鲁棒几何模型拟合，结果表明其能耗仅为标准CPU算法的15%。

Details

Motivation: 论文的动机是解决鲁棒拟合领域中能效问题的关注不足。随着高能耗成为AI应用日益关注的问题，探索更节能的鲁棒拟合方法变得至关重要。 Method: 论文的方法是设计了一种新的尖峰神经网络，用于在真实的神经拟态硬件（Intel Loihi 2）上进行鲁棒拟合。为此，作者提出了事件驱动的模型估计新公式，以及缓解硬件精度和指令集限制的算法策略。 Result: 实验结果表明，所提出的神经拟态鲁棒拟合方法相比在标准CPU上运行的传统算法，在达到相同准确率的情况下仅消耗其15%的能量。 Conclusion: 论文的结论是，通过使用Intel Loihi 2的神经拟态计算范式，所提出的鲁棒拟合方法相比在标准CPU上运行的传统算法，在达到相同准确率的情况下仅消耗其15%的能量。这表明神经拟态计算在实现节能鲁棒拟合方面具有巨大潜力。 Abstract: Robust fitting of geometric models is a fundamental task in many computer vision pipelines. Numerous innovations have been produced on the topic, from improving the efficiency and accuracy of random sampling heuristics to generating novel theoretical insights that underpin new approaches with mathematical guarantees. However, one aspect of robust fitting that has received little attention is energy efficiency. This performance metric has become critical as high energy consumption is a growing concern for AI adoption. In this paper, we explore energy-efficient robust fitting via the neuromorphic computing paradigm. Specifically, we designed a novel spiking neural network for robust fitting on real neuromorphic hardware, the Intel Loihi 2. Enabling this are novel event-driven formulations of model estimation that allow robust fitting to be implemented in the unique architecture of Loihi 2, and algorithmic strategies to alleviate the current limited precision and instruction set of the hardware. Results show that our neuromorphic robust fitting consumes only a fraction (15%) of the energy required to run the established robust fitting algorithm on a standard CPU to equivalent accuracy.

[84] CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios

Jialei Xu,Zizhuang Wei,Weikang You,Linyun Li,Weijian Sun

Main category: cs.CV

TL;DR: CitySeg introduces a novel framework for semantic segmentation of city-scale point clouds using text modality, achieving superior performance and enabling zero-shot inference without visual data.

Details

Motivation: Existing models are limited by the small scale of 3D data and domain gaps between datasets, which reduce generalization capability. The goal is to achieve comprehensive 3D understanding without relying on visual information. Method: CitySeg uses a local-global cross-attention network and a hierarchical classification strategy with a graph encoder to model category relationships. It also employs a two-stage training strategy and hinge loss to improve feature separability. Result: CitySeg achieves state-of-the-art performance on nine closed-set benchmarks and enables zero-shot generalization in city-scale point cloud scenarios without visual information. Conclusion: CitySeg is a foundation model for city-scale point cloud semantic segmentation that achieves open vocabulary segmentation and zero-shot inference, overcoming limitations such as limited 3D data scale, domain gaps, and semantic label discrepancies. Abstract: Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding. However, existing models are frequently constrained by the limited scale of 3D data and the domain gap between datasets, which lead to reduced generalization capability. To address these challenges, we propose CitySeg, a foundation model for city-scale point cloud semantic segmentation that incorporates text modality to achieve open vocabulary segmentation and zero-shot inference. Specifically, in order to mitigate the issue of non-uniform data distribution across multiple domains, we customize the data preprocessing rules, and propose a local-global cross-attention network to enhance the perception capabilities of point networks in UAV scenarios. To resolve semantic label discrepancies across datasets, we introduce a hierarchical classification strategy. A hierarchical graph established according to the data annotation rules consolidates the data labels, and the graph encoder is used to model the hierarchical relationships between categories. In addition, we propose a two-stage training strategy and employ hinge loss to increase the feature separability of subcategories. Experimental results demonstrate that the proposed CitySeg achieves state-of-the-art (SOTA) performance on nine closed-set benchmarks, significantly outperforming existing approaches. Moreover, for the first time, CitySeg enables zero-shot generalization in city-scale point cloud scenarios without relying on visual information.

[85] Leveraging Failed Samples: A Few-Shot and Training-Free Framework for Generalized Deepfake Detection

Shibo Yao,Renshuai Tao,Xiaolong Zheng,Chao Liang,Chunjie Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为FTNet的少样本深度伪造检测方法，仅使用一个伪造样本，无需训练即可实现优异性能。

Details

Motivation: 解决深度伪造检测中的少样本挑战，特别是当模型难以泛化时，利用失败样本来提高性能。 Method: 使用仅一个伪造样本进行评估，并通过与已知真实和伪造样本比较来进行分类，无需训练或参数更新。 Result: 在29种生成模型上的平均性能比现有方法提高了8.7%。 Conclusion: 本文提出了一种新的现实世界少样本深度伪造检测方法FTNet，该方法无需训练且在29种不同的生成模型中表现出新的SoTA性能。 Abstract: Recent deepfake detection studies often treat unseen sample detection as a ``zero-shot" task, training on images generated by known models but generalizing to unknown ones. A key real-world challenge arises when a model performs poorly on unknown samples, yet these samples remain available for analysis. This highlights that it should be approached as a ``few-shot" task, where effectively utilizing a small number of samples can lead to significant improvement. Unlike typical few-shot tasks focused on semantic understanding, deepfake detection prioritizes image realism, which closely mirrors real-world distributions. In this work, we propose the Few-shot Training-free Network (FTNet) for real-world few-shot deepfake detection. Simple yet effective, FTNet differs from traditional methods that rely on large-scale known data for training. Instead, FTNet uses only one fake samplefrom an evaluation set, mimicking the scenario where new samples emerge in the real world and can be gathered for use, without any training or parameter updates. During evaluation, each test sample is compared to the known fake and real samples, and it is classified based on the category of the nearest sample. We conduct a comprehensive analysis of AI-generated images from 29 different generative models and achieve a new SoTA performance, with an average improvement of 8.7\% compared to existing methods. This work introduces a fresh perspective on real-world deepfake detection: when the model struggles to generalize on a few-shot sample, leveraging the failed samples leads to better performance.

[86] From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts

Yuji Wang,Moran Li,Xiaobin Hu,Ran Yi,Jiangning Zhang,Chengming Xu,Weijian Cao,Yabiao Wang,Chengjie Wang,Lizhuang Ma

Main category: cs.CV

TL;DR: 本研究提出了一种新的视频生成模型，通过MoFE机制和LFA数据集解决了大面部角度下身份保持的挑战，取得了优于现有方法的性能。

Details

Motivation: 当前视频生成模型在处理大面部角度时面临身份保持困难，主要是缺乏有效的身份特征整合机制以及现有数据集中大面部角度样本不足。 Method: 引入了一种面部专家混合机制（MoFE），结合了三个专家（身份专家、语义专家、细节专家）的互补特征，并通过定制的数据处理流程（面部约束和身份一致性）构建了LFA数据集。 Result: 实验结果显示，该方法在LFA基准数据集上显著优于现有SOTA方法，包括面部相似度、面部FID和CLIP语义对齐指标。 Conclusion: 该论文提出了一种新的视频生成方法，在大规模面部角度数据集上实现了优于现有技术的效果，有效解决了身份保持和大面部角度覆盖的问题。 Abstract: Current video generation models struggle with identity preservation under large facial angles, primarily facing two challenges: the difficulty in exploring an effective mechanism to integrate identity features into DiT structure, and the lack of targeted coverage of large facial angles in existing open-source video datasets. To address these, we present two key innovations. First, we introduce a Mixture of Facial Experts (MoFE) that dynamically combines complementary cues from three specialized experts, each designed to capture distinct but mutually reinforcing aspects of facial attributes. The identity expert captures cross-pose identity-sensitive features, the semantic expert extracts high-level visual semantxics, and the detail expert preserves pixel-level features (e.g., skin texture, color gradients). Furthermore, to mitigate dataset limitations, we have tailored a data processing pipeline centered on two key aspects: Face Constraints and Identity Consistency. Face Constraints ensure facial angle diversity and a high proportion of facial regions, while Identity Consistency preserves coherent person-specific features across temporal sequences, collectively addressing the scarcity of large facial angles and identity-stable training data in existing datasets. Leveraging this pipeline, we have curated and refined a Large Face Angles (LFA) Dataset from existing open-source human video datasets, comprising 460K video clips with annotated facial angles. Experimental results on the LFA benchmark demonstrate that our method, empowered by the LFA dataset, significantly outperforms prior SOTA methods in face similarity, face FID, and CLIP semantic alignment. The code and dataset will be made publicly available at https://github.com/rain152/LFA-Video-Generation.

[87] CLIP-Flow: A Universal Discriminator for AI-Generated Images Inspired by Anomaly Detection

Zhipeng Yuan,Kai Wang,Weize Quan,Dong-Ming Yan,Tieru Wu

Main category: cs.CV

TL;DR: 本文提出了一种新的通用AI合成图像检测方法，该方法基于异常检测和无监督学习，不需要访问AI合成图像即可训练，且在多种图像生成器生成的AI合成图像上均表现出良好的检测性能。

Details

Motivation: 现有的AI合成图像检测器在检测来自未见过的生成模型的AI合成图像时性能有限，因此需要一种更通用的检测方法。 Method: 提出了一种基于异常检测的通用AI合成图像检测方法。使用预训练的CLIP编码器作为特征提取器，设计了一个类似归一化流的无监督模型。通过最小化代理图像的可能性进行训练，可选地结合最大化自然图像的可能性。 Result: 所提出的方法在各种图像生成器生成的AI合成图像上均表现出良好的检测性能。 Conclusion: 实验结果表明，所提出的方法在检测由各种图像生成器生成的AI合成图像方面具有良好的性能。 Abstract: With the rapid advancement of AI generative models, the visual quality of AI-generated images (AIIs) has become increasingly close to natural images, which inevitably raises security concerns. Most AII detectors often employ the conventional image classification pipeline with natural images and AIIs (generated by a generative model), which can result in limited detection performance for AIIs from unseen generative models. To solve this, we proposed a universal AI-generated image detector from the perspective of anomaly detection. Our discriminator does not need to access any AIIs and learn a generalizable representation with unsupervised learning. Specifically, we use the pre-trained CLIP encoder as the feature extractor and design a normalizing flow-like unsupervised model. Instead of AIIs, proxy images, e.g., obtained by applying a spectral modification operation on natural images, are used for training. Our models are trained by minimizing the likelihood of proxy images, optionally combined with maximizing the likelihood of natural images. Extensive experiments demonstrate the effectiveness of our method on AIIs produced by various image generators.

[88] GazeLT: Visual attention-guided long-tailed disease classification in chest radiographs

Moinak Bhattacharya,Gagandeep Singh,Shubham Jain,Prateek Prasanna

Main category: cs.CV

TL;DR: GazeLT是一种利用放射科医生眼动模式来提升长尾疾病分类性能的新方法，在两个大型医学影像数据集上表现优异。

Details

Motivation: 放射科医生在解读医学影像时具有独特的眼动模式，这些模式可以捕捉到疾病相关的细粒度和粗粒度信息。同时，医生在解读过程中还会关注一些次要的发现，这些次要发现可能构成数据分布中的长尾类别。 Method: GazeLT利用放射科医生的眼动数据，通过整合和分解机制，从视觉搜索过程的时间维度来提升长尾疾病分类的准确性。 Result: GazeLT在两个公开的长尾疾病分类数据集（NIH-CXR-LT和MIMIC-CXR-LT）上展示了其有效性，平均准确率分别比最佳的长尾损失方法提高了4.1%，比基于视觉注意力的基线方法提高了21.7%。 Conclusion: GazeLT是一个用于长尾疾病分类的视觉注意力整合-分解方法，它通过整合和分解机制提高了长尾疾病分类的性能。 Abstract: In this work, we present GazeLT, a human visual attention integration-disintegration approach for long-tailed disease classification. A radiologist's eye gaze has distinct patterns that capture both fine-grained and coarser level disease related information. While interpreting an image, a radiologist's attention varies throughout the duration; it is critical to incorporate this into a deep learning framework to improve automated image interpretation. Another important aspect of visual attention is that apart from looking at major/obvious disease patterns, experts also look at minor/incidental findings (few of these constituting long-tailed classes) during the course of image interpretation. GazeLT harnesses the temporal aspect of the visual search process, via an integration and disintegration mechanism, to improve long-tailed disease classification. We show the efficacy of GazeLT on two publicly available datasets for long-tailed disease classification, namely the NIH-CXR-LT (n=89237) and the MIMIC-CXR-LT (n=111898) datasets. GazeLT outperforms the best long-tailed loss by 4.1% and the visual attention-based baseline by 21.7% in average accuracy metrics for these datasets. Our code is available at https://github.com/lordmoinak1/gazelt.

[89] SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Xuejun Huang,Xinyi Liu,Yi Wan,Zhi Zheng,Bin Zhang,Mingtao Xiong,Yingying Pei,Yongjun Zhang

Main category: cs.CV

TL;DR: SkySplat 提出了一种新的自监督框架，通过集成RPC模型到可推广的3DGS管道中，显著提高了稀疏视角卫星图像的三维场景重建性能。

Details

Motivation: 由于现有方法与有理多项式系数（RPC）模型不兼容且泛化能力有限，稀疏视角卫星图像的三维场景重建任务面临挑战。虽然可推广的3DGS方法显示出潜力，但其在多时相稀疏卫星图像上的表现不佳。 Method: SkySplat 采用了交叉-自我一致性模块（CSCM）和多视角一致性聚合策略，以减轻瞬态物体干扰并优化重建结果，同时依赖于RGB图像和辐射鲁棒性的相对高度监督。 Result: SkySplat 在DFC19数据集上将MAE从13.18米减少到1.80米，并在MVS3D基准上展示了强大的跨数据集泛化能力，与EOGS相比，实现了86倍的速度提升并具有更高的准确性。 Conclusion: SkySplat 是一种新的自监督框架，通过集成RPC模型到可推广的3DGS管道中，实现了更有效的稀疏几何线索利用，从而提高了稀疏视角卫星图像的三维场景重建性能。 Abstract: Three-dimensional scene reconstruction from sparse-view satellite images is a long-standing and challenging task. While 3D Gaussian Splatting (3DGS) and its variants have recently attracted attention for its high efficiency, existing methods remain unsuitable for satellite images due to incompatibility with rational polynomial coefficient (RPC) models and limited generalization capability. Recent advances in generalizable 3DGS approaches show potential, but they perform poorly on multi-temporal sparse satellite images due to limited geometric constraints, transient objects, and radiometric inconsistencies. To address these limitations, we propose SkySplat, a novel self-supervised framework that integrates the RPC model into the generalizable 3DGS pipeline, enabling more effective use of sparse geometric cues for improved reconstruction. SkySplat relies only on RGB images and radiometric-robust relative height supervision, thereby eliminating the need for ground-truth height maps. Key components include a Cross-Self Consistency Module (CSCM), which mitigates transient object interference via consistency-based masking, and a multi-view consistency aggregation strategy that refines reconstruction results. Compared to per-scene optimization methods, SkySplat achieves an 86 times speedup over EOGS with higher accuracy. It also outperforms generalizable 3DGS baselines, reducing MAE from 13.18 m to 1.80 m on the DFC19 dataset significantly, and demonstrates strong cross-dataset generalization on the MVS3D benchmark.

[90] COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets

Lingyu Chen,Yawen Zeng,Yue Wang,Peng Wan,Guo-chen Ning,Hongen Liao,Daoqiang Zhang,Fang Chen

Main category: cs.CV

TL;DR: This paper proposes COME, a universal framework for multi-heterogeneous ultrasound datasets, which effectively addresses inter-dataset interference while preserving dataset-specific features, resulting in robust generalization and superior performance.

Details

Motivation: Conventional single-dataset training fails in ultrasound image analysis due to limited data, acoustic shadows, and speckle noise, necessitating a universal framework for multi-heterogeneous datasets. Method: COME employs a dual structure-semantic shared expert system that collaborates with source-specific experts to extract discriminative features and mitigate inter-dataset interference. Result: Extensive experiments showed that COME outperformed state-of-the-art methods, achieving significant improvements in mean AP across three evaluation modes: single-dataset, intra-organ, and inter-organ integration datasets. Conclusion: The proposed COME framework demonstrates robust generalization and superior performance in handling multi-heterogeneous ultrasound datasets, particularly in small-batch or unseen data scenarios. Abstract: Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise. Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME's superiority, achieving significant mean AP improvements over state-of-the-art methods. Our project is available at: https://universalcome.github.io/UniversalCOME/.

[91] Episodic Memory Representation for Long-form Video Understanding

Yun Wang,Long Zhang,Jingren Liu,Jiaqi Yan,Zhanjie Zhang,Jiahao Zheng,Xun Yang,Dapeng Wu,Xiangyu Chen,Xuelong Li

Main category: cs.CV

TL;DR: Video-EM improves video question answering by modeling keyframes as episodic events and using LLM-based reasoning, achieving better performance with fewer frames.

Details

Motivation: Existing Video-LLM approaches struggle with long-form videos due to context window limits and oversimplified keyframe retrieval, which ignores crucial spatio-temporal relationships and may result in redundant or information-poor keyframes. Method: Video-EM models keyframes as temporally ordered episodic events, capturing spatial and temporal relationships. It uses a chain of thought (CoT) with LLMs to identify a minimal set of informative memories for efficient question answering. Result: Video-EM achieved competitive results with 4-9% performance gains over baselines on benchmarks like Video-MME, EgoSchema, HourVideo, and LVBench while using fewer frames. Conclusion: Video-EM is a training-free framework that excels in video question answering by modeling keyframes as temporally ordered episodic events and leveraging LLMs for efficient reasoning, outperforming existing methods while using fewer frames. Abstract: Video Large Language Models (Video-LLMs) excel at general video understanding but struggle with long-form videos due to context window limits. Consequently, recent approaches focus on keyframe retrieval, condensing lengthy videos into a small set of informative frames. Despite their practicality, these methods simplify the problem to static text image matching, overlooking spatio temporal relationships crucial for capturing scene transitions and contextual continuity, and may yield redundant keyframes with limited information, diluting salient cues essential for accurate video question answering. To address these limitations, we introduce Video-EM, a training free framework inspired by the principles of human episodic memory, designed to facilitate robust and contextually grounded reasoning. Rather than treating keyframes as isolated visual entities, Video-EM explicitly models them as temporally ordered episodic events, capturing both spatial relationships and temporal dynamics necessary for accurately reconstructing the underlying narrative. Furthermore, the framework leverages chain of thought (CoT) thinking with LLMs to iteratively identify a minimal yet highly informative subset of episodic memories, enabling efficient and accurate question answering by Video-LLMs. Extensive evaluations on the Video-MME, EgoSchema, HourVideo, and LVBench benchmarks confirm the superiority of Video-EM, which achieves highly competitive results with performance gains of 4-9 percent over respective baselines while utilizing fewer frames.

[92] Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

Junyan Ye,Dongzhi Jiang,Zihao Wang,Leqi Zhu,Zhenghao Hu,Zilong Huang,Jun He,Zhiyuan Yan,Jinghua Yu,Hongsheng Li,Conghui He,Weijia Li

Main category: cs.CV

TL;DR: 本研究通过GPT-4o生成大规模合成图像数据集Echo-4o-Image，提升了多模态模型的性能，并提出了新的评估基准GenEval++和Imagine-Bench。

Details

Motivation: 尽管已有研究通过从GPT-4o中蒸馏图像数据来提升开源模型，但现实世界图像数据集已然是高质量数据来源，因此需要探讨为何使用GPT-4o生成的合成数据。 Method: 通过GPT-4o生成180K规模的合成图像数据集，用于微调多模态生成模型Bagel，得到Echo-4o。同时引入了两个新的评估基准GenEval++和Imagine-Bench。 Result: Echo-4o在标准基准测试中表现出色，且将Echo-4o-Image应用于其他基础模型（如OmniGen2、BLIP3-o）也带来了性能提升。 Conclusion: Echo-4o-Image 提供了一种有效的合成数据集，解决了现实数据集的盲点问题，并具有良好的迁移能力。 Abstract: Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.

[93] SARE: Semantic-Aware Reconstruction Error for Generalizable Diffusion-Generated Image Detection

Ju Yeon Kang,Jaehong Park,Semin Kim,Ji Won Yoon,Nam Soo Kim

Main category: cs.CV

TL;DR: This paper proposes Semantic-Aware Reconstruction Error (SARE) as a novel representation for detecting fake images by measuring semantic differences between images and their caption-guided reconstructions, demonstrating robust detection across diverse generative models.

Details

Motivation: Existing detection methods degrade significantly when facing fake images from unseen, out-of-distribution generative models as they rely on model-specific artifacts. Motivated by the observation that fake images tend to exhibit higher similarity to their captions than real images. Method: Proposed a novel representation, Semantic-Aware Reconstruction Error (SARE), which measures the semantic difference between an image and its caption-guided reconstruction. Result: The proposed method exhibits strong generalization, outperforming existing baselines on benchmarks including GenImage and CommunityForensics. Conclusion: SARE can be utilized as a discriminative feature for robust detection across diverse generative models. Abstract: Recently, diffusion-generated image detection has gained increasing attention, as the rapid advancement of diffusion models has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts. To address this limitation, we explore a fundamental property commonly observed in fake images. Motivated by the observation that fake images tend to exhibit higher similarity to their captions than real images, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE can be utilized as a discriminative feature for robust detection across diverse generative models. We empirically demonstrate that the proposed method exhibits strong generalization, outperforming existing baselines on benchmarks including GenImage and CommunityForensics.

[94] CWFBind: Geometry-Awareness for Fast and Accurate Protein-Ligand Docking

Liyan Jia,Chuan-Xian Ren,Hong Yan

Main category: cs.CV

TL;DR: CWFBind improves protein-ligand docking by incorporating geometric features and weighting mechanisms, enhancing accuracy and efficiency.

Details

Motivation: Current deep learning-based docking methods often neglect geometric information, leading to inaccurate pocket localization and unrealistic binding conformations. This necessitates a more geometry-aware approach. Method: CWFBind integrates local curvature descriptors and employs a degree-aware weighting mechanism during message passing. It also uses a ligand-aware dynamic radius strategy and an enhanced loss function to address class imbalance. Result: CWFBind achieves competitive performance across multiple docking benchmarks, with improved pocket localization and binding conformation prediction. Conclusion: CWFBind provides a balanced trade-off between accuracy and efficiency in docking benchmarks, offering an improved approach for rational drug design. Abstract: Accurately predicting the binding conformation of small-molecule ligands to protein targets is a critical step in rational drug design. Although recent deep learning-based docking surpasses traditional methods in speed and accuracy, many approaches rely on graph representations and language model-inspired encoders while neglecting critical geometric information, resulting in inaccurate pocket localization and unrealistic binding conformations. In this study, we introduce CWFBind, a weighted, fast, and accurate docking method based on local curvature features. Specifically, we integrate local curvature descriptors during the feature extraction phase to enrich the geometric representation of both proteins and ligands, complementing existing chemical, sequence, and structural features. Furthermore, we embed degree-aware weighting mechanisms into the message passing process, enhancing the model's ability to capture spatial structural distinctions and interaction strengths. To address the class imbalance challenge in pocket prediction, CWFBind employs a ligand-aware dynamic radius strategy alongside an enhanced loss function, facilitating more precise identification of binding regions and key residues. Comprehensive experimental evaluations demonstrate that CWFBind achieves competitive performance across multiple docking benchmarks, offering a balanced trade-off between accuracy and efficiency.

[95] Generation of Indian Sign Language Letters, Numbers, and Words

Ajeet Kumar Yadav,Nishant Kumar,Rathna G N

Main category: cs.CV

TL;DR: 本文提出了一种新的生成对抗网络模型，用于生成高分辨率、特征丰富的印度手语图像，并发布了一个大型印度手语数据集。

Details

Motivation: 手语是与听力障碍者交流的重要媒介，但非手语使用者面临重大挑战。尽管手语识别已有进展，但手语生成仍需进一步探索。 Method: 研究开发了一种新的生成对抗网络（GAN）变体，结合了ProGAN在高分辨率图像生成方面的优势和SAGAN在特征丰富性方面的优势。 Result: 该模型在Inception Score (IS) 和 Fréchet Inception Distance (FID) 方面优于传统ProGAN，分别提高了3.2和30.12。此外，研究团队发布了一个包含高质量印度手语字母、数字和129个单词的大型数据集。 Conclusion: 本文提出了一种结合ProGAN和SAGAN优点的改进型生成对抗网络模型，用于生成高质量的印度手语图像，并发布了一个大规模的印度手语数据集。 Abstract: Sign language, which contains hand movements, facial expressions and bodily gestures, is a significant medium for communicating with hard-of-hearing people. A well-trained sign language community communicates easily, but those who don't know sign language face significant challenges. Recognition and generation are basic communication methods between hearing and hard-of-hearing individuals. Despite progress in recognition, sign language generation still needs to be explored. The Progressive Growing of Generative Adversarial Network (ProGAN) excels at producing high-quality images, while the Self-Attention Generative Adversarial Network (SAGAN) generates feature-rich images at medium resolutions. Balancing resolution and detail is crucial for sign language image generation. We are developing a Generative Adversarial Network (GAN) variant that combines both models to generate feature-rich, high-resolution, and class-conditional sign language images. Our modified Attention-based model generates high-quality images of Indian Sign Language letters, numbers, and words, outperforming the traditional ProGAN in Inception Score (IS) and Fr\'echet Inception Distance (FID), with improvements of 3.2 and 30.12, respectively. Additionally, we are publishing a large dataset incorporating high-quality images of Indian Sign Language alphabets, numbers, and 129 words.

[96] SOI is the Root of All Evil: Quantifying and Breaking Similar Object Interference in Single Object Tracking

Yipei Wang,Shiyu Hu,Shukun Jia,Panxi Xu,Hongfei Ma,Yiping Ma,Jing Zhang,Xiaobo Lu,Xin Zhao

Main category: cs.CV

TL;DR: This paper investigates Similar Object Interference in Single Object Tracking, validates its impact through experiments, and proposes a novel method using large-scale vision-language models to improve tracking performance.

Details

Motivation: The paper is motivated by the need to address Similar Object Interference, a long-overlooked yet critical bottleneck in Single Object Tracking. Method: The paper uses controlled Online Interference Masking experiments to validate SOI as a primary constraint for robust tracking, constructs the SOIBench benchmark to target SOI challenges, and proposes a novel paradigm employing large-scale vision-language models as external cognitive engines. Result: Eliminating interference sources leads to substantial performance improvements across all SOTA trackers. Vision-language tracking methods fail to effectively exploit semantic cognitive guidance. The proposed approach demonstrates substantial improvements under semantic cognitive guidance. Conclusion: The paper concludes that SOI is a critical bottleneck in SOT, and that using large-scale vision-language models as external cognitive engines can effectively exploit semantic cognitive guidance, leading to significant improvements in tracking performance. Abstract: In this paper, we present the first systematic investigation and quantification of Similar Object Interference (SOI), a long-overlooked yet critical bottleneck in Single Object Tracking (SOT). Through controlled Online Interference Masking (OIM) experiments, we quantitatively demonstrate that eliminating interference sources leads to substantial performance improvements (AUC gains up to 4.35) across all SOTA trackers, directly validating SOI as a primary constraint for robust tracking and highlighting the feasibility of external cognitive guidance. Building upon these insights, we adopt natural language as a practical form of external guidance, and construct SOIBench-the first semantic cognitive guidance benchmark specifically targeting SOI challenges. It automatically mines SOI frames through multi-tracker collective judgment and introduces a multi-level annotation protocol to generate precise semantic guidance texts. Systematic evaluation on SOIBench reveals a striking finding: existing vision-language tracking (VLT) methods fail to effectively exploit semantic cognitive guidance, achieving only marginal improvements or even performance degradation (AUC changes of -0.26 to +0.71). In contrast, we propose a novel paradigm employing large-scale vision-language models (VLM) as external cognitive engines that can be seamlessly integrated into arbitrary RGB trackers. This approach demonstrates substantial improvements under semantic cognitive guidance (AUC gains up to 0.93), representing a significant advancement over existing VLT methods. We hope SOIBench will serve as a standardized evaluation platform to advance semantic cognitive tracking research and contribute new insights to the tracking research community.

[97] Learning Spatial Decay for Vision Transformers

Yuxin Mao,Zhen Qin,Jinxing Zhou,Bin Fan,Jing Zhang,Yiran Zhong,Yuchao Dai

Main category: cs.CV

TL;DR: 本文提出Spatial Decay Transformer (SDT)，通过内容感知的动态门控机制改进视觉Transformer中的空间注意力，显著提升性能。

Details

Motivation: 现有基于固定距离度量的数据无关空间衰减方法在视觉任务中表现不足，受大语言模型内容感知机制启发，本文旨在提升视觉Transformer的空间注意力机制的适应性。 Method: 引入基于内容相关门控机制的Spatial Decay Transformer (SDT)，结合曼哈顿距离空间先验和学习内容表示，实现动态数据依赖的衰减机制。 Result: 在ImageNet-1K分类和生成任务中，SDT一致优于强基线模型。 Conclusion: SDT通过上下文感知门控机制有效增强了视觉Transformer中的空间注意力，建立了数据依赖的空间衰减的新范式。 Abstract: Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce data-independent spatial decay based on fixed distance metrics, applying uniform attention weighting regardless of image content and limiting adaptability to diverse visual scenarios. Inspired by recent advances in large language models where content-aware gating mechanisms (e.g., GLA, HGRN2, FOX) significantly outperform static alternatives, we present the first successful adaptation of data-dependent spatial decay to 2D vision transformers. We introduce \textbf{Spatial Decay Transformer (SDT)}, featuring a novel Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. Our approach learns to modulate spatial attention based on both content relevance and spatial proximity. We address the fundamental challenge of 1D-to-2D adaptation through a unified spatial-content fusion framework that integrates manhattan distance-based spatial priors with learned content representations. Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines. Our work establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers.

[98] Physics-guided Deep Unfolding Network for Enhanced Kronecker Compressive sensing

Gang Qu,Ping Wang,Siming Zheng,Xin Yuan

Main category: cs.CV

TL;DR: This paper proposes MEUNet, combining AKCS and MACA, to enhance compressed sensing image reconstruction, achieving state-of-the-art results.

Details

Motivation: To improve the incoherence of compressed measurements and learn informative representations for better image reconstruction in CS tasks. Method: The authors proposed an asymmetric Kronecker CS (AKCS) model and measurement-aware cross attention (MACA) mechanism, integrating them into unfolding architecture to form MEUNet. Result: Theoretical analysis and experiments show that the proposed AKCS model provides better incoherence with minimal complexity increase, and the MEUNet achieves superior reconstruction accuracy and inference speed. Conclusion: MEUNet, which integrates AKCS and MACA, attains state-of-the-art performance in reconstruction accuracy and inference speed. Abstract: Deep networks have achieved remarkable success in image compressed sensing (CS) task, namely reconstructing a high-fidelity image from its compressed measurement. However, existing works are deficient inincoherent compressed measurement at sensing phase and implicit measurement representations at reconstruction phase, limiting the overall performance. In this work, we answer two questions: 1) how to improve the measurement incoherence for decreasing the ill-posedness; 2) how to learn informative representations from measurements. To this end, we propose a novel asymmetric Kronecker CS (AKCS) model and theoretically present its better incoherence than previous Kronecker CS with minimal complexity increase. Moreover, we reveal that the unfolding networks' superiority over non-unfolding ones result from sufficient gradient descents, called explicit measurement representations. We propose a measurement-aware cross attention (MACA) mechanism to learn implicit measurement representations. We integrate AKCS and MACA into widely-used unfolding architecture to get a measurement-enhanced unfolding network (MEUNet). Extensive experiences demonstrate that our MEUNet achieves state-of-the-art performance in reconstruction accuracy and inference speed.

[99] COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection

Peiran Peng,Tingfa Xu,Liqiang Song,Mengqi Zhu,Yuqiang Fang,Jianan Li

Main category: cs.CV

TL;DR: COXNet is a new framework for RGBT tiny object detection that improves detection accuracy by integrating features from visible and thermal modalities and addressing spatial misalignment issues.

Details

Motivation: Detecting tiny objects in RGBT imagery is challenging due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds, especially in drone-based scenarios. Current methods fail to effectively utilize complementary information from visible and thermal modalities. Method: COXNet introduces three innovations: Cross-Layer Fusion Module for feature integration, Dynamic Alignment and Scale Refinement module for alignment and feature preservation, and an optimized label assignment strategy using GeoShape Similarity Measure. Result: COXNet achieves a 3.32\% mAP$_{50}$ improvement on the RGBTDronePerson dataset, showing its robustness in complex environments. Conclusion: COXNet demonstrates effectiveness in RGBT tiny object detection, achieving a significant improvement in mAP$_{50}$ on the RGBTDronePerson dataset over existing methods. Abstract: Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32\% mAP$_{50}$ improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.

[100] Iterative Volume Fusion for Asymmetric Stereo Matching

Yuanting Gao,Linghao Shen

Main category: cs.CV

TL;DR: The paper proposes IVF-AStereo, a method to improve stereo matching in asymmetric multi-camera systems by effectively utilizing two cost volumes to handle visual asymmetry and enhance performance.

Details

Motivation: The motivation stems from the challenges posed by asymmetric multi-camera systems, such as tele-wide cameras, which disrupt traditional stereo matching by affecting cost volume computation due to visual asymmetry. Method: The paper proposes a two-phase Iterative Volume Fusion network for asymmetric stereo matching (IVF-AStereo), which initially refines the correlation volume using an aggregated concatenation volume and subsequently fuses both volumes to enhance fine details. Result: The results show that the IVF-AStereo method performs effectively in asymmetric stereo matching scenarios, confirmed by extensive comparative experiments and ablation studies on benchmark datasets involving resolution and color degradation. Conclusion: The paper concludes that the proposed IVF-AStereo method excels in asymmetric stereo matching scenarios and demonstrates robustness against significant visual asymmetry. Abstract: Stereo matching is vital in 3D computer vision, with most algorithms assuming symmetric visual properties between binocular visions. However, the rise of asymmetric multi-camera systems (e.g., tele-wide cameras) challenges this assumption and complicates stereo matching. Visual asymmetry disrupts stereo matching by affecting the crucial cost volume computation. To address this, we explore the matching cost distribution of two established cost volume construction methods in asymmetric stereo. We find that each cost volume experiences distinct information distortion, indicating that both should be comprehensively utilized to solve the issue. Based on this, we propose the two-phase Iterative Volume Fusion network for Asymmetric Stereo matching (IVF-AStereo). Initially, the aggregated concatenation volume refines the correlation volume. Subsequently, both volumes are fused to enhance fine details. Our method excels in asymmetric scenarios and shows robust performance against significant visual asymmetry. Extensive comparative experiments on benchmark datasets, along with ablation studies, confirm the effectiveness of our approach in asymmetric stereo with resolution and color degradation.

Fengyi Wu,Yifei Dong,Zhi-Qi Cheng,Yilong Dai,Guangyu Chen,Hang Wang,Qi Dai,Alexander G. Hauptmann

Main category: cs.CV

TL;DR: 本文提出GoViG，一种基于原始视觉数据的导航指令生成新方法，在合成和真实世界环境中均表现出优越性能。

Details

Motivation: 传统的导航指令生成方法依赖结构化输入，如语义标注或环境地图，而GoViG仅使用以自我为中心的原始视觉数据，以提高在未知和非结构化环境中的适应性。 Method: GoViG将任务分解为视觉预测和指令生成两个子任务，并通过一个自回归多模态大语言模型进行整合，采用one-pass和interleaved推理策略模拟人类导航过程。 Result: 在R2R-Goal数据集上的实验结果显示，GoViG在BLEU-4和CIDEr评分上优于最先进的方法，并具有强大的跨领域泛化能力。 Conclusion: GoViG方法在视觉导航指令生成任务中表现出色，相较于现有方法有显著改进，并实现了跨领域的良好泛化能力。 Abstract: We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization.

[102] Exploring the Equivalence of Closed-Set Generative and Real Data Augmentation in Image Classification

Haowen Wang,Guowei Zhang,Xiang Zhang,Zeyuan Chen,Haiyang Xu,Dou Hoon Kwark,Zhuowen Tu

Main category: cs.CV

TL;DR: This paper investigates the use of synthetic images generated by advanced generative models for closed-set data augmentation in image classification tasks, finding that synthetic data can achieve comparable performance to real images when used at an increased scale.

Details

Motivation: The motivation of the paper is to investigate whether training a generative model on a given image classification training set can enhance classification performance through closed-set generative data augmentation. Method: The authors conducted extensive experiments to explore the distinctions and similarities between real and synthetic images generated by advanced generative models. They empirically determined the scale of synthetic images needed for effective augmentation and compared the effects of real data augmentation with open-set generative augmentation. Result: The paper provides systematic insights into the effective use of closed-set synthetic data for augmentation, including the equivalent scale of synthetic images needed and a quantitative comparison between real data augmentation and open-set generative augmentation. Conclusion: The paper concludes that while real images are generally preferred for training, synthetic data augmentation can achieve comparable performance if the scale of synthetic data is increased appropriately. Abstract: In this paper, we address a key scientific problem in machine learning: Given a training set for an image classification task, can we train a generative model on this dataset to enhance the classification performance? (i.e., closed-set generative data augmentation). We start by exploring the distinctions and similarities between real images and closed-set synthetic images generated by advanced generative models. Through extensive experiments, we offer systematic insights into the effective use of closed-set synthetic data for augmentation. Notably, we empirically determine the equivalent scale of synthetic images needed for augmentation. In addition, we also show quantitative equivalence between the real data augmentation and open-set generative augmentation (generative models trained using data beyond the given training set). While it aligns with the common intuition that real images are generally preferred, our empirical formulation also offers a guideline to quantify the increased scale of synthetic data augmentation required to achieve comparable image classification performance. Our results on natural and medical image datasets further illustrate how this effect varies with the baseline training set size and the amount of synthetic data incorporated.

[103] Topological Invariant-Based Iris Identification via Digital Homology and Machine Learning

Ahmet Öztel,İsmet Karaca

Main category: cs.CV

TL;DR: This study introduces a novel biometric identification method using topological invariants from 2D iris images, achieving high accuracy with logistic regression and offering a compact, interpretable alternative to deep learning.

Details

Motivation: The study aims to present a biometric identification method based on topological invariants from 2D iris images, representing iris texture via formally defined digital homology and evaluating classification performance. Method: Each normalized iris image (48x482 pixels) is divided into grids (e.g., 6x54 or 3x27). For each subregion, Betti0, Betti1, and their ratio are computed using a recent algorithm for homology groups in 2D digital images. The resulting invariants form a feature matrix used with logistic regression, KNN, and SVM (with PCA and 100 randomized repetitions). A CNN is trained on raw images for comparison. Result: Logistic regression achieved 97.78 +/- 0.82% accuracy, outperforming CNN (96.44 +/- 1.32%) and other feature-based models. The topological features showed high accuracy with low variance. Conclusion: This is the first use of topological invariants from formal digital homology for iris recognition. The method offers a compact, interpretable, and accurate alternative to deep learning, useful when explainability or limited data is important. Beyond iris recognition, it can apply to other biometrics, medical imaging, materials science, remote sensing, and interpretable AI. It runs efficiently on CPU-only systems and produces robust, explainable features valuable for security-critical domains. Abstract: Objective - This study presents a biometric identification method based on topological invariants from 2D iris images, representing iris texture via formally defined digital homology and evaluating classification performance. Methods - Each normalized iris image (48x482 pixels) is divided into grids (e.g., 6x54 or 3x27). For each subregion, we compute Betti0, Betti1, and their ratio using a recent algorithm for homology groups in 2D digital images. The resulting invariants form a feature matrix used with logistic regression, KNN, and SVM (with PCA and 100 randomized repetitions). A convolutional neural network (CNN) is trained on raw images for comparison. Results - Logistic regression achieved 97.78 +/- 0.82% accuracy, outperforming CNN (96.44 +/- 1.32%) and other feature-based models. The topological features showed high accuracy with low variance. Conclusion - This is the first use of topological invariants from formal digital homology for iris recognition. The method offers a compact, interpretable, and accurate alternative to deep learning, useful when explainability or limited data is important. Beyond iris recognition, it can apply to other biometrics, medical imaging, materials science, remote sensing, and interpretable AI. It runs efficiently on CPU-only systems and produces robust, explainable features valuable for security-critical domains.

[104] WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization

Jiahao Wen,Hang Yu,Zhedong Zheng

Main category: cs.CV

TL;DR: WeatherPrompt是一种多模态学习范式，通过融合图像嵌入和文本上下文来建立与天气无关的表示，从而提高无人机在不同天气条件下的视觉地理定位能力。

Details

Motivation: 在天气扰动下，例如雨天和雾天，现有的视觉地理定位方法在无人机应用中面临严重的性能下降。这些问题源于对有限天气类别的过度依赖以及通过伪天气类别对纠缠场景-天气特征的次优解耦。 Method: WeatherPrompt引入了一种无训练的天气推理机制，利用现成的大型多模态模型通过类似人类的推理来合成多天气文本描述。此外，它还提出了一种由文本嵌入驱动的动态门控机制的多模态框架，以自适应地重新加权和融合跨模态的视觉特征。框架还通过包括图像-文本对比学习和图像-文本匹配在内的跨模态目标进行了优化。 Result: 实验表明，在不同天气条件下，WeatherPrompt相较于最先进的无人机地理定位方法实现了有竞争力的召回率。特别是在夜间条件下，Recall@1提高了+13.37%，在雾天和雪天条件下提高了18.69%。 Conclusion: WeatherPrompt通过其无训练的天气推理机制和多模态框架有效提升了无人机在多种天气条件下的视觉地理定位性能。 Abstract: Visual geo-localization for drones faces critical degradation under weather perturbations, \eg, rain and fog, where existing methods struggle with two inherent limitations: 1) Heavy reliance on limited weather categories that constrain generalization, and 2) Suboptimal disentanglement of entangled scene-weather features through pseudo weather categories. We present WeatherPrompt, a multi-modality learning paradigm that establishes weather-invariant representations through fusing the image embedding with the text context. Our framework introduces two key contributions: First, a Training-free Weather Reasoning mechanism that employs off-the-shelf large multi-modality models to synthesize multi-weather textual descriptions through human-like reasoning. It improves the scalability to unseen or complex weather, and could reflect different weather strength. Second, to better disentangle the scene and weather feature, we propose a multi-modality framework with the dynamic gating mechanism driven by the text embedding to adaptively reweight and fuse visual features across modalities. The framework is further optimized by the cross-modal objectives, including image-text contrastive learning and image-text matching, which maps the same scene with different weather conditions closer in the respresentation space. Extensive experiments validate that, under diverse weather conditions, our method achieves competitive recall rates compared to state-of-the-art drone geo-localization methods. Notably, it improves Recall@1 by +13.37\% under night conditions and by 18.69\% under fog and snow conditions.

[105] WEC-DG: Multi-Exposure Wavelet Correction Method Guided by Degradation Description

Ming Zhao,Pingping Liu,Tongshun Zhang,Zhe Zhang

Main category: cs.CV

TL;DR: This paper proposes WEC-DG, a new multi-exposure correction method that improves image restoration under challenging lighting conditions by incorporating degradation guidance and wavelet-based processing.

Details

Motivation: Current multi-exposure correction methods struggle with intra-class variability caused by diverse lighting conditions, shooting environments, and weather factors, especially for images captured at a single exposure level. Method: The paper proposes a Wavelet-based Exposure Correction method with Degradation Guidance (WEC-DG), incorporating an Exposure Consistency Alignment Module (ECAM) and an Exposure Restoration and Detail Reconstruction Module (EDRM). Result: Extensive experiments on multiple public datasets show that the proposed WEC-DG method achieves significant performance improvements over existing algorithms, validating its effectiveness and practical applicability. Conclusion: The proposed WEC-DG method demonstrates superior performance in handling multi-exposure correction, particularly under complex imaging conditions, outperforming existing methods. Abstract: Multi-exposure correction technology is essential for restoring images affected by insufficient or excessive lighting, enhancing the visual experience by improving brightness, contrast, and detail richness. However, current multi-exposure correction methods often encounter challenges in addressing intra-class variability caused by diverse lighting conditions, shooting environments, and weather factors, particularly when processing images captured at a single exposure level. To enhance the adaptability of these models under complex imaging conditions, this paper proposes a Wavelet-based Exposure Correction method with Degradation Guidance (WEC-DG). Specifically, we introduce a degradation descriptor within the Exposure Consistency Alignment Module (ECAM) at both ends of the processing pipeline to ensure exposure consistency and achieve final alignment. This mechanism effectively addresses miscorrected exposure anomalies caused by existing methods' failure to recognize 'blurred' exposure degradation. Additionally, we investigate the light-detail decoupling properties of the wavelet transform to design the Exposure Restoration and Detail Reconstruction Module (EDRM), which processes low-frequency information related to exposure enhancement before utilizing high-frequency information as a prior guide for reconstructing spatial domain details. This serial processing strategy guarantees precise light correction and enhances detail recovery. Extensive experiments conducted on multiple public datasets demonstrate that the proposed method outperforms existing algorithms, achieving significant performance improvements and validating its effectiveness and practical applicability.

[106] A Chain of Diagnosis Framework for Accurate and Explainable Radiology Report Generation

Haibo Jin,Haoxuan Che,Sunan He,Hao Chen

Main category: cs.CV

TL;DR: This paper proposes CoD, a trustworthy radiology report generation framework that improves clinical accuracy and explainability by incorporating QA-based diagnosis, grounding mechanisms, and omni-supervised learning.

Details

Motivation: Existing radiology report generation models have unsatisfactory clinical performance, especially in describing lesion attributes, and lack explainability, reducing trust from radiologists. Method: The authors propose a chain of diagnosis (CoD) framework that uses question-answer pairs for key finding extraction, prompts a large language model for report generation, and incorporates diagnosis and lesion grounding modules for explainability. They also use an omni-supervised learning strategy for training. Result: The CoD framework achieves superior performance on two RRG benchmarks, provides explainable results by grounding sentences to QA diagnoses and images, and includes a new dataset with QA pairs and lesion boxes, along with a dedicated evaluation tool. Conclusion: The proposed CoD framework improves clinical accuracy and explainability in radiology report generation, outperforming existing models on benchmarks and enabling trust through diagnosis grounding and lesion localization. Abstract: Despite the progress of radiology report generation (RRG), existing works face two challenges: 1) The performances in clinical efficacy are unsatisfactory, especially for lesion attributes description; 2) the generated text lacks explainability, making it difficult for radiologists to trust the results. To address the challenges, we focus on a trustworthy RRG model, which not only generates accurate descriptions of abnormalities, but also provides basis of its predictions. To this end, we propose a framework named chain of diagnosis (CoD), which maintains a chain of diagnostic process for clinically accurate and explainable RRG. It first generates question-answer (QA) pairs via diagnostic conversation to extract key findings, then prompts a large language model with QA diagnoses for accurate generation. To enhance explainability, a diagnosis grounding module is designed to match QA diagnoses and generated sentences, where the diagnoses act as a reference. Moreover, a lesion grounding module is designed to locate abnormalities in the image, further improving the working efficiency of radiologists. To facilitate label-efficient training, we propose an omni-supervised learning strategy with clinical consistency to leverage various types of annotations from different datasets. Our efforts lead to 1) an omni-labeled RRG dataset with QA pairs and lesion boxes; 2) a evaluation tool for assessing the accuracy of reports in describing lesion location and severity; 3) extensive experiments to demonstrate the effectiveness of CoD, where it outperforms both specialist and generalist models consistently on two RRG benchmarks and shows promising explainability by accurately grounding generated sentences to QA diagnoses and images.

[107] Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion

Jiwon Kim,Pureum Kim,SeonHwa Kim,Soobin Park,Eunju Cha,Kyong Hwan Jin

Main category: cs.CV

TL;DR: The paper proposes a Dual Recursive Feedback system to improve the spatial structure preservation and fine-grained condition capturing of controllable text-to-image diffusion models, demonstrated through extensive experiments.

Details

Motivation: Recent controllable text-to-image diffusion models struggle to accurately preserve spatial structures and capture fine-grained conditions related to object poses and scene layouts. Method: A training-free Dual Recursive Feedback (DRF) system is proposed, which consists of appearance feedback and generation feedback to refine intermediate latents and better reflect given appearance information and user intent. Result: Extensive experiments demonstrate the efficacy of the DRF system in producing high-quality, semantically coherent, and structurally consistent image generations. Conclusion: The proposed DRF system improves the ability of controllable text-to-image diffusion models to preserve spatial structures and capture fine-grained conditions related to object poses and scene layouts without requiring auxiliary module training. Abstract: Recent advancements in controllable text-to-image (T2I) diffusion models, such as Ctrl-X and FreeControl, have demonstrated robust spatial and appearance control without requiring auxiliary module training. However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. The proposed DRF consists of appearance feedback and generation feedback that recursively refines the intermediate latents to better reflect the given appearance information and the user's intent. This dual-update mechanism guides latent representations toward reliable manifolds, effectively integrating structural and appearance attributes. Our approach enables fine-grained generation even between class-invariant structure-appearance fusion, such as transferring human motion onto a tiger's form. Extensive experiments demonstrate the efficacy of our method in producing high-quality, semantically coherent, and structurally consistent image generations. Our source code is available at https://github.com/jwonkm/DRF.

[108] SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs

Bei Yan,Zhiyuan Chen,Yuecong Min,Jie Zhang,Jiahao Wang,Xiaozhen Wang,Shiguang Shan

Main category: cs.CV

TL;DR: This paper introduces SHALE, an automated and scalable benchmark for evaluating hallucinations in Large Vision-Language Models, highlighting significant factuality issues and sensitivity to input perturbations.

Details

Motivation: LVLMs suffer from hallucinations, and current evaluation methods are limited in granularity, scalability, and potential data leakage, necessitating a more robust and automated benchmark. Method: An automated data construction pipeline and hierarchical hallucination induction framework were designed to create the SHALE benchmark, which assesses both faithfulness and factuality hallucinations using a fine-grained categorization scheme. Result: SHALE includes over 30K image-instruction pairs across 12 visual perception aspects and 6 knowledge domains, enabling comprehensive evaluation of hallucinations in LVLMs. Conclusion: SHALE effectively evaluates hallucinations in LVLMs, revealing significant factuality hallucinations and sensitivity to semantic perturbations. Abstract: Despite rapid advances, Large Vision-Language Models (LVLMs) still suffer from hallucinations, i.e., generating content inconsistent with input or established world knowledge, which correspond to faithfulness and factuality hallucinations, respectively. Prior studies primarily evaluate faithfulness hallucination at a coarse level (e.g., object-level) and lack fine-grained analysis. Additionally, existing benchmarks rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. We also design a hierarchical hallucination induction framework with input perturbations to simulate realistic noisy scenarios. Integrating these designs, we construct SHALE, a Scalable HALlucination Evaluation benchmark designed to assess both faithfulness and factuality hallucinations via a fine-grained hallucination categorization scheme. SHALE comprises over 30K image-instruction pairs spanning 12 representative visual perception aspects for faithfulness and 6 knowledge domains for factuality, considering both clean and noisy scenarios. Extensive experiments on over 20 mainstream LVLMs reveal significant factuality hallucinations and high sensitivity to semantic perturbations.

[109] Offline Auto Labeling: BAAS

Stefan Haag,Bharanidhar Duraisamy,Felix Govaers,Wolfgang Koch,Martin Fritzsche,Juergen Dickmann

Main category: cs.CV

TL;DR: BAAS is a new radar detection annotation and tracking framework for autonomous driving, using Bayesian-based methods to provide accurate object trajectories and enable continuous improvements.

Details

Motivation: There is a need for accurate label annotation and tracking in autonomous driving radar detection, which BAAS aims to address. Method: Bayesian-based tracking, smoothing, and fusion methods are used to generate object trajectories and shape estimation. Result: The framework was evaluated in a real-world urban scenario, demonstrating functionality for varying dynamic objects and class types. Conclusion: BAAS provides a new approach for radar detection annotation and tracking, allowing for continuous improvement through module analysis and combination. Abstract: This paper introduces BAAS, a new Extended Object Tracking (EOT) and fusion-based label annotation framework for radar detections in autonomous driving. Our framework utilizes Bayesian-based tracking, smoothing and eventually fusion methods to provide veritable and precise object trajectories along with shape estimation to provide annotation labels on the detection level under various supervision levels. Simultaneously, the framework provides evaluation of tracking performance and label annotation. If manually labeled data is available, each processing module can be analyzed independently or combined with other modules to enable closed-loop continuous improvements. The framework performance is evaluated in a challenging urban real-world scenario in terms of tracking performance and the label annotation errors. We demonstrate the functionality of the proposed approach for varying dynamic objects and class types

[110] Hierarchical Brain Structure Modeling for Predicting Genotype of Glioma

Haotian Tang,Jianwei Chen,Xinrui Tang,Yunjia Wu,Zhengyang Miao,Chao Li

Main category: cs.CV

TL;DR: This study introduces Hi-SMGNN, a novel hierarchical framework using structural and morphological brain connectomes to accurately predict IDH mutations in gliomas, surpassing existing methods.

Details

Motivation: Current IDH mutation prediction methods are limited by low-quality functional MRI data and failure to account for the brain's hierarchical organization and multiscale interactions. Method: The study proposes Hi-SMGNN, a hierarchical framework combining structural and morphological connectomes with a Siamese network, cross-modal attention, and personalized modular partitioning for IDH mutation prediction. Result: Hi-SMGNN outperforms baseline and state-of-the-art models in IDH mutation prediction on the UCSF-PDGM dataset with enhanced robustness and effectiveness. Conclusion: Hi-SMGNN effectively predicts IDH mutation status by integrating structural and morphological connectomes, outperforming existing models in robustness and effectiveness. Abstract: Isocitrate DeHydrogenase (IDH) mutation status is a crucial biomarker for glioma prognosis. However, current prediction methods are limited by the low availability and noise of functional MRI. Structural and morphological connectomes offer a non-invasive alternative, yet existing approaches often ignore the brain's hierarchical organisation and multiscale interactions. To address this, we propose Hi-SMGNN, a hierarchical framework that integrates structural and morphological connectomes from regional to modular levels. It features a multimodal interaction module with a Siamese network and cross-modal attention, a multiscale feature fusion mechanism for reducing redundancy, and a personalised modular partitioning strategy to enhance individual specificity and interpretability. Experiments on the UCSF-PDGM dataset demonstrate that Hi-SMGNN outperforms baseline and state-of-the-art models, showing improved robustness and effectiveness in IDH mutation prediction.

[111] SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing

Heyi Sun,Cong Wang,Tian-Xing Xu,Jingwei Huang,Di Kang,Chunchao Guo,Song-Hai Zhang

Main category: cs.CV

TL;DR: 提出了一种新的混合表示SVG-Head，用于创建高保真和可编辑的头部头像，通过显式建模几何和全局外观，实现高质量的重建和实时纹理编辑。

Details

Motivation: 头部头像的实时外观编辑仍然具有挑战性，因为隐式表示和几何与全局外观的纠缠建模。 Method: 提出了Surface-Volumetric Gaussian Head Avatar (SVG-Head)，使用绑定在FLAME网格上的3D高斯显式建模几何，并利用解缠的纹理图像捕捉全局外观。 Result: 实验表明，SVG-Head不仅生成高保真的渲染结果，而且是第一个获得显式纹理图像并支持实时外观编辑的Gaussian头部头像方法。 Conclusion: SVG-Head是一种有效的头部头像表示方法，提供了高质量的重建和实时编辑的灵活性。 Abstract: Creating high-fidelity and editable head avatars is a pivotal challenge in computer vision and graphics, boosting many AR/VR applications. While recent advancements have achieved photorealistic renderings and plausible animation, head editing, especially real-time appearance editing, remains challenging due to the implicit representation and entangled modeling of the geometry and global appearance. To address this, we propose Surface-Volumetric Gaussian Head Avatar (SVG-Head), a novel hybrid representation that explicitly models the geometry with 3D Gaussians bound on a FLAME mesh and leverages disentangled texture images to capture the global appearance. Technically, it contains two types of Gaussians, in which surface Gaussians explicitly model the appearance of head avatars using learnable texture images, facilitating real-time texture editing, while volumetric Gaussians enhance the reconstruction quality of non-Lambertian regions (e.g., lips and hair). To model the correspondence between 3D world and texture space, we provide a mesh-aware Gaussian UV mapping method, which leverages UV coordinates given by the FLAME mesh to obtain sharp texture images and real-time rendering speed. A hierarchical optimization strategy is further designed to pursue the optimal performance in both reconstruction quality and editing flexibility. Experiments on the NeRSemble dataset show that SVG-Head not only generates high-fidelity rendering results, but also is the first method to obtain explicit texture images for Gaussian head avatars and support real-time appearance editing.

[112] Images Speak Louder Than Scores: Failure Mode Escape for Enhancing Generative Quality

Jie Shao,Ke Zhu,Minghao Fu,Guo-hua Wang,Jianxin Wu

Main category: cs.CV

TL;DR: 本文提出了一种名为FaME的训练无关且推理高效的方法，旨在提升扩散模型在类别到图像生成中的感知质量。

Details

Motivation: 尽管扩散模型在类别到图像生成方面取得了显著进展，但现有的最先进模型在某些类别中仍会产生扭曲或低质量的图像。这是因为常用的FID评分无法准确反映单个样本的感知质量。 Method: 提出FaME方法，利用图像质量评估模型识别低质量生成图像，并将其采样轨迹存储为负面指导，以引导未来的采样过程避开低质量区域。 Result: 实验表明，FaME在ImageNet数据集上实现了视觉质量的一致提升，同时保持了FID评分的稳定。此外，FaME还展示了扩展到文本到图像生成的潜力。 Conclusion: FaME是一种有效的训练无关方法，能够提升扩散模型的感知质量，为未来的研究提供了新的方向。 Abstract: Diffusion models have achieved remarkable progress in class-to-image generation. However, we observe that despite impressive FID scores, state-of-the-art models often generate distorted or low-quality images, especially in certain classes. This gap arises because FID evaluates global distribution alignment, while ignoring the perceptual quality of individual samples. We further examine the role of CFG, a common technique used to enhance generation quality. While effective in improving metrics and suppressing outliers, CFG can introduce distribution shift and visual artifacts due to its misalignment with both training objectives and user expectations. In this work, we propose FaME, a training-free and inference-efficient method for improving perceptual quality. FaME uses an image quality assessment model to identify low-quality generations and stores their sampling trajectories. These failure modes are then used as negative guidance to steer future sampling away from poor-quality regions. Experiments on ImageNet demonstrate that FaME brings consistent improvements in visual quality without compromising FID. FaME also shows the potential to be extended to improve text-to-image generation.

[113] BridgeTA: Bridging the Representation Gap in Knowledge Distillation via Teacher Assistant for Bird's Eye View Map Segmentation

Beomjun Kim,Suhan Woo,Sejong Heo,Euntai Kim

Main category: cs.CV

TL;DR: BridgeTA improves Camera-only BEV map segmentation performance through a lightweight Teacher Assistant network and theoretically grounded distillation, without increasing inference cost.

Details

Motivation: Camera-only approaches are cost-effective but underperform compared to LiDAR-Camera fusion methods; existing Knowledge Distillation methods increase student model size and inference cost. Method: BridgeTA uses a lightweight Teacher Assistant (TA) network to create a shared latent space between teacher and student models, with a theoretically grounded distillation loss derived using Young's Inequality. Result: BridgeTA achieved a 4.2% mIoU improvement over the Camera-only baseline, with up to 45% better performance than other state-of-the-art KD methods on the nuScenes dataset. Conclusion: BridgeTA effectively narrows the performance gap between LiDAR-Camera fusion and Camera-only models in BEV map segmentation without increasing inference cost. Abstract: Bird's-Eye-View (BEV) map segmentation is one of the most important and challenging tasks in autonomous driving. Camera-only approaches have drawn attention as cost-effective alternatives to LiDAR, but they still fall behind LiDAR-Camera (LC) fusion-based methods. Knowledge Distillation (KD) has been explored to narrow this gap, but existing methods mainly enlarge the student model by mimicking the teacher's architecture, leading to higher inference cost. To address this issue, we introduce BridgeTA, a cost-effective distillation framework to bridge the representation gap between LC fusion and Camera-only models through a Teacher Assistant (TA) network while keeping the student's architecture and inference cost unchanged. A lightweight TA network combines the BEV representations of the teacher and student, creating a shared latent space that serves as an intermediate representation. To ground the framework theoretically, we derive a distillation loss using Young's Inequality, which decomposes the direct teacher-student distillation path into teacher-TA and TA-student dual paths, stabilizing optimization and strengthening knowledge transfer. Extensive experiments on the challenging nuScenes dataset demonstrate the effectiveness of our method, achieving an improvement of 4.2% mIoU over the Camera-only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods.

[114] MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography

Daniel Barco,Marc Stadelmann,Martin Oswald,Ivo Herzig,Lukas Lichtensteiger,Pascal Paysan,Igor Peterlik,Michal Walczak,Bjoern Menze,Frank-Peter Schilling

Main category: cs.CV

TL;DR: MInDI-3D는 희소 뷰 CBCT 아티팩트를 제거하는 3D 조건부 확산 기반 모델로, 방사선 노출을 8배 줄이고 실제 CBCT 스캔 품질을 향상시킵니다.

Details

Motivation: 희소 뷰 CBCT 아티팩트 제거를 통해 실제 의료 환경에서 방사선 노출을 줄이기 위한 효과적인 모델 개발이 필요했습니다. Method: MInDI-3D는 2D에서 확장된 'InDI' 개념을 기반으로 하며, 희소 뷰 입력에서 CBCT 볼륨을 직접 개선하는 반복적 잡음 제거 프로세스를 구현합니다. 대규모 가상 CBCT 데이터셋을 생성하여 모델을 훈련시켰습니다. Result: CT-RATE 가상 CBCT 테스트 세트에서 50개의 프로젝션만 사용하여 수정되지 않은 스캔 대비 12.96dB의 PSNR 향상, 8배의 방사선 노출 감소, 그리고 더 많은 훈련 데이터로 성능이 향상됨을 보여주며, 새로운 CBCT 스캐너 기하학적 구조에도 일반화 가능함. Conclusion: MInDI-3D는 실제 CBCT 스캔의 품질을 향상시키고 방사선 노출을 줄이는 데 효과적인 3D 조건부 확산 기반 모델입니다. Abstract: We present MInDI-3D (Medical Inversion by Direct Iteration in 3D), the first 3D conditional diffusion-based model for real-world sparse-view Cone Beam Computed Tomography (CBCT) artefact removal, aiming to reduce imaging radiation exposure. A key contribution is extending the "InDI" concept from 2D to a full 3D volumetric approach for medical images, implementing an iterative denoising process that refines the CBCT volume directly from sparse-view input. A further contribution is the generation of a large pseudo-CBCT dataset (16,182) from chest CT volumes of the CT-RATE public dataset to robustly train MInDI-3D. We performed a comprehensive evaluation, including quantitative metrics, scalability analysis, generalisation tests, and a clinical assessment by 11 clinicians. Our results show MInDI-3D's effectiveness, achieving a 12.96 (6.10) dB PSNR gain over uncorrected scans with only 50 projections on the CT-RATE pseudo-CBCT (independent real-world) test set and enabling an 8x reduction in imaging radiation exposure. We demonstrate its scalability by showing that performance improves with more training data. Importantly, MInDI-3D matches the performance of a 3D U-Net on real-world scans from 16 cancer patients across distortion and task-based metrics. It also generalises to new CBCT scanner geometries. Clinicians rated our model as sufficient for patient positioning across all anatomical sites and found it preserved lung tumour boundaries well.

[115] Plane Detection and Ranking via Model Information Optimization

Daoxin Zhong,Jun Li,Meng Yee Michael Chuah

Main category: cs.CV

TL;DR: The paper introduces a new framework for plane detection from depth images that improves accuracy and avoids false positives by using model information optimization, with further enhancements from neural network segmentation.

Details

Motivation: The motivation is to address the issue of false positive plane detections in complex real-world scenes when using RANSAC for plane detection from depth images, due to the ambiguity of its inlier threshold criterion. Method: The method involves treating depth readings as discrete random variables, generating various models through random sub-sampling, calculating information for each model using the depth sensor's physics and noise model, and selecting the model with the least information as the most likely ground truth. The algorithm is accelerated by partitioning the depth map using neural network segmentation. Result: The result is a generalized framework for plane detection that allows for objective determination of the true number of planes, prevents false positive detections, and ranks the quality of detected planes. The algorithm estimates plane parameters more accurately than the default Open3D RANSAC plane segmentation and generates more realistic plane parameters in real-world data after acceleration through neural network segmentation. Conclusion: The paper concludes that their proposed framework for plane detection, based on model information optimization, provides more accurate plane parameter estimation and prevents false positive detections compared to the default Open3D RANSAC plane segmentation. Abstract: Plane detection from depth images is a crucial subtask with broad robotic applications, often accomplished by iterative methods such as Random Sample Consensus (RANSAC). While RANSAC is a robust strategy with strong probabilistic guarantees, the ambiguity of its inlier threshold criterion makes it susceptible to false positive plane detections. This issue is particularly prevalent in complex real-world scenes, where the true number of planes is unknown and multiple planes coexist. In this paper, we aim to address this limitation by proposing a generalised framework for plane detection based on model information optimization. Building on previous works, we treat the observed depth readings as discrete random variables, with their probability distributions constrained by the ground truth planes. Various models containing different candidate plane constraints are then generated through repeated random sub-sampling to explain our observations. By incorporating the physics and noise model of the depth sensor, we can calculate the information for each model, and the model with the least information is accepted as the most likely ground truth. This information optimization process serves as an objective mechanism for determining the true number of planes and preventing false positive detections. Additionally, the quality of each detected plane can be ranked by summing the information reduction of inlier points for each plane. We validate these properties through experiments with synthetic data and find that our algorithm estimates plane parameters more accurately compared to the default Open3D RANSAC plane segmentation. Furthermore, we accelerate our algorithm by partitioning the depth map using neural network segmentation, which enhances its ability to generate more realistic plane parameters in real-world data.

[116] Semantic-aware DropSplat: Adaptive Pruning of Redundant Gaussians for 3D Aerial-View Segmentation

Xu Tang,Junan Jia,Yijing Wang,Jingjing Ma,Xiangrong Zhang

Main category: cs.CV

TL;DR: SAD-Splat improves 3D aerial-view scene semantic segmentation by addressing semantic ambiguity and enhancing supervision with pseudo-labels, achieving high performance on a new benchmark dataset.

Details

Motivation: Traditional methods struggle with semantic ambiguity due to scale variations and structural occlusions in aerial images, limiting segmentation accuracy. Method: SAD-Splat introduces a Gaussian point drop module and a high-confidence pseudo-label generation pipeline using 2D foundation models. Result: SAD-Splat achieves excellent segmentation accuracy and representation compactness on the 3D-AS dataset, offering an efficient and scalable solution. Conclusion: SAD-Splat provides an efficient and scalable solution for 3D aerial scene understanding, achieving a good balance between segmentation accuracy and representation compactness. Abstract: In the task of 3D Aerial-view Scene Semantic Segmentation (3D-AVS-SS), traditional methods struggle to address semantic ambiguity caused by scale variations and structural occlusions in aerial images. This limits their segmentation accuracy and consistency. To tackle these challenges, we propose a novel 3D-AVS-SS approach named SAD-Splat. Our method introduces a Gaussian point drop module, which integrates semantic confidence estimation with a learnable sparsity mechanism based on the Hard Concrete distribution. This module effectively eliminates redundant and semantically ambiguous Gaussian points, enhancing both segmentation performance and representation compactness. Furthermore, SAD-Splat incorporates a high-confidence pseudo-label generation pipeline. It leverages 2D foundation models to enhance supervision when ground-truth labels are limited, thereby further improving segmentation accuracy. To advance research in this domain, we introduce a challenging benchmark dataset: 3D Aerial Semantic (3D-AS), which encompasses diverse real-world aerial scenes with sparse annotations. Experimental results demonstrate that SAD-Splat achieves an excellent balance between segmentation accuracy and representation compactness. It offers an efficient and scalable solution for 3D aerial scene understanding.

[117] Enhancing Monocular 3D Hand Reconstruction with Learned Texture Priors

Giorgos Karvounas,Nikolaos Kyriazis,Iason Oikonomidis,Georgios Pavlakos,Antonis A. Argyros

Main category: cs.CV

TL;DR: This paper introduces a texture module that utilizes texture alignment as a supervisory signal to improve 3D hand reconstruction accuracy and realism.

Details

Motivation: The motivation is to explore texture as a crucial and underused supervisory signal in 3D hand reconstruction rather than merely an afterthought for photorealism. Method: The method involves a lightweight texture module that embeds per-pixel observations into UV texture space, enabling a dense alignment loss between predicted and observed hand appearances, integrated into the existing HaMeR pipeline. Result: The result demonstrates that incorporating texture-guided supervision enhances the accuracy and realism of 3D hand reconstruction. Conclusion: The paper concludes that texture plays a significant role in enhancing 3D hand reconstruction by acting as a dense supervisory signal, improving both accuracy and realism. Abstract: We revisit the role of texture in monocular 3D hand reconstruction, not as an afterthought for photorealism, but as a dense, spatially grounded cue that can actively support pose and shape estimation. Our observation is simple: even in high-performing models, the overlay between predicted hand geometry and image appearance is often imperfect, suggesting that texture alignment may be an underused supervisory signal. We propose a lightweight texture module that embeds per-pixel observations into UV texture space and enables a novel dense alignment loss between predicted and observed hand appearances. Our approach assumes access to a differentiable rendering pipeline and a model that maps images to 3D hand meshes with known topology, allowing us to back-project a textured hand onto the image and perform pixel-based alignment. The module is self-contained and easily pluggable into existing reconstruction pipelines. To isolate and highlight the value of texture-guided supervision, we augment HaMeR, a high-performing yet unadorned transformer architecture for 3D hand pose estimation. The resulting system improves both accuracy and realism, demonstrating the value of appearance-guided alignment in hand reconstruction.

[118] Preacher: Paper-to-Video Agentic System

Jingwei Liu,Ling Yang,Hao Luo,Fan Wang Hongyan Li,Mengdi Wang

Main category: cs.CV

TL;DR: 该论文提出了一种名为Preacher的论文转视频系统，通过上下结合的方法和渐进式思考链（P-CoT）生成高质量的视频摘要。

Details

Motivation: 现有视频生成模型在上下文窗口、视频时长限制、风格多样性及领域知识表示方面存在局限，本文旨在解决这些问题。 Method: Preacher采用自上而下的方法对论文进行分解、摘要和重构，随后通过自下而上的视频生成方式合成视频片段，并利用P-CoT实现跨模态表示对齐。 Result: Preacher成功在五个研究领域生成了高质量的视频摘要，表现出超越现有视频生成模型的能力。 Conclusion: Preacher为论文转视频任务提供了创新性的解决方案，有效克服了现有技术的局限性。 Abstract: The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a top-down approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/GenVerse/Paper2Video

[119] Multi-Contrast Fusion Module: An attention mechanism integrating multi-contrast features for fetal torso plane classification

Shengjun Zhu,Siyu Liu,Runqing Xiong,Liping Zheng,Duo Ma,Rongshang Chen,Jiaxin Cai

Main category: cs.CV

TL;DR: A novel Multi-Contrast Fusion Module (MCFM) improves fetal torso plane recognition in ultrasound imaging, enhancing anatomical structure capture for more accurate prenatal diagnoses.

Details

Motivation: Accurate identification of standard fetal torso planes in ultrasound imaging is essential for reliable assessment and personalized prenatal care, but challenges such as low contrast and unclear texture details hinder fine-grained anatomical recognition. Method: A Multi-Contrast Fusion Module (MCFM) was developed to enhance feature representation by assigning attention weights to image representations under different contrast conditions. Result: The MCFM substantially improved recognition performance with minimal increase in model complexity, achieving higher classification accuracy and clinical reliability. Conclusion: The method provides an effective solution for improving fetal torso plane recognition in ultrasound imaging, supporting more accurate and consistent diagnoses. Abstract: Purpose: Prenatal ultrasound is a key tool in evaluating fetal structural development and detecting abnormalities, contributing to reduced perinatal complications and improved neonatal survival. Accurate identification of standard fetal torso planes is essential for reliable assessment and personalized prenatal care. However, limitations such as low contrast and unclear texture details in ultrasound imaging pose significant challenges for fine-grained anatomical recognition. Methods: We propose a novel Multi-Contrast Fusion Module (MCFM) to enhance the model's ability to extract detailed information from ultrasound images. MCFM operates exclusively on the lower layers of the neural network, directly processing raw ultrasound data. By assigning attention weights to image representations under different contrast conditions, the module enhances feature modeling while explicitly maintaining minimal parameter overhead. Results: The proposed MCFM was evaluated on a curated dataset of fetal torso plane ultrasound images. Experimental results demonstrate that MCFM substantially improves recognition performance, with a minimal increase in model complexity. The integration of multi-contrast attention enables the model to better capture subtle anatomical structures, contributing to higher classification accuracy and clinical reliability. Conclusions: Our method provides an effective solution for improving fetal torso plane recognition in ultrasound imaging. By enhancing feature representation through multi-contrast fusion, the proposed approach supports clinicians in achieving more accurate and consistent diagnoses, demonstrating strong potential for clinical adoption in prenatal screening. The codes are available at https://github.com/sysll/MCFM.

[120] Multi-Sequence Parotid Gland Lesion Segmentation via Expert Text-Guided Segment Anything Model

Zhongyuan Wu,Chuan-Xian Ren,Yu Wang,Xiaohua Ban,Jianning Xiao,Xiaohui Duan

Main category: cs.CV

TL;DR: The proposed PG-SAM model integrates expert domain knowledge with medical image segmentation, achieving top performance in parotid gland lesion segmentation by utilizing diagnostic text and cross-sequence attention mechanisms.

Details

Motivation: Accurate segmentation of parotid gland lesions is challenging due to variable size and complex boundaries, and existing methods often ignore medical expert knowledge, prompting the need for a more effective and knowledge-driven approach. Method: PG-SAM incorporates expert domain knowledge through an expert diagnosis report guided prompt generation module and a cross-sequence attention module to integrate complementary information from different modalities. Result: PG-SAM achieves superior performance in cross-sequence parotid gland lesion segmentation across three clinical centers, proving its effectiveness and real-world applicability. Conclusion: PG-SAM demonstrates state-of-the-art performance in parotid gland lesion segmentation, showing its clinical applicability and effectiveness in utilizing diagnostic text to improve segmentation. Abstract: Parotid gland lesion segmentation is essential for the treatment of parotid gland diseases. However, due to the variable size and complex lesion boundaries, accurate parotid gland lesion segmentation remains challenging. Recently, the Segment Anything Model (SAM) fine-tuning has shown remarkable performance in the field of medical image segmentation. Nevertheless, SAM's interaction segmentation model relies heavily on precise lesion prompts (points, boxes, masks, etc.), which are very difficult to obtain in real-world applications. Besides, current medical image segmentation methods are automatically generated, ignoring the domain knowledge of medical experts when performing segmentation. To address these limitations, we propose the parotid gland segment anything model (PG-SAM), an expert diagnosis text-guided SAM incorporating expert domain knowledge for cross-sequence parotid gland lesion segmentation. Specifically, we first propose an expert diagnosis report guided prompt generation module that can automatically generate prompt information containing the prior domain knowledge to guide the subsequent lesion segmentation process. Then, we introduce a cross-sequence attention module, which integrates the complementary information of different modalities to enhance the segmentation effect. Finally, the multi-sequence image features and generated prompts are feed into the decoder to get segmentation result. Experimental results demonstrate that PG-SAM achieves state-of-the-art performance in parotid gland lesion segmentation across three independent clinical centers, validating its clinical applicability and the effectiveness of diagnostic text for enhancing image segmentation in real-world clinical settings.

[121] The Brain Resection Multimodal Image Registration (ReMIND2Reg) 2025 Challenge

Reuben Dorent,Laura Rigolo,Colin P. Galvin,Junyu Chen,Mattias P. Heinrich,Aaron Carass,Olivier Colliot,Demian Wassermann,Alexandra Golby,Tina Kapur,William Wells

Main category: cs.CV

TL;DR: ReMIND2Reg 2025 挑战赛为脑肿瘤手术中的图像引导提供了一个大型公开基准，旨在解决术中 B 超与术前 MRI 的多模态配准问题，推动鲁棒和临床可用算法的发展。

Details

Motivation: 术中图像引导的准确性对于脑肿瘤手术至关重要，但基于术前 MRI 的神经导航系统由于脑移位而失去准确性。通过将术后术中 B 超与术前 MRI 配准，可以恢复空间准确性，但这项任务在技术和临床层面仍然具有挑战性。 Method: ReMIND2Reg 基于 ReMIND 数据集，提供了 99 个训练案例、5 个验证案例和 10 个私有测试案例，包括成对的 3D ceT1 MRI、T2 MRI 和术后 3D iUS 体积，并使用手动标注的解剖标志点进行评估。 Result: ReMIND2Reg 提供了一个标准化的评估框架，指标包括目标配准误差 (TRE)、对最坏情况下的地标点错位的鲁棒性 (TRE30) 和运行时间。 Conclusion: ReMIND2Reg 2025 Challenge 提供了一个大规模的公开基准，旨在推动图像引导神经外科中鲁棒、可推广和临床可部署的多模态配准算法的发展。 Abstract: Accurate intraoperative image guidance is critical for achieving maximal safe resection in brain tumor surgery, yet neuronavigation systems based on preoperative MRI lose accuracy during the procedure due to brain shift. Aligning post-resection intraoperative ultrasound (iUS) with preoperative MRI can restore spatial accuracy by estimating brain shift deformations, but it remains a challenging problem given the large anatomical and topological changes and substantial modality intensity gap. The ReMIND2Reg 2025 Challenge provides the largest public benchmark for this task, built upon the ReMIND dataset. It offers 99 training cases, 5 validation cases, and 10 private test cases comprising paired 3D ceT1 MRI, T2 MRI, and post-resection 3D iUS volumes. Data are provided without annotations for training, while validation and test performance are evaluated on manually annotated anatomical landmarks. Metrics include target registration error (TRE), robustness to worst-case landmark misalignment (TRE30), and runtime. By establishing a standardized evaluation framework for this clinically critical and technically complex problem, ReMIND2Reg aims to accelerate the development of robust, generalizable, and clinically deployable multimodal registration algorithms for image-guided neurosurgery.

[122] TOTNet: Occlusion-Aware Temporal Tracking for Robust Ball Detection in Sports Videos

Hao Xu,Arbind Agrahari Baniya,Sam Wells,Mohamed Reda Bouadjenek,Richard Dazely,Sunil Aryal

Main category: cs.CV

TL;DR: TOTNet是一种时间遮挡跟踪网络，专为在部分和完全遮挡情况下提高体育视频分析中的球跟踪性能而设计。

Details

Motivation: 在体育视频分析中，遮挡下的鲁棒球跟踪仍然是一个关键挑战，影响事件检测和裁判等任务。 Method: TOTNet利用3D卷积、可见性加权损失和遮挡增强来提高在部分和完全遮挡情况下的跟踪性能。 Result: 在四个数据集上评估，TOTNet显著优于最先进的方法，RMSE从37.30降低到7.19，并且在完全遮挡帧上的准确率从0.63提高到0.80。 Conclusion: TOTNet的有效性在多个数据集中得到了验证，证明其在快速体育场景的离线分析中具有优异性能。 Abstract: Robust ball tracking under occlusion remains a key challenge in sports video analysis, affecting tasks like event detection and officiating. We present TOTNet, a Temporal Occlusion Tracking Network that leverages 3D convolutions, visibility-weighted loss, and occlusion augmentation to improve performance under partial and full occlusions. Developed in collaboration with Paralympics Australia, TOTNet is designed for real-world sports analytics. We introduce TTA, a new occlusion-rich table tennis dataset collected from professional-level Paralympic matches, comprising 9,159 samples with 1,996 occlusion cases. Evaluated on four datasets across tennis, badminton, and table tennis, TOTNet significantly outperforms prior state-of-the-art methods, reducing RMSE from 37.30 to 7.19 and improving accuracy on fully occluded frames from 0.63 to 0.80. These results demonstrate TOTNets effectiveness for offline sports analytics in fast-paced scenarios. Code and data access:\href{https://github.com/AugustRushG/TOTNet}{AugustRushG/TOTNet}.

[123] Noise-adapted Neural Operator for Robust Non-Line-of-Sight Imaging

Lianfang Wang,Kuilin Qin,Xueying Liu,Huibin Chang,Yong Wang,Yuping Duan

Main category: cs.CV

TL;DR: This paper proposes a framework for 3D non-line-of-sight imaging using a parameterized inverse problem approach and neural operator, achieving robust and accurate reconstructions from noisy and sparse data.

Details

Motivation: The motivation is driven by the challenge of extracting information from obscured scenes in NLOS imaging, where indirect light signals are weak and prone to noise, necessitating advanced methods for accurate reconstruction. Method: The method involves a parameterized inverse problem framework for 3D imaging reconstruction, incorporating a noise estimation module and a parameterized neural operator for end-to-end rapid image reconstruction. It also includes a technique for fusing global and local spatiotemporal data features. Result: The result is a 3D image reconstruction framework that provides enhanced accuracy and robustness in NLOS imaging, with demonstrated efficacy on both simulated and real datasets. Conclusion: The paper concludes that their proposed 3D image reconstruction framework, which utilizes a parameterized inverse problem approach and a neural operator, offers a robust and accurate solution for NLOS imaging, particularly effective with fast scanning and sparse illumination data. Abstract: Computational imaging, especially non-line-of-sight (NLOS) imaging, the extraction of information from obscured or hidden scenes is achieved through the utilization of indirect light signals resulting from multiple reflections or scattering. The inherently weak nature of these signals, coupled with their susceptibility to noise, necessitates the integration of physical processes to ensure accurate reconstruction. This paper presents a parameterized inverse problem framework tailored for large-scale linear problems in 3D imaging reconstruction. Initially, a noise estimation module is employed to adaptively assess the noise levels present in transient data. Subsequently, a parameterized neural operator is developed to approximate the inverse mapping, facilitating end-to-end rapid image reconstruction. Our 3D image reconstruction framework, grounded in operator learning, is constructed through deep algorithm unfolding, which not only provides commendable model interpretability but also enables dynamic adaptation to varying noise levels in the acquired data, thereby ensuring consistently robust and accurate reconstruction outcomes. Furthermore, we introduce a novel method for the fusion of global and local spatiotemporal data features. By integrating structural and detailed information, this method significantly enhances both accuracy and robustness. Comprehensive numerical experiments conducted on both simulated and real datasets substantiate the efficacy of the proposed method. It demonstrates remarkable performance with fast scanning data and sparse illumination point data, offering a viable solution for NLOS imaging in complex scenarios.

[124] NegFaceDiff: The Power of Negative Context in Identity-Conditioned Diffusion for Synthetic Face Generation

Eduarda Caldeira,Naser Damer,Fadi Boutros

Main category: cs.CV

TL;DR: This paper proposes NegFaceDiff, a novel method for generating synthetic face images with improved identity separability in face recognition systems.

Details

Motivation: The use of synthetic data in face recognition development addresses privacy, ethical, and practical concerns, but existing methods often lack inter-class separability, leading to suboptimal performance. Method: NegFaceDiff introduces negative conditions into the identity-conditioned diffusion process to enhance identity separation while preserving intra-class consistency. Result: Experiments show that NegFaceDiff improves identity separability, with the Fisher Discriminant Ratio (FDR) increasing from 2.427 to 5.687, resulting in better performance in FR systems. Conclusion: NegFaceDiff significantly improves identity separability and consistency in synthetic face image generation, offering a promising alternative to authentic datasets in FR development. Abstract: The use of synthetic data as an alternative to authentic datasets in face recognition (FR) development has gained significant attention, addressing privacy, ethical, and practical concerns associated with collecting and using authentic data. Recent state-of-the-art approaches have proposed identity-conditioned diffusion models to generate identity-consistent face images, facilitating their use in training FR models. However, these methods often lack explicit sampling mechanisms to enforce inter-class separability, leading to identity overlap in the generated data and, consequently, suboptimal FR performance. In this work, we introduce NegFaceDiff, a novel sampling method that incorporates negative conditions into the identity-conditioned diffusion process. NegFaceDiff enhances identity separation by leveraging negative conditions that explicitly guide the model away from unwanted features while preserving intra-class consistency. Extensive experiments demonstrate that NegFaceDiff significantly improves the identity consistency and separability of data generated by identity-conditioned diffusion models. Specifically, identity separability, measured by the Fisher Discriminant Ratio (FDR), increases from 2.427 to 5.687. These improvements are reflected in FR systems trained on the NegFaceDiff dataset, which outperform models trained on data generated without negative conditions across multiple benchmarks.

[125] GSFixer: Improving 3D Gaussian Splatting with Reference-Guided Video Diffusion Priors

Xingyilang Yin,Qi Zhang,Jiahao Chang,Ying Feng,Qingnan Fan,Xi Yang,Chi-Man Pun,Huaqi Zhang,Xiaodong Cun

Main category: cs.CV

TL;DR: GSFixer improves 3D scene reconstruction from sparse views by leveraging a novel framework combining semantic and geometric features, outperforming existing methods.

Details

Motivation: Reconstructing 3D scenes using 3D Gaussian Splatting from sparse views is an ill-posed problem with noticeable artifacts, and existing methods struggle to generate content consistent with input observations. Method: Proposed GSFixer, a novel framework based on a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames, integrating 2D semantic features and 3D geometric features for improved restoration. Result: GSFixer achieves superior performance in 3DGS artifact restoration and sparse-view 3D reconstruction, validated through extensive experiments. Conclusion: GSFixer outperforms current state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction. Abstract: Reconstructing 3D scenes using 3D Gaussian Splatting (3DGS) from sparse views is an ill-posed problem due to insufficient information, often resulting in noticeable artifacts. While recent approaches have sought to leverage generative priors to complete information for under-constrained regions, they struggle to generate content that remains consistent with input observations. To address this challenge, we propose GSFixer, a novel framework designed to improve the quality of 3DGS representations reconstructed from sparse inputs. The core of our approach is the reference-guided video restoration model, built upon a DiT-based video diffusion model trained on paired artifact 3DGS renders and clean frames with additional reference-based conditions. Considering the input sparse views as references, our model integrates both 2D semantic features and 3D geometric features of reference views extracted from the visual geometry foundation model, enhancing the semantic coherence and 3D consistency when fixing artifact novel views. Furthermore, considering the lack of suitable benchmarks for 3DGS artifact restoration evaluation, we present DL3DV-Res which contains artifact frames rendered using low-quality 3DGS. Extensive experiments demonstrate our GSFixer outperforms current state-of-the-art methods in 3DGS artifact restoration and sparse-view 3D reconstruction. Project page: https://github.com/GVCLab/GSFixer.

[126] Surg-InvNeRF: Invertible NeRF for 3D tracking and reconstruction in surgical vision

Gerardo Loza,Junlei Hu,Dominic Jones,Sharib Ali,Pietro Valdastri

Main category: cs.CV

TL;DR: 提出一种基于NeRF的新颖TTO方法，显著提升2D点跟踪精度，并首次实现3D点跟踪。

Details

Motivation: 现有点跟踪方法难以获得一致的运动或局限于2D运动，而TTO方法结合NeRF架构可提高跟踪的精度和适用性。 Method: 通过一种新的可逆神经辐射场（InvNeRF）架构，采用基于渲染的方法进行监督，实现2D和3D点跟踪。 Result: 在STIR和SCARE数据集上测试表明，该方法在2D点跟踪中精度比现有TTO方法高近50%，并首次实现了3D点跟踪。 Conclusion: TTO方法在2D点跟踪方面比现有技术平均精度高出近50%，同时是首个用于3D点跟踪的TTO方法，结合了可变形NeRF重建的优势。 Abstract: We proposed a novel test-time optimisation (TTO) approach framed by a NeRF-based architecture for long-term 3D point tracking. Most current methods in point tracking struggle to obtain consistent motion or are limited to 2D motion. TTO approaches frame the solution for long-term tracking as optimising a function that aggregates correspondences from other specialised state-of-the-art methods. Unlike the state-of-the-art on TTO, we propose parametrising such a function with our new invertible Neural Radiance Field (InvNeRF) architecture to perform both 2D and 3D tracking in surgical scenarios. Our approach allows us to exploit the advantages of a rendering-based approach by supervising the reprojection of pixel correspondences. It adapts strategies from recent rendering-based methods to obtain a bidirectional deformable-canonical mapping, to efficiently handle a defined workspace, and to guide the rays' density. It also presents our multi-scale HexPlanes for fast inference and a new algorithm for efficient pixel sampling and convergence criteria. We present results in the STIR and SCARE datasets, for evaluating point tracking and testing the integration of kinematic data in our pipeline, respectively. In 2D point tracking, our approach surpasses the precision and accuracy of the TTO state-of-the-art methods by nearly 50% on average precision, while competing with other approaches. In 3D point tracking, this is the first TTO approach, surpassing feed-forward methods while incorporating the benefits of a deformable NeRF-based reconstruction.

[127] PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training

Yin Xie,Zhichao Chen,Xiaoze Yu,Yongle Zhao,Xiang An,Kaicheng Yang,Zimin Ran,Jia Guo,Ziyong Feng,Jiankang Deng

Main category: cs.CV

TL;DR: 本文提出了一种名为PaCo-FR的无监督面部表示学习框架，通过结合遮蔽图像建模和补丁像素对齐方法，在多个面部分析任务中实现了最先进的性能，同时减少了对昂贵标注数据的依赖。

Details

Motivation: 现有的面部表示预训练方法面临三个关键挑战：无法捕捉独特的面部特征和细粒度语义、忽略面部解剖固有的空间结构以及有限标注数据的利用效率低下。 Method: PaCo-FR采用了结构化遮蔽策略、基于补丁的码本以及空间一致性约束相结合的方法。 Result: PaCo-FR在仅使用200万张未标记图像进行预训练的情况下，在多个面部分析任务中实现了最先进的性能，并在不同姿态、遮挡和光照条件下表现出显著改进。 Conclusion: PaCo-FR通过结合遮蔽图像建模和补丁像素对齐，为面部表示学习提供了一个无监督框架，减少了对面部标注数据的依赖，推动了更有效的面部分析系统。 Abstract: Facial representation pre-training is crucial for tasks like facial recognition, expression analysis, and virtual reality. However, existing methods face three key challenges: (1) failing to capture distinct facial features and fine-grained semantics, (2) ignoring the spatial structure inherent to facial anatomy, and (3) inefficiently utilizing limited labeled data. To overcome these, we introduce PaCo-FR, an unsupervised framework that combines masked image modeling with patch-pixel alignment. Our approach integrates three innovative components: (1) a structured masking strategy that preserves spatial coherence by aligning with semantically meaningful facial regions, (2) a novel patch-based codebook that enhances feature discrimination with multiple candidate tokens, and (3) spatial consistency constraints that preserve geometric relationships between facial components. PaCo-FR achieves state-of-the-art performance across several facial analysis tasks with just 2 million unlabeled images for pre-training. Our method demonstrates significant improvements, particularly in scenarios with varying poses, occlusions, and lighting conditions. We believe this work advances facial representation learning and offers a scalable, efficient solution that reduces reliance on expensive annotated datasets, driving more effective facial analysis systems.

[128] Slot Attention-based Feature Filtering for Few-Shot Learning

Javier Rodenas,Eduardo Aguilar,Petia Radeva

Main category: cs.CV

TL;DR: SAFF uses Slot Attention to filter irrelevant features in few-shot learning, outperforming existing methods on multiple benchmarks.

Details

Motivation: Irrelevant features degrade few-shot learning performance, so a method is needed to filter these features and improve classification accuracy. Method: SAFF integrates slot attention with patch embeddings to filter irrelevant features, using a similarity matrix to quantify relevance for classification. Result: SAFF improves few-shot classification performance by leveraging slot attention to filter weak and irrelevant features effectively. Conclusion: Extensive experiments on few-shot learning benchmarks demonstrate that SAFF outperforms state-of-the-art methods by effectively filtering irrelevant features using Slot Attention. Abstract: Irrelevant features can significantly degrade few-shot learn ing performance. This problem is used to match queries and support images based on meaningful similarities despite the limited data. However, in this process, non-relevant fea tures such as background elements can easily lead to confu sion and misclassification. To address this issue, we pro pose Slot Attention-based Feature Filtering for Few-Shot Learning (SAFF) that leverages slot attention mechanisms to discriminate and filter weak features, thereby improving few-shot classification performance. The key innovation of SAFF lies in its integration of slot attention with patch em beddings, unifying class-aware slots into a single attention mechanism to filter irrelevant features effectively. We intro duce a similarity matrix that computes across support and query images to quantify the relevance of filtered embed dings for classification. Through experiments, we demon strate that Slot Attention performs better than other atten tion mechanisms, capturing discriminative features while reducing irrelevant information. We validate our approach through extensive experiments on few-shot learning bench marks: CIFAR-FS, FC100, miniImageNet and tieredIma geNet, outperforming several state-of-the-art methods.

[129] MangaDiT: Reference-Guided Line Art Colorization with Hierarchical Attention in Diffusion Transformers

Qianru Qiu,Jiafeng Mao,Kento Masui,Xueting Wang

Main category: cs.CV

TL;DR: 本文提出MangaDiT，通过内部注意力机制实现参考引导的线条艺术上色，增强了区域级颜色一致性。

Details

Motivation: 现有的方法在区域级颜色一致性上仍有困难，尤其是在参考图像和目标图像在角色姿态或动作上有差异时。 Method: 引入了一种具有动态注意力加权策略的分层注意力机制，利用池化空间特征增强模型的感受野。 Result: 在两个基准数据集上的实验表明，该方法在定性和定量评估中均显著优于现有技术方法。 Conclusion: MangaDiT利用内部注意力机制，显著优于现有方法，在定性和定量评估中都表现出色。 Abstract: Recent advances in diffusion models have significantly improved the performance of reference-guided line art colorization. However, existing methods still struggle with region-level color consistency, especially when the reference and target images differ in character pose or motion. Instead of relying on external matching annotations between the reference and target, we propose to discover semantic correspondences implicitly through internal attention mechanisms. In this paper, we present MangaDiT, a powerful model for reference-guided line art colorization based on Diffusion Transformers (DiT). Our model takes both line art and reference images as conditional inputs and introduces a hierarchical attention mechanism with a dynamic attention weighting strategy. This mechanism augments the vanilla attention with an additional context-aware path that leverages pooled spatial features, effectively expanding the model's receptive field and enhancing region-level color alignment. Experiments on two benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches, achieving superior performance in both qualitative and quantitative evaluations.

[130] NEURAL: Attention-Guided Pruning for Unified Multimodal Resource-Constrained Clinical Evaluation

Devvrat Joshi,Islem Rekik

Main category: cs.CV

TL;DR: NEURAL is a novel framework for compressing medical images by preserving diagnostically critical regions, resulting in high diagnostic accuracy while significantly reducing data size.

Details

Motivation: The rapid growth of multimodal medical imaging data poses storage and transmission challenges, especially in settings with limited resources. Method: NEURAL uses semantics-guided data compression by leveraging cross-attention scores from a fine-tuned vision-language model to prune chest X-rays and create a graph-based representation fused with clinical knowledge. Result: NEURAL achieves a 93.4-97.7% reduction in image data size while maintaining diagnostic performance of 0.88-0.95 AUC for pneumonia detection on the MIMIC-CXR and CheXpert Plus datasets. Conclusion: NEURAL provides an efficient framework for compressing medical imaging data without sacrificing diagnostic performance, making it suitable for resource-constrained clinical settings. Abstract: The rapid growth of multimodal medical imaging data presents significant storage and transmission challenges, particularly in resource-constrained clinical settings. We propose NEURAL, a novel framework that addresses this by using semantics-guided data compression. Our approach repurposes cross-attention scores between the image and its radiological report from a fine-tuned generative vision-language model to structurally prune chest X-rays, preserving only diagnostically critical regions. This process transforms the image into a highly compressed, graph representation. This unified graph-based representation fuses the pruned visual graph with a knowledge graph derived from the clinical report, creating a universal data structure that simplifies downstream modeling. Validated on the MIMIC-CXR and CheXpert Plus dataset for pneumonia detection, NEURAL achieves a 93.4-97.7\% reduction in image data size while maintaining a high diagnostic performance of 0.88-0.95 AUC, outperforming other baseline models that use uncompressed data. By creating a persistent, task-agnostic data asset, NEURAL resolves the trade-off between data size and clinical utility, enabling efficient workflows and teleradiology without sacrificing performance. Our NEURAL code is available at https://github.com/basiralab/NEURAL.

[131] Multimodal Sheaf-based Network for Glioblastoma Molecular Subtype Prediction

Shekhnaz Idrissova,Islem Rekik

Main category: cs.CV

TL;DR: 研究提出了一种新的sheaf-based框架，改进了胶质瘤分子亚型分类的多模态数据融合方法，有助于开发非侵入性诊断工具。

Details

Motivation: 目前的胶质母细胞瘤分子亚型分类需要侵入性组织提取进行综合组织病理学分析，而现有的多模态方法在保留结构信息方面存在不足。 Method: 提出了一种新的sheaf-based框架，结合MRI和组织病理学图像，保留了跨模态的共享结构信息。 Result: 所提出的模型在基线方法上表现更优，并在数据不完整或缺失的情况下表现出鲁棒性。 Conclusion: 该研究提出了一种基于sheaf的新框架，用于结构感知和一致的MRI和组织病理学数据融合，为快速诊断虚拟活检工具的发展做出了贡献。 Abstract: Glioblastoma is a highly invasive brain tumor with rapid progression rates. Recent studies have shown that glioblastoma molecular subtype classification serves as a significant biomarker for effective targeted therapy selection. However, this classification currently requires invasive tissue extraction for comprehensive histopathological analysis. Existing multimodal approaches combining MRI and histopathology images are limited and lack robust mechanisms for preserving shared structural information across modalities. In particular, graph-based models often fail to retain discriminative features within heterogeneous graphs, and structural reconstruction mechanisms for handling missing or incomplete modality data are largely underexplored. To address these limitations, we propose a novel sheaf-based framework for structure-aware and consistent fusion of MRI and histopathology data. Our model outperforms baseline methods and demonstrates robustness in incomplete or missing data scenarios, contributing to the development of virtual biopsy tools for rapid diagnostics. Our source code is available at https://github.com/basiralab/MMSN/.

[132] Predictive Uncertainty for Runtime Assurance of a Real-Time Computer Vision-Based Landing System

Romeo Valentin,Sydney M. Katz,Artur B. Carneiro,Don Walker,Mykel J. Kochenderfer

Main category: cs.CV

TL;DR: 本文提出了一种用于航空姿态估计的高效、鲁棒视觉方法，结合了神经网络架构创新、不确定性校准和实时错误检测机制，为安全关键型应用提供了新的解决方案。

Details

Motivation: 尽管数据驱动的计算机视觉技术在民用航空自主导航（如自动着陆和跑道检测）中取得了进展，但如何确保这些系统满足航空应用对鲁棒性和安全性的严格要求仍然是一个重大挑战。 Method: 论文中的方法包括三个创新点：1）基于空间Soft Argmax操作符的高效灵活神经网络架构，支持多样化的视觉骨干网络并实现概率关键点回归；2）设计了一个有原则的损失函数，用于生成经过尖锐度和校准度量评估的校准预测不确定性；3）采用基于残差的接收机自主完整性监控（RAIM）技术，能够在运行时检测并拒绝错误的模型输出。 Result: 该模型在准确性方面优于基线架构，同时生成具有亚像素精度的校准不确定性估计，可用于下游故障检测。 Conclusion: 该论文提出了一种基于视觉的飞机姿态估计流水线，为实现安全关键型航空应用中的系统认证提供了实用方法。 Abstract: Recent advances in data-driven computer vision have enabled robust autonomous navigation capabilities for civil aviation, including automated landing and runway detection. However, ensuring that these systems meet the robustness and safety requirements for aviation applications remains a major challenge. In this work, we present a practical vision-based pipeline for aircraft pose estimation from runway images that represents a step toward the ability to certify these systems for use in safety-critical aviation applications. Our approach features three key innovations: (i) an efficient, flexible neural architecture based on a spatial Soft Argmax operator for probabilistic keypoint regression, supporting diverse vision backbones with real-time inference; (ii) a principled loss function producing calibrated predictive uncertainties, which are evaluated via sharpness and calibration metrics; and (iii) an adaptation of Residual-based Receiver Autonomous Integrity Monitoring (RAIM), enabling runtime detection and rejection of faulty model outputs. We implement and evaluate our pose estimation pipeline on a dataset of runway images. We show that our model outperforms baseline architectures in terms of accuracy while also producing well-calibrated uncertainty estimates with sub-pixel precision that can be used downstream for fault detection.

[133] Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Lin Long,Yichen He,Wentao Ye,Yiyuan Pan,Yuan Lin,Hang Li,Junbo Zhao,Wei Li

Main category: cs.CV

TL;DR: M3-Agent 是一种具有长期记忆能力的多模态代理，能够处理实时视觉和听觉输入，并通过实体中心、多模态格式的记忆组织实现对环境的深入和一致理解。

Details

Motivation: 开发一种更接近人类处理实时视觉和听觉输入能力的多模态代理框架，以构建和更新长期记忆。 Method: 通过强化学习训练M3-Agent，并通过M3-Bench评估记忆有效性和基于记忆的推理能力。 Result: 实验结果表明，M3-Agent 在 M3-Bench-robot、M3-Bench-web 和 VideoMME-long 上分别比最强基线模型准确率高出 6.7%、7.7% 和 5.3%。 Conclusion: M3-Agent 是一个具有长期记忆的新型多模态代理框架，它在多模态代理中实现了更接近人类的长期记忆能力，并为实际设计提供了见解。 Abstract: We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent

[134] Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection

Zhiqiu Zhang,Dongqi Fan,Mingjie Wang,Qiang Tang,Jian Yang,Zili Yi

Main category: cs.CV

TL;DR: This paper proposes R2R, a novel model for image harmonization, and RPHarmony, a new synthetic dataset, achieving superior performance in visual harmony and realism compared to existing methods.

Details

Motivation: The motivation of the paper is to address the challenges in image harmonization, particularly the limitations of latent diffusion models (LDMs) in detail preservation and harmonization ability. Additionally, the lack of realistic synthetic datasets that capture complex real-world lighting conditions is addressed by proposing a new dataset generation method. Method: The paper proposes a Region-to-Region (R2R) transformation approach for image harmonization. The method involves three key components: Clear-VAE for preserving high-frequency details, Harmony Controller with Mask-aware Adaptive Channel Attention (MACA) for dynamic foreground adjustment, and Random Poisson Blending to generate diverse synthetic images. Additionally, a new synthetic dataset, RPHarmony, is constructed. Result: The experiments show that the proposed method outperforms other state-of-the-art methods in both quantitative metrics and visual harmony. The proposed RPHarmony dataset enhances the model's ability to generate realistic images in real-world examples. The code, dataset, and model weights are publicly available. Conclusion: The proposed R2R model and RPHarmony dataset demonstrate superior performance in image harmonization, achieving better results in both quantitative metrics and visual harmony compared to existing methods. The work contributes a new model, a synthetic dataset, and achieves realistic results on real examples. Abstract: The goal of image harmonization is to adjust the foreground in a composite image to achieve visual consistency with the background. Recently, latent diffusion model (LDM) are applied for harmonization, achieving remarkable results. However, LDM-based harmonization faces challenges in detail preservation and limited harmonization ability. Additionally, current synthetic datasets rely on color transfer, which lacks local variations and fails to capture complex real-world lighting conditions. To enhance harmonization capabilities, we propose the Region-to-Region transformation. By injecting information from appropriate regions into the foreground, this approach preserves original details while achieving image harmonization or, conversely, generating new composite data. From this perspective, We propose a novel model R2R. Specifically, we design Clear-VAE to preserve high-frequency details in the foreground using Adaptive Filter while eliminating disharmonious elements. To further enhance harmonization, we introduce the Harmony Controller with Mask-aware Adaptive Channel Attention (MACA), which dynamically adjusts the foreground based on the channel importance of both foreground and background regions. To address the limitation of existing datasets, we propose Random Poisson Blending, which transfers color and lighting information from a suitable region to the foreground, thereby generating more diverse and challenging synthetic images. Using this method, we construct a new synthetic dataset, RPHarmony. Experiments demonstrate the superiority of our method over other methods in both quantitative metrics and visual harmony. Moreover, our dataset helps the model generate more realistic images in real examples. Our code, dataset, and model weights have all been released for open access.

[135] MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Dianyi Wang,Siyuan Wang,Zejun Li,Yikun Wang,Yitong Li,Duyu Tang,Xiaoyu Shen,Xuanjing Huang,Zhongyu Wei

Main category: cs.CV

TL;DR: 本文提出 MoIIE，通过结合模态内和跨模态专家的混合架构，提高多模态模型的效率和性能。

Details

Motivation: 现有的密集型多模态模型计算成本高，而现有的 MoE 架构难以同时有效建模模态内特征和跨模态关联。 Method: 提出了一种新的 Mixture of Intra- and Inter-Modality Experts (MoIIE) 架构，并采用两阶段训练策略，以提升多模态特征学习的效率。 Result: MoIIE 模型在多个数据规模和 LLM 主干网络上的实验表明其有效性、高效性和通用性。 Conclusion: MoIIE 模型在激活参数较少的情况下，表现优于或媲美现有先进的基于 MoE-LLM 的多模态模型。 Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.

[136] Combinative Matching for Geometric Shape Assembly

Nahyuk Lee,Juhong Min,Junhong Lee,Chunghyun Park,Minsu Cho

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的几何形状组装的形状匹配方法，结合了等同表面形状和相反体积占据特性，并利用等变神经网络估计形状方向，从而显著减少了匹配的局部模糊性并实现了更稳健的部件组合。

Details

Motivation: 传统几何组装方法依赖于寻找部件间的相同表面进行对齐，而该论文旨在通过显式建模互锁形状的两个不同属性来改进匹配过程。 Method: 引入了“组合匹配”方法，结合了等同表面形状和相反体积占据的特性，并利用等变神经网络估计形状方向以减少匹配的局部模糊性。 Result: 实验结果表明，该方法在几何组装基准测试中表现优异，一致优于现有技术。 Conclusion: 该论文提出了一种新的形状匹配方法，用于几何形状组装中的互锁部件组合，并通过实验验证了其优于现有技术的性能。 Abstract: This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: 'identical surface shape' and 'opposite volume occupancy.' Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art. Project page: https://nahyuklee.github.io/cmnet.

[137] DSS-Prompt: Dynamic-Static Synergistic Prompting for Few-Shot Class-Incremental Learning

Linpu He,Yanan Li,Bingze Li,Elvis Han Cui,Donghui Wang

Main category: cs.CV

TL;DR: This paper proposes DSS-Prompt, a prompt-based method for few-shot class-incremental learning that leverages both static and dynamic prompts to enhance adaptation and transferability, achieving superior performance without further training.

Details

Motivation: Few-shot class-incremental learning (FSCIL) is challenging, as it requires continual learning of new concepts from limited samples without forgetting previous ones. Despite the success of pre-trained models in various tasks, their application in FSCIL remains underexplored. This paper aims to address this gap. Method: DSS-Prompt introduces two types of prompts in each Transformer block: static prompts to bridge the domain gap, and dynamic prompts to capture instance-aware semantics. Dynamic prompts are generated using a pre-trained multi-modal model, and their importance is adaptively adjusted across layers. Result: Experiments on four benchmarks show that DSS-Prompt consistently outperforms existing approaches and effectively addresses catastrophic forgetting. Conclusion: DSS-Prompt is a simple and effective method for FSCIL, which achieves state-of-the-art performance without further training on incremental tasks and alleviates the catastrophic forgetting issue. Abstract: Learning from large-scale pre-trained models with strong generalization ability has shown remarkable success in a wide range of downstream tasks recently, but it is still underexplored in the challenging few-shot class-incremental learning (FSCIL) task. It aims to continually learn new concepts from limited training samples without forgetting the old ones at the same time. In this paper, we introduce DSS-Prompt, a simple yet effective approach that transforms the pre-trained Vision Transformer with minimal modifications in the way of prompts into a strong FSCIL classifier. Concretely, we synergistically utilize two complementary types of prompts in each Transformer block: static prompts to bridge the domain gap between the pre-training and downstream datasets, thus enabling better adaption; and dynamic prompts to capture instance-aware semantics, thus enabling easy transfer from base to novel classes. Specially, to generate dynamic prompts, we leverage a pre-trained multi-modal model to extract input-related diverse semantics, thereby generating complementary input-aware prompts, and then adaptively adjust their importance across different layers. In this way, on top of the prompted visual embeddings, a simple prototype classifier can beat state-of-the-arts without further training on the incremental tasks. We conduct extensive experiments on four benchmarks to validate the effectiveness of our DSS-Prompt and show that it consistently achieves better performance than existing approaches on all datasets and can alleviate the catastrophic forgetting issue as well.

[138] MeMoSORT: Memory-Assisted Filtering and Motion-Adaptive Association Metric for Multi-Person Tracking

Yingjie Wang,Zhixing Wang,Le Zheng,Tianxiao Liu,Roujing Li,Xueyao Hu

Main category: cs.CV

TL;DR: 本文提出了一种名为MeMoSORT的新多目标跟踪方法，该方法通过引入Memory-assisted Kalman filter和Motion-adaptive IoU来解决传统跟踪-by-检测方法的局限性，并在DanceTrack和SportsMOT上分别达到了67.9%和82.1%的HOTA分数。

Details

Motivation: 由于目标的复杂运动和严重遮挡，基于传统卡尔曼滤波和刚性IoU关联的多目标跟踪方法存在局限性，因此提出了MeMoSORT。 Method: 提出了一种名为MeMoSORT的新方法，包括Memory-assisted Kalman filter (MeKF)和Motion-adaptive IoU (Mo-IoU)。 Result: 在DanceTrack和SportsMOT上的实验表明，MeMoSORT分别达到了67.9%和82.1%的HOTA分数，达到了最先进的性能。 Conclusion: MeMoSORT是一个简单、在线、实时的多目标跟踪器，通过MeKF和Mo-IoU两个关键创新解决了传统跟踪-by-检测方法的局限性。 Abstract: Multi-object tracking (MOT) in human-dominant scenarios, which involves continuously tracking multiple people within video sequences, remains a significant challenge in computer vision due to targets' complex motion and severe occlusions. Conventional tracking-by-detection methods are fundamentally limited by their reliance on Kalman filter (KF) and rigid Intersection over Union (IoU)-based association. The motion model in KF often mismatches real-world object dynamics, causing filtering errors, while rigid association struggles under occlusions, leading to identity switches or target loss. To address these issues, we propose MeMoSORT, a simple, online, and real-time MOT tracker with two key innovations. First, the Memory-assisted Kalman filter (MeKF) uses memory-augmented neural networks to compensate for mismatches between assumed and actual object motion. Second, the Motion-adaptive IoU (Mo-IoU) adaptively expands the matching space and incorporates height similarity to reduce the influence of detection errors and association failures, while remaining lightweight. Experiments on DanceTrack and SportsMOT show that MeMoSORT achieves state-of-the-art performance, with HOTA scores of 67.9\% and 82.1\%, respectively.

[139] MUJICA: Reforming SISR Models for PBR Material Super-Resolution via Cross-Map Attention

Xin Du,Maoyuan Xu,Zhi Ying

Main category: cs.CV

TL;DR: MUJICA enhances PBR material upscaling by integrating cross-map attention with pre-trained SISR models, outperforming existing methods in both quality and consistency.

Details

Motivation: Existing SISR methods face challenges like cross-map inconsistency, poor modality-specific feature modeling, and limited generalization when applied to PBR materials, motivating the need for a more effective solution. Method: The authors propose MUJICA, a cross-map attention-based adapter that integrates with pre-trained Swin-transformer-based SISR models to enhance PBR material super-resolution. Result: MUJICA achieves state-of-the-art performance on PBR material datasets, improves efficiency in training, and works effectively with limited resources. Conclusion: MUJICA improves the upscaling of PBR materials by leveraging cross-map attention, outperforming existing methods in PSNR, SSIM, and LPIPS while maintaining cross-map consistency. Abstract: Physically Based Rendering (PBR) materials are typically characterized by multiple 2D texture maps such as basecolor, normal, metallic, and roughness which encode spatially-varying bi-directional reflectance distribution function (SVBRDF) parameters to model surface reflectance properties and microfacet interactions. Upscaling SVBRDF material is valuable for modern 3D graphics applications. However, existing Single Image Super-Resolution (SISR) methods struggle with cross-map inconsistency, inadequate modeling of modality-specific features, and limited generalization due to data distribution shifts. In this work, we propose Multi-modal Upscaling Joint Inference via Cross-map Attention (MUJICA), a flexible adapter that reforms pre-trained Swin-transformer-based SISR models for PBR material super-resolution. MUJICA is seamlessly attached after the pre-trained and frozen SISR backbone. It leverages cross-map attention to fuse features while preserving remarkable reconstruction ability of the pre-trained SISR model. Applied to SISR models such as SwinIR, DRCT, and HMANet, MUJICA improves PSNR, SSIM, and LPIPS scores while preserving cross-map consistency. Experiments demonstrate that MUJICA enables efficient training even with limited resources and delivers state-of-the-art performance on PBR material datasets.

[140] Automated Segmentation of Coronal Brain Tissue Slabs for 3D Neuropathology

Jonathan Williams Ramirez,Dina Zemlyanker,Lucas Deden-Binder,Rogeny Herisse,Erendira Garcia Pallares,Karthik Gopinath,Harshvardhan Gazula,Christopher Mount,Liana N. Kozanno,Michael S. Marshall,Theresa R. Connors,Matthew P. Frosch,Mark Montine,Derek H. Oakley,Christine L. Mac Donald,C. Dirk Keene,Bradley T. Hyman,Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: 本文提出了一种自动分割脑组织图像的深度学习方法，性能接近人工标注，可用于提高脑组织图像分析的效率。

Details

Motivation: 现有的脑组织图像分割方法需要大量人工干预，成本较高，因此需要一种自动化的方法来提高效率。 Method: 论文采用了U-Net架构，使用1414张手动分割的图像和2000张合成图像进行训练，并在未见过的测试集上评估模型性能。 Result: 模型在测试数据上达到了中位Dice系数超过0.98，平均表面距离低于0.4毫米，95% Hausdorff距离低于1.60毫米。 Conclusion: 该论文提出了一种基于U-Net架构的深度学习模型，能够自动分割脑组织图像，其性能接近人工标注的水平，并且工具已公开。 Abstract: Advances in image registration and machine learning have recently enabled volumetric analysis of \emph{postmortem} brain tissue from conventional photographs of coronal slabs, which are routinely collected in brain banks and neuropathology laboratories worldwide. One caveat of this methodology is the requirement of segmentation of the tissue from photographs, which currently requires costly manual intervention. In this article, we present a deep learning model to automate this process. The automatic segmentation tool relies on a U-Net architecture that was trained with a combination of \textit{(i)}1,414 manually segmented images of both fixed and fresh tissue, from specimens with varying diagnoses, photographed at two different sites; and \textit{(ii)}~2,000 synthetic images with randomized contrast and corresponding masks generated from MRI scans for improved generalizability to unseen photographic setups. Automated model predictions on a subset of photographs not seen in training were analyzed to estimate performance compared to manual labels -- including both inter- and intra-rater variability. Our model achieved a median Dice score over 0.98, mean surface distance under 0.4~mm, and 95\% Hausdorff distance under 1.60~mm, which approaches inter-/intra-rater levels. Our tool is publicly available at surfer.nmr.mgh.harvard.edu/fswiki/PhotoTools.

[141] TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos

Jinxi Li,Ziyang Song,Bo Yang

Main category: cs.CV

TL;DR: 本文提出TRACE框架，通过建模3D点为刚性粒子，直接学习其动力学系统，在无需人工标签的情况下实现复杂动态场景的建模与分割。

Details

Motivation: 现有方法在没有足够标签的情况下难以学习复杂运动物理，需要额外的标签如物体类型或掩码。 Method: 提出名为TRACE的框架，将3D点建模为刚性粒子，并直接学习其平移旋转动力学系统。 Result: 在多个动态数据集上，TRACE在任务未来帧外推中表现优异，并能够通过聚类物理参数实现多物体或部分的分割。 Conclusion: TRACE通过将每个3D点建模为具有大小和方向的刚性粒子，成功地从动态多视角视频中学习复杂的运动物理，且无需人工标签。 Abstract: In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle's motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters.

[142] Poaching Hotspot Identification Using Satellite Imagery

Aryan Pandhi,Shrey Baid,Sanjali Jha

Main category: cs.CV

TL;DR: This paper proposes a Computer Vision Model using satellite imagery to identify elephant poaching hotspots in African countries, offering a more efficient and wide-reaching solution compared to manual tracking and traditional anti-poaching efforts.

Details

Motivation: Elephant poaching in African countries has been a longstanding issue, with African Forest Elephants listed as endangered and African Savannah Elephants as critically endangered. Poaching numbers are on the rise again, and anti-poaching efforts are primarily concentrated near towns, while most poaching occurs in deserted regions. This necessitates a more effective and wide-reaching solution. Method: The paper proposes the use of a Computer Vision Model combined with satellite imagery to locate geographic indicators of favorable poaching regions. Result: The paper highlights that the use of a Computer Vision Model combined with satellite imagery can survey large areas without disturbing local species or facing cross-border aviation restrictions, offering a promising tool for combating elephant poaching. Conclusion: The paper concludes that a Computer Vision Model is a viable solution for identifying elephant poaching hotspots in African countries, eliminating the need for manual tracking and allowing for efficient resource deployment. Abstract: Elephant Poaching in African countries has been a decade-old problem. So much so that African Forest Elephants are now listed as an endangered species, and African Savannah Elephants as critically endangered by the IUCN (International Union for Conservation of Nature). [1] Elephants are hunted primarily for their ivory tusks which caused many elephants to be born tuskless as a genetic modification for survival. [2] Data gathered by recent studies shows that though poaching methods remain the same, the poaching grounds are rather dynamic. Poachers have shifted to areas with less ranger patrols and several other factors like watering holes, seasons, altitude etc. cause constant shifts in poaching hotspot locations. [3] After a period of low poaching from 2000-2014, poaching numbers in African countries are now on the rise again -- WWF (World Wildlife Foundation) says there are 20,000 elephants poached annually [4]. In African countries, anti-poaching efforts are concentrated near towns, while a majority of poaching occurs in the deserted regions. All of these factors result in the need for a Computer Vision Model to identify poaching hotspots through locating the geographic indicators of favorable poaching regions. A CV model eliminates the need to manually track poachers and account for the environmental factors to deploy resources and its combination with satellite imagery allows us to survey large areas without disturbing local species or cross border aviation restrictions.

[143] Evolution of Low-Level and Texture Human-CLIP Alignment

Pablo Hernández-Cámara,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Jesus Malo,Valero Laparra

Main category: cs.CV

TL;DR: CLIP模型在训练初期对齐低层次人类感知，随后转向更抽象的形状特征，以提高鲁棒性但降低感知对齐度。

Details

Motivation: 研究CLIP模型在训练初期与低层次人类感知对齐后逐渐下降的现象及其原因。 Method: 分析CLIP模型在不同训练阶段与图像质量评估的相关性变化，并探讨形状-纹理偏差和噪声分类准确性下降两个因素。 Result: 发现CLIP初期学习低层次视觉特征，增加对噪声敏感性和纹理偏差；随着训练进行，转向更抽象的形状表示，提高噪声鲁棒性但降低与低层次感知的对齐度。 Conclusion: 模型在训练过程中从低层次特征向高层次形状特征转变，影响了感知对齐与鲁棒性之间的平衡。 Abstract: During the training of multi-modal models like CLIP, we observed an intriguing phenomenon: the correlation with low-level human image quality assessments peaks in the early epochs before gradually declining. This study investigates this observation and seeks to understand its causes through two key factors: shape-texture bias alignment and classification accuracy drop under noise. Our findings suggest that CLIP initially learn low-level visual features, enhancing its alignment with low-level human perception but also increasing its sensitivity to noise and its texture bias. As training progresses, the model shifts toward more abstract shape-based representations, improving noise robustness but reducing alignment with low-level human perception. These results suggest that these factors shared an underlying learning mechanism and provide new insights into optimizing the trade-off between perceptual alignment and robustness in vision-language models.

[144] ViMoNet: A Multimodal Vision-Language Framework for Human Behavior Understanding from Motion and Video

Rajan Das Gupta,Md Yeasin Rahat,Nafiz Fahad,Abir Ahmed,Liew Tze Hui

Main category: cs.CV

TL;DR: ViMoNet 结合运动和视频数据，提供了一个更全面理解人类行为的新框架，并在多个任务上表现优异。

Details

Motivation: 现有模型仅关注运动数据或视频，而结合两者可以更全面地捕捉人类行为的细微差别。 Method: ViMoNet 使用联合训练策略，结合详细的运动-文本数据和广泛的视频-文本数据，并提出了新的数据集 VIMOS 和基准测试 ViMoNet-Bench。 Result: ViMoNet 在生成字幕、理解运动和行为解释任务上优于现有方法。 Conclusion: ViMoNet 是一种有效的理解、描述和推断人类行为的框架，通过结合运动和视频数据，在现有方法上表现出色。 Abstract: This study investigates how large language models (LLMs) can be used to understand human behavior using motion and video data. We think that mixing both types is essential to completely capture the nuanced movements and meanings of human actions, in contrast to recent models that simply concentrate on motion data or films. To address this, we provide ViMoNet, a straightforward yet effective framework for comprehending, characterizing, and deducing human action. ViMoNet employs a joint training strategy that leverages the advantages of two data types: detailed motion-text data, which is more exact, and generic video-text data, which is more comprehensive but less detailed. This aids in the model's acquisition of rich data regarding time and space in human behavior. Additionally, we provide a brand new dataset named VIMOS that contains a variety of films, motion sequences, instructions, and subtitles. We developed ViMoNet-Bench, a standardized benchmark with carefully labeled samples, to evaluate how well models understand human behavior. Our tests show that ViMoNet outperforms existing methods in caption generation, motion understanding, and behavior interpretation.

[145] Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

Zijian Song,Sihan Qin,Tianshui Chen,Liang Lin,Guangrun Wang

Main category: cs.CV

TL;DR: This paper introduces a Physical Autoregressive Model (PAR) for robotic manipulation that leverages video generation models to achieve high performance without action pretraining.

Details

Motivation: The scarcity of manipulation data in robotics motivated the authors to explore the use of pretrained large models from other modalities, specifically video generation models. Method: The authors built upon autoregressive video generation models to create PAR, which combines frames and actions using physical tokens. The model uses a DiT-based de-tokenizer, causal mask with inverse kinematics, parallel training, and the KV-cache mechanism for improved performance and efficiency. Result: Experiments on the ManiSkill benchmark demonstrated that PAR achieved a 100% success rate on the PushCube task, matched the performance of action-pretrained baselines on other tasks, and accurately predicted future videos with aligned action trajectories. Conclusion: The study concludes that the proposed Physical Autoregressive Model (PAR) effectively transfers world knowledge from autoregressive video pretraining to robotic manipulation, achieving high performance on tasks without action pretraining. Abstract: The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100\% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining.

[146] KonfAI: A Modular and Fully Configurable Framework for Deep Learning in Medical Imaging

Valentin Boussot,Jean-Louis Dillenseger

Main category: cs.CV

TL;DR: KonfAI是一个专为医学成像设计的模块化深度学习框架，通过YAML配置文件简化工作流定义，支持高级策略和复杂模型训练，并已在实际任务和挑战赛中表现出色。

Details

Motivation: KonfAI旨在通过声明式方法提升医学成像深度学习任务的可重复性、透明性和实验可追溯性，同时减少开发时间，并支持高级策略和复杂模型训练。 Method: KonfAI基于模块化和可扩展架构，利用结构化的YAML配置文件实现声明式工作流定义，并支持高级策略如基于块的学习、测试时增强、模型集成等，以及复杂的多模型训练设置。 Result: KonfAI已被成功应用于医学成像的分割、配准和图像合成任务，并助力在多个国际挑战赛中取得顶尖成绩，同时该框架是开源的，便于社区使用和扩展。 Conclusion: KonfAI是一个专为医学成像任务设计的模块化、可扩展且完全可配置的深度学习框架，它通过结构化的YAML配置文件实现用户定义完整的训练、推理和评估工作流程，且已被成功应用于分割、配准和图像合成任务，并在多个国际医学成像挑战赛中取得优异成绩。 Abstract: KonfAI is a modular, extensible, and fully configurable deep learning framework specifically designed for medical imaging tasks. It enables users to define complete training, inference, and evaluation workflows through structured YAML configuration files, without modifying the underlying code. This declarative approach enhances reproducibility, transparency, and experimental traceability while reducing development time. Beyond the capabilities of standard pipelines, KonfAI provides native abstractions for advanced strategies including patch-based learning, test-time augmentation, model ensembling, and direct access to intermediate feature representations for deep supervision. It also supports complex multi-model training setups such as generative adversarial architectures. Thanks to its modular and extensible architecture, KonfAI can easily accommodate custom models, loss functions, and data processing components. The framework has been successfully applied to segmentation, registration, and image synthesis tasks, and has contributed to top-ranking results in several international medical imaging challenges. KonfAI is open source and available at \href{https://github.com/vboussot/KonfAI}{https://github.com/vboussot/KonfAI}.

[147] Reverse Convolution and Its Applications to Image Restoration

Xuhong Huang,Shiqi Liu,Kai Zhang,Ying Tai,Jian Yang,Hui Zeng,Lei Zhang

Main category: cs.CV

TL;DR: This paper proposes a novel depthwise reverse convolution operator to effectively reverse convolution operations in neural networks, leading to the development of ConverseNet, which demonstrates strong performance in image restoration tasks.

Details

Motivation: The motivation is to develop a true reverse convolution operator, as transposed convolution does not effectively invert convolution mathematically. This gap in neural architecture design prompted the search for a more accurate and effective reverse operator. Method: A depthwise reverse convolution operator is proposed by formulating and solving a regularized least-squares optimization problem. The authors also construct a reverse convolution block combining layer normalization, 1×1 convolution, and GELU activation to form a Transformer-like structure. Result: The proposed reverse convolution operator demonstrates effectiveness in image restoration tasks such as Gaussian denoising, super-resolution, and deblurring when applied to models like ConverseNet. The operator is shown to be a viable replacement for traditional convolution and transposed convolution layers. Conclusion: The paper concludes that the proposed reverse convolution operator can effectively serve as a reverse operator for convolution, replacing conventional convolution and transposed convolution layers, and can be widely used in ConverseNet for image restoration tasks. Abstract: Convolution and transposed convolution are fundamental operators widely used in neural networks. However, transposed convolution (a.k.a. deconvolution) does not serve as a true inverse of convolution due to inherent differences in their mathematical formulations. To date, no reverse convolution operator has been established as a standard component in neural architectures. In this paper, we propose a novel depthwise reverse convolution operator as an initial attempt to effectively reverse depthwise convolution by formulating and solving a regularized least-squares optimization problem. We thoroughly investigate its kernel initialization, padding strategies, and other critical aspects to ensure its effective implementation. Building upon this operator, we further construct a reverse convolution block by combining it with layer normalization, 1$\times$1 convolution, and GELU activation, forming a Transformer-like structure. The proposed operator and block can directly replace conventional convolution and transposed convolution layers in existing architectures, leading to the development of ConverseNet. Corresponding to typical image restoration models such as DnCNN, SRResNet and USRNet, we train three variants of ConverseNet for Gaussian denoising, super-resolution and deblurring, respectively. Extensive experiments demonstrate the effectiveness of the proposed reverse convolution operator as a basic building module. We hope this work could pave the way for developing new operators in deep model design and applications.

[148] RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians

Shenxing Wei,Jinxi Li,Yafei Yang,Siyuan Zhou,Bo Yang

Main category: cs.CV

TL;DR: 该论文提出了一种高效的3D表面重建方法RayletDF，通过新引入的raylet距离场技术，实现了对点云或3D高斯数据的高质量表面重建，并在多个数据集中展现出卓越的泛化能力。

Details

Motivation: 现有基于坐标的显式表面渲染方法通常计算量较大，而本文旨在提出一种更高效的3D表面重建方法，能够直接预测表面点以提高计算效率和重建精度。 Method: 该论文提出了一种名为RayletDF的方法，通过引入称为raylet距离场的新技术，直接从查询射线预测表面点。该流程包括三个关键模块：raylet特征提取器、raylet距离场预测器和多raylet混合器。 Result: RayletDF在多个公开的真实世界数据集上进行了广泛评估，显示出在从点云或3D高斯重建表面方面具有优越性能，特别是在未见过的数据集上也表现出强大的泛化能力。 Conclusion: RayletDF展现出卓越的泛化能力，在多个真实世界数据集中实现了精确的3D表面重建，并在未见过的数据集上通过单次前向传递成功恢复3D表面。 Abstract: In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.

[149] Hierarchical Graph Attention Network for No-Reference Omnidirectional Image Quality Assessment

Hao Yang,Xu Zhang,Jiaqi Ma,Linwei Zhu,Yun Zhang,Huan Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的全向图像质量评估方法，利用图神经网络建模视口关系，有效解决局部非均匀失真问题，并在大规模数据上表现优异。

Details

Motivation: 当前的全向图像质量评估方法难以有效评估局部非均匀失真，主要由于对质量空间变化建模不足以及无法同时捕捉局部细节和全局上下文的特征表示。 Method: 使用斐波那契球面采样生成具有良好结构拓扑的视口，并将其表示为图节点；通过多阶段特征提取网络获得高维节点表示；结合图注意力网络（GAT）和图Transformer分别建模相邻视口的局部失真变化和远距离区域的质量交互。 Result: 在两个大规模全向图像质量评估数据库上的实验表明，该方法显著优于现有方法，特别是在处理复杂空间失真方面表现出色。 Conclusion: 该研究提出了一种基于图神经网络的全向图像质量评估框架，通过建模视口之间的结构关系，有效解决了局部非均匀失真评估的问题，并验证了其在大规模数据库上的优越性能和泛化能力。 Abstract: Current Omnidirectional Image Quality Assessment (OIQA) methods struggle to evaluate locally non-uniform distortions due to inadequate modeling of spatial variations in quality and ineffective feature representation capturing both local details and global context. To address this, we propose a graph neural network-based OIQA framework that explicitly models structural relationships between viewports to enhance perception of spatial distortion non-uniformity. Our approach employs Fibonacci sphere sampling to generate viewports with well-structured topology, representing each as a graph node. Multi-stage feature extraction networks then derive high-dimensional node representation. To holistically capture spatial dependencies, we integrate a Graph Attention Network (GAT) modeling fine-grained local distortion variations among adjacent viewports, and a graph transformer capturing long-range quality interactions across distant regions. Extensive experiments on two large-scale OIQA databases with complex spatial distortions demonstrate that our method significantly outperforms existing approaches, confirming its effectiveness and strong generalization capability.

[150] Enhancing Diffusion Face Generation with Contrastive Embeddings and SegFormer Guidance

Dhruvraj Singh Rawat,Enggen Sherpa,Rishikesan Kirupanantha,Tin Hoang

Main category: cs.CV

TL;DR: 本文提出了一种改进的扩散模型方法，在小规模数据集CelebAMask-HQ上实现了更可控的高质量人脸生成。

Details

Motivation: 探索扩散模型在人脸生成任务中的性能，特别是在有限数据条件下提升生成结果的可控性和语义一致性。 Method: 在CelebAMask-HQ数据集上构建了一个小规模基准，比较了UNet和DiT架构的无条件生成效果，同时采用LoRA对预训练Stable Diffusion模型进行微调。此外，结合Giambi和Lisanti的多条件方法，引入了InfoNCE损失和SegFormer分割编码器。 Result: 引入InfoNCE损失和SegFormer分割编码器显著提升了属性引导生成的语义对齐和可控性。 Conclusion: 对比学习嵌入和高级分割编码可以有效提升有限数据下的可控人脸生成效果。 Abstract: We present a benchmark of diffusion models for human face generation on a small-scale CelebAMask-HQ dataset, evaluating both unconditional and conditional pipelines. Our study compares UNet and DiT architectures for unconditional generation and explores LoRA-based fine-tuning of pretrained Stable Diffusion models as a separate experiment. Building on the multi-conditioning approach of Giambi and Lisanti, which uses both attribute vectors and segmentation masks, our main contribution is the integration of an InfoNCE loss for attribute embedding and the adoption of a SegFormer-based segmentation encoder. These enhancements improve the semantic alignment and controllability of attribute-guided synthesis. Our results highlight the effectiveness of contrastive embedding learning and advanced segmentation encoding for controlled face generation in limited data settings.

[151] ARI3D: A Software for Interactive Quantification of Regions in X-Ray CT 3D Images

Jan Phillipp Albrecht,Jose R. A. Godinho,Christina Hübers,Deborah Schmidt

Main category: cs.CV

TL;DR: 本文介绍了一种名为 ARI3D 的软件工具，用于三维X射线CT图像的交互式区域分析，旨在改善相位识别、考虑部分体积效应、提高物体量化检测限和准确性，并协调不同科学领域的定量三维分析。

Details

Motivation: X射线计算机断层扫描（CT）是成像材料内部微观结构的主要三维技术。然而，由于该技术固有的各种成像伪影（如射束硬化和部分体积效应），对微观结构的定量分析面临挑战。因此，需要一个工具来协助用户进行微观结构的分割和分类。 Method: 提出了一个名为 ARI3D 的软件工具，用于交互式分析三维X射线CT图像中的区域，并协助用户完成分类和量化三维图像中对象的各个步骤。 Result: 开发了一个名为 ARI3D 的软件工具，该工具在三维X射线CT图像分析中可以实现：1) 改善相位识别；2) 考虑部分体积效应；3) 提高物体量化检测限和准确性；4) 协调可在不同科学领域实施的定量三维分析。 Conclusion: ARI3D 是一个用于三维X射线CT图像区域交互分析的软件工具，旨在改善相位识别，考虑部分体积效应，提高物体量化检测限和准确性，并协调可在不同科学领域实施的定量三维分析。 Abstract: X-ray computed tomography (CT) is the main 3D technique for imaging the internal microstructures of materials. Quantitative analysis of the microstructures is usually achieved by applying a sequence of steps that are implemented to the entire 3D image. This is challenged by various imaging artifacts inherent from the technique, e.g., beam hardening and partial volume. Consequently, the analysis requires users to make a number of decisions to segment and classify the microstructures based on the voxel gray-values. In this context, a software tool, here called ARI3D, is proposed to interactively analyze regions in three-dimensional X-ray CT images, assisting users through the various steps of a protocol designed to classify and quantify objects within regions of a three-dimensional image. ARI3D aims to 1) Improve phase identification; 2) Account for partial volume effect; 3) Increase the detection limit and accuracy of object quantification; and 4) Harmonize quantitative 3D analysis that can be implemented in different fields of science.

[152] Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment

Pablo Hernández-Cámara,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Valero Laparra,Jesus Malo

Main category: cs.CV

TL;DR: 视觉变压器（ViT）在图像识别中表现优秀，但它们与人类感知的一致性受到模型大小、数据集大小、数据增强和正则化的影响。

Details

Motivation: 尽管视觉变压器（ViTs）在图像识别任务中表现出色，但它们与人类感知的一致性仍未得到充分探索。 Method: 该研究使用TID2013数据集，系统性地分析了模型大小、数据集大小、数据增强和正则化对ViT感知一致性的影响。 Result: 研究发现，更大的模型表现出更低的感知一致性，增加数据集的多样性影响较小，而让模型重复接触相同图像会降低一致性。较强的数据增强和正则化也会进一步降低一致性，尤其是在经历多次训练周期的模型中。 Conclusion: 研究发现，视觉变压器（ViT）的感知一致性受到模型大小、数据集大小、数据增强和正则化的影响，强调了模型复杂度、训练策略与人类感知一致性之间的权衡。 Abstract: Vision Transformers (ViTs) achieve remarkable performance in image recognition tasks, yet their alignment with human perception remains largely unexplored. This study systematically analyzes how model size, dataset size, data augmentation and regularization impact ViT perceptual alignment with human judgments on the TID2013 dataset. Our findings confirm that larger models exhibit lower perceptual alignment, consistent with previous works. Increasing dataset diversity has a minimal impact, but exposing models to the same images more times reduces alignment. Stronger data augmentation and regularization further decrease alignment, especially in models exposed to repeated training cycles. These results highlight a trade-off between model complexity, training strategies, and alignment with human perception, raising important considerations for applications requiring human-like visual understanding.

[153] OneVAE: Joint Discrete and Continuous Optimization Helps Discrete Video VAE Train Better

Yupeng Zhou,Zhen Li,Ziheng Ouyang,Yuming Chen,Ruoyi Du,Daquan Zhou,Bin Fu,Yihao Liu,Peng Gao,Ming-Ming Cheng,Qibin Hou

Main category: cs.CV

TL;DR: 本文提出了一种名为OneVAE的视频编码方法，通过结合离散和连续VAE的优势，提高了视频编码效率和质量，同时在离散和连续表示上都取得了优异的性能。

Details

Motivation: 由于离散视频VAE在训练稳定性和重建质量方面存在挑战，而连续VAE在训练和性能上更优，因此提出OneVAE来结合两者的优势，提高视频编码效率。 Method: 该方法基于连续VAE的先验信息，采用了一种改进的离散化机制，包括多标记量化机制和强化第一帧重建的结构改进，并提出了联合离散-连续优化方案。 Result: OneVAE方法在收敛速度上比从头训练快几倍，并在离散和连续表示上均取得了优异的性能，包括在PSNR上有近1dB的提升。 Conclusion: 本文提出了一种名为OneVAE的方法，通过结合离散和连续VAE的优势，实现了在统一网络中对视频编码的高效处理，同时在离散和连续表示上都取得了竞争性的性能。 Abstract: Encoding videos into discrete tokens could align with text tokens to facilitate concise and unified multi-modal LLMs, yet introducing significant spatiotemporal compression compared to continuous video representation. Previous discrete video VAEs experienced unstable training, long training time, and degraded reconstruction quality. Given the easier training and superior performance of continuous VAEs, an intuitive idea is to enhance discrete video VAEs by leveraging continuous VAEs. After rethinking the intrinsic link between discrete and continuous representations, we found that FSQ could effectively preserve pre-trained continuous VAE priors compared to other quantization methods. By leveraging continuous VAE priors, it converges several times faster than training from scratch and achieves superior performance at convergence. Meanwhile, two structural improvements are proposed. First, inspired by how continuous VAEs enhance reconstruction via enlarged latent dimensions, we introduce a multi-token quantization mechanism, which achieves nearly a 1 dB improvement in PSNR without compromising the token compression ratio. Second, to tackle reconstruction challenges in high-compression video VAEs, we strengthen first-frame reconstruction, enabling the causal VAE to leverage this information in subsequent frames and markedly improving the performance of 4 x 16 x 16 discrete VAEs. Furthermore, we propose a joint discrete-continuous optimization scheme that unifies the two paradigms and, for the first time, achieves competitive performance on both continuous and discrete representations within a single network. We name our method OneVAE to reflect this connection.

[154] HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics

Weiqi Li,Zehao Zhang,Liang Lin,Guangrun Wang

Main category: cs.CV

TL;DR: HumanGenesis通过集成几何建模和生成能力，实现了高质量的人体动态视频生成。

Details

Motivation: 现有方法在几何一致性和运动泛化方面存在局限，需要更强大的建模和生成能力。 Method: 提出了HumanGenesis框架，包含四个协作代理：Reconstructor进行3D建模，Critique Agent提升重建质量，Pose Guider实现运动泛化，Video Harmonizer生成逼真视频。 Result: HumanGenesis在文本引导合成、视频重现和新姿态泛化任务上表现优异，显著提升了表现力、几何保真度和场景整合能力。 Conclusion: HumanGenesis有效地解决了几何不一致、粗略重建、运动泛化限制和场景不协调的问题，实现了最先进的性能。 Abstract: \textbf{Synthetic human dynamics} aims to generate photorealistic videos of human subjects performing expressive, intention-driven motions. However, current approaches face two core challenges: (1) \emph{geometric inconsistency} and \emph{coarse reconstruction}, due to limited 3D modeling and detail preservation; and (2) \emph{motion generalization limitations} and \emph{scene inharmonization}, stemming from weak generative capabilities. To address these, we present \textbf{HumanGenesis}, a framework that integrates geometric and generative modeling through four collaborative agents: (1) \textbf{Reconstructor} builds 3D-consistent human-scene representations from monocular video using 3D Gaussian Splatting and deformation decomposition. (2) \textbf{Critique Agent} enhances reconstruction fidelity by identifying and refining poor regions via multi-round MLLM-based reflection. (3) \textbf{Pose Guider} enables motion generalization by generating expressive pose sequences using time-aware parametric encoders. (4) \textbf{Video Harmonizer} synthesizes photorealistic, coherent video via a hybrid rendering pipeline with diffusion, refining the Reconstructor through a Back-to-4D feedback loop. HumanGenesis achieves state-of-the-art performance on tasks including text-guided synthesis, video reenactment, and novel-pose generalization, significantly improving expressiveness, geometric fidelity, and scene integration.

[155] E-4DGS: High-Fidelity Dynamic Reconstruction from the Multi-view Event Cameras

Chaoran Feng,Zhenyu Tang,Wangbo Yu,Yatian Pang,Yian Zhao,Jianbin Zhao,Li Yuan,Yonghong Tian

Main category: cs.CV

TL;DR: 本文提出了一种基于事件相机的多视角场景重建方法E-4DGS，解决了传统RGB相机的局限性，并在快速运动和低光场景下实现了高效重建。

Details

Motivation: 传统RGB相机在场景重建中存在光照依赖、运动模糊和动态范围有限等缺点，而事件相机具有低功耗、高时间分辨率和高动态范围的优势。 Method: 提出了一种基于事件驱动的动态高斯点阵方法（E-4DGS），包括事件初始化方案、事件自适应切片点阵技术和强度重要性剪枝技术。 Result: E-4DGS在多视角事件流数据集上优于事件单模态和事件-RGB融合方法，有效解决了快速运动和低光场景下的重建问题。 Conclusion: E-4DGS通过使用事件相机实现了高效的多视角场景重建，为快速场景捕捉提供了新方法。 Abstract: Novel view synthesis and 4D reconstruction techniques predominantly rely on RGB cameras, thereby inheriting inherent limitations such as the dependence on adequate lighting, susceptibility to motion blur, and a limited dynamic range. Event cameras, offering advantages of low power, high temporal resolution and high dynamic range, have brought a new perspective to addressing the scene reconstruction challenges in high-speed motion and low-light scenes. To this end, we propose E-4DGS, the first event-driven dynamic Gaussian Splatting approach, for novel view synthesis from multi-view event streams with fast-moving cameras. Specifically, we introduce an event-based initialization scheme to ensure stable training and propose event-adaptive slicing splatting for time-aware reconstruction. Additionally, we employ intensity importance pruning to eliminate floating artifacts and enhance 3D consistency, while incorporating an adaptive contrast threshold for more precise optimization. We design a synthetic multi-view camera setup with six moving event cameras surrounding the object in a 360-degree configuration and provide a benchmark multi-view event stream dataset that captures challenging motion scenarios. Our approach outperforms both event-only and event-RGB fusion baselines and paves the way for the exploration of multi-view event-based reconstruction as a novel approach for rapid scene capture.

[156] SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

Yachao Liang,Min Yu,Gang Li,Jianguo Jiang,Boquan Li,Feng Yu,Ning Zhang,Xiang Meng,Weiqing Huang

Main category: cs.CV

TL;DR: 本文提出了一种基于音频-视觉语音表征学习的伪造视频检测方法，在不使用伪造视频训练的情况下，实现了更好的跨数据集泛化和鲁棒性。

Details

Motivation: 音频信号包含丰富的语音内容，可以提供能够准确反映面部运动的精确信息。 Method: 利用自监督的掩码预测任务进行音频-视觉语音表征学习，然后将该模型直接用于伪造检测任务。 Result: 在无需任何伪造视频训练的情况下，所提出的方法在跨数据集泛化性和鲁棒性方面优于现有最先进方法。 Conclusion: 音频信号与视觉语音元素的结合能够有效提升伪造视频的检测能力，尤其是在跨数据集泛化和鲁棒性方面。 Abstract: Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training. Code is available at https://github.com/Eleven4AI/SpeechForensics.

[157] Towards Comprehensive Cellular Characterisation of H&E slides

Benjamin Adjadj,Pierre-Antoine Bannier,Guillaume Horent,Sebastien Mandela,Aurore Lyon,Kathryn Schutte,Ulysse Marteau,Valentin Gaury,Laura Dumont,Thomas Mathieu,Reda Belbahri,Benoît Schmauch,Eric Durand,Katharina Von Loga,Lucie Gillet

Main category: cs.CV

TL;DR: HistoPLUS is a novel model for cell analysis that significantly outperforms current methods, particularly for understudied cell types, while being more efficient.

Details

Motivation: Existing methods for cell detection, segmentation, and classification suffer from poor performance on understudied cell types and limited cross-domain generalization. Method: HistoPLUS was trained on a curated pan-cancer dataset of 108,722 nuclei covering 13 cell types to improve performance on understudied cell types and cross-domain generalization. Result: In external validation across 4 independent cohorts, HistoPLUS improved detection quality by 5.2% and overall F1 classification score by 23.7%, while using 5x fewer parameters. It enables the study of 7 understudied cell types and improves performance on 8 of 13 cell types. Conclusion: HistoPLUS is a state-of-the-art model for cell analysis that outperforms existing methods in detection and classification while using fewer parameters. It enables the study of understudied cell types and demonstrates robust transferability to unseen oncology indications. Abstract: Cell detection, segmentation and classification are essential for analyzing tumor microenvironments (TME) on hematoxylin and eosin (H&E) slides. Existing methods suffer from poor performance on understudied cell types (rare or not present in public datasets) and limited cross-domain generalization. To address these shortcomings, we introduce HistoPLUS, a state-of-the-art model for cell analysis, trained on a novel curated pan-cancer dataset of 108,722 nuclei covering 13 cell types. In external validation across 4 independent cohorts, HistoPLUS outperforms current state-of-the-art models in detection quality by 5.2% and overall F1 classification score by 23.7%, while using 5x fewer parameters. Notably, HistoPLUS unlocks the study of 7 understudied cell types and brings significant improvements on 8 of 13 cell types. Moreover, we show that HistoPLUS robustly transfers to two oncology indications unseen during training. To support broader TME biomarker research, we release the model weights and inference code at https://github.com/owkin/histoplus/.

[158] Quo Vadis Handwritten Text Generation for Handwritten Text Recognition?

Vittorio Pippi,Konstantina Nikolaidou,Silvia Cascianelli,George Retsinas,Giorgos Sfikas,Rita Cucchiara,Marcus Liwicki

Main category: cs.CV

TL;DR: This paper evaluates the effectiveness of three HTG models in improving HTR performance for small, author-specific manuscript collections, offering insights into their impact and areas for future development.

Details

Motivation: The motivation stems from the challenges faced by HTR systems when dealing with small, author-specific manuscript collections that differ from training data distributions, and the need to evaluate how HTG models can address these issues. Method: The authors systematically compared three state-of-the-art styled HTG models (generative adversarial, diffusion, and autoregressive) to evaluate their impact on HTR fine-tuning, analyzing how synthetic data characteristics influence outcomes. Result: The analysis provides insights into the current capabilities of HTG methods, quantitatively guides model selection, and identifies key areas for further improvement in low-resource HTR applications. Conclusion: The study concludes that while HTG models offer promise for improving HTR performance in low-resource settings, their effectiveness varies based on the visual and linguistic qualities of the generated data, highlighting areas for future improvement. Abstract: The digitization of historical manuscripts presents significant challenges for Handwritten Text Recognition (HTR) systems, particularly when dealing with small, author-specific collections that diverge from the training data distributions. Handwritten Text Generation (HTG) techniques, which generate synthetic data tailored to specific handwriting styles, offer a promising solution to address these challenges. However, the effectiveness of various HTG models in enhancing HTR performance, especially in low-resource transcription settings, has not been thoroughly evaluated. In this work, we systematically compare three state-of-the-art styled HTG models (representing the generative adversarial, diffusion, and autoregressive paradigms for HTG) to assess their impact on HTR fine-tuning. We analyze how visual and linguistic characteristics of synthetic data influence fine-tuning outcomes and provide quantitative guidelines for selecting the most effective HTG model. The results of our analysis provide insights into the current capabilities of HTG methods and highlight key areas for further improvement in their application to low-resource HTR.

[159] AST-n: A Fast Sampling Approach for Low-Dose CT Reconstruction using Diffusion Models

Tomás de la Sotta,José M. Saavedra,Héctor Henríquez,Violeta Chang,Aline Xavier

Main category: cs.CV

TL;DR: This paper introduces AST-n, an accelerated diffusion model framework for low-dose CT denoising that drastically reduces inference time while maintaining image quality, making it more suitable for clinical use.

Details

Motivation: Low-dose CT (LDCT) reduces radiation exposure but increases image noise, affecting diagnostic confidence. The authors aim to accelerate diffusion-based generative models for LDCT denoising to make them more practical for clinical workflows. Method: The authors introduced AST-n, an accelerated inference framework that starts reverse diffusion from intermediate noise levels, and incorporated high-order ODE solvers in conditioned models to reduce sampling steps. They evaluated the method on the Low Dose CT Grand Challenge dataset using different acceleration paradigms and measured performance in terms of PSNR, SSIM, and inference time. Result: Conditioned models using only 25 steps (AST-25) achieved PSNR above 38 dB and SSIM above 0.95, significantly reducing inference time from ~16 seconds to under 1 second per slice. Unconditional sampling led to significant quality loss, emphasizing the importance of conditioning. DDIM inversion offered minor PSNR improvements but doubled inference time. Conclusion: The study concludes that AST-n with high-order samplers allows for fast LDCT reconstruction with minimal loss of image fidelity, making diffusion-based methods more feasible for clinical use. Abstract: Low-dose CT (LDCT) protocols reduce radiation exposure but increase image noise, compromising diagnostic confidence. Diffusion-based generative models have shown promise for LDCT denoising by learning image priors and performing iterative refinement. In this work, we introduce AST-n, an accelerated inference framework that initiates reverse diffusion from intermediate noise levels, and integrate high-order ODE solvers within conditioned models to further reduce sampling steps. We evaluate two acceleration paradigms--AST-n sampling and standard scheduling with high-order solvers -- on the Low Dose CT Grand Challenge dataset, covering head, abdominal, and chest scans at 10-25 % of standard dose. Conditioned models using only 25 steps (AST-25) achieve peak signal-to-noise ratio (PSNR) above 38 dB and structural similarity index (SSIM) above 0.95, closely matching standard baselines while cutting inference time from ~16 seg to under 1 seg per slice. Unconditional sampling suffers substantial quality loss, underscoring the necessity of conditioning. We also assess DDIM inversion, which yields marginal PSNR gains at the cost of doubling inference time, limiting its clinical practicality. Our results demonstrate that AST-n with high-order samplers enables rapid LDCT reconstruction without significant loss of image fidelity, advancing the feasibility of diffusion-based methods in clinical workflows.

[160] Stable Diffusion Models are Secretly Good at Visual In-Context Learning

Trevine Oorloff,Vishwanath Sindagi,Wele Gedara Chaminda Bandara,Ali Shafahi,Amin Ghiasi,Charan Prakash,Reza Ardekani

Main category: cs.CV

TL;DR: This paper demonstrates that Stable Diffusion models can be repurposed for visual in-context learning without fine-tuning, showing strong performance across multiple computer vision tasks.

Details

Motivation: The motivation was to explore visual in-context learning (V-ICL) for computer vision tasks by repurposing existing models without specialized training or additional data, aiming to simplify the process and improve generalizability. Method: The researchers formulated an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture, explicitly incorporating context between the query and example prompts. They tested the repurposed model across six different tasks and evaluated performance using metrics like mean intersection over union (mIoU). Result: The proposed approach improved the mIoU for foreground segmentation on the Pascal-5i dataset by 8.9% and 3.2% compared to methods like Visual Prompting and IMProv. The method also effectively leveraged multiple prompts through ensembling to enhance performance. Conclusion: The study concludes that off-the-shelf Stable Diffusion models can be effectively repurposed for visual in-context learning (V-ICL) without any additional fine-tuning, demonstrating adaptability across six different tasks and outperforming recent methods. Abstract: Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) -- the ability to leverage a few sets of example prompts to adapt to various tasks without having to explicitly update the model weights. ICL has recently been explored for computer vision tasks with promising early outcomes. These approaches involve specialized training and/or additional data that complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this repurposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal-5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task better and further improve the performance.

[161] LIA-X: Interpretable Latent Portrait Animator

Yaohui Wang,Di Yang,Xinyuan Chen,Francois Bremond,Yu Qiao,Antitza Dantcheva

Main category: cs.CV

TL;DR: LIA-X is an interpretable and scalable portrait animator that enables precise, fine-grained control over facial dynamics transfer, achieving superior performance and supporting practical applications like user-guided editing.

Details

Motivation: The motivation is to improve control and interpretability in facial dynamics transfer, narrowing differences in pose and expression between source and driving videos while overcoming limitations of previous 'warp-render' approaches. Method: LIA-X uses an autoencoder framework with a Sparse Motion Dictionary to model motion transfer as linear navigation in latent space, enabling disentanglement of facial dynamics into interpretable factors and implementing an 'edit-warp-render' strategy for precise control. Result: LIA-X outperforms prior methods in self-reenactment and cross-reenactment tasks, supports fine-grained facial manipulation, and scales successfully to a 1-billion-parameter model trained on large datasets. Conclusion: LIA-X is a scalable and interpretable portrait animator that enables fine-grained control over facial dynamics transfer, outperforming previous methods and supporting practical applications like user-guided editing and 3D-aware manipulation. Abstract: We introduce LIA-X, a novel interpretable portrait animator designed to transfer facial dynamics from a driving video to a source portrait with fine-grained control. LIA-X is an autoencoder that models motion transfer as a linear navigation of motion codes in latent space. Crucially, it incorporates a novel Sparse Motion Dictionary that enables the model to disentangle facial dynamics into interpretable factors. Deviating from previous 'warp-render' approaches, the interpretability of the Sparse Motion Dictionary allows LIA-X to support a highly controllable 'edit-warp-render' strategy, enabling precise manipulation of fine-grained facial semantics in the source portrait. This helps to narrow initial differences with the driving video in terms of pose and expression. Moreover, we demonstrate the scalability of LIA-X by successfully training a large-scale model with approximately 1 billion parameters on extensive datasets. Experimental results show that our proposed method outperforms previous approaches in both self-reenactment and cross-reenactment tasks across several benchmarks. Additionally, the interpretable and controllable nature of LIA-X supports practical applications such as fine-grained, user-guided image and video editing, as well as 3D-aware portrait video manipulation.

[162] January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis

Amir Hosseinian,Ashkan Dehghani Zahedani,Umer Mansoor,Noosheen Hashemi,Mark Woodward

Main category: cs.CV

TL;DR: This paper introduces a new food image benchmark dataset and evaluation framework, demonstrating significant performance improvements with a specialized model for automated nutritional analysis.

Details

Motivation: The lack of standardized evaluation methodologies and high-quality, real-world benchmark datasets hampers progress in AI for automated nutritional analysis. Method: The authors introduced the January Food Benchmark (JFB), a comprehensive benchmarking framework with robust metrics, and conducted evaluations using general-purpose Vision-Language Models (VLMs) and their specialized model january/food-vision-v1. Result: The specialized model january/food-vision-v1 achieved an Overall Score of 86.2, a 12.1-point improvement over the best-performing general-purpose configuration. Conclusion: This work provides a new evaluation dataset and framework for guiding and benchmarking future developments in automated nutritional analysis. Abstract: Progress in AI for automated nutritional analysis is critically hampered by the lack of standardized evaluation methodologies and high-quality, real-world benchmark datasets. To address this, we introduce three primary contributions. First, we present the January Food Benchmark (JFB), a publicly available collection of 1,000 food images with human-validated annotations. Second, we detail a comprehensive benchmarking framework, including robust metrics and a novel, application-oriented overall score designed to assess model performance holistically. Third, we provide baseline results from both general-purpose Vision-Language Models (VLMs) and our own specialized model, january/food-vision-v1. Our evaluation demonstrates that the specialized model achieves an Overall Score of 86.2, a 12.1-point improvement over the best-performing general-purpose configuration. This work offers the research community a valuable new evaluation dataset and a rigorous framework to guide and benchmark future developments in automated nutritional analysis.

[163] MOC: Meta-Optimized Classifier for Few-Shot Whole Slide Image Classification

Tianqi Xiang,Yi Li,Qixiang Zhang,Xiaomeng Li

Main category: cs.CV

TL;DR: 提出了一种名为Meta-Optimized Classifier (MOC)的新方法，通过结合多种分类器配置，提升在有限数据下的全切片图像分类性能。

Details

Motivation: 现有的少样本方法虽然在有限注释下提高了诊断准确性，但其依赖传统的分类器设计，导致对数据稀缺的脆弱性。 Method: MOC包含两个核心组件：(1) 元学习器，自动从候选分类器混合中优化分类器配置；(2) 分类器库，提供多样化的候选分类器以实现全面的病理分析。 Result: 实验表明，MOC在多个少样本基准测试中优于现有技术。在TCGA-NSCLC基准测试中，MOC比最先进的少样本VLFM方法提高了10.4%的AUC，在1-shot条件下增益高达26.25%。 Conclusion: MOC为临床诊断数据严重受限的部署提供了关键进展。 Abstract: Recent advances in histopathology vision-language foundation models (VLFMs) have shown promise in addressing data scarcity for whole slide image (WSI) classification via zero-shot adaptation. However, these methods remain outperformed by conventional multiple instance learning (MIL) approaches trained on large datasets, motivating recent efforts to enhance VLFM-based WSI classification through fewshot learning paradigms. While existing few-shot methods improve diagnostic accuracy with limited annotations, their reliance on conventional classifier designs introduces critical vulnerabilities to data scarcity. To address this problem, we propose a Meta-Optimized Classifier (MOC) comprising two core components: (1) a meta-learner that automatically optimizes a classifier configuration from a mixture of candidate classifiers and (2) a classifier bank housing diverse candidate classifiers to enable a holistic pathological interpretation. Extensive experiments demonstrate that MOC outperforms prior arts in multiple few-shot benchmarks. Notably, on the TCGA-NSCLC benchmark, MOC improves AUC by 10.4% over the state-of-the-art few-shot VLFM-based methods, with gains up to 26.25% under 1-shot conditions, offering a critical advancement for clinical deployments where diagnostic training data is severely limited. Code is available at https://github.com/xmed-lab/MOC.

[164] PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image

Geonhee Sim,Gyeongsik Moon

Main category: cs.CV

TL;DR: PERSONA creates personalized 3D human avatars with realistic pose-driven deformations using only a single input image, combining the strengths of 3D- and diffusion-based methods.

Details

Motivation: The motivation is to overcome the limitations of existing methods for creating animatable human avatars, particularly the need for pose-rich videos in 3D-based approaches and the struggle with identity preservation in diffusion-based approaches. Method: PERSONA uses a diffusion-based approach to generate pose-rich videos from a single input image and then optimizes a 3D avatar using balanced sampling and geometry-weighted optimization to preserve identity and rendering quality. Result: PERSONA achieves personalized 3D human avatars with pose-driven deformations from a single image, demonstrating high authenticity and sharp renderings across various poses. Conclusion: PERSONA successfully integrates 3D-based and diffusion-based approaches to create a personalized 3D human avatar with pose-driven deformations from a single image, ensuring high authenticity and sharp renderings across diverse poses. Abstract: Two major approaches exist for creating animatable human avatars. The first, a 3D-based approach, optimizes a NeRF- or 3DGS-based avatar from videos of a single person, achieving personalization through a disentangled identity representation. However, modeling pose-driven deformations, such as non-rigid cloth deformations, requires numerous pose-rich videos, which are costly and impractical to capture in daily life. The second, a diffusion-based approach, learns pose-driven deformations from large-scale in-the-wild videos but struggles with identity preservation and pose-dependent identity entanglement. We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. PERSONA leverages a diffusion-based approach to generate pose-rich videos from the input image and optimizes a 3D avatar based on them. To ensure high authenticity and sharp renderings across diverse poses, we introduce balanced sampling and geometry-weighted optimization. Balanced sampling oversamples the input image to mitigate identity shifts in diffusion-generated training videos. Geometry-weighted optimization prioritizes geometry constraints over image loss, preserving rendering quality in diverse poses.

[165] A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation

Shuting He,Peilin Ji,Yitong Yang,Changshuo Wang,Jiayi Ji,Yinglin Wang,Henghui Ding

Main category: cs.CV

TL;DR: This paper surveys recent advancements in 3D Gaussian Splatting (3DGS), highlighting its potential as a high-fidelity, real-time alternative to NeRF for 3D scene representation. It reviews applications in segmentation, editing, and generation, identifies design principles and trends, and provides a resource repository for ongoing research.

Details

Motivation: The motivation behind the paper is to provide a comprehensive overview of recent progress in 3D Gaussian Splatting (3DGS) applications. It aims to highlight the potential of 3DGS as a powerful alternative to Neural Radiance Fields (NeRF) for 3D scene representation, especially given its explicit, compact nature and suitability for real-time, photorealistic rendering. Method: The paper presents a survey methodology, systematically reviewing and categorizing recent advancements in 3D Gaussian Splatting (3DGS) applications. It introduces 2D foundation models and NeRF-based methods, organizes 3DGS applications into categories such as segmentation, editing, and generation, and analyzes datasets, evaluation protocols, and benchmark comparisons. Result: The paper results in a structured survey of 3D Gaussian Splatting (3DGS) applications, categorizing them into segmentation, editing, generation, and other functional tasks. It identifies representative methods, supervision strategies, and learning paradigms, along with shared design principles and emerging trends. Additionally, it provides a summary of datasets, evaluation protocols, and benchmark comparisons, supported by a maintained repository for ongoing research. Conclusion: The paper concludes that 3D Gaussian Splatting is a promising approach for high-fidelity, real-time 3D scene representation with broad applications in segmentation, editing, and generation tasks. It emphasizes the importance of continued research and development in this area. Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful alternative to Neural Radiance Fields (NeRF) for 3D scene representation, offering high-fidelity photorealistic rendering with real-time performance. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first introduces 2D foundation models that support semantic understanding and control in 3DGS applications, followed by a review of NeRF-based methods that inform their 3DGS counterparts. We then categorize 3DGS applications into segmentation, editing, generation, and other functional tasks. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.

[166] LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

Chengtao Lv,Bilang Zhang,Yang Yong,Ruihao Gong,Yushi Huang,Shiqiao Gu,Jiajun Wu,Yumeng Shi,Jinyang Guo,Wenya Wang

Main category: cs.CV

TL;DR: 本文提出 LLMC+，一个全面的视觉语言模型（VLM）压缩基准和工具包，系统研究 token 级与模型级压缩，解决了现有方法在模块化、评估范围和组合压缩方面的不足。

Details

Motivation: 现有 VLM 压缩方法存在三大限制：未将技术分解为可比较模块、仅限于简单任务评估、孤立使用单一压缩技术，因此需要一种更全面、系统的方法来研究 VLM 压缩。 Method: 提出 LLMC+，一个全面的 VLM 压缩基准和多功能工具包，支持 20 多种算法和五种代表性 VLM 家族，实现对 token 级和模型级压缩的系统研究。 Result: LLMC+ 揭示了空间与时间冗余需要不同的技术策略，token 减少方法在多轮对话和细节敏感任务中表现显著下降，结合 token 和模型压缩可实现高效压缩。 Conclusion: LLMC+ 能够实现极高的压缩率且性能损失最小，同时促进公平评估并激发未来高效 VLM 的研究。 Abstract: Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.

[167] Story2Board: A Training-Free Approach for Expressive Storyboard Generation

David Dinkevich,Matan Levy,Omri Avrahami,Dvir Samuel,Dani Lischinski

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的框架Story2Board，用于从自然语言生成富有表现力的故事板。

Details

Motivation: 现有方法过于关注主体身份，忽视了视觉叙事中的关键方面，如空间构成、背景演变和叙事节奏。 Method: 引入了一个轻量级一致性框架，包括潜在面板锚定和互惠注意力值混合。利用现成的语言模型将自由形式的故事转化为基于面板的提示，并提出了一种新的场景多样性度量。 Result: 通过定性和定量结果以及用户研究显示，Story2Board生成的故事板更具动态性、连贯性和叙事吸引力。 Conclusion: Story2Board是一个有效的训练-free框架，能够在没有架构更改或微调的情况下，生成视觉多样且一致的故事板。 Abstract: We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.

Table of Contents

cs.CL [Back]

[1] ParallelSearch: Train your LLMs to Decompose Query and Search Sub-queries in Parallel with Reinforcement Learning

[2] Leveraging Large Language Models for Rare Disease Named Entity Recognition

[3] TEN: Table Explicitization, Neurosymbolically

[4] Decoding Neural Emotion Patterns through Natural Language Processing Embeddings

[5] The Human-AI Hybrid Delphi Model: A Structured Framework for Context-Rich, Expert Consensus in Complex Domains

[6] Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

[7] APIO: Automatic Prompt Induction and Optimization for Grammatical Error Correction and Text Simplification

[8] Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models

[9] Leveraging Zipformer Model for Effective Language Identification in Code-Switched Child-Directed Speech

[10] From Charts to Fair Narratives: Uncovering and Mitigating Geo-Economic Biases in Chart-to-Text

[11] User-centric Subjective Leaderboard by Customizable Reward Modeling

[12] Learning Facts at Scale with Active Reading

[13] From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation

[14] LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation

[15] Cross-lingual Aspect-Based Sentiment Analysis: A Survey on Tasks, Approaches, and Challenges

[16] UWBa at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval

[17] COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation

[18] The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage

[19] AINL-Eval 2025 Shared Task: Detection of AI-Generated Scientific Abstracts in Russian

[20] Improving Diversity in Language Models: When Temperature Fails, Change the Loss

[21] EffiEval: Efficient and Generalizable Model Evaluation via Capability Coverage Maximization

[22] Slow Tuning and Low-Entropy Masking for Safe Chain-of-Thought Distillation

[23] Evaluating the Role of Large Language Models in Legal Practice in India

[24] The Perils of Chart Deception: How Misleading Visualizations Affect Vision-Language Models

[25] Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

[26] Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation

[27] Echoes of Agreement: Argument Driven Opinion Shifts in Large Language Models

[28] UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

[29] Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study

[30] Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges

[31] BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

[32] A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems

[33] Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

[34] PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

[35] Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription

[36] Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

[37] A Survey of Cognitive Distortion Detection and Classification in NLP

[38] Language of Persuasion and Misrepresentation in Business Communication: A Textual Detection Approach

[39] A Comprehensive Evaluation framework of Alignment Techniques for LLMs

[40] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

[41] Specialised or Generic? Tokenization Choices for Radiology Language Models

[42] Shaping Event Backstories to Estimate Potential Emotion Contexts

[43] Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

[44] Which one Performs Better? Wav2Vec or Whisper? Applying both in Badini Kurdish Speech to Text (BKSTT)

[45] Neural Bandit Based Optimal LLM Selection for a Pipeline of Tasks

cs.CV [Back]

[46] A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection

[47] IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection

[48] A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality

[49] RL-MoE: An Image-Based Privacy Preserving Approach In Intelligent Transportation System

[50] Synthetic Data Generation for Emotional Depth Faces: Optimizing Conditional DCGANs via Genetic Algorithms in the Latent Space and Stabilizing Training with Knowledge Distillation

[51] $Δ$-AttnMask: Attention-Guided Masked Hidden States for Efficient Data Selection and Augmentation

[52] Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

[53] GANime: Generating Anime and Manga Character Drawings from Sketches with Deep Learning

[54] MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

[55] Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

[56] Towards Scalable Training for Handwritten Mathematical Expression Recognition

[57] Gradient-Direction-Aware Density Control for 3D Gaussian Splatting

[58] FineState-Bench: A Comprehensive Benchmark for Fine-Grained State Control in GUI Agents

[59] Beyond Blanket Masking: Examining Granularity for Privacy Protection in Images Captured by Blind and Low Vision Users

[60] Harnessing Input-Adaptive Inference for Efficient VLN

[61] SegDAC: Segmentation-Driven Actor-Critic for Visual Reinforcement Learning

[62] Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model

[63] UltraLight Med-Vision Mamba for Classification of Neoplastic Progression in Tubular Adenomas

[64] Blink-to-code: real-time Morse code communication via eye blink detection and classification

[65] FusionEnsemble-Net: An Attention-Based Ensemble of Spatiotemporal Networks for Multimodal Sign Language Recognition

[66] A Signer-Invariant Conformer and Multi-Scale Fusion Transformer for Continuous Sign Language Recognition

[67] What Can We Learn from Inter-Annotator Variability in Skin Lesion Segmentation?

[68] X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents

[69] DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection

[70] Skyshield: Event-Driven Submillimetre Thin Obstacle Detection for Drone Flight Safety

[71] Autonomous AI Bird Feeder for Backyard Biodiversity Monitoring

[72] Waymo-3DSkelMo: A Multi-Agent 3D Skeletal Motion Dataset for Pedestrian Interaction Modeling in Autonomous Driving

[73] RampNet: A Two-Stage Pipeline for Bootstrapping Curb Ramp Detection in Streetscape Images from Open Government Metadata

[74] Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation

[75] What-Meets-Where: Unified Learning of Action and Contact Localization in a New Dataset

[76] MPT: Motion Prompt Tuning for Micro-Expression Recognition

[77] RASR: Retrieval-Augmented Super Resolution for Practical Reference-based Image Restoration