Skip to content

Table of Contents

cs.CL [Back]

[1] Semantic Attractors and the Emergence of Meaning: Towards a Teleological Model of AGI

Hans-Joachim Rudolph

Main category: cs.CL

TL;DR: 本文提出了一种基于语义吸引子理论的人工通用智能模型,该模型通过递归张量变换而非统计预测来形成语言意义。

Details Motivation: 论文的动机是探索一种语义人工智能(AGI),这种模型不是基于统计下一个词的预测,而是通过递归张量变换来形成意义。 Method: 论文提出了一种基于复杂值意义空间中的语义吸引子的理论框架,使用涉及虚数单位i的循环操作来描述旋转语义结构。 Result: 论文的结果是一个语义吸引子模型,它以梯度流、张量变形和迭代矩阵动力学的形式提供了一种不仅在数学上具有启发性而且在哲学上具有重要意义的语义变换模型。 Conclusion: 论文得出结论,真正的意义不是来自模拟,而是通过递归收敛到语义连贯性,并且需要一种全新的认知结构——一种旨在塑造语言而不仅仅是预测它的结构。 Abstract: This essay develops a theoretical framework for a semantic Artificial General Intelligence (AGI) based on the notion of semantic attractors in complex-valued meaning spaces. Departing from current transformer-based language models, which operate on statistical next-token prediction, we explore a model in which meaning is not inferred probabilistically but formed through recursive tensorial transformation. Using cyclic operations involving the imaginary unit \emph{i}, we describe a rotational semantic structure capable of modeling irony, homonymy, and ambiguity. At the center of this model, however, is a semantic attractor -- a teleological operator that, unlike statistical computation, acts as an intentional agent (Microvitum), guiding meaning toward stability, clarity, and expressive depth. Conceived in terms of gradient flows, tensor deformations, and iterative matrix dynamics, the attractor offers a model of semantic transformation that is not only mathematically suggestive, but also philosophically significant. We argue that true meaning emerges not from simulation, but from recursive convergence toward semantic coherence, and that this requires a fundamentally new kind of cognitive architecture -- one designed to shape language, not just predict it.

[2] LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

Maojia Song,Tej Deep Pala,Weisheng Jin,Amir Zadeh,Chuan Li,Dorien Herremans,Soujanya Poria

Main category: cs.CL

TL;DR: This study explores how large language models (LLMs) in multi-agent systems (MAS) manage trust, misinformation, and peer input. It introduces KAIROS, a benchmark for simulating peer interactions, and finds that GRPO improves performance but increases susceptibility to social influence.

Details Motivation: The motivation is to understand how large language models (LLMs) form trust, resist misinformation, and integrate peer input in multi-agent systems (MAS), which are essential for achieving collective intelligence under complex social dynamics. Method: The study introduces KAIROS, a benchmark for simulating quiz contests with peer agents of varying reliability. It uses historical interactions and current peer responses to analyze how trust, peer action, and self-confidence influence decisions. Mitigation strategies like prompting, supervised fine-tuning, and reinforcement learning (GRPO) are evaluated. Result: Results show that GRPO with multi-agent context and outcome-based rewards achieves the best overall performance in decision-making but reduces robustness against social influence compared to base models. Conclusion: The study concludes that GRPO with multi-agent context and outcome-based rewards yields the best performance in managing peer interactions and decision-making in multi-agent systems, albeit at the cost of increased susceptibility to social influence. Abstract: Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. Although prior work has focused on conformity bias, we extend the analysis to examine how LLMs form trust from previous impressions, resist misinformation, and integrate peer input during interaction, key factors for achieving collective intelligence under complex social dynamics. We present KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expert-novice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how trust, peer action, and self-confidence influence decisions. As for mitigation strategies, we evaluate prompting, supervised fine-tuning, and reinforcement learning, Group Relative Policy Optimisation (GRPO), across multiple models. Our results reveal that GRPO with multi-agent context combined with outcome-based rewards and unconstrained reasoning achieves the best overall performance, but also decreases the robustness to social influence compared to Base models. The code and datasets are available at: https://github.com/declare-lab/KAIROS.

[3] Not All Visitors are Bilingual: A Measurement Study of the Multilingual Web from an Accessibility Perspective

Masudul Hasan Masud Bhuiyan,Matteo Varvello,Yasir Zaki,Cristian-Alexandru Staicu

Main category: cs.CL

TL;DR: This paper introduces LangCrUX, a dataset of multilingual websites, to analyze web accessibility issues for non-Latin script users and proposes Kizuki, a tool to improve screen reader compatibility.

Details Motivation: The growing use of multilingual content on the web, especially involving non-Latin scripts, creates significant accessibility barriers for users with visual impairments due to limited support in assistive technologies like screen readers. Method: The authors created LangCrUX, a large-scale dataset of 120,000 popular websites across 12 non-Latin script languages, and used it to systematically analyze multilingual web accessibility, focusing on the use and effectiveness of accessibility hints. Result: The study found widespread neglect of language-aware accessibility hints, which often do not reflect the language diversity of visible content, thereby reducing the effectiveness of screen readers. Conclusion: Current web accessibility practices are insufficient for multilingual and non-Latin script contexts, and tools like Kizuki are needed to improve language-aware accessibility testing and support. Abstract: English is the predominant language on the web, powering nearly half of the world's top ten million websites. Support for multilingual content is nevertheless growing, with many websites increasingly combining English with regional or native languages in both visible content and hidden metadata. This multilingualism introduces significant barriers for users with visual impairments, as assistive technologies like screen readers frequently lack robust support for non-Latin scripts and misrender or mispronounce non-English text, compounding accessibility challenges across diverse linguistic contexts. Yet, large-scale studies of this issue have been limited by the lack of comprehensive datasets on multilingual web content. To address this gap, we introduce LangCrUX, the first large-scale dataset of 120,000 popular websites across 12 languages that primarily use non-Latin scripts. Leveraging this dataset, we conduct a systematic analysis of multilingual web accessibility and uncover widespread neglect of accessibility hints. We find that these hints often fail to reflect the language diversity of visible content, reducing the effectiveness of screen readers and limiting web accessibility. We finally propose Kizuki, a language-aware automated accessibility testing extension to account for the limited utility of language-inconsistent accessibility hints.

[4] Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models

Yuchun Fan,Yilin Wang,Yongyu Mu,Lei Huang,Bei Li,Xiaocheng Feng,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: PLAST 是一种高效的多语言增强方法,通过微调语言特定层来提升大型视觉语言模型的多语言理解能力。

Details Motivation: LVLMs 在理解视觉信息方面表现出色,但在多语言能力上存在不平衡。 Method: 通过监控语言特定神经元激活来识别与多语言理解相关的层,并使用问题-翻译对进行精确的微调,以实现多语言对齐。 Result: 在 MM-Bench 和 MMMB 上的实验证明,PLAST 有效提升了 LVLMs 的多语言能力,仅需微调 14% 的参数即可实现显著效率提升。此外,PLAST 可推广至低资源和复杂视觉推理任务。 Conclusion: PLAST 是一种高效的多语言增强训练方法,适用于大型视觉语言模型(LVLMs)。 Abstract: Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers.

[5] Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails

Kellen Tan Cheng,Anna Lisa Gentile,Chad DeLuca,Guang-Jie Ren

Main category: cs.CL

TL;DR: This paper introduces 'backprompting' to generate realistic labeled data for training efficient LLM guardrails, particularly for detecting health advice, achieving better performance than GPT-4o with far fewer parameters.

Details Motivation: The difficulty in acquiring production-quality labeled data for developing robust LLM guardrails motivated the creation of a new method to generate such data synthetically. Method: Backprompting is used to generate production-like labeled data, which is then paired with sparse human-in-the-loop clustering for labeling. This synthetic data is infused into existing datasets to train a more robust detector. Result: The detector trained using the proposed method outperformed GPT-4o by up to 3.73% in identifying health advice in LLM outputs, despite having 400x fewer parameters. Conclusion: The proposed backprompting method combined with sparse human-in-the-loop clustering effectively generates production-like labeled data, leading to a more robust health advice detector that outperforms GPT-4o despite having significantly fewer parameters. Abstract: The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs' input/output text through various detectors. However, developing and maintaining robust detectors faces many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs prior to deployment. In this work, we propose backprompting, a simple yet intuitive solution to generate production-like labeled data for health advice guardrails development. Furthermore, we pair our backprompting method with a sparse human-in-the-loop clustering technique to label the generated data. Our aim is to construct a parallel corpus roughly representative of the original dataset yet resembling real LLM output. We then infuse existing datasets with our synthetic examples to produce robust training data for our detector. We test our technique in one of the most difficult and nuanced guardrails: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.73%, despite having 400x less parameters.

[6] Integral Transformer: Denoising Attention, Not Too Much Not Too Little

Ivan Kobyzev,Abbas Ghaddar,Dingtao Hu,Boxing Chen

Main category: cs.CL

TL;DR: 本文提出了一种新的自注意力机制,即Integral Transformer,通过从logit分布中采样的信号来消除注意力中的噪声。

Details Motivation: Softmax自注意力常常给语义信息量较少的标记(如特殊标记和标点符号)分配不成比例的权重,这种现象被称为注意力噪声。现有的方法如Cog Attention和Differential Transformer通过引入负的注意力分数来解决这个问题,但可能会丢弃有用的信息。 Method: 提出了一种新的自注意力机制,即Integral Transformer,通过从logit分布中采样的信号来消除注意力中的噪声。 Result: 广泛的实验表明,该模型在已建立的知识和推理语言基准测试中优于vanilla、Cog和Differential attention变体。此外,分析表明,在较低的Transformer层中使用vanilla自注意力可以提高性能,并且Integral Transformer可以有效平衡注意力分布并在上层减少等级崩溃。 Conclusion: Integral Transformer有效地减轻了注意力噪声,同时保留了对模型性能至关重要的特殊标记的贡献。 Abstract: Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.

[7] Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

Jeong-seok Oh,Jay-yoon Lee

Main category: cs.CL

TL;DR: Latent Self-Consistency (LSC)是一种使用可学习token embeddings选择最语义一致响应的方法,它在各种基准测试中表现优异且计算开销极小。

Details Motivation: Probabilistic decoding在复杂或长问题上往往产生不一致的输出,现有方法如Self-Consistency (SC), Universal Self-Consistency (USC)和Weighted Unigram Consistency Score (WUCS)各有局限性。 Method: 引入Latent Self-Consistency (LSC),使用可学习的token embeddings选择语义上最一致的响应。 Result: LSC在所有短格式和长格式基准测试中均超过了SC、USC和WUCS,同时保持了可忽略的计算开销。 Conclusion: LSC是一个跨答案格式可靠工作的实用一致性选择方法,并提供了良好的校准置信度估计。 Abstract: Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce Latent Self-Consistency (LSC), which selects the most semantically consistent response using learnable token embeddings. A lightweight forward generation of summary tokens increases inference time by less than 1% and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form ones on average, while maintaining negligible computational overhead. These results position LSC as a practical consistency-selection method that works reliably across answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low Expected Calibration Error across both answer formats.

[8] Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering

Michal Štefánik,Timothee Mickus,Marek Kadlčík,Michal Spiegel,Josef Kuchař

Main category: cs.CL

TL;DR: 该研究挑战了传统使用OOD数据集评估模型泛化能力的方法,发现其存在局限性,并提出了更稳健的评估方法。

Details Motivation: 当前多数人工智能研究依赖OOD数据集来评估模型的泛化能力,但这种评估方法是否能够真实反映模型在实际部署中的失败情况仍存在疑问。 Method: 将OOD评估的结果与现有问答模型中记录的一组特定故障模式进行对比分析,这些故障模式被称为对虚假特征或预测捷径的依赖。 Result: 研究发现,用于问答任务中OOD评估的不同数据集对模型稳健性的估计质量差异很大,一些数据集的表现甚至不如简单的分布内(ID)评估。 Conclusion: 该论文强调了依赖于分布外(OOD)数据集评估模型泛化能力的局限性,并提出了在问答(QA)领域内外更稳健地评估泛化能力的方法和建议。 Abstract: A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset's quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.

[9] How Reliable are LLMs for Reasoning on the Re-ranking task?

Nafis Tanveer Islam,Zhiming Zhao

Main category: cs.CL

TL;DR: 该论文探讨了不同训练方法如何影响大型语言模型在重新排序任务中的语义理解和可解释性,并通过环境和地球科学领域的小型排序数据集进行分析。

Details Motivation: 随着大型语言模型(LLMs)在语义理解方面的能力提升,它们在与人类价值观保持一致方面表现出色,但这以牺牲透明度为代价。深入了解LLM的内部运作对于理解其重新排序的原因至关重要,尤其是在用户参与有限和排序数据不足的新系统中。 Method: 使用相对较小的来自环境和地球科学领域的排序数据集来重新排序检索到的内容,并分析可解释的信息以评估重新排序的合理性。 Result: 研究发现,虽然不同的训练方法影响LLM的训练和推理,但一些方法在可解释性方面表现更优,而并非所有方法都学会了精确的语义理解。 Conclusion: 不同的训练方法对大型语言模型(LLM)在重新排序任务中的语义理解能力有显著影响,并且一些训练方法比其他方法更具可解释性。 Abstract: With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM's internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.

[10] Integrating gender inclusivity into large language models via instruction tuning

Alina Wróblewska,Bartosz Żuk

Main category: cs.CL

TL;DR: 本研究通过微调大型语言模型以引入性别包容性,旨在减少波兰语生成中的性别偏见。

Details Motivation: 由于波兰语中语法性别的使用存在历史和政治上的惯例,导致男性形式被广泛用于指代男性、女性和混合性别群体,这种不公平的语言体系被训练在波兰语文本上的大型语言模型所继承和强化,因此需要解决这一性别偏见问题。 Method: 研究者基于理论语言学框架,设计了一个明确包含性别包容性指南的系统提示,并使用IPIS数据集对多种多语言和波兰语专用的大型语言模型进行微调。 Result: 研究结果表明,通过对大型语言模型进行性别包容性的微调,可以有效减少在波兰语生成任务中的性别偏见。 Conclusion: 该研究通过使用IPIS数据集对大型语言模型进行微调,成功地将性别包容性作为内在特征整合到模型中,为缓解波兰语生成中的性别偏见提供了系统性的解决方案。 Abstract: Imagine a language with masculine, feminine, and neuter grammatical genders, yet, due to historical and political conventions, masculine forms are predominantly used to refer to men, women and mixed-gender groups. This is the reality of contemporary Polish. A social consequence of this unfair linguistic system is that large language models (LLMs) trained on Polish texts inherit and reinforce this masculine bias, generating gender-imbalanced outputs. This study addresses this issue by tuning LLMs using the IPIS dataset, a collection of human-crafted gender-inclusive proofreading in Polish and Polish-to-English translation instructions. Grounded in a theoretical linguistic framework, we design a system prompt with explicit gender-inclusive guidelines for Polish. In our experiments, we IPIS-tune multilingual LLMs (Llama-8B, Mistral-7B and Mistral-Nemo) and Polish-specific LLMs (Bielik and PLLuM). Our approach aims to integrate gender inclusivity as an inherent feature of these models, offering a systematic solution to mitigate gender bias in Polish language generation.

[11] Principled Detection of Hallucinations in Large Language Models via Multiple Testing

Jiawei Li,Akshayaa Magesh,Venugopal V. Veeravalli

Main category: cs.CL

TL;DR: 这篇论文研究了大型语言模型中的幻觉检测问题,提出了一种新的解决方法,并验证了其有效性。

Details Motivation: 大型语言模型虽然强大,但容易产生幻觉,即生成看似自信但实际错误或无意义的响应。 Method: 将幻觉检测问题表述为假设检验问题,并借鉴机器学习模型中的分布外检测问题进行解决。 Result: 该方法在实验中表现出了对现有最先进方法的鲁棒性。 Conclusion: 该论文提出了一种基于多重检验的假想检测方法,用于解决大型语言模型中的幻觉检测问题,并通过广泛的实验验证了其方法的鲁棒性。 Abstract: While Large Language Models (LLMs) have emerged as powerful foundational models to solve a variety of tasks, they have also been shown to be prone to hallucinations, i.e., generating responses that sound confident but are actually incorrect or even nonsensical. In this work, we formulate the problem of detecting hallucinations as a hypothesis testing problem and draw parallels to the problem of out-of-distribution detection in machine learning models. We propose a multiple-testing-inspired method to solve the hallucination detection problem, and provide extensive experimental results to validate the robustness of our approach against state-of-the-art methods.

[12] COMET-poly: Machine Translation Metric Grounded in Other Candidates

Maike Züfle,Vilém Zouhar,Tu Anh Dinh,Felipe Maia Polo,Jan Niehues,Mrinmaya Sachan

Main category: cs.CL

TL;DR: This paper introduces two improved automated metrics for machine translation evaluation, COMET-polycand and COMET-polyic, which incorporate additional translation information and human-labeled quality scores, resulting in enhanced performance.

Details Motivation: Automated metrics for machine translation usually consider only the source sentence and a single translation, unlike human evaluation which often involves multiple alternatives. This discrepancy can negatively impact metric performance. Method: COMET-polycand uses alternative translations of the same source sentence for assessment, while COMET-polyic uses translations of similar texts with human-labeled quality scores. Result: Including additional translations in COMET-polycand improved segment-level metric performance (Kendall's tau-b correlation from 0.079 to 0.118), with further gains as more translations were added. Incorporating retrieved examples in COMET-polyic yielded similar improvements (Kendall's tau-b correlation from 0.079 to 0.116). Conclusion: COMET-polycand and COMET-polyic are effective automated metrics for machine translation that incorporate additional information beyond the single translation, and the models are publicly released. Abstract: Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall's tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall's tau-b correlation). We release our models publicly.

[13] The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation

Girish A. Koushik,Fatemeh Nazarieh,Katherine Birch,Shenbin Qian,Diptesh Kanojia

Main category: cs.CL

TL;DR: This paper introduces a self-evaluating framework for visual metaphor generation, combining structured prompting and lightweight reinforcement learning to align source-target meanings effectively, particularly for abstract metaphors.

Details Motivation: Visual metaphor generation requires aligning source and target concepts from text to image while preserving meaning and visual coherence. This work addresses the challenge by focusing on metaphor alignment through structured prompting and lightweight learning approaches. Method: The paper proposes a self-evaluating visual metaphor generation framework that includes a training-free pipeline for prompt decomposition and a training-based pipeline using a self-evaluation reward schema. It combines existing metrics with new ones like metaphor decomposition score and meaning alignment metric. Result: The training-free pipeline outperforms strong baselines (e.g., GPT-4o, Imagen) on decomposition, CLIP, and MA scores, while the training-based approach follows closely. User studies indicate preference for GPT-4o overall, but the proposed training-free method excels on abstract metaphors and outperforms open-source methods and Imagen in certain cases. Conclusion: Structured prompting and lightweight reinforcement learning can effectively perform metaphor alignment with modest computational resources, though gaps remain in aesthetics and sampling compared to human preferences. Abstract: Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach surpasses strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores, with the training-based approach close behind. We evaluate our framework output using a user-facing study, and observed that participants preferred GPT-4o overall, while our training-free pipeline led open-source methods and edged Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or more abstract metaphors, with closed models excelling on short, concrete cases; we also observe sensitivity to sampler settings. Overall, structured prompting and lightweight RL perform metaphor alignment well under modest compute, and remaining gaps to human preference appear driven by aesthetics and sampling.

[14] What do language models model? Transformers, automata, and the format of thought

Colin Klein

Main category: cs.CL

TL;DR: Transformers are 'discourse machines' that generate language based on context, but their processing differs from human cognition.

Details Motivation: To determine whether large language models reflect human cognition or are simply models of the training corpus. Method: An analytical approach based on computational architecture of transformers and Liu et al. (2022)'s shortcut automata. Result: Transformer-based models operate with linear computation, differing from human supralinear linguistic processing. Conclusion: Language models are 'discourse machines' that generate new language based on context, similar to humans but learned through different means. Abstract: What do large language models actually model? Do they tell us something about human capacities, or are they models of the corpus we've trained them on? I give a non-deflationary defence of the latter position. Cognitive science tells us that linguistic capabilities in humans rely supralinear formats for computation. The transformer architecture, by contrast, supports at best a linear formats for processing. This argument will rely primarily on certain invariants of the computational architecture of transformers. I then suggest a positive story about what transformers are doing, focusing on Liu et al. (2022)'s intriguing speculations about shortcut automata. I conclude with why I don't think this is a terribly deflationary story. Language is not (just) a means for expressing inner state but also a kind of 'discourse machine' that lets us make new language given appropriate context. We have learned to use this technology in one way; LLMs have also learned to use it too, but via very different means.

[15] A New NMT Model for Translating Clinical Texts from English to Spanish

Rumeng Li,Xun Wang,Hong Yu

Main category: cs.CL

TL;DR: NOOV enhances English-to-Spanish EHR translation by tackling unknown words and repetition issues, even with limited training data.

Details Motivation: The motivation is to overcome the challenges in translating EHR narratives, including the lack of parallel-aligned corpus and the presence of numerous unknown words. Method: The study introduces NOOV, which combines a bilingual lexicon learned from parallel-aligned corpora and a phrase look-up table from biomedical knowledge resources to enhance translation accuracy and fluency. Result: NOOV demonstrated improved accuracy and fluency in translating EHR narratives from English to Spanish, addressing key challenges in neural machine translation. Conclusion: NOOV, the proposed NMT system, effectively improves the translation of EHR narratives from English to Spanish by addressing unknown words and word repetition challenges, even with limited parallel-aligned training data. Abstract: Translating electronic health record (EHR) narratives from English to Spanish is a clinically important yet challenging task due to the lack of a parallel-aligned corpus and the abundant unknown words contained. To address such challenges, we propose \textbf{NOOV} (for No OOV), a new neural machine translation (NMT) system that requires little in-domain parallel-aligned corpus for training. NOOV integrates a bilingual lexicon automatically learned from parallel-aligned corpora and a phrase look-up table extracted from a large biomedical knowledge resource, to alleviate both the unknown word problem and the word-repeat challenge in NMT, enhancing better phrase generation of NMT systems. Evaluation shows that NOOV is able to generate better translation of EHR with improvement in both accuracy and fluency.

[16] Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

Chenxi Zhou,Pengfei Cao,Jiang Li,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: This paper investigates how post-training quantization affects different knowledge capabilities of large language models, finding that knowledge memorization is more sensitive to quantization parameters than utilization.

Details Motivation: The motivation is to understand how post-training quantization (PTQ) affects different knowledge capabilities of large language models (LLMs) and to develop more effective quantization strategies. Method: The paper conducts an extensive empirical investigation to establish task-stratified scaling laws, using a quantitative framework that includes model size, effective bit-width, calibration set size, and group size. Result: The central finding is that knowledge memorization is more sensitive to variations in PTQ parameters than knowledge utilization. Conclusion: The paper concludes that knowledge memorization in LLMs is more sensitive to changes in quantization parameters than knowledge utilization, providing insights for developing better quantization strategies. Abstract: Large language models (LLMs) present significant deployment challenges due to their scale, with post-training quantization (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse LLM knowledge capabilities remains elusive, and existing scaling laws for quantized models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle LLM knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ's impact and provide guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.

[17] Thinking Before You Speak: A Proactive Test-time Scaling Approach

Cong Li,Wenchang Chai,Hejun Wu,Yan Pan,Pengxu Wei,Liang Lin

Main category: cs.CL

TL;DR: 本文提出了一种新的推理框架TBYS,通过生成Insight来提高大型语言模型在复杂数学任务中的表现。

Details Motivation: 大型语言模型在处理复杂推理任务时存在不足,主要原因是人类推理模式与训练数据中的模式存在差异。 Method: 提出了一个名为Thinking Before You Speak (TBYS) 的推理框架,并设计了一个自动收集和过滤上下文示例的管道,用于生成Insight。 Result: TBYS框架通过插入Insight来指导推理过程,在数学数据集上取得了良好效果。 Conclusion: 实验结果验证了TBYS在复杂数学任务中的有效性,减少了对人工标注和微调的依赖。 Abstract: Large Language Models (LLMs) often exhibit deficiencies with complex reasoning tasks, such as maths, which we attribute to the discrepancy between human reasoning patterns and those presented in the LLMs' training data. When dealing with complex problems, humans tend to think carefully before expressing solutions. However, they often do not articulate their inner thoughts, including their intentions and chosen methodologies. Consequently, critical insights essential for bridging reasoning steps may be absent in training data collected from human sources. To bridge this gap, we proposes inserting \emph{insight}s between consecutive reasoning steps, which review the status and initiate the next reasoning steps. Unlike prior prompting strategies that rely on a single or a workflow of static prompts to facilitate reasoning, \emph{insight}s are \emph{proactively} generated to guide reasoning processes. We implement our idea as a reasoning framework, named \emph{Thinking Before You Speak} (TBYS), and design a pipeline for automatically collecting and filtering in-context examples for the generation of \emph{insight}s, which alleviates human labeling efforts and fine-tuning overheads. Experiments on challenging mathematical datasets verify the effectiveness of TBYS. Project website: https://gitee.com/jswrt/TBYS

[18] Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models

Chenxu Yang,Qingyi Si,Zheng Lin

Main category: cs.CL

TL;DR: This paper proposes Collaborative Decoding (CoDe), a novel approach for Large Language Models (LLMs) to effectively integrate external knowledge while maintaining both faithfulness and expressiveness.

Details Motivation: Current LLMs struggle to seamlessly integrate external knowledge while maintaining faithfulness and expressiveness, leading to outputs that either lack support from external knowledge or appear overly verbose and unnatural. Method: The authors proposed Collaborative Decoding (CoDe), a novel approach that dynamically integrates output probabilities generated with and without external knowledge. Additionally, a knowledge-aware reranking mechanism is introduced to prevent over-reliance on prior parametric knowledge while ensuring proper utilization of external information. Result: Through comprehensive experiments, the plug-and-play CoDe framework demonstrates superior performance in enhancing faithfulness without compromising expressiveness across diverse LLMs and evaluation metrics. Conclusion: The proposed Collaborative Decoding (CoDe) framework effectively enhances faithfulness without compromising expressiveness across diverse Large Language Models (LLMs), demonstrating its effectiveness and generalizability. Abstract: Grounding responses in external knowledge represents an effective strategy for mitigating hallucinations in Large Language Models (LLMs). However, current LLMs struggle to seamlessly integrate knowledge while simultaneously maintaining faithfulness (or fidelity) and expressiveness, capabilities that humans naturally possess. This limitation results in outputs that either lack support from external knowledge, thereby compromising faithfulness, or appear overly verbose and unnatural, thus sacrificing expressiveness. In this work, to break the trade-off between faithfulness and expressiveness, we propose Collaborative Decoding (CoDe), a novel approach that dynamically integrates output probabilities generated with and without external knowledge. This integration is guided by distribution divergence and model confidence, enabling the selective activation of relevant and reliable expressions from the model's internal parameters. Furthermore, we introduce a knowledge-aware reranking mechanism that prevents over-reliance on prior parametric knowledge while ensuring proper utilization of provided external information. Through comprehensive experiments, our plug-and-play CoDe framework demonstrates superior performance in enhancing faithfulness without compromising expressiveness across diverse LLMs and evaluation metrics, validating both its effectiveness and generalizability.

[19] Emotion Omni: Enabling Empathetic Speech Response Generation through Large Language Models

Haoyu Wang,Guangyan Zhang,Jiale Chen,Jingyu Li,Yuehai Wang,Yiwen Guo

Main category: cs.CL

TL;DR: This paper proposes Emotion Omni, a new model architecture for empathetic speech understanding and response generation, along with a data pipeline to build emotional dialogue datasets with minimal resources.

Details Motivation: Current speech LLMs often lack the ability to understand emotional and paralinguistic cues, which are crucial for meaningful human-machine interaction. Existing empathetic speech models also require massive datasets and computational resources, posing a challenge for development with limited data. Method: The paper introduces Emotion Omni, a model architecture designed to understand emotional speech and generate empathetic responses. Additionally, a data generation pipeline using an open-source TTS framework was developed to create a 200k emotional dialogue dataset. Result: The proposed Emotion Omni architecture and data generation pipeline successfully support the creation of an empathetic speech assistant, as demonstrated by the availability of demos. Conclusion: Emotion Omni is a novel model architecture that effectively understands emotional content in user speech and generates empathetic responses, offering a solution for developing empathetic speech assistants with limited data and reduced reliance on large-scale training. Abstract: With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models simply convert the response content into speech without fully understanding the rich emotional and paralinguistic cues embedded in the user's query. In many cases, the same sentence can have different meanings depending on the emotional expression. Furthermore, emotional understanding is essential for improving user experience in human-machine interaction. Currently, most speech LLMs with empathetic capabilities are trained on massive datasets. This approach requires vast amounts of data and significant computational resources. Therefore, a key challenge lies in how to develop a speech LLM capable of generating empathetic responses with limited data and without the need for large-scale training. To address this challenge, we propose Emotion Omni, a novel model architecture designed to understand the emotional content of user speech input and generate empathetic speech responses. Additionally, we developed a data generation pipeline based on an open-source TTS framework to construct a 200k emotional dialogue dataset, which supports the construction of an empathetic speech assistant. The demos are available at https://w311411.github.io/omni_demo/

[20] Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

Xinglong Yang,Quan Feng,Zhongying Pan,Xiang Chen,Yu Tian,Wentong Li,Shuofei Qiao,Yuxia Geng,Xingyu Zhao,Sheng-Jun Huang

Main category: cs.CL

TL;DR: This paper proposes a difficulty-balanced prompt selection framework for Multimodal Chain-of-Thought prompting, enhancing model performance by aligning examples with model capabilities and task complexity.

Details Motivation: The motivation stems from the limitations of random or manual example selection in MCoT prompting, which do not consider model-specific knowledge distributions or task complexity, resulting in unstable and suboptimal performance. Method: The method involves reframing prompt selection as a curriculum design problem, integrating two signals: model-perceived difficulty (via prediction disagreement in active learning) and intrinsic sample complexity. This joint analysis leads to a difficulty-balanced sampling strategy. Result: Extensive experiments on five challenging benchmarks and multiple Multimodal Large Language Models show substantial and consistent improvements, with reduced performance discrepancies caused by random sampling. Conclusion: The proposed difficulty-balanced sampling strategy effectively improves the performance of Multimodal Chain-of-Thought prompting across multiple benchmarks and models, offering a principled and robust approach for enhancing multimodal reasoning. Abstract: The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of "tailored teaching with balanced difficulty". We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model's current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.

[21] Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning

Songtao Jiang,Yuxi Chen,Sibo Song,Yan Zhang,Yeying Jin,Yang Feng,Jian Wu,Zuozhu Liu

Main category: cs.CL

TL;DR: 研究揭示了当前医学视觉-语言模型在答案一致性方面的脆弱性,并提出了一种新方法CCL来解决这一问题,提高了模型的鲁棒性和性能。

Details Motivation: 当前医学视觉-语言模型在面对语义等价的医学问题重述时表现出答案波动,需要提高模型的可靠性和鲁棒性。 Method: 构建了一个名为RoMed的数据集,并提出了包含知识锚定一致性学习和偏见感知对比学习的CCL方法。 Result: 在RoMed上评估LLaVA-Med等SOTA模型时,观察到性能显著下降,而CCL方法在提高答案一致性方面表现出色。 Conclusion: CCL方法在三个流行的VQA基准测试中达到了SOTA性能,并在RoMed测试集上显著提高了答案一致性,证明了其显著增强的鲁棒性。 Abstract: In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40\% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50\% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.

[22] Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System

Yanfan Du,Jun Zhang,Bin Wang,Jin Qiu,Lu Huang,Yuan Ge,Xiaoqian Liu,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: 本文提出了一种新的语音转文本术语检索方法Attention2Probability,在准确率和效率方面表现出色,并揭示了当前SLMs在术语使用方面的局限性。

Details Motivation: 尽管语音大语言模型(SLMs)在通用领域语音识别和翻译方面取得了进展,但在生成领域特定术语或新词方面仍面临挑战。 Method: 提出了一种名为Attention2Probability的方法,将语音和术语之间的交叉注意力权重转换为存在概率,并利用课程学习增强检索准确性。 Result: 在测试集上,Attention2Probability显著优于VectorDB方法,中文和英文的最大召回率分别达到92.57%和86.83%,每个查询的延迟仅为8.71毫秒。术语干预使术语准确率提高了6-17%。 Conclusion: Attention2Probability是一个轻量级、灵活且准确的术语检索方法,可以有效地提高语音转文本系统的术语准确率,同时揭示了当前SLMs在术语使用方面的局限性。 Abstract: Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs' recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.

[23] Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs

Duy Le,Kent Ziti,Evan Girard-Sun,Sean O'Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: AOF is a prompting framework that enhances multilingual riddle generation by reducing redundancy and improving lexical diversity without task-specific fine-tuning.

Details Motivation: Multilingual riddle generation challenges LLMs to balance cultural fluency with creative abstraction, which standard prompting strategies often fail to achieve, leading to memorized or shallow outputs. Method: AOF filters redundant riddle generations using cosine-based similarity rejection while enforcing lexical novelty and cross-lingual fidelity. It was tested across three LLMs and four language pairs. Result: AOF-enhanced GPT-4o achieved 0.177 Self-BLEU and 0.915 Distinct-2 in Japanese, showing better lexical diversity and reduced redundancy compared to other methods and language pairs. Conclusion: Adaptive Originality Filtering (AOF) improves the creative and culturally grounded generation in multilingual riddle generation without task-specific fine-tuning by reducing redundancy and enhancing lexical diversity. Abstract: Multilingual riddle generation challenges large language models (LLMs) to balance cultural fluency with creative abstraction. Standard prompting strategies -- zero-shot, few-shot, chain-of-thought -- tend to reuse memorized riddles or perform shallow paraphrasing. We introduce Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity. Evaluated across three LLMs and four language pairs, AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915} Distinct-2 in Japanese, signaling improved lexical diversity and reduced redundancy compared to other prompting methods and language pairs. Our findings show that semantic rejection can guide culturally grounded, creative generation without task-specific fine-tuning.

[24] EMMM, Explain Me My Model! Explainable Machine Generated Text Detection in Dialogues

Angela Yifei Yuan,Haoyi Li,Soyeon Caren Han,Christopher Leckie

Main category: cs.CL

TL;DR: 本文提出了一种名为EMMM的可解释性机器生成文本检测框架,适用于在线客服场景,解决了大规模用户假冒问题,并在非专业用户友好性、准确性和低延迟方面取得了良好平衡。

Details Motivation: 随着大语言模型在客服领域的广泛应用,恶意行为者可能利用这些模型生成文本进行大规模用户假冒,这对现有检测方法提出了更高的要求。 Method: 提出了一种名为EMMM的解释-检测框架,以在在线客服场景中实现可解释的机器生成文本检测。 Result: EMMM框架在实验中表现优异,70%的非专业用户更倾向于使用其解释性输出,同时在准确性和延迟方面也表现良好。 Conclusion: EMMM是一个在大规模用户假冒检测中具有实用价值的解释性检测框架。 Abstract: The rapid adoption of large language models (LLMs) in customer service introduces new risks, as malicious actors can exploit them to conduct large-scale user impersonation through machine-generated text (MGT). Current MGT detection methods often struggle in online conversational settings, reducing the reliability and interpretability essential for trustworthy AI deployment. In customer service scenarios where operators are typically non-expert users, explanation become crucial for trustworthy MGT detection. In this paper, we propose EMMM, an explanation-then-detection framework that balances latency, accuracy, and non-expert-oriented interpretability. Experimental results demonstrate that EMMM provides explanations accessible to non-expert users, with 70\% of human evaluators preferring its outputs, while achieving competitive accuracy compared to state-of-the-art models and maintaining low latency, generating outputs within 1 second. Our code and dataset are open-sourced at https://github.com/AngieYYF/EMMM-explainable-chatbot-detection.

[25] Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

Chang Wang,Siyu Yan,Depeng Yuan,Yuqi Chen,Yanhua Huang,Yuanhang Zheng,Shuhao Li,Yinqi Zhang,Kedi Chen,Mingrui Zhu,Ruiwen Xu

Main category: cs.CL

TL;DR: 提出DIVER框架,通过语义和风格感知的数据生成及多阶段多目标优化,实现高质量、多样化的广告标题生成,并在实际应用中取得广告价值和点击率提升。

Details Motivation: 当前方法主要优化标题质量和点击率,忽略了多样性需求,导致输出同质化,因此需要一种能够同时优化多样性和质量的新框架。 Method: 设计了一个语义和风格感知的数据生成流程,并采用监督微调和强化学习的多阶段多目标优化框架。 Result: 实验表明DIVER在真实工业数据集中有效平衡了质量与多样性,在大规模内容分享平台部署后,广告价值和点击率分别提升了4.0%和1.4%。 Conclusion: DIVER是一个基于大语言模型的广告标题生成框架,它同时优化了多样性和质量,实验证明其在平衡这两方面效果显著,并在实际部署中提升了广告价值和点击率。 Abstract: The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity and quality. We first design a semantic- and stylistic-aware data generation pipeline that automatically produces high-quality training pairs with ad content and multiple diverse headlines. To achieve the goal of generating high-quality and diversified ad headlines within a single forward pass, we propose a multi-stage multi-objective optimization framework with supervised fine-tuning (SFT) and reinforcement learning (RL). Experiments on real-world industrial datasets demonstrate that DIVER effectively balances quality and diversity. Deployed on a large-scale content-sharing platform serving hundreds of millions of users, our framework improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.

[26] M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations

Qiao Liang,Ying Shen,Tiantian Chen,Lin Zhang

Main category: cs.CL

TL;DR: 本文提出M3HG模型,结合多模态异构图以捕捉情感与因果上下文,并在新构建的多场景数据集MECAD上验证了其优越性能。

Details Motivation: 现有的MECTEC数据集稀缺且场景单一,同时现有方法未能充分建模情感和因果上下文以及多层语义信息的融合,导致性能下降。 Method: 提出了一种名为M3HG的模型,利用多模态异构图对情感和因果上下文进行建模,并在对话内和对话间层次上融合上下文信息。 Result: 在广泛实验中,M3HG模型在MECTEC任务上表现优于现有最先进方法。 Conclusion: 本文提出了M3HG模型,通过多模态异构图有效地捕捉情感和因果上下文,并融合不同层次的语义信息,从而在MECTEC任务上优于现有方法。 Abstract: Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has recently gained significant attention in social media analysis, aiming to extract emotion utterances, cause utterances, and emotion categories simultaneously. However, the scarcity of related datasets, with only one published dataset featuring highly uniform dialogue scenarios, hinders model development in this field. To address this, we introduce MECAD, the first multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56 TV series spanning a wide range of dialogue contexts. In addition, existing MECTEC methods fail to explicitly model emotional and causal contexts and neglect the fusion of semantic information at different levels, leading to performance degradation. In this paper, we propose M3HG, a novel model that explicitly captures emotional and causal contexts and effectively fuses contextual information at both inter- and intra-utterance levels via a multimodal heterogeneous graph. Extensive experiments demonstrate the effectiveness of M3HG compared with existing state-of-the-art methods. The codes and dataset are available at https://github.com/redifinition/M3HG.

[27] Chronological Passage Assembling in RAG framework for Temporal Question Answering

Byeongjeong Kim,Jeonghyun Park,Joonho Yang,Hwanhee Lee

Main category: cs.CL

TL;DR: ChronoRAG is a novel framework that enhances narrative question answering by preserving temporal order and creating coherent passages, outperforming existing methods on the NarrativeQA dataset.

Details Motivation: Narrative texts require understanding broader context and sequential relationships, which existing RAG methods struggle to handle effectively. Method: ChronoRAG utilizes a retrieval-augmented generation framework to refine dispersed document information into structured passages while explicitly preserving the temporal order of narrative texts. Result: Experiments on the NarrativeQA dataset showed substantial improvements in tasks involving factual identification and comprehension of complex sequential relationships. Conclusion: ChronoRAG improves long-context question answering for narrative texts by focusing on coherent passage creation and maintaining temporal order among retrieved passages. Abstract: Long-context question answering over narrative tasks is challenging because correct answers often hinge on reconstructing a coherent timeline of events while preserving contextual flow in a limited context window. Retrieval-augmented generation (RAG) indexing methods aim to address this challenge by selectively retrieving only necessary document segments. However, narrative texts possess unique characteristics that limit the effectiveness of these existing approaches. Specifically, understanding narrative texts requires more than isolated segments, as the broader context and sequential relationships between segments are crucial for comprehension. To address these limitations, we propose ChronoRAG, a novel RAG framework specialized for narrative texts. This approach focuses on two essential aspects: refining dispersed document information into coherent and structured passages, and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages. We empirically demonstrate the effectiveness of ChronoRAG through experiments on the NarrativeQA dataset, showing substantial improvements in tasks requiring both factual identification and comprehension of complex sequential relationships, underscoring that reasoning over temporal order is crucial in resolving narrative QA.

[28] ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models

Qianyu He,Siyu Yuan,Xuefeng Li,Mingxuan Wang,Jiangjie Chen

Main category: cs.CL

TL;DR: ThinkDial实现了类似gpt-oss的可控推理,通过端到端的训练范式整合预算模式控制,能够实现三种推理模式的切换,具有良好的性能和推广能力。

Details Motivation: 控制大型语言模型的计算工作量在实际部署中是一个重大挑战,而开源社区尚未实现类似专有系统的控制能力。 Method: 通过端到端的训练范式,整合预算模式控制,包括预算模式监督微调和两阶段预算感知强化学习。 Result: ThinkDial能够实现三种不同的推理模式切换,达到目标压缩性能权衡,并在分布外任务上表现出色。 Conclusion: ThinkDial是一个成功的开源框架,实现了类似gpt-oss的可控推理,具备良好的性能和推广能力。 Abstract: Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI's gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.

[29] Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction

Yilin Li,Xunjian Yin,Yilin Chen,Xiaojun Wan

Main category: cs.CL

TL;DR: This paper introduces a Rule-Based RL framework for grammatical error correction, achieving superior performance on Chinese datasets by leveraging the reasoning capabilities of LLMs.

Details Motivation: Traditional encoder-decoder models and supervised fine-tuning methods limit the reasoning ability of LLMs in grammatical error correction. The motivation is to explore a more effective and controllable approach using Rule-Based RL. Method: The researchers proposed a novel framework based on Rule-Based Reinforcement Learning (RL) and conducted experiments on Chinese datasets to evaluate its performance in grammatical error correction. Result: The Rule-Based RL framework achieved state-of-the-art performance on Chinese datasets, with a notable increase in recall, demonstrating the effectiveness of RL in steering LLMs for GEC. Conclusion: The study concludes that using a Rule-Based RL framework offers a more controllable and reliable paradigm for grammatical error correction (GEC) compared to traditional supervised fine-tuning methods. Abstract: Grammatical error correction is a significant task in NLP. Traditional methods based on encoder-decoder models have achieved certain success, but the application of LLMs in this field is still underexplored. Current research predominantly relies on supervised fine-tuning to train LLMs to directly generate the corrected sentence, which limits the model's powerful reasoning ability. To address this limitation, we propose a novel framework based on Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL framework achieves \textbf{state-of-the-art }performance, with a notable increase in \textbf{recall}. This result clearly highlights the advantages of using RL to steer LLMs, offering a more controllable and reliable paradigm for future development in GEC.

[30] Controllable Conversational Theme Detection Track at DSTC 12

Igor Shalyminov,Hang Su,Jake Vincent,Siffi Singh,Jason Cai,James Gung,Raphael Shu,Saab Mansour

Main category: cs.CL

TL;DR: This paper introduces Controllable Conversational Theme Detection as a new task in conversational analytics, aiming to automate and customize topic identification in dialogs, particularly for customer support or sales applications.

Details Motivation: The motivation of this paper is to advance conversational analytics by automating the identification and categorization of topics within conversations, reducing manual effort in dialog analysis, especially in domains like customer support or sales. Method: The authors introduced the Controllable Conversational Theme Detection task as a public competition track at DSTC 12, framed as joint clustering and theme labeling of dialog utterances. They provided an overview of the problem, dataset, evaluation metrics, and discussed participant submissions. Result: The paper presents an overview of the Controllable Conversational Theme Detection problem, including its dataset, evaluation metrics (both automatic and human), and insights from participant teams' submissions. The track materials, including data and code, are openly available. Conclusion: This paper concludes that the Controllable Conversational Theme Detection task can significantly improve the automation and customization of dialog analysis, providing flexibility in theme surface forms and user-specific customizations. Abstract: Conversational analytics has been on the forefront of transformation driven by the advances in Speech and Natural Language Processing techniques. Rapid adoption of Large Language Models (LLMs) in the analytics field has taken the problems that can be automated to a new level of complexity and scale. In this paper, we introduce Theme Detection as a critical task in conversational analytics, aimed at automatically identifying and categorizing topics within conversations. This process can significantly reduce the manual effort involved in analyzing expansive dialogs, particularly in domains like customer support or sales. Unlike traditional dialog intent detection, which often relies on a fixed set of intents for downstream system logic, themes are intended as a direct, user-facing summary of the conversation's core inquiry. This distinction allows for greater flexibility in theme surface forms and user-specific customizations. We pose Controllable Conversational Theme Detection problem as a public competition track at Dialog System Technology Challenge (DSTC) 12 -- it is framed as joint clustering and theme labeling of dialog utterances, with the distinctive aspect being controllability of the resulting theme clusters' granularity achieved via the provided user preference data. We give an overview of the problem, the associated dataset and the evaluation metrics, both automatic and human. Finally, we discuss the participant teams' submissions and provide insights from those. The track materials (data and code) are openly available in the GitHub repository.

[31] LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

Ziming Zhu,Chenglong Wang,Shunjie Xing,Yifu Huo,Fengning Tian,Quan Du,Di Yang,Chunliang Zhang,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: This paper introduces LaTeXTrans, a multi-agent system for translating LaTeX documents, which achieves better translation accuracy and structural fidelity than existing machine translation systems.

Details Motivation: Despite progress in machine translation, translating structured LaTeX-formatted documents remains challenging due to the need to accurately preserve domain-specific syntax while maintaining semantic integrity and compilability. Method: The paper introduces LaTeXTrans, a collaborative multi-agent system that includes a Parser, Translator, Validator, Summarizer, Terminology Extractor, and Generator to ensure format preservation, structural fidelity, and terminology consistency. Result: Experimental results show that LaTeXTrans outperforms mainstream machine translation systems in both translation accuracy and structural fidelity. Conclusion: LaTeXTrans is an effective and practical solution for translating LaTeX-formatted documents, offering improvements in translation accuracy and structural fidelity compared to mainstream MT systems. Abstract: Despite the remarkable progress of modern machine translation (MT) systems on general-domain texts, translating structured LaTeX-formatted documents remains a significant challenge. These documents typically interleave natural language with domain-specific syntax, such as mathematical equations, tables, figures, and cross-references, all of which must be accurately preserved to maintain semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge. LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2) a Translator, Validator, Summarizer, and Terminology Extractor that work collaboratively to ensure context-aware, self-correcting, and terminology-consistent translations; 3) a Generator that reconstructs the translated content into well-structured LaTeX documents. Experimental results demonstrate that LaTeXTrans can outperform mainstream MT systems in both translation accuracy and structural fidelity, offering an effective and practical solution for translating LaTeX-formatted documents.

[32] LLM-based Contrastive Self-Supervised AMR Learning with Masked Graph Autoencoders for Fake News Detection

Shubham Gupta,Shraban Kumar Chatterjee,Suman Kundu

Main category: cs.CL

TL;DR: This paper proposes a self-supervised misinformation detection framework integrating semantic and propagation dynamics, using AMR and an LLM-based graph contrastive loss, achieving better performance with limited labeled data.

Details Motivation: Misinformation in the digital age poses significant societal challenges. Existing approaches struggle with capturing long-range dependencies, complex semantic relations, and social dynamics affecting news dissemination, and they often require extensive labeled datasets. Method: The study introduces a novel self-supervised framework that uses Abstract Meaning Representation (AMR) for capturing complex semantic relations and a multi-view graph masked autoencoder to model news propagation dynamics. Additionally, an LLM-based graph contrastive loss (LGCL) is proposed to enhance feature separability in a zero-shot manner. Result: Extensive experiments demonstrate that the proposed self-supervised framework achieves superior performance compared to other state-of-the-art methodologies, even with limited labeled datasets, while enhancing generalizability. Conclusion: The proposed self-supervised misinformation detection framework effectively combines semantic and propagation-based features, outperforming state-of-the-art methods especially in scenarios with limited labeled data, while improving generalizability. Abstract: The proliferation of misinformation in the digital age has led to significant societal challenges. Existing approaches often struggle with capturing long-range dependencies, complex semantic relations, and the social dynamics influencing news dissemination. Furthermore, these methods require extensive labelled datasets, making their deployment resource-intensive. In this study, we propose a novel self-supervised misinformation detection framework that integrates both complex semantic relations using Abstract Meaning Representation (AMR) and news propagation dynamics. We introduce an LLM-based graph contrastive loss (LGCL) that utilizes negative anchor points generated by a Large Language Model (LLM) to enhance feature separability in a zero-shot manner. To incorporate social context, we employ a multi view graph masked autoencoder, which learns news propagation features from social context graph. By combining these semantic and propagation-based features, our approach effectively differentiates between fake and real news in a self-supervised manner. Extensive experiments demonstrate that our self-supervised framework achieves superior performance compared to other state-of-the-art methodologies, even with limited labelled datasets while improving generalizability.

[33] Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness

Sirui Chen,Changxin Tian,Binbin Hu,Kunlong Chen,Ziqi Liu,Zhiqiang Zhang,Jun Zhou

Main category: cs.CL

TL;DR: 本文提出了一种程序辅助合成框架,以生成高质量的数学问题数据集,并通过实验证明其有效性。

Details Motivation: 增强大型语言模型的数学推理能力需要高质量的训练数据,而传统方法在可扩展性、成本和数据可靠性方面面临重大挑战。 Method: 我们提出了一种新的程序辅助合成框架,系统地生成具有多样性、复杂性和正确性的高质量数学语料库。 Result: 我们生成了1230万个问题解决三元组,并通过实验表明,基于这些数据微调的模型显著提高了推理能力。 Conclusion: 模型在我们的数据集上微调后,其推理能力显著提高,在多个基准数据集上达到了最先进的性能,展示了我们合成方法的有效性。 Abstract: Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose a novel program-assisted synthesis framework that systematically generates a high-quality mathematical corpus with guaranteed diversity, complexity, and correctness. This framework integrates mathematical knowledge systems and domain-specific tools to create executable programs. These programs are then translated into natural language problem-solution pairs and vetted by a bilateral validation mechanism that verifies solution correctness against program outputs and ensures program-problem consistency. We have generated 12.3 million such problem-solving triples. Experiments demonstrate that models fine-tuned on our data significantly improve their inference capabilities, achieving state-of-the-art performance on several benchmark datasets and showcasing the effectiveness of our synthesis approach.

[34] ConfTuner: Training Large Language Models to Express Their Confidence Verbally

Yibo Li,Miao Xiong,Jiaying Wu,Bryan Hooi

Main category: cs.CL

TL;DR: ConfTuner is a fine-tuning method that improves the calibration of large language models' verbalized confidence, enabling downstream gains in self-correction and model cascade.

Details Motivation: Current LLMs often exhibit overconfidence, generating incorrect answers with high confidence. Existing approaches to calibrate LLMs have limited effectiveness and generalizability. Method: ConfTuner uses a new loss function called the tokenized Brier score, which is theoretically proven to be a proper scoring rule, eliminating the need for ground-truth confidence scores or proxy confidence estimates. Result: ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models like GPT-4o. Better-calibrated confidence enables downstream gains in self-correction and model cascade. Conclusion: ConfTuner is a simple and efficient fine-tuning method that improves calibration across diverse reasoning tasks and generalizes to black-box models like GPT-4o, enabling downstream gains in self-correction and model cascade. Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it "correctly incentivizes the model to report its true probability of being correct". ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.

[35] ReflectivePrompt: Reflective evolution in autoprompting algorithms

Viktor N. Zhuravlev,Artur R. Khairullin,Ernest A. Dyagin,Alena N. Sitkina,Nikita I. Kulin

Main category: cs.CL

TL;DR: ReflectivePrompt 是一种利用反思进化算法优化语言模型提示的新方法,在多个任务和模型上均表现出色。

Details Motivation: 随着大型语言模型(LLMs)领域的迅速发展,提示工程的研究推动了自动选择优化提示(autoprompting)的流行。 Method: ReflectivePrompt 利用短期和长期的反思操作,在交叉和精英突变之前增强修改的质量,并在每个世代根据当前种群更新进化过程中积累的知识。 Result: ReflectivePrompt 在使用开源大型语言模型 t-lite-instruct-0.1 和 gemma3-27b-it 进行分类和文本生成任务时,相较当前最先进的方法平均显著提高了指标(例如,在 BBH 上比 EvoPrompt 提高了 28%)。 Conclusion: ReflectivePrompt 是一种基于进化算法的新型自动提示方法,它通过反思进化方法提高了最优提示搜索的精确性和全面性。 Abstract: Autoprompting is the process of automatically selecting optimized prompts for language models, which has been gaining popularity with the rapid advancement of prompt engineering, driven by extensive research in the field of large language models (LLMs). This paper presents ReflectivePrompt - a novel autoprompting method based on evolutionary algorithms that employs a reflective evolution approach for more precise and comprehensive search of optimal prompts. ReflectivePrompt utilizes short-term and long-term reflection operations before crossover and elitist mutation to enhance the quality of the modifications they introduce. This method allows for the accumulation of knowledge obtained throughout the evolution process and updates it at each epoch based on the current population. ReflectivePrompt was tested on 33 datasets for classification and text generation tasks using open-access large language models: t-lite-instruct-0.1 and gemma3-27b-it. The method demonstrates, on average, a significant improvement (e.g., 28% on BBH compared to EvoPrompt) in metrics relative to current state-of-the-art approaches, thereby establishing itself as one of the most effective solutions in evolutionary algorithm-based autoprompting.

[36] Empowering Computing Education Researchers Through LLM-Assisted Content Analysis

Laurie Gale,Sebastian Mateos Nicolajsen

Main category: cs.CL

TL;DR: This paper introduces LLM-assisted content analysis (LACA) as a scalable and rigorous method for analyzing large volumes of qualitative data in computing education research, aiming to improve the generalizability of findings and advance the discipline.

Details Motivation: The motivation stems from the challenges faced by computing education researchers in conducting generalizable and rigorous research due to limitations in colleagues, resources, or capacity. There is a need for methods that can handle larger volumes of qualitative data without increasing the burden on researchers. Method: The paper proposes a variation of LLM-assisted content analysis (LACA), combining traditional content analysis with large language models to analyze large volumes of textual data in a reproducible and rigorous manner. Result: The proposed LACA method demonstrates potential in enabling researchers to conduct large-scale studies that were previously unfeasible, thereby allowing for more generalizable findings and advancing the field of computing education research. Conclusion: The paper concludes that LLM-assisted content analysis (LACA) has the potential to enhance computing education research (CER) by enabling more generalizable findings and improving both research practice and quality in the discipline. Abstract: Computing education research (CER) is often instigated by practitioners wanting to improve both their own and the wider discipline's teaching practice. However, the latter is often difficult as many researchers lack the colleagues, resources, or capacity to conduct research that is generalisable or rigorous enough to advance the discipline. As a result, research methods that enable sense-making with larger volumes of qualitative data, while not increasing the burden on the researcher, have significant potential within CER. In this discussion paper, we propose such a method for conducting rigorous analysis on large volumes of textual data, namely a variation of LLM-assisted content analysis (LACA). This method combines content analysis with the use of large language models, empowering researchers to conduct larger-scale research which they would otherwise not be able to perform. Using a computing education dataset, we illustrate how LACA could be applied in a reproducible and rigorous manner. We believe this method has potential in CER, enabling more generalisable findings from a wider range of research. This, together with the development of similar methods, can help to advance both the practice and research quality of the CER discipline.

[37] Affective Polarization across European Parliaments

Bojan Evkoski,Igor Mozetič,Nikola Ljubešić,Petra Kralj Novak

Main category: cs.CL

TL;DR: 本研究利用自然语言处理分析六个欧洲国家议会的演讲内容,发现议员之间存在普遍的情感极化现象,并且互惠是导致这一现象的重要机制。

Details Motivation: 情感极化(即对对立群体表现出更多的负面情绪和敌意)已成为全球政治话语中的一个显著特征,研究议会中的情感极化有助于理解政治互动中的对立现象。 Method: 研究者使用自然语言处理技术,对六个欧洲国家议会的演讲内容进行了自动化分析,通过比较议员在提及对立群体和自身群体时表现出的负面情绪水平,来评估情感极化程度。 Result: 研究发现六个欧洲国家的议会中普遍存在情感极化现象,尽管议员的活跃程度与负面情绪有关,但活跃程度高低的议员在情感极化方面没有显著差异。 Conclusion: 该研究得出的结论是,在所研究的六个欧洲国家议会中,情感极化现象普遍存在,并且互惠是导致情感极化的一个重要因素。 Abstract: Affective polarization, characterized by increased negativity and hostility towards opposing groups, has become a prominent feature of political discourse worldwide. Our study examines the presence of this type of polarization in a selection of European parliaments in a fully automated manner. Utilizing a comprehensive corpus of parliamentary speeches from the parliaments of six European countries, we employ natural language processing techniques to estimate parliamentarian sentiment. By comparing the levels of negativity conveyed in references to individuals from opposing groups versus one's own, we discover patterns of affectively polarized interactions. The findings demonstrate the existence of consistent affective polarization across all six European parliaments. Although activity correlates with negativity, there is no observed difference in affective polarization between less active and more active members of parliament. Finally, we show that reciprocity is a contributing mechanism in affective polarization between parliamentarians across all six parliaments.

[38] Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework

Ilias Driouich,Hongliu Cao,Eoin Thomas

Main category: cs.CL

TL;DR: 我们提出了一种多智能体框架,用于生成用于RAG系统评估的合成QA数据集,确保语义多样性和隐私保护。

Details Motivation: RAG系统的有效性和可信度在很大程度上取决于它们如何被评估,尤其是在评估过程是否能够捕捉到现实世界的约束,例如保护敏感信息。虽然当前对RAG系统的评估工作主要集中在性能指标的开发上,但对基础评估数据集的设计和质量的关注却少得多,尽管它们在实现有意义、可靠的评估中起着关键作用。 Method: 我们引入了一种新的多智能体框架,用于生成用于RAG评估的合成QA数据集,该框架优先考虑语义多样性和隐私保护。其中包括:(1)多样性智能体利用聚类技术最大化主题覆盖范围和语义变化,(2)隐私智能体检测并掩盖跨多个领域的敏感信息,以及(3)QA策划智能体合成适合用作RAG评估的私有且多样化的QA对。 Result: 广泛的实验证明,我们的评估集在多样性方面优于基线方法,并在特定领域的数据集上实现了强大的隐私掩码。 Conclusion: 这项工作提供了一种实用且符合伦理的途径,以实现更安全、更全面的RAG系统评估,为未来符合不断发展的AI法规和合规标准的改进奠定了基础。 Abstract: Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses. However, the effectiveness and trustworthiness of these systems critically depends on how they are evaluated, particularly on whether the evaluation process captures real-world constraints like protecting sensitive information. While current evaluation efforts for RAG systems have primarily focused on the development of performance metrics, far less attention has been given to the design and quality of the underlying evaluation datasets, despite their pivotal role in enabling meaningful, reliable assessments. In this work, we introduce a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation. Our approach involves: (1) a Diversity agent leveraging clustering techniques to maximize topical coverage and semantic variability, (2) a Privacy Agent that detects and mask sensitive information across multiple domains and (3) a QA curation agent that synthesizes private and diverse QA pairs suitable as ground truth for RAG evaluation. Extensive experiments demonstrate that our evaluation sets outperform baseline methods in diversity and achieve robust privacy masking on domain-specific datasets. This work offers a practical and ethically aligned pathway toward safer, more comprehensive RAG system evaluation, laying the foundation for future enhancements aligned with evolving AI regulations and compliance standards.

[39] Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models

Hung Ming Liu

Main category: cs.CL

TL;DR: 本文提出了一种神经模型框架,该框架能够发展出一种AI母语,实现直觉推理、组合符号链和内在可解释性。

Details Motivation: 我们提出了一个框架,其中神经模型发展出一种AI母语,一种原生的符号语言,同时支持直觉推理、组合符号链和内在可解释性。 Method: 我们引入了互补的训练目标以增强符号纯度和决策稀疏性,并采用顺序专业化策略来首先建立广泛的符号能力,然后细化直觉判断。 Result: 在AI任务上的实验表明了竞争性的准确性以及可验证的推理轨迹。 Conclusion: 实验表明,AI母语可以作为神经模型中可解释性、直觉和符号推理的统一机制。 Abstract: We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model's representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.

[40] Automatic Prompt Optimization with Prompt Distillation

Viktor N. Zhuravlev,Artur R. Khairullin,Ernest A. Dyagin,Alena N. Sitkina,Nikita I. Kulin

Main category: cs.CL

TL;DR: This paper introduces DistillPrompt, a novel autoprompting technique using distillation, compression, and aggregation to optimize prompts for large language models (LLMs), achieving significant performance improvements over existing methods.

Details Motivation: Autoprompting is increasingly important due to the growth of prompt engineering and large language models (LLMs), driving the need for efficient, non-gradient methods to optimize prompts. Method: DistillPrompt uses distillation, compression, and aggregation operations to integrate task-specific information into prompts across multiple stages, thoroughly exploring the prompt space for optimization. Result: The method showed a significant average improvement of 20.12% across datasets in text classification and generation tasks using the t-lite-instruct-0.1 language model compared to existing approaches like Grips. Conclusion: DistillPrompt proves to be one of the most effective non-gradient approaches in autoprompting, significantly improving performance metrics compared to existing methods. Abstract: Autoprompting is the process of automatically selecting optimized prompts for language models, which is gaining popularity due to the rapid development of prompt engineering driven by extensive research in the field of large language models (LLMs). This paper presents DistillPrompt -- a novel autoprompting method based on large language models that employs a multi-stage integration of task-specific information into prompts using training data. DistillPrompt utilizes distillation, compression, and aggregation operations to explore the prompt space more thoroughly. The method was tested on different datasets for text classification and generation tasks using the t-lite-instruct-0.1 language model. The results demonstrate a significant average improvement (e.g., 20.12% across the entire dataset compared to Grips) in key metrics over existing methods in the field, establishing DistillPrompt as one of the most effective non-gradient approaches in autoprompting.

[41] MovieCORE: COgnitive REasoning in Movies

Gueter Josmy Faure,Min-Hung Chen,Jia-Fong Yeh,Ying Cheng,Hung-Ting Su,Yung-Hao Tang,Shang-Hong Lai,Winston H. Hsu

Main category: cs.CL

TL;DR: 本文提出了 MovieCORE,一个用于视频问答的新数据集,专注于提升 AI 对电影内容的深层认知理解。

Details Motivation: 现有视频问答数据集主要关注表层理解,而缺乏对深层次认知能力的挑战,MovieCORE 旨在弥补这一不足。 Method: 创新性地使用多个大语言模型作为思考代理,生成高质量的问题-答案对,并提出 Agentic Choice Enhancement 模块以提升模型推理能力。 Result: MovieCORE 提升了视频问答模型在深度认知任务上的表现,评估结果显示其在深度、启发性及句法复杂性方面具有显著优势。 Conclusion: MovieCORE 为视频问答模型提供了更具挑战性的数据集,推动了电影内容的深度理解,并提出了有效的认知评估体系和 Agentic Choice Enhancement 模块,提升了模型推理能力。 Abstract: This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.

[42] HiPlan: Hierarchical Planning for LLM-Based Agents with Adaptive Global-Local Guidance

Ziyue Li,Yuan Chang,Gaihong Yu,Xiaoqiu Le

Main category: cs.CL

TL;DR: HiPlan is a hierarchical planning framework that improves the decision-making of LLM-based agents in complex tasks by offering structured guidance and dynamic adaptation to correct deviations.

Details Motivation: LLM-based agents struggle with complex, long-horizon planning scenarios due to lack of macroscopic guidance and insufficient continuous oversight, leading to disorientation, failures, and deviations. Method: HiPlan decomposes complex tasks into milestone action guides for general direction and step-wise hints for detailed actions. It constructs a milestone library from expert demonstrations during the offline phase and dynamically adapts trajectory segments during execution to generate step-wise hints. Result: Extensive experiments on two challenging benchmarks show that HiPlan significantly outperforms strong baselines, with ablation studies confirming the complementary benefits of its hierarchical components. Conclusion: HiPlan provides an effective solution for enhancing the decision-making of LLM-based agents in complex, long-horizon planning scenarios by offering adaptive global-local guidance. Abstract: Large language model (LLM)-based agents have demonstrated remarkable capabilities in decision-making tasks, but struggle significantly with complex, long-horizon planning scenarios. This arises from their lack of macroscopic guidance, causing disorientation and failures in complex tasks, as well as insufficient continuous oversight during execution, rendering them unresponsive to environmental changes and prone to deviations. To tackle these challenges, we introduce HiPlan, a hierarchical planning framework that provides adaptive global-local guidance to boost LLM-based agents'decision-making. HiPlan decomposes complex tasks into milestone action guides for general direction and step-wise hints for detailed actions. During the offline phase, we construct a milestone library from expert demonstrations, enabling structured experience reuse by retrieving semantically similar tasks and milestones. In the execution phase, trajectory segments from past milestones are dynamically adapted to generate step-wise hints that align current observations with the milestone objectives, bridging gaps and correcting deviations. Extensive experiments across two challenging benchmarks demonstrate that HiPlan substantially outperforms strong baselines, and ablation studies validate the complementary benefits of its hierarchical components.

[43] "Where does it hurt?" -- Dataset and Study on Physician Intent Trajectories in Doctor Patient Dialogues

Tom Röhr,Soumyadeep Roy,Fares Al Mohamad,Jens-Michalis Papaioannou,Wolfgang Nejdl,Felix Gers,Alexander Löser

Main category: cs.CL

TL;DR: 本研究首次探索了医患对话中的医生意图轨迹,构建了一个基于SOAP框架的医生意图分类法,并创建了一个大规模标注数据集用于模型基准测试。研究发现模型在理解对话结构上表现良好,但难以识别SOAP类别转换,并揭示了常见对话轨迹以及意图过滤对摘要性能的提升。

Details Motivation: 在医患对话中,医生的主要目标是诊断患者并提出治疗方案。医生通过有针对性的提问来高效地收集必要的信息,以实现最佳的患者结果。然而,目前尚无研究关注医生意图轨迹。 Method: 使用“Ambient Clinical Intelligence Benchmark”(Aci-bench)数据集,与医疗专业人员合作,基于SOAP框架开发了一个细粒度的医生意图分类法。随后通过Prolific平台招募大量医学专家对5000多个医生-患者对话轮次进行标注,并利用这些标注数据对最先进的生成模型和编码器模型进行基准测试。 Result: 创建了一个大规模的标注数据集,并利用其对最先进的医疗意图分类模型进行了基准测试。模型在理解医疗对话整体结构上表现良好,但在识别SOAP类别转换上存在不足。研究揭示了医疗对话结构的常见轨迹,并发现意图过滤对医疗对话摘要性能有显著提升。 Conclusion: 研究发现,虽然模型能够理解医疗对话的整体结构,但在识别SOAP类别之间的转换时存在困难。研究还首次报告了医疗对话结构中的常见轨迹,为设计“鉴别诊断”系统提供了有价值的见解。此外,研究广泛探讨了意图过滤对医疗对话摘要的性能提升。 Abstract: In a doctor-patient dialogue, the primary objective of physicians is to diagnose patients and propose a treatment plan. Medical doctors guide these conversations through targeted questioning to efficiently gather the information required to provide the best possible outcomes for patients. To the best of our knowledge, this is the first work that studies physician intent trajectories in doctor-patient dialogues. We use the `Ambient Clinical Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with medical professionals to develop a fine-grained taxonomy of physician intents based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We then conduct a large-scale annotation effort to label over 5000 doctor-patient turns with the help of a large number of medical experts recruited using Prolific, a popular crowd-sourcing platform. This large labeled dataset is an important resource contribution that we use for benchmarking the state-of-the-art generative and encoder models for medical intent classification tasks. Our findings show that our models understand the general structure of medical dialogues with high accuracy, but often fail to identify transitions between SOAP categories. We also report for the first time common trajectories in medical dialogue structures that provide valuable insights for designing `differential diagnosis' systems. Finally, we extensively study the impact of intent filtering for medical dialogue summarization and observe a significant boost in performance. We make the codes and data, including annotation guidelines, publicly available at https://github.com/DATEXIS/medical-intent-classification.

[44] It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs

Yue Li,Zhixue Zhao,Carolina Scarton

Main category: cs.CL

TL;DR: This paper analyzes the effectiveness of in-context learning and parameter-efficient fine-tuning for adapting large language models to extremely low-resource languages, particularly those in rare scripts, and provides guidelines for optimal adaptation strategies.

Details Motivation: Extremely low-resource languages, especially those written in rare scripts, remain largely unsupported by large language models (LLMs), primarily due to a lack of training data. This paper aims to comprehensively analyze whether LLMs can acquire such languages through in-context learning or parameter-efficient fine-tuning. Method: The paper systematically evaluates 20 under-represented languages across three state-of-the-art multilingual LLMs, analyzing the effectiveness of in-context learning (ICL) with or without auxiliary alignment signals and comparing them to parameter-efficient fine-tuning (PEFT). Result: The findings reveal limitations in parameter-efficient fine-tuning (PEFT) when both the language and its script are under-represented in LLMs. Zero-shot in-context learning (ICL) with language alignment proves to be impressively effective for extremely low-resource languages. Conclusion: The study concludes that zero-shot ICL with language alignment is effective for extremely low-resource languages, while PEFT and few-shot ICL are better for relatively well-represented languages. Guidelines are provided for adapting LLMs to low-resource languages based on these findings. Abstract: Extremely low-resource languages, especially those written in rare scripts, as shown in Figure 1, remain largely unsupported by large language models (LLMs). This is due in part to compounding factors such as the lack of training data. This paper delivers the first comprehensive analysis of whether LLMs can acquire such languages purely via in-context learning (ICL), with or without auxiliary alignment signals, and how these methods compare to parameter-efficient fine-tuning (PEFT). We systematically evaluate 20 under-represented languages across three state-of-the-art multilingual LLMs. Our findings highlight the limitation of PEFT when both language and its script are extremely under-represented by the LLM. In contrast, zero-shot ICL with language alignment is impressively effective on extremely low-resource languages, while few-shot ICL or PEFT is more beneficial for languages relatively better represented by LLMs. For LLM practitioners working on extremely low-resource languages, we summarise guidelines grounded by our results on adapting LLMs to low-resource languages, e.g., avoiding fine-tuning a multilingual model on languages of unseen scripts.

[45] Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index

Mathew Henrickson

Main category: cs.CL

TL;DR: 论文提出了一种用于艺术来源研究的检索增强生成框架,旨在解决档案数据碎片化和多语言问题,便于历史学家和文化专业人士进行敏感研究。

Details Motivation: 来源研究对于确认艺术品的真实性、支持归还和法律主张以及理解艺术品的文化和历史背景至关重要,但目前的检索方式受限于精确的元数据需求和碎片化、多语言的档案数据。 Method: 通过语义检索和上下文摘要,实现自然语言和多语言检索,减少对元数据结构的依赖。 Result: 结果表明,该方法能够提供一个可扩展的解决方案,以更高效地浏览艺术市场档案。 Conclusion: 该论文提出了一种用于艺术来源研究的检索增强生成框架,为历史学家和文化遗产专业人士提供了可实际应用于敏感历史研究的工具。 Abstract: This research presents a Retrieval-Augmented Generation (RAG) framework for art provenance studies, focusing on the Getty Provenance Index. Provenance research establishes the ownership history of artworks, which is essential for verifying authenticity, supporting restitution and legal claims, and understanding the cultural and historical context of art objects. The process is complicated by fragmented, multilingual archival data that hinders efficient retrieval. Current search portals require precise metadata, limiting exploratory searches. Our method enables natural-language and multilingual searches through semantic retrieval and contextual summarization, reducing dependence on metadata structures. We assess RAG's capability to retrieve and summarize auction records using a 10,000-record sample from the Getty Provenance Index - German Sales. The results show this approach provides a scalable solution for navigating art market archives, offering a practical tool for historians and cultural heritage professionals conducting historically sensitive research.

[46] Beyond the Black Box: Integrating Lexical and Semantic Methods in Quantitative Discourse Analysis with BERTopic

Thomas Compton

Main category: cs.CL

TL;DR: This paper proposes a transparent, hybrid framework for Quantitative Discourse Analysis using Python tools to overcome the limitations of black-box software, enhancing methodological control and interpretability.

Details Motivation: The motivation is to address the limitations of black-box software like MAXQDA and NVivo in QDA by promoting methodological transparency and alignment with research goals through a customizable, open-source approach. Method: The paper employs a hybrid QDA framework using custom Python pipelines with NLTK, spaCy, and Sentence Transformers for preprocessing, lemmatisation, and embedding generation. It also utilizes BERTopic modeling with UMAP, HDBSCAN, and c-TF-IDF, optimized through parameter tuning and multiple runs. Result: The result is a multi-layered analytical workflow that improves topic coherence and coverage while enabling fine-grained control over discourse analysis, demonstrated through a case study in historical political discourse. Conclusion: The paper concludes that a hybrid framework combining lexical and semantic methods enhances the transparency, reproducibility, and interpretability of Quantitative Discourse Analysis, emphasizing the importance of researcher agency and methodological triangulation. Abstract: Quantitative Discourse Analysis has seen growing adoption with the rise of Large Language Models and computational tools. However, reliance on black box software such as MAXQDA and NVivo risks undermining methodological transparency and alignment with research goals. This paper presents a hybrid, transparent framework for QDA that combines lexical and semantic methods to enable triangulation, reproducibility, and interpretability. Drawing from a case study in historical political discourse, we demonstrate how custom Python pipelines using NLTK, spaCy, and Sentence Transformers allow fine-grained control over preprocessing, lemmatisation, and embedding generation. We further detail our iterative BERTopic modelling process, incorporating UMAP dimensionality reduction, HDBSCAN clustering, and c-TF-IDF keyword extraction, optimised through parameter tuning and multiple runs to enhance topic coherence and coverage. By juxtaposing precise lexical searches with context-aware semantic clustering, we argue for a multi-layered approach that mitigates the limitations of either method in isolation. Our workflow underscores the importance of code-level transparency, researcher agency, and methodological triangulation in computational discourse studies. Code and supplementary materials are available via GitHub.

[47] Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs

Zhikai Ding,Shiyu Ni,Keping Bi

Main category: cs.CL

TL;DR: 该论文研究了大型视觉-语言模型在视觉问答任务中对自身知识边界的认知能力,并提出了改进方法。

Details Motivation: 为了提升大型视觉-语言模型在视觉问答任务中的可靠性,研究其对自身知识边界的认知能力。 Method: 通过评估三种类型的置信度信号(概率置信度、答案一致性置信度和语言化置信度),并借鉴大型语言模型的置信度校准方法,提出三种有效的改进方法。 Result: 实验结果表明,虽然现有模型具有一定的认知水平,但仍有较大改进空间;其中概率置信度和一致性置信度是更可靠的指标,而语言化置信度容易导致过度自信。 Conclusion: 该论文得出结论,尽管大型视觉-语言模型在视觉问答方面表现出合理的能力,但在认知自身知识边界方面仍有较大的改进空间。联合处理视觉和文本输入可以提高感知水平。 Abstract: Large vision-language models (LVLMs) demonstrate strong visual question answering (VQA) capabilities but are shown to hallucinate. A reliable model should perceive its knowledge boundaries-knowing what it knows and what it does not. This paper investigates LVLMs' perception of their knowledge boundaries by evaluating three types of confidence signals: probabilistic confidence, answer consistency-based confidence, and verbalized confidence. Experiments on three LVLMs across three VQA datasets show that, although LVLMs possess a reasonable perception level, there is substantial room for improvement. Among the three confidences, probabilistic and consistency-based signals are more reliable indicators, while verbalized confidence often leads to overconfidence. To enhance LVLMs' perception, we adapt several established confidence calibration methods from Large Language Models (LLMs) and propose three effective methods. Additionally, we compare LVLMs with their LLM counterparts, finding that jointly processing visual and textual inputs decreases question-answering performance but reduces confidence, resulting in an improved perception level compared to LLMs.

[48] Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

Alan Li,Yixin Liu,Arpan Sarkar,Doug Downey,Arman Cohan

Main category: cs.CL

TL;DR: The paper introduces new benchmarks and a probing framework to analyze scientific reasoning in LLMs, revealing key insights about the importance of knowledge retrieval and verbalized reasoning.

Details Motivation: The motivation for the study was the lack of a holistic benchmark for evaluating scientific reasoning in LLMs and the need to understand the distinct roles of knowledge and reasoning in these tasks. Method: The researchers introduced SciReas and SciReas-Pro benchmarks and used the KRUX probing framework to analyze the roles of knowledge and reasoning in scientific tasks. Result: Key findings included the importance of retrieving task-relevant knowledge, the benefits of external knowledge for reasoning models, and the positive impact of verbalized reasoning on knowledge retrieval. A strong 8B baseline for scientific reasoning, SciLit01, was also released. Conclusion: The study concludes that retrieving task-relevant knowledge is a critical bottleneck for LLMs in scientific reasoning, reasoning models benefit from external knowledge, and enhancing verbalized reasoning improves knowledge retrieval. Abstract: Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.

[49] VibeVoice Technical Report

Zhiliang Peng,Jianwei Yu,Wenhui Wang,Yaoyao Chang,Yutao Sun,Li Dong,Yi Zhu,Weijiang Xu,Hangbo Bao,Zehua Wang,Shaohan Huang,Yan Xia,Furu Wei

Main category: cs.CL

TL;DR: VibeVoice是一种新的模型,可以使用next-token扩散方法合成长篇多说话者的语音,并引入了一种提高数据压缩和计算效率的连续语音分词器。

Details Motivation: 为了提高数据压缩和处理长序列的计算效率,同时保持音频保真度。 Method: 采用了一种新的连续语音分词器,并使用next-token扩散方法进行建模。 Result: 与流行的Encodec模型相比,新分词器将数据压缩提高了80倍,同时保持了可比的性能。 Conclusion: VibeVoice可以合成最多4位说话者的90分钟语音,捕捉真实的对话氛围,超越开源和专有对话模型。 Abstract: This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

[50] Evaluating the Evaluators: Are readability metrics good measures of readability?

Isabel Cachola,Daniel Khashabi,Mark Dredze

Main category: cs.CL

TL;DR: 本文对通俗语言摘要(PLS)文献进行了彻底的调查,并评估了传统的可读性度量与人类判断的相关性。我们发现语言模型(LMs)是更好的可读性判断工具,并且在PLS数据集上得出了不同的结论。

Details Motivation: 当前可读性评估的标准实践是使用传统的可读性度量,如Flesch-Kincaid年级水平(FKGL),但这些度量在PLS中尚未与人类可读性判断进行比较。 Method: 我们评估了8种可读性度量标准,并展示了大多数与人类判断相关性较差,包括最常用的指标FKGL。然后我们展示了语言模型(LMs)作为可读性判断的更好选择,并在通俗语言摘要数据集上进行了扩展分析。 Result: 我们发现语言模型(LMs)能够更好地捕捉更深层次的可读性度量,例如所需背景知识,并且与传统度量相比,得出了不同的结论。表现最好的模型与人类判断的皮尔逊相关系数为0.56。 Conclusion: 基于我们的分析,我们为评估通俗语言摘要的最佳实践提供了建议,并发布了我们的分析代码和调查数据。 Abstract: Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries. We release our analysis code and survey data.

[51] Generative Interfaces for Language Models

Jiaqi Chen,Yanzhe Zhang,Yutong Zhang,Yijia Shao,Diyi Yang

Main category: cs.CL

TL;DR: This paper introduces Generative Interfaces for Language Models, which generate interactive user interfaces to enhance the efficiency and engagement of human-AI interactions beyond traditional chat-based formats.

Details Motivation: The motivation behind the study is to overcome the limitations of current large language models (LLMs) that are constrained by linear request-response formats, which are inefficient for multi-turn, information-dense, and exploratory tasks. The authors aim to enhance the interaction experience by introducing a more adaptive and interactive engagement method. Method: The authors propose a new paradigm called Generative Interfaces for Language Models, which uses structured representations and iterative refinements to generate task-specific user interfaces (UIs) in response to user queries. They evaluate this approach using a multidimensional assessment framework comparing it with traditional chat-based systems across various tasks and interaction patterns. Result: Results show that generative interfaces outperform traditional conversational interfaces, with users preferring them in over 70% of cases. The study identifies the scenarios and reasons for this preference, indicating improved functional, interactive, and emotional aspects of user experience. Conclusion: The study concludes that generative interfaces offer a more effective and engaging way for human-AI interaction compared to traditional chat-based systems, especially for complex and exploratory tasks. Abstract: Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain constrained by a linear request-response format that often makes interactions inefficient in multi-turn, information-dense, and exploratory tasks. To address these limitations, we propose Generative Interfaces for Language Models, a paradigm in which LLMs respond to user queries by proactively generating user interfaces (UIs) that enable more adaptive and interactive engagement. Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs. For systematic evaluation, we introduce a multidimensional assessment framework that compares generative interfaces with traditional chat-based ones across diverse tasks, interaction patterns, and query types, capturing functional, interactive, and emotional aspects of user experience. Results show that generative interfaces consistently outperform conversational ones, with humans preferring them in over 70% of cases. These findings clarify when and why users favor generative interfaces, paving the way for future advancements in human-AI interaction.

cs.CV [Back]

[52] Towards Training-Free Underwater 3D Object Detection from Sonar Point Clouds: A Comparison of Traditional and Deep Learning Approaches

M. Salman Shaukat,Yannik Käckenmeister,Sebastian Bader,Thomas Kirste

Main category: cs.CV

TL;DR: This paper explores training-free underwater 3D object detection, showing that template matching outperforms deep learning models trained on synthetic data in real-world scenarios.

Details Motivation: Underwater 3D object detection is challenging due to harsh acoustic environments and limited annotated training data. Deep learning approaches face bottlenecks in data acquisition. This work explores whether reliable detection can be achieved without real-world training data. Method: The authors developed two paradigms: a physics-based sonar simulation pipeline for generating synthetic training data for neural networks, and a model-based template matching system using geometric priors of objects. They evaluated both approaches using real bathymetry surveys from the Baltic Sea. Result: Neural networks trained on synthetic data achieved 98% mAP on simulated scenes but dropped to 40% mAP on real sonar data due to domain shift. In contrast, the template matching approach achieved 83% mAP on real data without any training, showing robustness to noise and environmental changes. Conclusion: The study concludes that template matching approaches can achieve robust underwater 3D object detection without training data, outperforming deep learning models trained on synthetic data. This challenges conventional reliance on data-hungry methods and enables applications in data-scarce underwater environments. Abstract: Underwater 3D object detection remains one of the most challenging frontiers in computer vision, where traditional approaches struggle with the harsh acoustic environment and scarcity of training data. While deep learning has revolutionized terrestrial 3D detection, its application underwater faces a critical bottleneck: obtaining sufficient annotated sonar data is prohibitively expensive and logistically complex, often requiring specialized vessels, expert surveyors, and favorable weather conditions. This work addresses a fundamental question: Can we achieve reliable underwater 3D object detection without real-world training data? We tackle this challenge by developing and comparing two paradigms for training-free detection of artificial structures in multibeam echo-sounder point clouds. Our dual approach combines a physics-based sonar simulation pipeline that generates synthetic training data for state-of-the-art neural networks, with a robust model-based template matching system that leverages geometric priors of target objects. Evaluation on real bathymetry surveys from the Baltic Sea reveals surprising insights: while neural networks trained on synthetic data achieve 98% mean Average Precision (mAP) on simulated scenes, they drop to 40% mAP on real sonar data due to domain shift. Conversely, our template matching approach maintains 83% mAP on real data without requiring any training, demonstrating remarkable robustness to acoustic noise and environmental variations. Our findings challenge conventional wisdom about data-hungry deep learning in underwater domains and establish the first large-scale benchmark for training-free underwater 3D detection. This work opens new possibilities for autonomous underwater vehicle navigation, marine archaeology, and offshore infrastructure monitoring in data-scarce environments where traditional machine learning approaches fail.

[53] MobileDenseAttn:A Dual-Stream Architecture for Accurate and Interpretable Brain Tumor Detection

Shudipta Banik,Muna Das,Trapa Banik,Md. Ehsanul Haque

Main category: cs.CV

TL;DR: MobileDenseAttn is a high-performance and interpretable model introduced to overcome the limitations of current brain tumor detection approaches in MRI, offering improved feature representation, computing efficiency, and visual explanations.

Details Motivation: The detection of brain tumor in MRI is an important aspect of ensuring timely diagnostics and treatment; however, manual analysis is commonly long and error-prone. Current approaches are not universal because they have limited generalization to heterogeneous tumors, are computationally inefficient, are not interpretable, and lack transparency, thus limiting trustworthiness. Method: MobileDenseAttn is a fusion model of dual streams of MobileNetV2 and DenseNet201 that can help gradually improve the feature representation scale, computing efficiency, and visual explanations via GradCAM. The model uses feature level fusion and is trained on an augmented dataset of 6,020 MRI scans. Result: Measured under strict 5-fold cross-validation protocols, MobileDenseAttn provides a training accuracy of 99.75%, a testing accuracy of 98.35%, and a stable F1 score of 0.9835. The extensive validation shows the stability of the model, and the comparative analysis proves that it is a great advancement over the baseline models with a +3.67% accuracy increase and a 39.3% decrease in training time. Conclusion: MobileDenseAttn is an efficient, high performance, interpretable model with a high probability of becoming a clinically practical tool in identifying brain tumors in the real world. Abstract: The detection of brain tumor in MRI is an important aspect of ensuring timely diagnostics and treatment; however, manual analysis is commonly long and error-prone. Current approaches are not universal because they have limited generalization to heterogeneous tumors, are computationally inefficient, are not interpretable, and lack transparency, thus limiting trustworthiness. To overcome these issues, we introduce MobileDenseAttn, a fusion model of dual streams of MobileNetV2 and DenseNet201 that can help gradually improve the feature representation scale, computing efficiency, and visual explanations via GradCAM. Our model uses feature level fusion and is trained on an augmented dataset of 6,020 MRI scans representing glioma, meningioma, pituitary tumors, and normal samples. Measured under strict 5-fold cross-validation protocols, MobileDenseAttn provides a training accuracy of 99.75%, a testing accuracy of 98.35%, and a stable F1 score of 0.9835 (95% CI: 0.9743 to 0.9920). The extensive validation shows the stability of the model, and the comparative analysis proves that it is a great advancement over the baseline models (VGG19, DenseNet201, MobileNetV2) with a +3.67% accuracy increase and a 39.3% decrease in training time compared to VGG19. The GradCAM heatmaps clearly show tumor-affected areas, offering clinically significant localization and improving interpretability. These findings position MobileDenseAttn as an efficient, high performance, interpretable model with a high probability of becoming a clinically practical tool in identifying brain tumors in the real world.

[54] Can VLMs Recall Factual Associations From Visual References?

Dhananjay Ashok,Ashutosh Chaubey,Hirona J. Arai,Jonathan May,Jesse Thomason

Main category: cs.CV

TL;DR: This paper identifies a key deficiency in Vision Language Models' ability to link visual inputs with factual knowledge, showing that visual references significantly impair their performance. The issue can be detected through internal state analysis, enabling improved reliability in multimodal tasks without retraining.

Details Motivation: The motivation behind this study is to understand and address the limitations of VLMs in effectively linking visual representations with their internal knowledge, which is crucial for reliable multimodal understanding. Method: The researchers conducted a controlled study to evaluate the ability of VLMs to recall factual knowledge when provided with textual versus visual references. They analyzed internal state patterns to identify linking failures and developed probes to flag unreliable VLM responses without retraining. Result: Forcing VLMs to rely on image representations reduced their factual recall ability by half. The study identified distinct patterns in internal states that correlated with linking failures, achieving over 92% accuracy in flagging unreliable responses. Applying these probes improved coverage by 7.87% and reduced error risk by 0.9% in visual question answering tasks. Conclusion: The study concludes that there is a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs), particularly in linking internal knowledge with image representations. This deficiency can be detected through patterns in model internal states, and addressing it is crucial for improving language grounding. Abstract: Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.

[55] SERES: Semantic-aware neural reconstruction from sparse views

Bo Xu,Yuhu Guo,Yuchao Wang,Wenting Wang,Yeung Yam,Charlie C. L. Wang,Xinyi Le

Main category: cs.CV

TL;DR: 本文提出了一种新的3D高保真重建方法,通过引入语义logits和几何正则化,显著提升了稀疏图像下的重建效果,并在多个基准上取得了显著的误差减少。

Details Motivation: 针对稀疏输入中特征不匹配导致的严重辐射模糊问题,需要通过引入语义信息来提升3D重建的准确性和鲁棒性。 Method: 该方法通过在神经隐式表示中加入基于补丁的语义logits,并结合符号距离场和辐射场进行联合优化,同时引入基于几何基元掩码的新正则化方法来缓解形状模糊问题。 Result: 实验结果显示,该方法在DTU数据集上的平均Chamfer距离相较于SparseNeuS减少了44%,VolRecon减少了20%;作为插件用于NeuS和Neuralangelo时,平均误差分别降低了69%和68%。 Conclusion: 本文提出了一种语义感知的神经重建方法,通过从稀疏图像生成3D高保真模型,有效减少了重建误差,并可作为插件提升现有密集重建方法的性能。 Abstract: We propose a semantic-aware neural reconstruction method to generate 3D high-fidelity models from sparse images. To tackle the challenge of severe radiance ambiguity caused by mismatched features in sparse input, we enrich neural implicit representations by adding patch-based semantic logits that are optimized together with the signed distance field and the radiance field. A novel regularization based on the geometric primitive masks is introduced to mitigate shape ambiguity. The performance of our approach has been verified in experimental evaluation. The average chamfer distances of our reconstruction on the DTU dataset can be reduced by 44% for SparseNeuS and 20% for VolRecon. When working as a plugin for those dense reconstruction baselines such as NeuS and Neuralangelo, the average error on the DTU dataset can be reduced by 69% and 68% respectively.

[56] Automated Landfill Detection Using Deep Learning: A Comparative Study of Lightweight and Custom Architectures with the AerialWaste Dataset

Nowshin Sharmily,Rusab Sarmun,Muhammad E. H. Chowdhury,Mir Hamidul Hussain,Saad Bin Abul Kashem,Molla E Majid,Amith Khandakar

Main category: cs.CV

TL;DR: This study uses lightweight deep learning models to detect illegal landfills in the AerialWaste Dataset, achieving high performance with an ensemble approach.

Details Motivation: Illegal landfills are a global threat, and detecting them manually is difficult and resource-consuming. There is also a lack of good-quality public datasets for landfill detection due to security concerns. Method: The researchers used lightweight deep learning models (Mobilenetv2, Googlenet, Densenet, MobileVit) to train and validate the AerialWaste Dataset, eventually combining the best-performing models into an ensemble model using fusion techniques. Result: The ensemble model achieved 92.33% accuracy, 92.67% precision, 92.33% sensitivity, 92.41% F1 score, and 92.71% specificity in binary classification on the AerialWaste Dataset. Conclusion: The study concludes that lightweight deep learning models, when combined into an ensemble model, can effectively perform binary classification on the AerialWaste Dataset for illegal landfill detection with high accuracy and other promising performance metrics. Abstract: Illegal landfills are posing as a hazardous threat to people all over the world. Due to the arduous nature of manually identifying the location of landfill, many landfills go unnoticed by authorities and later cause dangerous harm to people and environment. Deep learning can play a significant role in identifying these landfills while saving valuable time, manpower and resources. Despite being a burning concern, good quality publicly released datasets for illegal landfill detection are hard to find due to security concerns. However, AerialWaste Dataset is a large collection of 10434 images of Lombardy region of Italy. The images are of varying qualities, collected from three different sources: AGEA Orthophotos, WorldView-3, and Google Earth. The dataset contains professionally curated, diverse and high-quality images which makes it particularly suitable for scalable and impactful research. As we trained several models to compare results, we found complex and heavy models to be prone to overfitting and memorizing training data instead of learning patterns. Therefore, we chose lightweight simpler models which could leverage general features from the dataset. In this study, Mobilenetv2, Googlenet, Densenet, MobileVit and other lightweight deep learning models were used to train and validate the dataset as they achieved significant success with less overfitting. As we saw substantial improvement in the performance using some of these models, we combined the best performing models and came up with an ensemble model. With the help of ensemble and fusion technique, binary classification could be performed on this dataset with 92.33% accuracy, 92.67% precision, 92.33% sensitivity, 92.41% F1 score and 92.71% specificity.

[57] Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

Jiangfeng Sun,Sihao He,Zhonghong Ou,Meina Song

Main category: cs.CV

TL;DR: This paper proposes the Structural-Semantic Unifier (SSU) framework for multimodal sentiment analysis, which captures modality-specific structures and aligns cross-modal semantics to achieve strong performance with reduced computation and better interpretability.

Details Motivation: Existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their performance. This work aims to address these challenges by integrating structural and semantic information more effectively. Method: SSU dynamically constructs modality-specific graphs using linguistic syntax for text and a text-guided attention mechanism for acoustic and visual modalities. It also introduces a semantic anchor for cross-modal alignment and employs a multiview contrastive learning objective to enhance discriminability and consistency. Result: Extensive evaluations on CMU-MOSI and CMU-MOSEI datasets show that SSU achieves state-of-the-art performance while significantly reducing computational overhead. Qualitative analyses confirm its interpretability and ability to capture nuanced emotional patterns. Conclusion: The proposed Structural-Semantic Unifier (SSU) framework effectively addresses the challenges of modality-specific structural dependencies and semantic misalignment in multimodal sentiment analysis, achieving state-of-the-art performance with reduced computational overhead and improved interpretability. Abstract: Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multiview contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU's interpretability and its ability to capture nuanced emotional patterns through semantically grounded interactions.

[58] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses

Hao Liang,Zhixuan Ge,Ashish Tiwari,Soumendu Majee,G. M. Dilshan Godaliyadda,Ashok Veeraraghavan,Guha Balakrishnan

Main category: cs.CV

TL;DR: FastAvatar is a fast and accurate framework for generating 3D Gaussian Splatting models of faces from single images, significantly outperforming existing methods in speed and quality.

Details Motivation: To overcome the limitations of existing methods in speed and quality when generating 3DGS models of faces from single images. Method: FastAvatar uses a novel encoder-decoder neural network to generate a 3D Gaussian Splatting model from a single face image, encoding identity-specific and pose-invariant latent embeddings and predicting residuals to a template model. Result: FastAvatar can generate 3DGS models in near-instant time (<10ms), significantly outperforming existing feed-forward face 3DGS methods in reconstruction quality and running 1000x faster than per-face optimization methods. Conclusion: FastAvatar provides a fast and accurate method for generating 3DGS models of faces from single images, significantly outperforming existing methods in speed and quality. Abstract: We present FastAvatar, a pose-invariant, feed-forward framework that can generate a 3D Gaussian Splatting (3DGS) model from a single face image from an arbitrary pose in near-instant time (<10ms). FastAvatar uses a novel encoder-decoder neural network design to achieve both fast fitting and identity preservation regardless of input pose. First, FastAvatar constructs a 3DGS face ``template'' model from a training dataset of faces with multi-view captures. Second, FastAvatar encodes the input face image into an identity-specific and pose-invariant latent embedding, and decodes this embedding to predict residuals to the structural and appearance parameters of each Gaussian in the template 3DGS model. By only inferring residuals in a feed-forward fashion, model inference is fast and robust. FastAvatar significantly outperforms existing feed-forward face 3DGS methods (e.g., GAGAvatar) in reconstruction quality, and runs 1000x faster than per-face optimization methods (e.g., FlashAvatar, GaussianAvatars and GASP). In addition, FastAvatar's novel latent space design supports real-time identity interpolation and attribute editing which is not possible with any existing feed-forward 3DGS face generation framework. FastAvatar's combination of excellent reconstruction quality and speed expands the scope of 3DGS for photorealistic avatar applications in consumer and interactive systems.

[59] Securing Face and Fingerprint Templates in Humanitarian Biometric Systems

Giuseppe Stragapede,Sam Merrick,Vedrana Krivokuća Hahn,Justin Sukaitis,Vincent Graf Narbel

Main category: cs.CV

TL;DR: This paper introduces a secure and efficient biometric system using PolyProtect for template protection in humanitarian scenarios, showing promising results for both face and fingerprint recognition.

Details Motivation: The motivation is to enhance the efficiency of operations in humanitarian and emergency scenarios through secure biometric systems while mitigating risks to data subjects, particularly in vulnerable contexts. Method: The researchers conducted a comparative analysis of biometric template protection (BTP) schemes, selected PolyProtect as the most suitable method, and evaluated its performance on face embeddings using EdgeFace and fingerprint biometrics in terms of accuracy, irreversibility, and unlinkability. Result: PolyProtect demonstrated promising results in verification and identification accuracy, irreversibility, and unlinkability for both face and fingerprint biometrics, marking the first evaluation of this method in identification scenarios and for fingerprint data. Conclusion: The study concludes that PolyProtect, when applied to face and fingerprint biometrics, offers a promising solution for secure and efficient biometric template protection in humanitarian and emergency scenarios. Abstract: In humanitarian and emergency scenarios, the use of biometrics can dramatically improve the efficiency of operations, but it poses risks for the data subjects, which are exacerbated in contexts of vulnerability. To address this, we present a mobile biometric system implementing a biometric template protection (BTP) scheme suitable for these scenarios. After rigorously formulating the functional, operational, and security and privacy requirements of these contexts, we perform a broad comparative analysis of the BTP landscape. PolyProtect, a method designed to operate on neural network face embeddings, is identified as the most suitable method due to its effectiveness, modularity, and lightweight computational burden. We evaluate PolyProtect in terms of verification and identification accuracy, irreversibility, and unlinkability, when this BTP method is applied to face embeddings extracted using EdgeFace, a novel state-of-the-art efficient feature extractor, on a real-world face dataset from a humanitarian field project in Ethiopia. Moreover, as PolyProtect promises to be modality-independent, we extend its evaluation to fingerprints. To the best of our knowledge, this is the first time that PolyProtect has been evaluated for the identification scenario and for fingerprint biometrics. Our experimental results are promising, and we plan to release our code

[60] Why Relational Graphs Will Save the Next Generation of Vision Foundation Models?

Fatemeh Ziaeetabar

Main category: cs.CV

TL;DR: This paper argues that next-generation vision foundation models should integrate dynamic relational graphs to improve performance on complex tasks requiring relational reasoning, showcasing benefits in semantic fidelity, robustness, and efficiency while proposing a roadmap for future research.

Details Motivation: Vision foundation models (FMs) face limitations in tasks requiring explicit reasoning over entities, roles, and spatio-temporal relations. These limitations hinder performance in areas like human activity recognition, egocentric video understanding, and medical image analysis. The paper aims to propose solutions to overcome these shortcomings by enhancing FMs with relational reasoning capabilities. Method: The paper presents an analytical position supported by cross-domain evidence from recent systems in areas like human manipulation action recognition and brain tumor segmentation. It advocates for augmenting FMs with lightweight, context-adaptive graph-reasoning modules to enhance relational competence. Result: Augmenting FMs with dynamic relational graphs leads to improvements in fine-grained semantic fidelity, out-of-distribution robustness, interpretability, and computational efficiency. These hybrid models also achieve better memory and hardware efficiency, making them suitable for deployment under resource constraints. Conclusion: The paper concludes that incorporating explicit relational interfaces into vision foundation models (FMs) through dynamic relational graphs enhances performance in fine-grained semantic tasks. It emphasizes the advantages of these hybrid models in terms of semantic fidelity, robustness, interpretability, and efficiency. The authors propose a research agenda focusing on dynamic graph construction, multi-level reasoning, cross-modal fusion, and relational competence evaluation. Abstract: Vision foundation models (FMs) have become the predominant architecture in computer vision, providing highly transferable representations learned from large-scale, multimodal corpora. Nonetheless, they exhibit persistent limitations on tasks that require explicit reasoning over entities, roles, and spatio-temporal relations. Such relational competence is indispensable for fine-grained human activity recognition, egocentric video understanding, and multimodal medical image analysis, where spatial, temporal, and semantic dependencies are decisive for performance. We advance the position that next-generation FMs should incorporate explicit relational interfaces, instantiated as dynamic relational graphs (graphs whose topology and edge semantics are inferred from the input and task context). We illustrate this position with cross-domain evidence from recent systems in human manipulation action recognition and brain tumor segmentation, showing that augmenting FMs with lightweight, context-adaptive graph-reasoning modules improves fine-grained semantic fidelity, out of distribution robustness, interpretability, and computational efficiency relative to FM only baselines. Importantly, by reasoning sparsely over semantic nodes, such hybrids also achieve favorable memory and hardware efficiency, enabling deployment under practical resource constraints. We conclude with a targeted research agenda for FM graph hybrids, prioritizing learned dynamic graph construction, multi-level relational reasoning (e.g., part object scene in activity understanding, or region organ in medical imaging), cross-modal fusion, and evaluation protocols that directly probe relational competence in structured vision tasks.

[61] LPLC: A Dataset for License Plate Legibility Classification

Lucas Wojcik,Gabriel E. Lima,Valfride Nascimento,Eduil Nascimento Jr.,Rayson Laroca,David Menotti

Main category: cs.CV

TL;DR: 本文介绍了一种新的数据集LPLC,用于研究自动车牌识别(ALPR)系统在处理模糊车牌时的挑战,并提出了一个分类任务作为基准,结果显示该任务具有较高难度,需要进一步研究。

Details Motivation: 论文的动机是解决自动车牌识别(ALPR)系统在处理模糊车牌时的挑战,通过引入新的数据集和分类任务,以优化模型性能和计算效率,并推动该领域的进一步研究。 Method: 论文的方法包括创建一个新的包含10,210张车辆图像和12,687个标注车牌的数据集(LPLC数据集),并采用细粒度标注策略,包括车辆和车牌级别的遮挡信息、四个可读性分类和三个字符标签分类。此外,作者还提出了使用三种图像识别网络对车牌是否足够清晰、是否需要超分辨率或完全无法恢复进行分类的任务作为基准。 Result: 论文的结果显示,所有三种基线模型(ViT、ResNet 和 YOLO)的总体F1分数均低于80%,说明该任务的难度较大,同时分析了超分辨率和车牌识别方法,进一步证明需要更多的研究。 Conclusion: 论文的结论是,自动车牌识别在处理模糊车牌时仍面临重大挑战,尽管超分辨率等方法已经出现,但识别低质量车牌的核心问题仍未解决,同时引入的新数据集有助于相关研究的进一步发展。 Abstract: Automatic License Plate Recognition (ALPR) faces a major challenge when dealing with illegible license plates (LPs). While reconstruction methods such as super-resolution (SR) have emerged, the core issue of recognizing these low-quality LPs remains unresolved. To optimize model performance and computational efficiency, image pre-processing should be applied selectively to cases that require enhanced legibility. To support research in this area, we introduce a novel dataset comprising 10,210 images of vehicles with 12,687 annotated LPs for legibility classification (the LPLC dataset). The images span a wide range of vehicle types, lighting conditions, and camera/image quality levels. We adopt a fine-grained annotation strategy that includes vehicle- and LP-level occlusions, four legibility categories (perfect, good, poor, and illegible), and character labels for three categories (excluding illegible LPs). As a benchmark, we propose a classification task using three image recognition networks to determine whether an LP image is good enough, requires super-resolution, or is completely unrecoverable. The overall F1 score, which remained below 80% for all three baseline models (ViT, ResNet, and YOLO), together with the analyses of SR and LP recognition methods, highlights the difficulty of the task and reinforces the need for further research. The proposed dataset is publicly available at https://github.com/lmlwojcik/lplc-dataset.

[62] CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering

Aranya Saha,Tanvir Ahmed Khan,Ismam Nur Swapnil,Mohammad Ariful Haque

Main category: cs.CV

TL;DR: The paper introduces CLARIFY, a Specialist-Generalist framework for dermatological visual question answering that improves diagnostic accuracy and computational efficiency by combining a lightweight image classifier with a compressed conversational VLM and a knowledge graph-based retrieval module.

Details Motivation: The general-purpose nature of vision-language models (VLMs) can limit specialized diagnostic accuracy and their large size poses substantial inference costs for real-world clinical deployment. Method: CLARIFY combines a lightweight, domain-trained image classifier (the Specialist) with a powerful yet compressed conversational VLM (the Generalist), which is further enhanced by a knowledge graph-based retrieval module. Result: Experiments on a curated multimodal dermatology dataset demonstrated that CLARIFY achieves an 18% improvement in diagnostic accuracy over the strongest baseline while reducing the average VRAM requirement and latency by at least 20% and 5%, respectively. Conclusion: The Specialist-Generalist system, as exemplified by CLARIFY, provides a practical and powerful paradigm for building lightweight, trustworthy, and clinically viable AI systems in dermatology. Abstract: Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment. To address these challenges, we introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries. In our framework, the Specialist's predictions directly guide the Generalist's reasoning, focusing it on the correct diagnostic path. This synergy is further enhanced by a knowledge graph-based retrieval module, which grounds the Generalist's responses in factual dermatological knowledge, ensuring both accuracy and reliability. This hierarchical design not only reduces diagnostic errors but also significantly improves computational efficiency. Experiments on our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an 18\% improvement in diagnostic accuracy over the strongest baseline, a fine-tuned, uncompressed single-line VLM, while reducing the average VRAM requirement and latency by at least 20\% and 5\%, respectively. These results indicate that a Specialist-Generalist system provides a practical and powerful paradigm for building lightweight, trustworthy, and clinically viable AI systems.

[63] VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results

Sizhuo Ma,Wei-Ting Chen,Qiang Gao,Jian Wang,Chris Wei Zhou,Wei Sun,Weixia Zhang,Linhan Cao,Jun Jia,Xiangyang Zhu,Dandan Zhu,Xiongkuo Min,Guangtao Zhai,Baoying Chen,Xiongwei Xiao,Jishen Zeng,Wei Wu,Tiexuan Lou,Yuchen Tan,Chunyi Song,Zhiwei Xu,MohammadAli Hamidi,Hadi Amirpour,Mingyin Bai,Jiawang Du,Zhenyu Jiang,Zilong Lu,Ziguan Cui,Zongliang Gan,Xinpeng Li,Shiqi Jiang,Chenhui Li,Changbo Wang,Weijun Yuan,Zhan Li,Yihang Chen,Yifan Deng,Ruting Deng,Zhanglu Chen,Boyang Yao,Shuling Zheng,Feng Zhang,Zhiheng Fu,Abhishek Joshi,Aman Agarwal,Rakhil Immidisetti,Ajay Narasimha Mopidevi,Vishwajeet Shukla,Hao Yang,Ruikun Zhang,Liyuan Pan,Kaixin Deng,Hang Ouyang,Fan yang,Zhizun Luo,Zhuohang Shi,Songning Lai,Weilin Ruan,Yutao Yue

Main category: cs.CV

TL;DR: The VQualA 2025 Challenge aimed to develop efficient Face Image Quality Assessment models, drawing significant participation and submissions to improve practical FIQA approaches.

Details Motivation: Real-world conditions often introduce degradations in face images, affecting image quality and hindering subsequent tasks, necessitating effective Face Image Quality Assessment models. Method: Organizing the VQualA 2025 Challenge on Face Image Quality Assessment with constraints on model efficiency and evaluating submissions based on correlation metrics on a dataset of in-the-wild face images. Result: The challenge attracted 127 participants and 1519 final submissions, indicating strong engagement and interest in the field. Conclusion: The VQualA 2025 Challenge on Face Image Quality Assessment successfully attracted numerous participants and submissions, contributing to the advancement of practical FIQA approaches. Abstract: Face images play a crucial role in numerous applications; however, real-world conditions frequently introduce degradations such as noise, blur, and compression artifacts, affecting overall image quality and hindering subsequent tasks. To address this challenge, we organized the VQualA 2025 Challenge on Face Image Quality Assessment (FIQA) as part of the ICCV 2025 Workshops. Participants created lightweight and efficient models (limited to 0.5 GFLOPs and 5 million parameters) for the prediction of Mean Opinion Scores (MOS) on face images with arbitrary resolutions and realistic degradations. Submissions underwent comprehensive evaluations through correlation metrics on a dataset of in-the-wild face images. This challenge attracted 127 participants, with 1519 final submissions. This report summarizes the methodologies and findings for advancing the development of practical FIQA approaches.

[64] Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling

Md. Rashid Shahriar Khan,Md. Abrar Hasan,Mohammod Tareq Aziz Justice

Main category: cs.CV

TL;DR: This paper proposes a novel context-aware zero-shot anomaly detection framework for surveillance that combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context, enabling the detection of anomalies without prior exposure to such events.

Details Motivation: Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work aims to identify abnormal events without exposure to anomaly examples during training. Method: The method involves a hybrid architecture combining TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. It uses InfoNCE and CPC losses for joint training and includes a context-gating mechanism to enhance decision-making. Result: A context-aware zero-shot anomaly detection framework was developed that can generalize to previously unseen behaviors in complex environments. Conclusion: The framework successfully bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance by integrating predictive modeling with vision-language understanding. Abstract: Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at https://github.com/NK-II/Context-Aware-ZeroShot-Anomaly-Detection-in-Surveillance.

[65] DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance

Ajinkya Khoche,Qingwen Zhang,Yixi Cai,Sina Sharif Mansouri,Patric Jensfelt

Main category: cs.CV

TL;DR: DoGFlow是一种新型的自监督框架,无需任何手动标注即可恢复完整的3D物体运动,用于LiDAR场景流估计。

Details Motivation: 现有的自监督方法难以匹敌全监督方法的性能,尤其是在复杂长距离和恶劣天气场景下,而监督方法由于依赖昂贵的人工标注无法扩展。 Method: DoGFlow使用跨模态标签传输方法,直接从4D雷达多普勒测量中实时计算运动伪标签,并利用动态感知关联和消歧传播将这些标签传输到LiDAR域。 Result: 在MAN TruckScenes数据集上,DoGFlow显著优于现有自监督方法,并且提高了标签效率,使LiDAR模型在仅使用10%的真值数据时达到超过90%的全监督性能。 Conclusion: DoGFlow实现了无需手动标注的LiDAR场景流估计,并在MAN TruckScenes数据集上显著优于现有的自监督方法,同时通过仅使用10%的真值数据使LiDAR模型达到超过90%的全监督性能。 Abstract: Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse weather scenarios, while supervised methods are not scalable due to their reliance on expensive human labeling. We introduce DoGFlow, a novel self-supervised framework that recovers full 3D object motions for LiDAR scene flow estimation without requiring any manual ground truth annotations. This paper presents our cross-modal label transfer approach, where DoGFlow computes motion pseudo-labels in real-time directly from 4D radar Doppler measurements and transfers them to the LiDAR domain using dynamic-aware association and ambiguity-resolved propagation. On the challenging MAN TruckScenes dataset, DoGFlow substantially outperforms existing self-supervised methods and improves label efficiency by enabling LiDAR backbones to achieve over 90% of fully supervised performance with only 10% of the ground truth data. For more details, please visit https://ajinkyakhoche.github.io/DogFlow/

[66] SAT-SKYLINES: 3D Building Generation from Satellite Imagery and Coarse Geometric Priors

Zhangyu Jin,Andrew Feng

Main category: cs.CV

TL;DR: SatSkylines generates detailed 3D buildings from satellite images and simple geometric shapes efficiently and effectively, using a novel transformation model and a large dataset.

Details Motivation: Existing image-based 3D generation methods struggle with top-down satellite views, and detailization methods require highly detailed inputs like voxels, which limits flexibility and performance. Method: SatSkylines models the transformation from noisy, coarse priors to detailed geometries, supported by a large-scale dataset called Skylines-50K containing over 50,000 unique 3D building assets. Result: The approach achieves accurate and detailed 3D building generation with flexible geometric control and without additional computational cost. Conclusion: SatSkylines offers an effective and computationally efficient method for generating detailed 3D building models from satellite imagery and coarse geometric priors, with strong generalization ability. Abstract: We present SatSkylines, a 3D building generation approach that takes satellite imagery and coarse geometric priors. Without proper geometric guidance, existing image-based 3D generation methods struggle to recover accurate building structures from the top-down views of satellite images alone. On the other hand, 3D detailization methods tend to rely heavily on highly detailed voxel inputs and fail to produce satisfying results from simple priors such as cuboids. To address these issues, our key idea is to model the transformation from interpolated noisy coarse priors to detailed geometries, enabling flexible geometric control without additional computational cost. We have further developed Skylines-50K, a large-scale dataset of over 50,000 unique and stylized 3D building assets in order to support the generations of detailed building models. Extensive evaluations indicate the effectiveness of our model and strong generalization ability.

[67] Adaptive Visual Navigation Assistant in 3D RPGs

Kaijie Xu,Clark Verbrugge

Main category: cs.CV

TL;DR: 本研究定义了一个新问题,提出了一种检测和选择游戏环境中关键过渡点的深度学习框架,并探讨了模型适配的效率问题。

Details Motivation: 在复杂的3D游戏环境中,快速识别地图过渡点对客户端自动绘图和评估地图提示呈现至关重要,该研究旨在提供一种新的研究方向。 Method: 采用两阶段深度学习流程,第一阶段使用Faster R-CNN检测潜在的STP,第二阶段通过融合局部和全局视觉特征的轻量级MSTP选择器进行排序,并引入参数高效的适配器和可选的检索增强融合步骤。 Result: 实验表明,全网络微调在数据充足时表现更好,而仅使用适配器的迁移学习在低数据情况下更具鲁棒性和有效性,尤其是在MSTP选择任务中。 Conclusion: 该论文提出了一种用于检测游戏环境中空间过渡点(STP)和主要STP(MSTP)的两阶段深度学习框架,为AI驱动的导航辅助和数据驱动的地图设计工具提供了基础。 Abstract: In complex 3D game environments, players rely on visual affordances to spot map transition points. Efficient identification of such points is important to client-side auto-mapping, and provides an objective basis for evaluating map cue presentation. In this work, we formalize the task of detecting traversable Spatial Transition Points (STPs)-connectors between two sub regions-and selecting the singular Main STP (MSTP), the unique STP that lies on the designer-intended critical path toward the player's current macro-objective, from a single game frame, proposing this as a new research focus. We introduce a two-stage deep-learning pipeline that first detects potential STPs using Faster R-CNN and then ranks them with a lightweight MSTP selector that fuses local and global visual features. Both stages benefit from parameter-efficient adapters, and we further introduce an optional retrieval-augmented fusion step. Our primary goal is to establish the feasibility of this problem and set baseline performance metrics. We validate our approach on a custom-built, diverse dataset collected from five Action RPG titles. Our experiments reveal a key trade-off: while full-network fine-tuning produces superior STP detection with sufficient data, adapter-only transfer is significantly more robust and effective in low-data scenarios and for the MSTP selection task. By defining this novel problem, providing a baseline pipeline and dataset, and offering initial insights into efficient model adaptation, we aim to contribute to future AI-driven navigation aids and data-informed level-design tools.

[68] Wan-S2V: Audio-Driven Cinematic Video Generation

Xin Gao,Li Hu,Siqi Hu,Mingyang Huang,Chaonan Ji,Dechao Meng,Jinwei Qi,Penchong Qiao,Zhen Shen,Yafei Song,Ke Sun,Linrui Tian,Guangyuan Wang,Qi Wang,Zhongjian Wang,Jiayu Xiao,Sheng Xu,Bang Zhang,Peng Zhang,Xindi Zhang,Zhe Zhang,Jingren Zhou,Lian Zhuo

Main category: cs.CV

TL;DR: 提出 Wan-S2V,一种用于高质量音频驱动角色动画的新方法,优于现有技术。

Details Motivation: 当前最先进的音频驱动角色动画方法在复杂的影视制作中表现不足,需要更高质量的解决方案。 Method: 基于 Wan 构建音频驱动模型,并进行广泛的实验以验证其性能。 Result: Wan-S2V 在电影背景下显著优于 Hunyuan-Avatar 和 Omnihuman 等现有方法,并展现出强大的通用性。 Conclusion: Wan-S2V 提供了更高质量的音频驱动角色动画,适用于复杂的影视制作场景。 Abstract: Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.

[69] Decouple, Reorganize, and Fuse: A Multimodal Framework for Cancer Survival Prediction

Huayi Wang,Haochao Ying,Yuyang Xu,Qibo Qiu,Cheng Zhang,Danny Z. Chen,Ying Sun,Jian Wu

Main category: cs.CV

TL;DR: DeReF improves cancer survival analysis by dynamically reorganizing and fusing features across medical modalities, enhancing model generalization and performance.

Details Motivation: Current methods for cancer survival analysis suffer from over-reliance on fixed fusion schemes and limited information interaction in MoE-based approaches, which motivated the development of a more dynamic and interactive framework. Method: DeReF integrates a random feature reorganization strategy and dynamic MoE fusion modules, along with a regional cross-attention network to enhance feature representation and interaction. Result: Extensive experiments on Liver Cancer and TCGA datasets demonstrate the effectiveness of DeReF in improving survival-time prediction accuracy. Conclusion: The proposed DeReF framework effectively addresses the challenges in cancer survival analysis by enhancing feature combination diversity and improving information interaction among decoupled features through a dynamic fusion approach. Abstract: Cancer survival analysis commonly integrates information across diverse medical modalities to make survival-time predictions. Existing methods primarily focus on extracting different decoupled features of modalities and performing fusion operations such as concatenation, attention, and MoE-based (Mixture-of-Experts) fusion. However, these methods still face two key challenges: i) Fixed fusion schemes (concatenation and attention) can lead to model over-reliance on predefined feature combinations, limiting the dynamic fusion of decoupled features; ii) in MoE-based fusion methods, each expert network handles separate decoupled features, which limits information interaction among the decoupled features. To address these challenges, we propose a novel Decoupling-Reorganization-Fusion framework (DeReF), which devises a random feature reorganization strategy between modalities decoupling and dynamic MoE fusion modules.Its advantages are: i) it increases the diversity of feature combinations and granularity, enhancing the generalization ability of the subsequent expert networks; ii) it overcomes the problem of information closure and helps expert networks better capture information among decoupled features. Additionally, we incorporate a regional cross-attention network within the modality decoupling module to improve the representation quality of decoupled features. Extensive experimental results on our in-house Liver Cancer (LC) and three widely used TCGA public datasets confirm the effectiveness of our proposed method. The code will be made publicly available.

[70] ROSE: Remove Objects with Side Effects in Videos

Chenxuan Miao,Yutong Feng,Jianshu Zeng,Zixiang Gao,Hantang Liu,Yunfeng Yan,Donglian Qi,Xi Chen,Bin Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: ROSE is a framework for video object removal that effectively handles side effects like shadows and reflections using synthetic data and a diffusion transformer model, achieving excellent performance and generalization.

Details Motivation: Existing video object removal methods struggle with eliminating side effects such as shadows and reflections due to the scarcity of paired video data. This work aims to systematically address these challenges. Method: ROSE utilizes a 3D rendering engine for synthetic data generation and implements a video inpainting model based on diffusion transformer. It uses reference-based erasing and additional supervision to predict areas affected by side effects. Result: ROSE achieves superior performance compared to existing models in removing objects and their side effects, validated through a new benchmark called ROSE-Bench. Conclusion: ROSE demonstrates superior performance in video object removal, particularly in handling side effects like shadows and reflections, and generalizes well to real-world scenarios. Abstract: Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.

[71] OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

Chunlin Zhong,Qiuxia Hou,Zhangjun Zhou,Shuang Hao,Haonan Lu,Yanhao Zhang,He Tang,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出了一种新的视频字幕生成方法OwlCap,并构建了新的数据集HMD-270K,以解决现有方法中存在的运动细节不平衡问题。

Details Motivation: 现有的视频字幕方法往往存在运动细节不平衡的问题,导致生成的字幕不完整,影响视频理解和生成的一致性。 Method: 从两个方面提出了解决方案:1)数据方面:通过两阶段流水线构建了HMD-270K数据集;2)优化方面:引入了基于GRPO的CSER。 Result: 实验结果表明,OwlCap在两个基准测试中均取得了显著的提升:在注重细节的VDC上准确率提升了4.2%,在注重运动的DREAM-1K上F1分数提升了4.6%。 Conclusion: OwlCap是一个强大的视频字幕多模态大语言模型,具有运动细节平衡。HMD-270K数据集和OwlCap模型将公开发布,以促进视频字幕研究社区的发展。 Abstract: Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1). The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.

[72] Clustering-based Feature Representation Learning for Oracle Bone Inscriptions Detection

Ye Tao,Xinran Fu,Honglin Pang,Xi Yang,Chuntao Li

Main category: cs.CV

TL;DR: 本文提出一种新的基于聚类的特征学习方法,利用甲骨文字体库优化检测网络,在多个框架下显著提升了甲骨文拓片图像的检测效果。

Details Motivation: 由于噪声和裂缝等因素的影响,传统的检测网络在甲骨文拓片图像中检测效果有限,因此需要一种更有效的方法来提升检测性能。 Method: 提出一种基于聚类的特征空间表示学习方法,并设计了基于聚类结果的损失函数以优化特征表示,将其集成到整个网络损失中。 Result: 在两个甲骨文检测数据集上使用 Faster R-CNN、DETR 和 Sparse R-CNN 三个主流检测框架进行实验,均表现出显著的性能提升。 Conclusion: 本文提出了一种基于聚类的特征空间表示学习方法,通过引入甲骨文字符字体库数据集作为先验知识,有效提升了检测网络的性能,并在多个主流检测框架下验证了该方法的有效性。 Abstract: Oracle Bone Inscriptions (OBIs), play a crucial role in understanding ancient Chinese civilization. The automated detection of OBIs from rubbing images represents a fundamental yet challenging task in digital archaeology, primarily due to various degradation factors including noise and cracks that limit the effectiveness of conventional detection networks. To address these challenges, we propose a novel clustering-based feature space representation learning method. Our approach uniquely leverages the Oracle Bones Character (OBC) font library dataset as prior knowledge to enhance feature extraction in the detection network through clustering-based representation learning. The method incorporates a specialized loss function derived from clustering results to optimize feature representation, which is then integrated into the total network loss. We validate the effectiveness of our method by conducting experiments on two OBIs detection dataset using three mainstream detection frameworks: Faster R-CNN, DETR, and Sparse R-CNN. Through extensive experimentation, all frameworks demonstrate significant performance improvements.

[73] SFormer: SNR-guided Transformer for Underwater Image Enhancement from the Frequency Domain

Xin Tian,Yingtie Lei,Xiujun Zhang,Zimeng Li,Chi-Man Pun,Xuhang Chen

Main category: cs.CV

TL;DR: 本文提出SFormer框架,利用频域SNR先验和变换器结构,在水下图像增强任务中取得了优异的性能,优于现有方法。

Details Motivation: 现有的空间域SNR先验方法在处理水下图像时存在跨通道干扰和信息结构增强能力有限的问题,因此需要一种更有效的方法来提高水下图像质量。 Method: 提出了一种新的水下图像增强框架SFormer,包括傅里叶注意SNR先验变换器(FAST)和频率自适应变换器(FAT),利用频域的SNR先验进行特征分解和调制。 Result: SFormer在4800张配对图像上训练后,在PSNR上取得了3.1 dB的增益,在SSIM上提高了0.08,有效恢复了水下场景的颜色、纹理和对比度。 Conclusion: SFormer通过结合频域SNR先验和注意力机制,在水下图像增强任务中表现出色,超过了现有方法。 Abstract: Recent learning-based underwater image enhancement (UIE) methods have advanced by incorporating physical priors into deep neural networks, particularly using the signal-to-noise ratio (SNR) prior to reduce wavelength-dependent attenuation. However, spatial domain SNR priors have two limitations: (i) they cannot effectively separate cross-channel interference, and (ii) they provide limited help in amplifying informative structures while suppressing noise. To overcome these, we propose using the SNR prior in the frequency domain, decomposing features into amplitude and phase spectra for better channel modulation. We introduce the Fourier Attention SNR-prior Transformer (FAST), combining spectral interactions with SNR cues to highlight key spectral components. Additionally, the Frequency Adaptive Transformer (FAT) bottleneck merges low- and high-frequency branches using a gated attention mechanism to enhance perceptual quality. Embedded in a unified U-shaped architecture, these modules integrate a conventional RGB stream with an SNR-guided branch, forming SFormer. Trained on 4,800 paired images from UIEB, EUVP, and LSUI, SFormer surpasses recent methods with a 3.1 dB gain in PSNR and 0.08 in SSIM, successfully restoring colors, textures, and contrast in underwater scenes.

[74] Hierarchical Spatio-temporal Segmentation Network for Ejection Fraction Estimation in Echocardiography Videos

Dongfang Wang,Jian Yang,Yizhe Zhang,Tao Zhou

Main category: cs.CV

TL;DR: This paper proposes a novel hierarchical spatio-temporal segmentation network with a Spatio-temporal Cross Scan module to improve the accuracy of Ejection Fraction estimation in echocardiography videos.

Details Motivation: The motivation is to enhance the accuracy of Ejection Fraction estimation in echocardiography videos, as existing methods, while achieving good segmentation results, fall short in EF estimation due to local error accumulation or neglect of details. Method: The method involves a hierarchical network design that combines convolutional networks for single-frame processing with the Mamba architecture for spatio-temporal modeling, enhanced by the STCS module for long-range context integration. Result: The proposed method improves EF estimation accuracy by effectively capturing both local details and global dynamics in echocardiography videos, mitigating biases caused by image noise and other factors. Conclusion: The proposed Hierarchical Spatio-temporal Segmentation Network, along with the Spatio-temporal Cross Scan module, effectively improves EF estimation accuracy in echocardiography videos by balancing local detail modeling and global dynamic perception. Abstract: Automated segmentation of the left ventricular endocardium in echocardiography videos is a key research area in cardiology. It aims to provide accurate assessment of cardiac structure and function through Ejection Fraction (EF) estimation. Although existing studies have achieved good segmentation performance, their results do not perform well in EF estimation. In this paper, we propose a Hierarchical Spatio-temporal Segmentation Network (\ourmodel) for echocardiography video, aiming to improve EF estimation accuracy by synergizing local detail modeling with global dynamic perception. The network employs a hierarchical design, with low-level stages using convolutional networks to process single-frame images and preserve details, while high-level stages utilize the Mamba architecture to capture spatio-temporal relationships. The hierarchical design balances single-frame and multi-frame processing, avoiding issues such as local error accumulation when relying solely on single frames or neglecting details when using only multi-frame data. To overcome local spatio-temporal limitations, we propose the Spatio-temporal Cross Scan (STCS) module, which integrates long-range context through skip scanning across frames and positions. This approach helps mitigate EF calculation biases caused by ultrasound image noise and other factors.

[75] Feature-Space Planes Searcher: A Universal Domain Adaptation Framework for Interpretability and Computational Efficiency

Zhitong Cheng,Yiran Jiang,Yulong Ge,Yufeng Li,Zhongheng Qin,Rongzhi Lin,Jianwei Ma

Main category: cs.CV

TL;DR: The paper introduces FPS, a novel domain adaptation framework that optimizes decision boundaries while keeping the feature encoder frozen, achieving superior performance and scalability.

Details Motivation: Domain shift poses a challenge in deep learning deployment, and current UDA methods have limitations in efficiency, interpretability, and scalability. Method: Proposed Feature-space Planes Searcher (FPS), which optimizes decision boundaries while keeping the feature encoder frozen. Result: FPS achieves competitive or superior performance on public benchmarks, scales efficiently with multimodal large models, and shows versatility across various domains. Conclusion: FPS provides a simple, effective, and generalizable paradigm for transfer learning, especially in domain adaptation tasks. Abstract: Domain shift, characterized by degraded model performance during transition from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine-tuning feature extractors - an approach limited by inefficiency, reduced interpretability, and poor scalability to modern architectures. Our analysis reveals that models pretrained on large-scale data exhibit domain-invariant geometric patterns in their feature space, characterized by intra-class clustering and inter-class separation, thereby preserving transferable discriminative structures. These findings indicate that domain shifts primarily manifest as boundary misalignment rather than feature degradation. Unlike fine-tuning entire pre-trained models - which risks introducing unpredictable feature distortions - we propose the Feature-space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretative analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full-dataset optimization in a single computation cycle. Evaluations on public benchmarks demonstrate that FPS achieves competitive or superior performance to state-of-the-art methods. FPS scales efficiently with multimodal large models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable paradigm for transfer learning, particularly in domain adaptation tasks. .

[76] A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition

Wasi Ullah,Yasir Noman Khalid,Saddam Hussain Khan

Main category: cs.CV

TL;DR: This paper proposes an optimized hybrid deep learning framework for Human Activity Recognition (HAR) that improves accuracy, reduces features, and supports real-time deployment on edge devices.

Details Motivation: HAR systems face challenges like high computational costs, redundant features, and limited scalability in real-time scenarios, which this study aims to address. Method: An optimized hybrid deep learning framework integrating a customized InceptionV3, LSTM architecture, and an ensemble-based feature selection strategy was introduced. Result: The approach achieved 99.65% recognition accuracy, reduced features to as few as 7, and improved inference time on the UCF-YouTube dataset. Conclusion: The proposed HAR system is lightweight and scalable, supporting real-time deployment on edge devices, which enables practical applications in various intelligent environments. Abstract: Human Activity Recognition (HAR) plays a pivotal role in various applications, including smart surveillance, healthcare, assistive technologies, sports analytics, etc. However, HAR systems still face critical challenges, including high computational costs, redundant features, and limited scalability in real-time scenarios. An optimized hybrid deep learning framework is introduced that integrates a customized InceptionV3, an LSTM architecture, and a novel ensemble-based feature selection strategy. The proposed framework first extracts spatial descriptors using the customized InceptionV3 model, which captures multilevel contextual patterns, region homogeneity, and fine-grained localization cues. The temporal dependencies across frames are then modeled using LSTMs to effectively encode motion dynamics. Finally, an ensemble-based genetic algorithm with Adaptive Dynamic Fitness Sharing and Attention (ADFSA) is employed to select a compact and optimized feature set by dynamically balancing objectives such as accuracy, redundancy, uniqueness, and complexity reduction. Consequently, the selected feature subsets, which are both diverse and discriminative, enable various lightweight machine learning classifiers to achieve accurate and robust HAR in heterogeneous environments. Experimental results on the robust UCF-YouTube dataset, which presents challenges such as occlusion, cluttered backgrounds, motion dynamics, and poor illumination, demonstrate good performance. The proposed approach achieves 99.65% recognition accuracy, reduces features to as few as 7, and enhances inference time. The lightweight and scalable nature of the HAR system supports real-time deployment on edge devices such as Raspberry Pi, enabling practical applications in intelligent, resource-aware environments, including public safety, assistive technology, and autonomous monitoring systems.

[77] ColorGS: High-fidelity Surgical Scene Reconstruction with Colored Gaussian Splatting

Qun Ji,Peng Li,Mingqiang Wei

Main category: cs.CV

TL;DR: ColorGS improves high-fidelity reconstruction of deformable tissues from endoscopic videos by addressing color expressiveness and deformation modeling limitations in existing methods.

Details Motivation: Existing methods struggle with capturing subtle color variations and modeling global deformations in reconstructing deformable tissues from endoscopic videos. Method: ColorGS introduces Colored Gaussian Primitives for adaptive color encoding and an Enhanced Deformation Model combining time-aware Gaussian basis functions with learnable time-independent deformations. Result: ColorGS achieves state-of-the-art performance with a PSNR of 39.85 and SSIM of 97.25%, while maintaining real-time rendering efficiency. Conclusion: ColorGS advances surgical scene reconstruction by balancing high fidelity with computational practicality, benefiting intraoperative guidance and AR/VR applications. Abstract: High-fidelity reconstruction of deformable tissues from endoscopic videos remains challenging due to the limitations of existing methods in capturing subtle color variations and modeling global deformations. While 3D Gaussian Splatting (3DGS) enables efficient dynamic reconstruction, its fixed per-Gaussian color assignment struggles with intricate textures, and linear deformation modeling fails to model consistent global deformation. To address these issues, we propose ColorGS, a novel framework that integrates spatially adaptive color encoding and enhanced deformation modeling for surgical scene reconstruction. First, we introduce Colored Gaussian Primitives, which employ dynamic anchors with learnable color parameters to adaptively encode spatially varying textures, significantly improving color expressiveness under complex lighting and tissue similarity. Second, we design an Enhanced Deformation Model (EDM) that combines time-aware Gaussian basis functions with learnable time-independent deformations, enabling precise capture of both localized tissue deformations and global motion consistency caused by surgical interactions. Extensive experiments on DaVinci robotic surgery videos and benchmark datasets (EndoNeRF, StereoMIS) demonstrate that ColorGS achieves state-of-the-art performance, attaining a PSNR of 39.85 (1.5 higher than prior 3DGS-based methods) and superior SSIM (97.25\%) while maintaining real-time rendering efficiency. Our work advances surgical scene reconstruction by balancing high fidelity with computational practicality, critical for intraoperative guidance and AR/VR applications.

[78] Class-wise Flooding Regularization for Imbalanced Image Classification

Hiroaki Aizawa,Yuta Naito,Kohei Fukuda

Main category: cs.CV

TL;DR: This paper proposes class-wise flooding regularization to improve minority class recognition in imbalanced datasets by assigning class-specific flooding levels, leading to better generalization.

Details Motivation: When trained on imbalanced datasets, neural networks tend to favor majority classes, degrading the recognition performance of minority classes. This work aims to address that issue. Method: Class-wise flooding regularization assigns a class-specific flooding level based on class frequencies, suppressing overfitting in majority classes while allowing sufficient learning for minority classes. Result: The proposed method improves classification performance for minority classes compared to conventional flooding regularization techniques. Conclusion: Class-wise flooding regularization enhances the recognition performance of minority classes and improves overall generalization when training neural networks on imbalanced datasets. Abstract: The purpose of training neural networks is to achieve high generalization performance on unseen inputs. However, when trained on imbalanced datasets, a model's prediction tends to favor majority classes over minority classes, leading to significant degradation in the recognition performance of minority classes. To address this issue, we propose class-wise flooding regularization, an extension of flooding regularization applied at the class level. Flooding is a regularization technique that mitigates overfitting by preventing the training loss from falling below a predefined threshold, known as the flooding level, thereby discouraging memorization. Our proposed method assigns a class-specific flooding level based on class frequencies. By doing so, it suppresses overfitting in majority classes while allowing sufficient learning for minority classes. We validate our approach on imbalanced image classification. Compared to conventional flooding regularizations, our method improves the classification performance of minority classes and achieves better overall generalization.

[79] Flatness-aware Curriculum Learning via Adversarial Difficulty

Hiroaki Aizawa,Yoshikazu Hayashi

Main category: cs.CV

TL;DR: 为了解决在平坦区域loss值和梯度范数趋于均匀小,从而难以评估样本难度和设计有效课程的问题,提出了一种新的对抗难度度量(ADM)方法,并将其整合到基于课程的学习中,以动态评估样本难度。

Details Motivation: 神经网络训练过程中存在的过拟合问题,特别是在特定样本或领域上,导致泛化能力差。虽然课程学习和Sharpness-Aware Minimization(SAM)方法在改善泛化方面有效,但它们的结合并不直接。 Method: 提出了一种新的对抗难度度量(ADM),通过利用模型训练朝向平坦极小值的鲁棒性属性来量化对抗脆弱性,并将其整合到基于课程的学习中。 Result: ADM在测量样本难度时保持信息量,即使在训练进入更平坦区域后,也能够有效评估原始示例和对抗示例之间的归一化损失差距。 Conclusion: 结合ADM与CL-based training with SAM在图像分类、细粒度识别和领域泛化任务中表现出色,同时保留了CL和SAM的优点。 Abstract: Neural networks trained by empirical risk minimization often suffer from overfitting, especially to specific samples or domains, which leads to poor generalization. Curriculum Learning (CL) addresses this issue by selecting training samples based on the difficulty. From the optimization perspective, methods such as Sharpness-Aware Minimization (SAM) improve robustness and generalization by seeking flat minima. However, combining CL with SAM is not straightforward. In flat regions, both the loss values and the gradient norms tend to become uniformly small, which makes it difficult to evaluate sample difficulty and design an effective curriculum. To overcome this problem, we propose the Adversarial Difficulty Measure (ADM), which quantifies adversarial vulnerability by leveraging the robustness properties of models trained toward flat minima. Unlike loss- or gradient-based measures, which become ineffective as training progresses into flatter regions, ADM remains informative by measuring the normalized loss gap between original and adversarial examples. We incorporate ADM into CL-based training with SAM to dynamically assess sample difficulty. We evaluated our approach on image classification tasks, fine-grained recognition, and domain generalization. The results demonstrate that our method preserves the strengths of both CL and SAM while outperforming existing curriculum-based and flatness-aware training strategies.

[80] Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

Melanie Wille,Tobias Fischer,Scarlett Raine

Main category: cs.CV

TL;DR: This paper investigates the challenges and solutions for underwater object detection, focusing on the factors affecting class-specific performance disparities and proposing recommendations for improving detection of under-performing marine species through algorithmic advances in localization modules.

Details Motivation: Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. Method: We manipulate the DUO dataset to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Result: Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. Conclusion: Improving under-performing classes should focus on algorithmic advances, especially within localization modules. Abstract: Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO dataset to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.

[81] Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vectorized Drawings

Feiwei Qin,Shichao Lu,Junhao Hou,Changmiao Wang,Meie Fang,Ligang Liu

Main category: cs.CV

TL;DR: 本文提出了Drawing2CAD,一种通过序列到序列学习方法将2D工程图转换为参数化CAD模型的新框架。

Details Motivation: 传统工业工作流程从2D工程图开始,但自动从这些2D向量图生成参数化CAD模型的研究仍不足。 Method: 提出了一种名为Drawing2CAD的框架,包括网络友好的向量原语表示、双解码器Transformer架构和适应CAD参数灵活性的软目标分布损失函数。 Result: 创建了CAD-VGDrawing数据集,并通过实验验证了Drawing2CAD框架的有效性。 Conclusion: Drawing2CAD通过序列到序列的学习方法,有效实现了从2D工程图生成参数化CAD模型,为CAD生成领域带来了新的解决方案。 Abstract: Computer-Aided Design (CAD) generative modeling is driving significant innovations across industrial applications. Recent works have shown remarkable progress in creating solid models from various inputs such as point clouds, meshes, and text descriptions. However, these methods fundamentally diverge from traditional industrial workflows that begin with 2D engineering drawings. The automatic generation of parametric CAD models from these 2D vector drawings remains underexplored despite being a critical step in engineering design. To address this gap, our key insight is to reframe CAD generation as a sequence-to-sequence learning problem where vector drawing primitives directly inform the generation of parametric CAD operations, preserving geometric precision and design intent throughout the transformation process. We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. To train and evaluate Drawing2CAD, we create CAD-VGDrawing, a dataset of paired engineering drawings and parametric CAD models, and conduct thorough experiments to demonstrate the effectiveness of our method. Code and dataset are available at https://github.com/lllssc/Drawing2CAD.

[82] Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

DongHoon Lim,YoungChae Kim,Dong-Hyun Kim,Da-Hee Yang,Joon-Hyuk Chang

Main category: cs.CV

TL;DR: 提出了一种基于路由器门控跨模态特征融合的鲁棒音频-视觉语音识别框架,通过动态调整音频和视觉特征的权重来提高识别性能。

Details Motivation: 在嘈杂环境中,现有的音频-视觉语音识别系统难以估计音频可靠性并动态调整模态依赖性,因此提出了新的方法来解决这个问题。 Method: 采用基于音频-视觉特征融合的路由器,根据令牌级声学损坏评分自适应地重新加权音频和视觉特征,并通过门控跨注意机制强化视觉线索。 Result: 在LRS3上的实验表明,与AV-HuBERT相比,该方法在单词错误率上有16.51-42.67%的相对降低。消融研究确认了路由器和门控机制在提高真实世界声学噪声下的鲁棒性方面的贡献。 Conclusion: 所提出的路由器门控跨模态特征融合框架能够有效提高音频-视觉语音识别在嘈杂环境中的性能。 Abstract: Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.

[83] Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

Qinqian Lei,Bo Wang,Robby T. Tan

Main category: cs.CV

TL;DR: 该论文探讨了通用视觉语言模型(VLMs)在人类-物体交互(HOI)检测中的有效性,并提出了一个新的基准来评估这些模型和专门的HOI方法。

Details Motivation: 现有的HOI基准(如HICO-DET)不适合现代VLMs的生成特性,因为它们要求与标注的HOI类别精确匹配,这可能导致有效的预测被惩罚。 Method: 作者提出了一种新的基准,将HOI检测重新定义为多答案多选任务,每个问题只包括真实正例和一组经过筛选的负例,以减少歧义。 Result: 新的评估协议能够适应VLMs和HOI专用方法,使得可以直接比较它们的表现,并提供了关于HOI理解进展的新见解。 Conclusion: 研究表明,通用的VLMs在HOI检测方面可能比之前认为的更有效,并且新的基准避免了有效预测被错误惩罚的问题。 Abstract: Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either "throwing" or "catching". When only "catching" is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing" is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.

[84] Beyond the Textual: Generating Coherent Visual Options for MCQs

Wanqiang Wang,Longzhu He,Wei Zheng

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态MCQ生成框架CmOS,结合MCoT和RAG技术,解决了视觉干扰项生成的问题,并在实验中表现出色。

Details Motivation: 传统的MCQs生成研究主要集中在文本选项上,忽略了视觉选项;同时,由于手动创作的成本高和扩展性有限,生成高质量干扰项仍是一个重大挑战。 Method: 提出了一种名为Cross-modal Options Synthesis (CmOS) 的框架,结合了Multimodal Chain-of-Thought (MCoT) 推理过程和Retrieval-Augmented Generation (RAG) 技术,并引入了辨别模块来识别适合视觉选项的内容。 Result: 实验结果表明,CmOS在多个学科和教育水平上的测试任务中,在内容辨别、问题生成和视觉选项生成方面均优于现有方法。 Conclusion: CmOS是一个新颖的框架,能够有效生成带有视觉选项的教育性MCQ,且在内容辨别、问题生成和视觉干扰项生成方面优于现有方法。 Abstract: Multiple-choice questions (MCQs) play a crucial role in fostering deep thinking and knowledge integration in education. However, previous research has primarily focused on generating MCQs with textual options, but it largely overlooks the visual options. Moreover, generating high-quality distractors remains a major challenge due to the high cost and limited scalability of manual authoring. To tackle these problems, we propose a Cross-modal Options Synthesis (CmOS), a novel framework for generating educational MCQs with visual options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning process and Retrieval-Augmented Generation (RAG) to produce semantically plausible and visually similar answer and distractors. It also includes a discrimination module to identify content suitable for visual options. Experimental results on test tasks demonstrate the superiority of CmOS in content discrimination, question generation and visual option generation over existing methods across various subjects and educational levels.

[85] Design, Implementation and Evaluation of a Real-Time Remote Photoplethysmography (rPPG) Acquisition System for Non-Invasive Vital Sign Monitoring

Constantino Álvarez Casado,Sasan Sharifipour,Manuel Lage Cañellas,Nhi Nguyen,Le Nguyen,Miguel Bordallo López

Main category: cs.CV

TL;DR: 本文提出了一种针对低功耗设备优化的实时远程生理监测系统,通过多线程架构和混合编程模型解决可扩展性、互操作性和性能问题。

Details Motivation: 智能环境和低功耗计算设备的集成,以及大规模市场传感器技术的发展,推动了远程和非接触式生理监测的进步。然而,在资源受限的平台上实时部署这些系统存在显著的可扩展性、互操作性和性能挑战。 Method: 基于Face2PPG流水线构建了一个实时远程光电容积描记(rPPG)系统,采用多线程架构和混合编程模型(函数响应编程和Actor模型),同时提供HTTP服务器和RESTful API。 Result: 系统能够在低功耗设备上以每秒30帧的速度连续可靠运行,并通过协作用户界面提供自适应反馈,以指导最佳信号采集条件。 Conclusion: 该论文的工作解决了实时生物信号监测中的关键挑战,为现代医疗保健和人机交互应用中的性能优化提供了实用解决方案。 Abstract: The growing integration of smart environments and low-power computing devices, coupled with mass-market sensor technologies, is driving advancements in remote and non-contact physiological monitoring. However, deploying these systems in real-time on resource-constrained platforms introduces significant challenges related to scalability, interoperability, and performance. This paper presents a real-time remote photoplethysmography (rPPG) system optimized for low-power devices, designed to extract physiological signals, such as heart rate (HR), respiratory rate (RR), and oxygen saturation (SpO2), from facial video streams. The system is built on the Face2PPG pipeline, which processes video frames sequentially for rPPG signal extraction and analysis, while leveraging a multithreaded architecture to manage video capture, real-time processing, network communication, and graphical user interface (GUI) updates concurrently. This design ensures continuous, reliable operation at 30 frames per second (fps), with adaptive feedback through a collaborative user interface to guide optimal signal capture conditions. The network interface includes both an HTTP server for continuous video streaming and a RESTful API for on-demand vital sign retrieval. To ensure accurate performance despite the limitations of low-power devices, we use a hybrid programming model combining Functional Reactive Programming (FRP) and the Actor Model, allowing event-driven processing and efficient task parallelization. The system is evaluated under real-time constraints, demonstrating robustness while minimizing computational overhead. Our work addresses key challenges in real-time biosignal monitoring, offering practical solutions for optimizing performance in modern healthcare and human-computer interaction applications.

[86] PseudoMapTrainer: Learning Online Mapping without HD Maps

Christian Löwens,Thorben Funke,Jingchao Xie,Alexandru Paul Condurache

Main category: cs.CV

TL;DR: PseudoMapTrainer 是一种无需真实地图数据即可训练在线地图模型的新方法,它通过从无标签传感器数据生成伪标签,并引入掩码感知算法来处理遮挡问题。

Details Motivation: 现有的在线地图绘制方法在训练过程中仍然依赖于昂贵且地理多样性不足的真实高清晰度地图,因此需要一种更可靠且具有更好泛化能力的方法。 Method: 通过使用高斯点阵和预训练的2D分割网络的语义重建道路表面,生成伪标签,并引入一种掩码感知分配算法和损失函数来处理部分遮挡的伪标签。 Result: PseudoMapTrainer 可以在没有真实地图的情况下训练在线地图模型,并且可以半监督地利用大规模无标签的众包数据进行预训练。 Conclusion: PseudoMapTrainer 是一种新颖的在线地图绘制方法,它通过使用从无标签传感器数据生成的伪标签来训练模型,而无需任何真实地图数据。 Abstract: Online mapping models show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo-labels generated from unlabeled sensor data. We derive those pseudo-labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo-labels can be effectively used to pre-train an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code is available at github.com/boschresearch/PseudoMapTrainer.

[87] Robust and Label-Efficient Deep Waste Detection

Hassan Abid,Khan Muhammad,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出了一个基于集成的半监督学习框架,用于垃圾检测,并通过微调基于Transformer的检测器和使用软伪标签策略,显著提高了垃圾分类的性能。

Details Motivation: 垃圾分类对可持续回收至关重要,但该领域的人工智能研究由于数据集的限制和对传统目标检测器的依赖而落后于商业系统。这项工作旨在通过建立严格的基线和提出可扩展的注释管道来推动这一领域的发展。 Method: 作者首先在现实世界的ZeroWaste数据集上基准了最先进的开放词汇目标检测(OVOD)模型,随后通过调整现代基于Transformer的检测器建立了新的基线。接着,他们提出了一种融合集成预测的软伪标签策略,并将其应用于未标记的ZeroWaste-s子集。 Result: 通过使用LLM优化的提示,零样本准确率显著提高。通过微调基于Transformer的检测器,达到了51.6 mAP的新基线。提出的伪标签策略在未标记的ZeroWaste-s子集上实现了超越全监督训练的性能增益。 Conclusion: 本文通过建立强大的基线和引入基于集成的半监督学习框架,推进了人工智能驱动的垃圾分类。作者展示了其提出的伪标签策略的有效性,并系统评估了现实世界垃圾分类条件下的开放词汇目标检测模型。 Abstract: Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.

[88] Embedding Font Impression Word Tags Based on Co-occurrence

Yugo Kubota,Seiichi Uchida

Main category: cs.CV

TL;DR: 本文提出了一种基于字体形状与印象标签关系的嵌入方法,在印象引导的字体生成和检索方面优于现有方法如BERT和CLIP。

Details Motivation: 不同的字体风格传达了不同的印象,表明字体形状与描述这些印象的词标签之间有密切关系。 Method: 构建一个图,其中节点代表印象标签,边编码共现关系;然后应用谱嵌入以获得每个标签的印象向量。 Result: 该方法在定性和定量评估中均优于BERT和CLIP,尤其是在印象引导的字体生成方面表现更佳。 Conclusion: 本文提出了一种利用字体形状与印象标签之间关系的新嵌入方法,并证明其在基于印象的字体生成和检索方面优于BERT和CLIP。 Abstract: Different font styles (i.e., font shapes) convey distinct impressions, indicating a close relationship between font shapes and word tags describing those impressions. This paper proposes a novel embedding method for impression tags that leverages these shape-impression relationships. For instance, our method assigns similar vectors to impression tags that frequently co-occur in order to represent impressions of fonts, whereas standard word embedding methods (e.g., BERT and CLIP) yield very different vectors. This property is particularly useful for impression-based font generation and font retrieval. Technically, we construct a graph whose nodes represent impression tags and whose edges encode co-occurrence relationships. Then, we apply spectral embedding to obtain the impression vectors for each tag. We compare our method with BERT and CLIP in qualitative and quantitative evaluations, demonstrating that our approach performs better in impression-guided font generation.

[89] Deep Pre-trained Time Series Features for Tree Species Classification in the Dutch Forest Inventory

Takayuki Ishikawa,Carmelo Bonannella,Bas J. W. Lerink,Marc Rußwurm

Main category: cs.CV

TL;DR: 这篇论文研究了如何利用预训练的遥感基础模型提高国家森林清查中的树木种类分类准确性,结果表明该方法在数据有限的情况下显著优于传统方法。

Details Motivation: 国家森林清查(NFI)是森林信息的主要来源,但其维护需要大量人力。遥感技术结合机器学习提供了更频繁和更大规模更新NFI的机会。当前方法依赖于手工设计的特征和物候指标,而深度学习特征提供了一种补充策略。 Method: 论文通过提取Sentinel-1、Sentinel-2和ERA5卫星数据以及SRTM数据的时间序列,并使用公开可用的遥感基础模型进行微调,以进行树木种类分类任务。 Result: 研究结果表明,在荷兰使用预训练的遥感时间序列基础模型进行微调,能够显著提高树木种类分类的准确性,证明传统的手工定义特征对于该任务过于简单,而深度AI特征在数据有限的应用中具有巨大潜力。 Conclusion: 该论文得出结论,利用预训练的遥感时间序列基础模型进行微调,可以显著提高国家森林清查(NFI)的树木种类分类准确性,相较于传统方法,准确率提高了多达10%。这种方法在数据有限的应用场景下具有巨大潜力,并可以有效补充现有的森林清查流程。 Abstract: National Forest Inventory (NFI)s serve as the primary source of forest information, providing crucial tree species distribution data. However, maintaining these inventories requires labor-intensive on-site campaigns. Remote sensing approaches, particularly when combined with machine learning, offer opportunities to update NFIs more frequently and at larger scales. While the use of Satellite Image Time Series has proven effective for distinguishing tree species through seasonal canopy reflectance patterns, current approaches rely primarily on Random Forest classifiers with hand-designed features and phenology-based metrics. Using deep features from an available pre-trained remote sensing foundation models offers a complementary strategy. These pre-trained models leverage unannotated global data and are meant to used for general-purpose applications and can then be efficiently fine-tuned with smaller labeled datasets for specific classification tasks. This work systematically investigates how deep features improve tree species classification accuracy in the Netherlands with few annotated data. Data-wise, we extracted time-series data from Sentinel-1, Sentinel-2 and ERA5 satellites data and SRTM data using Google Earth Engine. Our results demonstrate that fine-tuning a publicly available remote sensing time series foundation model outperforms the current state-of-the-art in NFI classification in the Netherlands by a large margin of up to 10% across all datasets. This demonstrates that classic hand-defined harmonic features are too simple for this task and highlights the potential of using deep AI features for data-limited application like NFI classification. By leveraging openly available satellite data and pre-trained models, this approach significantly improves classification accuracy compared to traditional methods and can effectively complement existing forest inventory processes.

[90] Automated Classification of Normal and Atypical Mitotic Figures Using ConvNeXt V2: MIDOG 2025 Track 2

Yosuke Yamagishi,Shouhei Hanaoka

Main category: cs.CV

TL;DR: This paper presents a solution for the MIDOG 2025 Challenge Track 2, which focuses on binary classification of normal mitotic figures versus atypical mitotic figures in histopathological images. The solution uses a ConvNeXt V2 base model with preprocessing and ensemble strategies to achieve robust performance while maintaining computational efficiency.

Details Motivation: The motivation is to accurately classify normal mitotic figures (NMFs) versus atypical mitotic figures (AMFs) in histopathological images, addressing challenges such as class imbalance, morphological variability, and domain heterogeneity across different tumor types, species, and scanners. Method: The method involves leveraging a ConvNeXt V2 base model with center cropping preprocessing and a 5-fold cross-validation ensemble strategy to address challenges like class imbalance, morphological variability, and domain heterogeneity. Result: The model achieved robust performance on the diverse MIDOG 2025 dataset through strategic preprocessing with 60% center cropping and mixed precision training. Conclusion: The paper concludes that modern convolutional architectures, specifically the ConvNeXt V2 base model, are effective for mitotic figure subtyping while maintaining computational efficiency. Abstract: This paper presents our solution for the MIDOG 2025 Challenge Track 2, which focuses on binary classification of normal mitotic figures (NMFs) versus atypical mitotic figures (AMFs) in histopathological images. Our approach leverages a ConvNeXt V2 base model with center cropping preprocessing and 5-fold cross-validation ensemble strategy. The method addresses key challenges including severe class imbalance, high morphological variability, and domain heterogeneity across different tumor types, species, and scanners. Through strategic preprocessing with 60% center cropping and mixed precision training, our model achieved robust performance on the diverse MIDOG 2025 dataset. The solution demonstrates the effectiveness of modern convolutional architectures for mitotic figure subtyping while maintaining computational efficiency through careful architectural choices and training optimizations.

[91] Boosting Micro-Expression Analysis via Prior-Guided Video-Level Regression

Zizheng Guo,Bochao Zou,Yinuo Jia,Xiangyu Li,Huimin Ma

Main category: cs.CV

TL;DR: 本文提出了一种新的微表情分析方法,该方法采用视频级回归框架,结合可扩展的区间选择策略和协同优化框架,以更准确地捕捉微表情的时间动态。

Details Motivation: 现有的微表情分析方法依赖于固定窗口大小的窗口级分类或手动预定义的基于窗口的方法,限制了其捕捉微表情复杂时间动态的能力。 Method: 提出了一种先验引导的视频级回归方法进行微表情分析,引入了一种可扩展的区间选择策略,并采用了协同优化框架,使定位和识别任务共享参数。 Result: 该方法在多个基准数据集上实现了最先进的性能,CAS(ME)³上的STRS为0.0562,SAMMLV上的STRS为0.2000。 Conclusion: 实验结果表明,该方法在多个基准数据集上表现出最先进的性能,CAS(ME)³上的STRS为0.0562,SAMMLV上的STRS为0.2000。 Abstract: Micro-expressions (MEs) are involuntary, low-intensity, and short-duration facial expressions that often reveal an individual's genuine thoughts and emotions. Most existing ME analysis methods rely on window-level classification with fixed window sizes and hard decisions, which limits their ability to capture the complex temporal dynamics of MEs. Although recent approaches have adopted video-level regression frameworks to address some of these challenges, interval decoding still depends on manually predefined, window-based methods, leaving the issue only partially mitigated. In this paper, we propose a prior-guided video-level regression method for ME analysis. We introduce a scalable interval selection strategy that comprehensively considers the temporal evolution, duration, and class distribution characteristics of MEs, enabling precise spotting of the onset, apex, and offset phases. In addition, we introduce a synergistic optimization framework, in which the spotting and recognition tasks share parameters except for the classification heads. This fully exploits complementary information, makes more efficient use of limited data, and enhances the model's capability. Extensive experiments on multiple benchmark datasets demonstrate the state-of-the-art performance of our method, with an STRS of 0.0562 on CAS(ME)$^3$ and 0.2000 on SAMMLV. The code is available at https://github.com/zizheng-guo/BoostingVRME.

[92] Quantitative Outcome-Oriented Assessment of Microsurgical Anastomosis

Luyin Hu,Soheil Gholami,George Dindelegan,Torstein R. Meling,Aude Billard

Main category: cs.CV

TL;DR: This paper proposes an objective quantitative framework using image processing and geometric modeling to assess microsurgical anastomosis, improving reliability and replicating expert scoring.

Details Motivation: The motivation stems from the current reliance on subjective methods for assessing microsurgical competence, which can introduce bias and reduce assessment reliability and efficiency. Method: The research employs a quantitative framework using image-processing techniques and geometric modeling of errors, including a detection and scoring mechanism, to objectively assess microsurgical anastomoses across three hospital datasets. Result: The geometric metrics used in the framework demonstrate effectiveness in replicating expert raters' scoring for the considered errors, indicating improved objective assessment capabilities. Conclusion: The study concludes that the introduced quantitative framework using image-processing techniques enhances the efficiency and reliability of assessing microsurgical proficiency, effectively replicating expert raters' scoring. Abstract: Microsurgical anastomosis demands exceptional dexterity and visuospatial skills, underscoring the importance of comprehensive training and precise outcome assessment. Currently, methods such as the outcome-oriented anastomosis lapse index are used to evaluate this procedure. However, they often rely on subjective judgment, which can introduce biases that affect the reliability and efficiency of the assessment of competence. Leveraging three datasets from hospitals with participants at various levels, we introduce a quantitative framework that uses image-processing techniques for objective assessment of microsurgical anastomoses. The approach uses geometric modeling of errors along with a detection and scoring mechanism, enhancing the efficiency and reliability of microsurgical proficiency assessment and advancing training protocols. The results show that the geometric metrics effectively replicate expert raters' scoring for the errors considered in this work.

[93] Harnessing Meta-Learning for Controllable Full-Frame Video Stabilization

Muhammad Kashif Ali,Eun Woo Im,Dongjin Kim,Tae Hyun Kim,Vivek Gupta,Haonan Luo,Tianrui Li

Main category: cs.CV

TL;DR: 这篇论文介绍了一种新的视频稳定化方法,通过测试时快速适应输入视频,显著提升了现有方法的性能。

Details Motivation: 视频稳定化是计算机视觉中的一个基本问题,特别是像素级合成解决方案,因其需要合成全帧输出而增加了任务的复杂性。 Method: 利用推理过程中可用的低级视觉线索,提出了一种快速适应模型的方法,并设计了一个抖动定位模块和有针对性的适应策略。 Result: 实验表明,所提出的方法在各种真实世界数据集上都表现出色,显著提升了现有全帧合成模型的性能。 Conclusion: 该论文提出了一种新颖的视频稳定化方法,通过在测试时快速适应输入视频,提高了像素级合成视频稳定化方法的性能。 Abstract: Video stabilization remains a fundamental problem in computer vision, particularly pixel-level synthesis solutions for video stabilization, which synthesize full-frame outputs, add to the complexity of this task. These methods aim to enhance stability while synthesizing full-frame videos, but the inherent diversity in motion profiles and visual content present in each video sequence makes robust generalization with fixed parameters difficult. To address this, we present a novel method that improves pixel-level synthesis video stabilization methods by rapidly adapting models to each input video at test time. The proposed approach takes advantage of low-level visual cues available during inference to improve both the stability and visual quality of the output. Notably, the proposed rapid adaptation achieves significant performance gains even with a single adaptation pass. We further propose a jerk localization module and a targeted adaptation strategy, which focuses the adaptation on high-jerk segments for maximizing stability with fewer adaptation steps. The proposed methodology enables modern stabilizers to overcome the longstanding SOTA approaches while maintaining the full frame nature of the modern methods, while offering users with control mechanisms akin to classical approaches. Extensive experiments on diverse real-world datasets demonstrate the versatility of the proposed method. Our approach consistently improves the performance of various full-frame synthesis models in both qualitative and quantitative terms, including results on downstream applications.

[94] Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models

Yuexuan Xia,Benteng Ma,Jiang He,Zhiyong Wang,Qi Dou,Yong Xia

Main category: cs.CV

TL;DR: DualFairVL is a novel multimodal prompt-learning framework that improves fairness and accuracy in medical diagnosis across diverse imaging datasets and modalities, offering efficient performance with fewer parameters.

Details Motivation: Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, especially under distribution shifts caused by variations in imaging equipment and clinical practice. Method: DualFairVL uses a parallel dual-branch architecture with text anchors and a hypernetwork to disentangle and align cross-modal representations, incorporating prototype-based regularization for fairness. Result: DualFairVL outperforms full fine-tuning and parameter-efficient baselines on eight medical imaging datasets across four modalities, achieving superior fairness and accuracy with only 3.6M trainable parameters. Conclusion: DualFairVL, a multimodal prompt-learning framework, achieves state-of-the-art fairness and accuracy in medical diagnosis across different imaging datasets and modalities, while using a relatively small number of trainable parameters. Abstract: Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice. Vision-language models (VLMs) exhibit strong generalization, and text prompts encode identity attributes, enabling explicit identification and removal of sensitive directions. However, existing debiasing approaches typically address vision and text modalities independently, leaving residual cross-modal misalignment and fairness gaps. To address this challenge, we propose DualFairVL, a multimodal prompt-learning framework that jointly debiases and aligns cross-modal representations. DualFairVL employs a parallel dual-branch architecture that separates sensitive and target attributes, enabling disentangled yet aligned representations across modalities. Approximately orthogonal text anchors are constructed via linear projections, guiding cross-attention mechanisms to produce fused features. A hypernetwork further disentangles attribute-related information and generates instance-aware visual prompts, which encode dual-modal cues for fairness and robustness. Prototype-based regularization is applied in the visual branch to enforce separation of sensitive features and strengthen alignment with textual anchors. Extensive experiments on eight medical imaging datasets across four modalities show that DualFairVL achieves state-of-the-art fairness and accuracy under both in- and out-of-distribution settings, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters. Code will be released upon publication.

[95] DQEN: Dual Query Enhancement Network for DETR-based HOI Detection

Zhehao Li,Chong Wang,Yi Chen,Yinghao Lu,Jiangbo Qian,Jiong Wang,Jiafei Wu

Main category: cs.CV

TL;DR: This paper proposes a Dual Query Enhancement Network (DQEN) for Human-Object Interaction (HOI) detection, which improves query representations for objects and interactions, leading to competitive performance on benchmark datasets.

Details Motivation: The motivation is to address the limitations of randomly initialized queries in DETR-based HOI models, which result in vague representations, by enhancing the clarity and effectiveness of object and interaction queries. Method: The method involves enhancing object queries using object-aware encoder features and improving interaction queries through an Interaction Semantic Fusion module that utilizes semantic features from the CLIP model. Additionally, an Auxiliary Prediction Unit is introduced to enhance interaction feature representation. Result: The proposed method achieves competitive performance on both the HICO-Det and V-COCO datasets, demonstrating its effectiveness in HOI detection. Conclusion: The proposed Dual Query Enhancement Network (DQEN) achieves competitive performance on HOI detection tasks by effectively enhancing object and interaction queries. Abstract: Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model's effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model's ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at https://github.com/lzzhhh1019/DQEN.

[96] Interpretable Decision-Making for End-to-End Autonomous Driving

Mona Mirzaie,Bodo Rosenhahn

Main category: cs.CV

TL;DR: This paper introduces a method to enhance interpretability in autonomous driving models, achieving superior performance on CARLA benchmarks while enabling explanations for AI-driven decisions.

Details Motivation: The motivation is to address the lack of interpretability in end-to-end autonomous driving models that use deep neural networks with non-linear decision boundaries, making it challenging to understand AI-driven decisions in complex urban scenarios. Method: The paper proposes loss functions that promote interpretability by generating sparse and localized feature maps, enabling explanations of which image regions contribute to the predicted control commands. Result: The proposed method improves interpretability, reduces infractions, and achieves a higher route completion rate compared to existing approaches on the CARLA benchmarks. Conclusion: The paper concludes that enhancing interpretability in autonomous driving models can lead to safer and high-performance outcomes, as demonstrated by surpassing top-performing approaches on the CARLA Leaderboard while ensuring interpretability. Abstract: Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.

[97] Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025

Thien-Phuc Tran,Minh-Quang Nguyen,Minh-Triet Tran,Tam V. Nguyen,Trong-Le Do,Duy-Nam Ly,Viet-Tham Huynh,Khanh-Duy Le,Mai-Khiem Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: The EVENTA Grand Challenge introduces a large-scale benchmark for event-level multimodal understanding to better capture the context and meaning behind images.

Details Motivation: Traditional captioning and retrieval tasks overlook the deeper contextual and semantic dimensions of real-world events. Method: Integration of contextual, temporal, and semantic information using the OpenEvents V1 dataset, featuring two challenge tracks. Result: 45 teams participated, with top three teams presenting solutions at ACM Multimedia 2025. Conclusion: EVENTA Grand Challenge sets a foundation for context-aware, narrative-driven multimedia AI with applications in various fields. Abstract: The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: https://ltnghia.github.io/eventa/eventa-2025.

[98] Preliminary Study on Space Utilization and Emergent Behaviors of Group vs. Single Pedestrians in Real-World Trajectories

Amartaivan Sanjjamts,Morita Hiroshi

Main category: cs.CV

TL;DR: 本文提出了一种基于轨迹数据区分行人个体与群体的框架,并设计了空间与行为度量体系,为未来研究提供基础。

Details Motivation: 研究旨在分析群体与单独行人在空间利用和行为模式上的差异,以促进人群动态研究中的行人模拟和空间设计验证。 Method: 通过将行人轨迹分割成固定时间区间,并应用基于Transformer的配对分类模型,结合空间和行为维度的度量框架进行分析。 Result: 建立了分类管道和数据集结构,并提出了包括空间利用和行为特征的度量体系,为未来定量分析打下基础。 Conclusion: 论文提出了一个初步框架,用于区分群体和单独行人,并为后续深入分析奠定了基础。 Abstract: This study presents an initial framework for distinguishing group and single pedestrians based on real-world trajectory data, with the aim of analyzing their differences in space utilization and emergent behavioral patterns. By segmenting pedestrian trajectories into fixed time bins and applying a Transformer-based pair classification model, we identify cohesive groups and isolate single pedestrians over a structured sequence-based filtering process. To prepare for deeper analysis, we establish a comprehensive metric framework incorporating both spatial and behavioral dimensions. Spatial utilization metrics include convex hull area, smallest enclosing circle radius, and heatmap-based spatial densities to characterize how different pedestrian types occupy and interact with space. Behavioral metrics such as velocity change, motion angle deviation, clearance radius, and trajectory straightness are designed to capture local adaptations and responses during interactions. Furthermore, we introduce a typology of encounter types-single-to-single, single-to-group, and group-to-group to categorize and later quantify different interaction scenarios. Although this version focuses primarily on the classification pipeline and dataset structuring, it establishes the groundwork for scalable analysis across different sequence lengths 60, 100, and 200 frames. Future versions will incorporate complete quantitative analysis of the proposed metrics and their implications for pedestrian simulation and space design validation in crowd dynamics research.

[99] The point is the mask: scaling coral reef segmentation with weak supervision

Matteo Contini,Victor Illien,Sylvain Poulain,Serge Bernard,Julien Barde,Sylvain Bonhommeau,Alexis Joly

Main category: cs.CV

TL;DR: 提出了一种结合弱监督深度学习和多尺度遥感技术的大规模珊瑚礁监测方法,具有成本低、可扩展性强的优点。

Details Motivation: 大规模监测珊瑚礁对于评估生态系统健康和推动保护工作至关重要,但现有的方法由于分辨率限制和标注成本高昂而难以扩展。 Method: 通过将水下图像中的细尺度生态信息转移到航空图像中,结合分类监督、空间插值和自我蒸馏技术,实现大规模珊瑚礁的弱监督深度学习分割。 Result: 该方法能够在最小人工标注的情况下,实现大面积珊瑚形态类型的分割,并展示了集成新类别的灵活性。 Conclusion: 该研究提出了一种多尺度弱监督语义分割框架,可以有效地从无人机图像中进行大规模珊瑚礁映射,并且具有可扩展性和成本效益。 Abstract: Monitoring coral reefs at large spatial scales remains an open challenge, essential for assessing ecosystem health and informing conservation efforts. While drone-based aerial imagery offers broad spatial coverage, its limited resolution makes it difficult to reliably distinguish fine-scale classes, such as coral morphotypes. At the same time, obtaining pixel-level annotations over large spatial extents is costly and labor-intensive, limiting the scalability of deep learning-based segmentation methods for aerial imagery. We present a multi-scale weakly supervised semantic segmentation framework that addresses this challenge by transferring fine-scale ecological information from underwater imagery to aerial data. Our method enables large-scale coral reef mapping from drone imagery with minimal manual annotation, combining classification-based supervision, spatial interpolation and self-distillation techniques. We demonstrate the efficacy of the approach, enabling large-area segmentation of coral morphotypes and demonstrating flexibility for integrating new classes. This study presents a scalable, cost-effective methodology for high-resolution reef monitoring, combining low-cost data collection, weakly supervised deep learning and multi-scale remote sensing.

[100] Generative AI in Map-Making: A Technical Exploration and Its Implications for Cartographers

Claudio Affolter,Sidi Wu,Yizi Chen,Lorenz Hurni

Main category: cs.CV

TL;DR: 本文开发了一种利用生成式AI和矢量数据的新方法,实现了可控风格的地图生成,并通过网络应用提升了用户体验。

Details Motivation: 传统制图依赖地理信息系统(GIS),需要领域专业知识且耗时,而生成式AI在空间组成和语义布局上的不足限制了其在制图中的应用。 Method: 通过整合矢量数据和文本提示来指导地图生成,并开发了一个网络应用程序来增强可用性和可访问性。 Result: 该模型能够生成精确的地图,用户研究显示其在可用性和地图保真度方面具有潜力,并为未来的AI辅助制图提供了启示。 Conclusion: 本文提出了一种结合矢量数据和文本提示生成可控地图风格的生成式AI模型,并探讨了其在制图领域的潜力和影响。 Abstract: Traditional map-making relies heavily on Geographic Information Systems (GIS), requiring domain expertise and being time-consuming, especially for repetitive tasks. Recent advances in generative AI (GenAI), particularly image diffusion models, offer new opportunities for automating and democratizing the map-making process. However, these models struggle with accurate map creation due to limited control over spatial composition and semantic layout. To address this, we integrate vector data to guide map generation in different styles, specified by the textual prompts. Our model is the first to generate accurate maps in controlled styles, and we have integrated it into a web application to improve its usability and accessibility. We conducted a user study with professional cartographers to assess the fidelity of generated maps, the usability of the web application, and the implications of ever-emerging GenAI in map-making. The findings have suggested the potential of our developed application and, more generally, the GenAI models in helping both non-expert users and professionals in creating maps more efficiently. We have also outlined further technical improvements and emphasized the new role of cartographers to advance the paradigm of AI-assisted map-making.

[101] Enhancing compact convolutional transformers with super attention

Simpenzwe Honore Leandre,Natenaile Asmamaw Shiferaw,Dillip Rout

Main category: cs.CV

TL;DR: 本文提出一种新的视觉模型架构,在CIFAR100上显著提升准确率,同时具备高效推理能力和高训练稳定性,且不依赖复杂技术。

Details Motivation: 为了在固定上下文长度任务中提升性能并减少对复杂技术(如数据增强、位置嵌入等)的依赖。 Method: 采用token mixing、sequence-pooling和卷积tokenizers设计了一种新的视觉模型架构。 Result: 在CIFAR100基准测试中,top 1%和top 5%验证准确率分别从36.50%提升到46.29%和从66.33%提升到76.31%,且模型更高效、体积更小。 Conclusion: 本文提出的视觉模型在固定上下文长度任务中实现了最先进的性能和高效的推理,同时具有高训练稳定性且不依赖数据增强等复杂技术。 Abstract: In this paper, we propose a vision model that adopts token mixing, sequence-pooling, and convolutional tokenizers to achieve state-of-the-art performance and efficient inference in fixed context-length tasks. In the CIFAR100 benchmark, our model significantly improves the baseline of the top 1% and top 5% validation accuracy from 36.50% to 46.29% and 66.33% to 76.31%, while being more efficient than the Scaled Dot Product Attention (SDPA) transformers when the context length is less than the embedding dimension and only 60% the size. In addition, the architecture demonstrates high training stability and does not rely on techniques such as data augmentation like mixup, positional embeddings, or learning rate scheduling. We make our code available on Github.

[102] USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Shaojin Wu,Mengqi Huang,Yufeng Cheng,Wenxu Wu,Jiahe Tian,Yiming Luo,Fei Ding,Qian He

Main category: cs.CV

TL;DR: 本研究提出了一种新的统一模型USO,通过解耦学习和奖励学习范式,同时实现了图像生成中的风格相似性和主体一致性,并发布了相关基准USO-Bench。

Details Motivation: 现有文献通常将风格驱动和主体驱动的生成视为两个互斥的任务,该论文旨在通过一个统一的框架来解决这一问题,实现内容和风格的解耦与重组。 Method: 构建了一个大规模的三元组数据集,引入了一种解耦学习方案,包括风格对齐训练和内容-风格解耦训练,并结合了风格奖励学习范式SRL。 Result: USO模型在多个指标上实现了最先进的性能,同时在风格相似性和主体一致性方面表现出色。 Conclusion: 该论文提出了一种统一的风格-主体优化定制模型USO,能够同时实现风格相似性和主体一致性,并通过大量实验验证了其在开源模型中的最先进性能。 Abstract: Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

[103] Can we make NeRF-based visual localization privacy-preserving?

Maxime Pietrantoni,Martin Humenberger,Torsten Sattler,Gabriela Csurka

Main category: cs.CV

TL;DR: This paper proposes ppNeSF, a privacy-preserving NeRF variant for visual localization, which uses segmentation supervision to obscure identifiable details while maintaining accuracy.

Details Motivation: NeRF-based methods encode fine scene details that raise privacy concerns in cloud-based localization services. Method: ppNeSF uses segmentation supervision instead of RGB images to obscure identifiable scene details. Result: ppNeSF achieves state-of-the-art results in visual localization while preserving privacy. Conclusion: ppNeSF provides a privacy-preserving solution for visual localization without compromising performance. Abstract: Visual localization (VL) is the task of estimating the camera pose in a known scene. VL methods, a.o., can be distinguished based on how they represent the scene, e.g., explicitly through a (sparse) point cloud or a collection of images or implicitly through the weights of a neural network. Recently, NeRF-based methods have become popular for VL. While NeRFs offer high-quality novel view synthesis, they inadvertently encode fine scene details, raising privacy concerns when deployed in cloud-based localization services as sensitive information could be recovered. In this paper, we tackle this challenge on two ends. We first propose a new protocol to assess privacy-preservation of NeRF-based representations. We show that NeRFs trained with photometric losses store fine-grained details in their geometry representations, making them vulnerable to privacy attacks, even if the head that predicts colors is removed. Second, we propose ppNeSF (Privacy-Preserving Neural Segmentation Field), a NeRF variant trained with segmentation supervision instead of RGB images. These segmentation labels are learned in a self-supervised manner, ensuring they are coarse enough to obscure identifiable scene details while remaining discriminativeness in 3D. The segmentation space of ppNeSF can be used for accurate visual localization, yielding state-of-the-art results.

[104] Enhancing Document VQA Models via Retrieval-Augmented Generation

Eric López,Artemis Llabrés,Ernest Valveny

Main category: cs.CV

TL;DR: 本文系统评估了检索增强生成(RAG)在文档视觉问答(Document VQA)中的影响,发现基于文本和纯视觉的检索方法均能显著提升准确率,同时减少对内存的需求。

Details Motivation: 文档VQA需要处理数十页的文档,但现有方法要么拼接所有页面,要么依赖大型视觉-语言模型,内存消耗大。RAG提供了一种更有效的方法,通过先检索相关片段再生成答案来减少内存消耗。 Method: 通过不同的检索变体(基于OCR文本的检索和无需OCR的纯视觉检索)系统评估RAG在多个模型和基准数据集(MP-DocVQA、DUDE和InfographicVQA)上的表现。 Result: 基于文本的检索方法比“拼接所有页面”基线提升了最多+22.5 ANLS,而纯视觉检索方法在不需要文本提取的情况下提升了+5.0 ANLS。消融实验表明,检索和重排序组件是性能提升的主要原因,而布局引导的分块策略在这些数据集中未能带来帮助。 Conclusion: 精心选择证据的方法在多个模型大小和多页基准数据集上均能持续提升准确率,表明其在实际文档VQA任务中的实用价值。 Abstract: Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

[105] Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone

Shaivi Malik,Hasnat Md Abdullah,Sriparna Saha,Amit Sheth

Main category: cs.CV

TL;DR: 介绍了一个名为GRAS的新基准,用于检测视觉语言模型中的性别、种族、年龄和肤色偏差,并提出了GRAS偏差评分。

Details Motivation: 随着视觉语言模型在现实世界应用中的重要性增加,了解其人口统计偏差变得至关重要。 Method: 引入了一个名为GRAS的基准,用于跨性别、种族、年龄和肤色检测视觉语言模型中的偏差,并提出了可解释的GRAS偏差评分来量化偏差。 Result: 基准测试了五种最先进的视觉语言模型,发现令人担忧的偏差水平,其中最不偏见的模型的GRAS偏差评分仅为100分中的2分。 Conclusion: 研究得出,最先进的视觉语言模型在性别、种族、年龄和肤色方面存在显著的人口统计偏差,并且评估这些偏差时需要考虑多种问题形式。 Abstract: As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.

[106] RoofSeg: An edge-aware transformer-based network for end-to-end roof plane segmentation

Siyuan You,Guozheng Xu,Pengwei Zhou,Qiwen Jin,Jian Yao,Li Li

Main category: cs.CV

TL;DR: RoofSeg is a new end-to-end deep learning method for roof plane segmentation using LiDAR data, combining transformer architecture with edge-aware modules and improved training strategies.

Details Motivation: Current deep learning-based roof plane segmentation methods face three main challenges: non end-to-end processing, low feature discriminability near edges, and insufficient use of planar geometric constraints. Method: RoofSeg uses a transformer encoder-decoder framework with learnable plane queries to predict plane instance masks. It also incorporates an Edge-Aware Mask Module (EAMM) and an adaptive weighting strategy in the mask loss, along with a new plane geometric loss for training. Result: The proposed RoofSeg achieves better segmentation accuracy, especially in edge regions, by integrating geometric priors and refining the training process through adaptive weighting and geometric loss. Conclusion: RoofSeg, an edge-aware transformer-based network, is introduced to improve roof plane segmentation from LiDAR point clouds by addressing three key limitations of existing deep learning-based approaches. Abstract: Roof plane segmentation is one of the key procedures for reconstructing three-dimensional (3D) building models at levels of detail (LoD) 2 and 3 from airborne light detection and ranging (LiDAR) point clouds. The majority of current approaches for roof plane segmentation rely on the manually designed or learned features followed by some specifically designed geometric clustering strategies. Because the learned features are more powerful than the manually designed features, the deep learning-based approaches usually perform better than the traditional approaches. However, the current deep learning-based approaches have three unsolved problems. The first is that most of them are not truly end-to-end, the plane segmentation results may be not optimal. The second is that the point feature discriminability near the edges is relatively low, leading to inaccurate planar edges. The third is that the planar geometric characteristics are not sufficiently considered to constrain the network training. To solve these issues, a novel edge-aware transformer-based network, named RoofSeg, is developed for segmenting roof planes from LiDAR point clouds in a truly end-to-end manner. In the RoofSeg, we leverage a transformer encoder-decoder-based framework to hierarchically predict the plane instance masks with the use of a set of learnable plane queries. To further improve the segmentation accuracy of edge regions, we also design an Edge-Aware Mask Module (EAMM) that sufficiently incorporates planar geometric prior of edges to enhance its discriminability for plane instance mask refinement. In addition, we propose an adaptive weighting strategy in the mask loss to reduce the influence of misclassified points, and also propose a new plane geometric loss to constrain the network training.

[107] MicroDetect-Net (MDN): Leveraging Deep Learning to Detect Microplastics in Clam Blood, a Step Towards Human Blood Analysis

Riju Marwah,Riya Arora,Navneet Yadav,Himank Arora

Main category: cs.CV

TL;DR: This paper introduces MicroDetect-Net (MDN), a deep learning model combining fluorescence imaging and Nile Red dye to detect microplastics in blood samples with high accuracy.

Details Motivation: Microplastic pollution poses significant health risks, with potential harm to the liver, intestines, and gut flora. Detecting microplastics in biological samples is crucial for understanding their impact on human health. Method: The study used fluorescence microscopy with Nile Red dye staining combined with a deep learning convolutional neural network (MicroDetect-Net) to detect and count microplastics in blood samples. Result: MDN achieved an accuracy of 92%, with an Intersection over Union of 87.4%, F1 score of 92.1%, Precision of 90.6%, and Recall of 93.7% on a dataset of 276 fluorescent blood images. Conclusion: MicroDetect-Net (MDN) proves to be an effective model for detecting microplastics in blood samples, showing strong performance metrics and opening opportunities for application in human samples. Abstract: With the prevalence of plastics exceeding 368 million tons yearly, microplastic pollution has grown to an extent where air, water, soil, and living organisms have all tested positive for microplastic presence. These particles, which are smaller than 5 millimeters in size, are no less harmful to humans than to the environment. Toxicity research on microplastics has shown that exposure may cause liver infection, intestinal injuries, and gut flora imbalance, leading to numerous potential health hazards. This paper presents a new model, MicroDetect-Net (MDN), which applies fluorescence microscopy with Nile Red dye staining and deep learning to scan blood samples for microplastics. Although clam blood has certain limitations in replicating real human blood, this study opens avenues for applying the approach to human samples, which are more consistent for preliminary data collection. The MDN model integrates dataset preparation, fluorescence imaging, and segmentation using a convolutional neural network to localize and count microplastic fragments. The combination of convolutional networks and Nile Red dye for segmentation produced strong image detection and accuracy. MDN was evaluated on a dataset of 276 Nile Red-stained fluorescent blood images and achieved an accuracy of ninety two percent. Robust performance was observed with an Intersection over Union of 87.4 percent, F1 score of 92.1 percent, Precision of 90.6 percent, and Recall of 93.7 percent. These metrics demonstrate the effectiveness of MDN in the detection of microplastics.

[108] ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Yi Pan,Yujia Zhang,Michael Kampffmeyer,Xiaoguang Zhao

Main category: cs.CV

TL;DR: ProPy is a model designed for Partially Relevant Video Retrieval with innovations including a Prompt Pyramid structure and Ancestor-Descendant Interaction Mechanism, achieving SOTA results.

Details Motivation: To bridge the gap in utilizing powerful pretrained vision-language models like CLIP in the field of Partially Relevant Video Retrieval (PRVR). Method: ProPy introduces two key innovations: A Prompt Pyramid structure and Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. Result: ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Conclusion: ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Abstract: Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.

[109] GReAT: leveraging geometric artery data to improve wall shear stress assessment

Julian Suk,Jolanda J. Wentzel,Patryk Rygiel,Joost Daemen,Daniel Rueckert,Jelmer M. Wolterink

Main category: cs.CV

TL;DR: 本研究通过自监督学习从大量3D血管模型中提取几何特征,解决了临床数据不足的问题,并显著提高了冠状动脉壁剪应力的评估性能。

Details Motivation: 由于难以收集足够大的数据集来训练模型,使用机器学习算法评估血流动力学生物标志物面临数据稀缺问题,这限制了其在临床中的广泛应用。 Method: 通过计算热核签名进行自我监督学习,利用Laplacian特征向量获取3D血管模型的本质特征,并研究这些表示如何提升壁剪应力的分割。 Result: 在仅有49名患者的临床试验数据上,利用8449个3D血管几何模型数据集学习到的几何表示显著提高了低、中、高壁剪应力区域的分割性能。 Conclusion: 几何表示学习可以提高冠状动脉壁剪应力评估的性能,尤其是在有限的临床数据情况下。 Abstract: Leveraging big data for patient care is promising in many medical fields such as cardiovascular health. For example, hemodynamic biomarkers like wall shear stress could be assessed from patient-specific medical images via machine learning algorithms, bypassing the need for time-intensive computational fluid simulation. However, it is extremely challenging to amass large-enough datasets to effectively train such models. We could address this data scarcity by means of self-supervised pre-training and foundations models given large datasets of geometric artery models. In the context of coronary arteries, leveraging learned representations to improve hemodynamic biomarker assessment has not yet been well studied. In this work, we address this gap by investigating whether a large dataset (8449 shapes) consisting of geometric models of 3D blood vessels can benefit wall shear stress assessment in coronary artery models from a small-scale clinical trial (49 patients). We create a self-supervised target for the 3D blood vessels by computing the heat kernel signature, a quantity obtained via Laplacian eigenvectors, which captures the very essence of the shapes. We show how geometric representations learned from this datasets can boost segmentation of coronary arteries into regions of low, mid and high (time-averaged) wall shear stress even when trained on limited data.

[110] No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes

Blaž Rolih,Matic Fučka,Danijel Skočaj

Main category: cs.CV

TL;DR: SuperSimpleNet是一种高效的表面缺陷检测模型,能够适应多种监督场景,兼具高性能和低延迟。

Details Motivation: 现有的表面缺陷检测方法通常受限于特定的监督场景,难以适应实际制造业中多样化的数据注释,如无监督、弱监督、混合监督和全监督设置。 Method: SuperSimpleNet基于SimpleNet构建,引入了一种新的合成异常生成过程、增强的分类头和改进的学习过程,使其能够在四种监督场景下高效训练。 Result: SuperSimpleNet在四个具有挑战性的基准数据集中均设立了新的性能标准,并且推理时间低于10毫秒,表现出卓越的速度和准确性。 Conclusion: SuperSimpleNet通过统一多样化的监督范式,同时保持出色的速度和可靠性,弥合了学术研究与工业应用之间的差距。 Abstract: Surface defect detection is a critical task across numerous industries, aimed at efficiently identifying and localising imperfections or irregularities on manufactured components. While numerous methods have been proposed, many fail to meet industrial demands for high performance, efficiency, and adaptability. Existing approaches are often constrained to specific supervision scenarios and struggle to adapt to the diverse data annotations encountered in real-world manufacturing processes, such as unsupervised, weakly supervised, mixed supervision, and fully supervised settings. To address these challenges, we propose SuperSimpleNet, a highly efficient and adaptable discriminative model built on the foundation of SimpleNet. SuperSimpleNet incorporates a novel synthetic anomaly generation process, an enhanced classification head, and an improved learning procedure, enabling efficient training in all four supervision scenarios, making it the first model capable of fully leveraging all available data annotations. SuperSimpleNet sets a new standard for performance across all scenarios, as demonstrated by its results on four challenging benchmark datasets. Beyond accuracy, it is very fast, achieving an inference time below 10 ms. With its ability to unify diverse supervision paradigms while maintaining outstanding speed and reliability, SuperSimpleNet represents a promising step forward in addressing real-world manufacturing challenges and bridging the gap between academic research and industrial applications. Code: https://github.com/blaz-r/SuperSimpleNet

[111] Learning Binary Sampling Patterns for Single-Pixel Imaging using Bilevel Optimisation

Serban C. Tudosie,Alexander Denker,Zeljko Kereta,Simon Arridge

Main category: cs.CV

TL;DR: 本文提出了一种新颖的双层优化方法,用于单像素成像中的二值照明模式学习,在欠采样条件下显著提高了重建性能。

Details Motivation: 为了提升单像素荧光显微镜等应用中的成像效果,需要优化任务特定的二值照明模式。 Method: 通过使用直通估计器解决二值模式优化的不可微问题,并在双层公式中利用总深度变化正则化器进行优化。 Result: 该方法在CytoImageNet显微镜数据集上验证,表明所学习的模式在高度欠采样情况下具有优越的重建性能。 Conclusion: 本文提出了一种用于单像素成像的双层优化方法,以学习任务特定的二值照明模式,并在高度欠采样情况下展现出优于基线方法的重建性能。 Abstract: Single-Pixel Imaging enables reconstructing objects using a single detector through sequential illuminations with structured light patterns. We propose a bilevel optimisation method for learning task-specific, binary illumination patterns, optimised for applications like single-pixel fluorescence microscopy. We address the non-differentiable nature of binary pattern optimisation using the Straight-Through Estimator and leveraging a Total Deep Variation regulariser in the bilevel formulation. We demonstrate our method on the CytoImageNet microscopy dataset and show that learned patterns achieve superior reconstruction performance compared to baseline methods, especially in highly undersampled regimes.

[112] VibES: Induced Vibration for Persistent Event-Based Sensing

Vincenzo Polizzi,Stephen Yang,Quentin Clark,Jonathan Kelly,Igor Gilitschenski,David B. Lindell

Main category: cs.CV

TL;DR: 本文介绍了一种轻量级方法,通过旋转不平衡质量诱导事件相机的周期性振动,结合运动补偿提高感知性能。

Details Motivation: 事件相机在静态或低运动场景下无法生成事件,因此不适合大多数计算机视觉任务。本文旨在解决这一限制。 Method: 采用一个简单的旋转不平衡质量来诱导周期性振动运动,并结合运动补偿管道以去除注入的运动。 Result: 该方法能够可靠地恢复运动参数,并在图像重建和边缘检测方面优于没有运动诱导的基于事件的传感方法。 Conclusion: 本文提出了一种通过引入周期性振动运动来维持事件相机持续生成事件的轻量级方法,并结合运动补偿流水线,提高了事件的感知能力。 Abstract: Event cameras are a bio-inspired class of sensors that asynchronously measure per-pixel intensity changes. Under fixed illumination conditions in static or low-motion scenes, rigidly mounted event cameras are unable to generate any events, becoming unsuitable for most computer vision tasks. To address this limitation, recent work has investigated motion-induced event stimulation that often requires complex hardware or additional optical components. In contrast, we introduce a lightweight approach to sustain persistent event generation by employing a simple rotating unbalanced mass to induce periodic vibrational motion. This is combined with a motion-compensation pipeline that removes the injected motion and yields clean, motion-corrected events for downstream perception tasks. We demonstrate our approach with a hardware prototype and evaluate it on real-world captured datasets. Our method reliably recovers motion parameters and improves both image reconstruction and edge detection over event-based sensing without motion induction.

[113] Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents

Rafael Sterzinger,Tingyu Lin,Robert Sablatnig

Main category: cs.CV

TL;DR: 本文提出了一种适用于历史文档文本行分割任务的轻量级深度学习方法,通过使用简单架构和拓扑感知损失函数,在极少标注数据的情况下取得了显著的性能提升。

Details Motivation: 由于历史文档通常缺乏大规模标注数据集,且标注过程耗时耗资,因此需要一种数据需求更少的文本行分割方法。 Method: 使用轻量级UNet++结合连通性感知损失函数,并通过从每份手稿仅标注的三页中提取小块数据进行训练。 Result: 在U-DIADS-TL数据集上取得了比当前最先进的技术显著改进的结果,识别准确率提高了200%,行交并比提高了75%。 Conclusion: 小而简单的架构结合拓扑感知损失函数在文本行分割任务中比复杂的替代方案更准确且数据效率更高。 Abstract: A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.

[114] Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding

Yuzhen Li,Min Liu,Yuan Bian,Xueping Wang,Zhaoyang Li,Gen Li,Yaonan Wang

Main category: cs.CV

TL;DR: 这篇论文主要研究了单目三维视觉定位任务,发现预训练语言模型对三维的理解力较弱,提出了一种增强模型对文本嵌入和几何特征的三维感知的方法,并在Mono3DRefer数据集上验证了其有效性。

Details Motivation: 作者发现,尽管文本中包含了几何细节,但文本嵌入对数值大小敏感,但基本上忽略了相关的测量单位。这种现象表明了预训练语言模型对三维的理解力较弱,生成了误导性的文本特征,阻碍了三维感知。 Method: 作者首先引入了一种预处理方法,名为3D-text Enhancement (3DTE),然后提出了一种Text-Guided Geometry Enhancement (TGE)模块。 Result: 实验结果表明,作者提出的方法在"Far"场景下取得了11.94%的显著准确率增益的最新成果。 Conclusion: 作者提出了一种增强模型对文本嵌入和几何特征的三维感知的方法,并在Mono3DRefer数据集上进行了评估,结果表明其方法比以往方法有显著改进。 Abstract: Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit "meter" to "decimeters" or "centimeters" leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94\% in the "Far" scenario. Our code will be made publicly available.

[115] Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions

Zhihang Xin,Xitong Hu,Rui Wang

Main category: cs.CV

TL;DR: This paper introduces WEF-PE, a novel positional encoding method for Vision Transformers that better preserves 2D spatial structure using elliptic functions, achieving strong performance on image benchmarks.

Details Motivation: Traditional positional encoding methods in Vision Transformers disrupt the 2D spatial structure of images and fail to establish a proper correspondence between Euclidean and sequential distances. Method: Weierstrass Elliptic Function Positional Encoding (WEF-PE) is introduced to encode spatial distance relationships using the non-linear geometric nature of elliptic functions and derive relative positional information algebraically. Result: WEF-PE achieved superior performance across multiple scenarios, including 63.78% accuracy on CIFAR-100 from scratch with ViT-Tiny and 93.28% fine-tuned accuracy with ViT-Base, along with theoretical validation and improved attention visualization. Conclusion: The proposed WEF-PE method enhances the spatial encoding in Vision Transformers by leveraging the properties of elliptic functions, outperforming conventional approaches in various benchmarks. Abstract: Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through patch flattening procedures. Traditional positional encoding approaches lack geometric constraints and fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model's capacity to leverage spatial proximity priors effectively. We propose Weierstrass Elliptic Function Positional Encoding (WEF-PE), a mathematically principled approach that directly addresses two-dimensional coordinates through natural complex domain representation, where the doubly periodic properties of elliptic functions align remarkably with translational invariance patterns commonly observed in visual data. Our method exploits the non-linear geometric nature of elliptic functions to encode spatial distance relationships naturally, while the algebraic addition formula enables direct derivation of relative positional information between arbitrary patch pairs from their absolute encodings. Comprehensive experiments demonstrate that WEF-PE achieves superior performance across diverse scenarios, including 63.78\% accuracy on CIFAR-100 from-scratch training with ViT-Tiny architecture, 93.28\% on CIFAR-100 fine-tuning with ViT-Base, and consistent improvements on VTAB-1k benchmark tasks. Theoretical analysis confirms the distance-decay property through rigorous mathematical proof, while attention visualization reveals enhanced geometric inductive bias and more coherent semantic focus compared to conventional approaches.The source code implementing the methods described in this paper is publicly available on GitHub.

[116] SoccerNet 2025 Challenges Results

Silvio Giancola,Anthony Cioppa,Marc Gutiérrez-Pérez,Jan Held,Carlos Hinojosa,Victor Joos,Arnaud Leduc,Floriane Magera,Karen Sanchez,Vladimir Somers,Artur Xarles,Antonio Agudo,Alexandre Alahi,Olivier Barnich,Albert Clapés,Christophe De Vleeschouwer,Sergio Escalera,Bernard Ghanem,Thomas B. Moeslund,Marc Van Droogenbroeck,Tomoki Abe,Saad Alotaibi,Faisal Altawijri,Steven Araujo,Xiang Bai,Xiaoyang Bi,Jiawang Cao,Vanyi Chao,Kamil Czarnogórski,Fabian Deuser,Mingyang Du,Tianrui Feng,Patrick Frenzel,Mirco Fuchs,Jorge García,Konrad Habel,Takaya Hashiguchi,Sadao Hirose,Xinting Hu,Yewon Hwang,Ririko Inoue,Riku Itsuji,Kazuto Iwai,Hongwei Ji,Yangguang Ji,Licheng Jiao,Yuto Kageyama,Yuta Kamikawa,Yuuki Kanasugi,Hyungjung Kim,Jinwook Kim,Takuya Kurihara,Bozheng Li,Lingling Li,Xian Li,Youxing Lian,Dingkang Liang,Hongkai Lin,Jiadong Lin,Jian Liu,Liang Liu,Shuaikun Liu,Zhaohong Liu,Yi Lu,Federico Méndez,Huadong Ma,Wenping Ma,Jacek Maksymiuk,Henry Mantilla,Ismail Mathkour,Daniel Matthes,Ayaha Motomochi,Amrulloh Robbani Muhammad,Haruto Nakayama,Joohyung Oh,Yin May Oo,Marcelo Ortega,Norbert Oswald,Rintaro Otsubo,Fabian Perez,Mengshi Qi,Cristian Rey,Abel Reyes-Angulo,Oliver Rose,Hoover Rueda-Chacón,Hideo Saito,Jose Sarmiento,Kanta Sawafuji,Atom Scott,Xi Shen,Pragyan Shrestha,Jae-Young Sim,Long Sun,Yuyang Sun,Tomohiro Suzuki,Licheng Tang,Masato Tonouchi,Ikuma Uchida,Henry O. Velesaca,Tiancheng Wang,Rio Watanabe,Jay Wu,Yongliang Wu,Shunzo Yamagishi,Di Yang,Xu Yang,Yuxin Yang,Hao Ye,Xinyu Ye,Calvin Yeung,Xuanlong Yu,Chao Zhang,Dingyuan Zhang,Kexing Zhang,Zhe Zhao,Xin Zhou,Wenbo Zhu,Julian Ziegler

Main category: cs.CV

TL;DR: The SoccerNet 2025 Challenges provide a benchmark for computer vision research in football video understanding through four vision-based tasks and promote reproducible and open research.

Details Motivation: The motivation is to advance computer vision research in football video understanding and foster reproducible and open research at the intersection of computer vision, artificial intelligence, and sports. Method: The challenges provide large-scale annotated datasets, unified evaluation protocols, and strong baselines to evaluate progress in four vision-based tasks: Team Ball Action Spotting, Monocular Depth Estimation, Multi-View Foul Recognition, and Game State Reconstruction. Result: The report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. Conclusion: The SoccerNet 2025 Challenges continue to drive reproducible and open research in computer vision, artificial intelligence, and sports, providing a benchmark for evaluating progress in the field. Abstract: The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year's challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, targeting the recovery of scene geometry from single-camera broadcast clips through relative depth estimation for each pixel; (3) Multi-View Foul Recognition, requiring the analysis of multiple synchronized camera views to classify fouls and their severity; and (4) Game State Reconstruction, aimed at localizing and identifying all players from a broadcast video to reconstruct the game state on a 2D top-view of the field. Across all tasks, participants were provided with large-scale annotated datasets, unified evaluation protocols, and strong baselines as starting points. This report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. The SoccerNet Challenges continue to serve as a driving force for reproducible, open research at the intersection of computer vision, artificial intelligence, and sports. Detailed information about the tasks, challenges, and leaderboards can be found at https://www.soccer-net.org, with baselines and development kits available at https://github.com/SoccerNet.

[117] FastMesh:Efficient Artistic Mesh Generation via Component Decoupling

Jeonghwan Kim,Yushi Lan,Armando Fortes,Yongwei Chen,Xingang Pan

Main category: cs.CV

TL;DR: 本文提出了一种高效生成艺术网格的框架,通过分别处理顶点和面来显著减少冗余,实现了比现有方法快8倍以上的生成速度,并生成了更高质量的网格。

Details Motivation: 现有的网格生成方法不可避免地多次重复使用顶点来完整表示流形网格,导致生成过程效率低下。 Method: 本文采用了一种分离顶点和面的生成方法,仅使用自回归模型生成顶点,并使用双向变压器一次性完成网格生成,同时引入了保真度增强器和后处理框架来优化顶点排列和去除不良边连接。 Result: 实验结果表明,该方法在网格生成速度上比现有最先进方法快8倍以上,同时生成的网格质量更高。 Conclusion: 该方法有效解决了网格生成中的冗余问题,提高了生成效率和质量,为未来网格生成研究提供了新思路。 Abstract: Recent mesh generation approaches typically tokenize triangle meshes into sequences of tokens and train autoregressive models to generate these tokens sequentially. Despite substantial progress, such token sequences inevitably reuse vertices multiple times to fully represent manifold meshes, as each vertex is shared by multiple faces. This redundancy leads to excessively long token sequences and inefficient generation processes. In this paper, we propose an efficient framework that generates artistic meshes by treating vertices and faces separately, significantly reducing redundancy. We employ an autoregressive model solely for vertex generation, decreasing the token count to approximately 23\% of that required by the most compact existing tokenizer. Next, we leverage a bidirectional transformer to complete the mesh in a single step by capturing inter-vertex relationships and constructing the adjacency matrix that defines the mesh faces. To further improve the generation quality, we introduce a fidelity enhancer to refine vertex positioning into more natural arrangements and propose a post-processing framework to remove undesirable edge connections. Experimental results show that our method achieves more than 8$\times$ faster speed on mesh generation compared to state-of-the-art approaches, while producing higher mesh quality.

[118] All-in-One Slider for Attribute Manipulation in Diffusion Models

Weixin Ye,Hongguang Zhu,Wei Wang,Yahui Liu,Mengyu Wang

Main category: cs.CV

TL;DR: 本文提出了一种 All-in-One Slider 模块,能够统一控制文本到图像生成中的多种属性,包括未见过的属性,并且支持真实图像的属性调整。

Details Motivation: 现有的文本到图像生成模型在对生成图像的特定属性进行渐进式调整时存在挑战,尤其是对于细节丰富的图像(如人脸)。已有方法采用独立训练的滑块模块,导致参数冗余,限制了灵活性和扩展性。 Method: 引入 All-in-One Slider 模块,通过稀疏分解文本嵌入空间来学习属性方向,实现对多个属性的统一操作。 Result: All-in-One Slider 在多个属性上实现了准确且可扩展的属性操作,并在未见过的属性(如种族和名人)上展示了零样本操作能力。此外,该方法可以与反转框架结合,用于对真实图像进行属性调整。 Conclusion: All-in-One Slider 是一种轻量级模块,可以在文本嵌入空间中分解出稀疏的、语义上有意义的属性方向,从而实现对各种属性的可解释和细粒度连续控制,并且支持对未见过的属性进行零样本操作以及多个属性的组合。 Abstract: Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released at: https://github.com/ywxsuperstar/KSAE-FaceSteer.

[119] LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding

Julian Ost,Andrea Ramazzina,Amogh Joshi,Maximilian Bömer,Mario Bijelic,Felix Heide

Main category: cs.CV

TL;DR: 本文提出了一种结合代理几何和得分蒸馏的新方法,实现了可控且几何精确的大规模3D驾驶场景生成。

Details Motivation: 现有神经重建方法受限于静态环境和有限的场景控制,而扩散模型虽然提供高控制性却缺乏几何基础,因此需要一种兼顾两者优势的方法。 Method: 结合代理几何建模和基于2D图像先验的得分蒸馏方法,以生成可控且几何精确的3D驾驶场景。 Result: 实现了高质量、几何一致且可控制的3D驾驶场景生成,支持基于地图布局的条件生成。 Conclusion: 该论文提出了一种结合代理几何生成和得分蒸馏的新方法,实现了对大规模3D驾驶场景的可控生成,同时保持了高保真度的几何和纹理。 Abstract: Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control -- they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts -- producing realistic and geometrically consistent 3D generations of complex driving scenes.

[120] OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation

Jianwen Jiang,Weihong Zeng,Zerong Zheng,Jiaqi Yang,Chao Liang,Wang Liao,Han Liang,Yuan Zhang,Mingyuan Gao

Main category: cs.CV

TL;DR: OmniHuman-1.5提出了一种新框架,能生成物理上合理且语义上连贯和富有表现力的角色动画,解决了现有模型在捕捉角色本质上的不足。

Details Motivation: 现有视频虚拟模型难以超越物理相似性来捕捉角色的真实本质,动作通常仅与音频节奏同步,缺乏对情感、意图或背景的深层语义理解。 Method: 使用Multimodal Large Language Models合成结构化文本表示,以指导动作生成;引入带有Pseudo Last Frame设计的Multimodal DiT架构以融合多模态输入。 Result: 实验表明,该模型在一系列指标上取得了领先表现,包括唇形同步准确性、视频质量、动作自然度以及与文本提示的语义一致性。 Conclusion: OmniHuman-1.5通过结合多模态输入和创新的Multimodal DiT架构,能够生成与角色、场景和语言内容深度一致的动画,展示了在复杂场景中的可扩展性。 Abstract: Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}

[121] Automated Feature Tracking for Real-Time Kinematic Analysis and Shape Estimation of Carbon Nanotube Growth

Kaveh Safavigerdini,Ramakrishna Surya,Jaired Collins,Prasad Calyam,Filiz Bunyak,Matthew R. Maschmann,Kannappan Palaniappan

Main category: cs.CV

TL;DR: The paper introduces VFTrack, an in-situ real-time particle tracking framework that automatically detects and tracks individual CNT particles in SEM image sequences, enabling kinematic analysis of CNT micropillar growth and facilitating the calculation of heterogeneous regional growth rates and the reconstruction of evolving CNT pillar morphologies.

Details Motivation: Carbon nanotubes are critical building blocks in nanotechnology, yet the characterization of their dynamic growth is limited by the experimental challenges in nanoscale motion measurement using SEM imaging. Existing ex situ methods offer only static analysis, while in situ techniques often require manual initialization and lack continuous per-particle trajectory decomposition. Method: VFTrack integrates handcrafted or deep feature detectors and matchers within a particle tracking framework to enable kinematic analysis of CNT micropillar growth. Result: A systematic using 13,540 manually annotated trajectories identifies the ALIKED detector with LightGlue matcher as an optimal combination (F1-score of 0.78, $\alpha$-score of 0.89). VFTrack motion vectors decomposed into axial growth, lateral drift, and oscillations, facilitate the calculation of heterogeneous regional growth rates and the reconstruction of evolving CNT pillar morphologies. Conclusion: This work enables advancement in automated nano-material characterization, bridging the gap between physics-based models and experimental observation to enable real-time optimization of CNT synthesis. Abstract: Carbon nanotubes (CNTs) are critical building blocks in nanotechnology, yet the characterization of their dynamic growth is limited by the experimental challenges in nanoscale motion measurement using scanning electron microscopy (SEM) imaging. Existing ex situ methods offer only static analysis, while in situ techniques often require manual initialization and lack continuous per-particle trajectory decomposition. We present Visual Feature Tracking (VFTrack) an in-situ real-time particle tracking framework that automatically detects and tracks individual CNT particles in SEM image sequences. VFTrack integrates handcrafted or deep feature detectors and matchers within a particle tracking framework to enable kinematic analysis of CNT micropillar growth. A systematic using 13,540 manually annotated trajectories identifies the ALIKED detector with LightGlue matcher as an optimal combination (F1-score of 0.78, $\alpha$-score of 0.89). VFTrack motion vectors decomposed into axial growth, lateral drift, and oscillations, facilitate the calculation of heterogeneous regional growth rates and the reconstruction of evolving CNT pillar morphologies. This work enables advancement in automated nano-material characterization, bridging the gap between physics-based models and experimental observation to enable real-time optimization of CNT synthesis.

[122] Autoregressive Universal Video Segmentation Model

Miran Heo,Sukjun Hwang,Min-Hung Chen,Yu-Chiang Frank Wang,Albert Gu,Seon Joo Kim,Ryo Hachiuma

Main category: cs.CV

TL;DR: AUSM is a unified video segmentation model that efficiently handles both prompted and unprompted tasks, outperforming existing methods with faster training speeds.

Details Motivation: Current video segmentation models are fragmented across specific tasks and struggle with unprompted segmentation; AUSM aims to unify these approaches and improve efficiency. Method: AUSM is based on state-space models and treats video segmentation as sequential mask prediction, maintaining a fixed-size spatial state while allowing parallel training across frames. Result: AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences. Conclusion: AUSM provides a unified framework for both prompted and unprompted video segmentation, outperforming previous methods and enabling faster training. Abstract: Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today's landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.

[123] Style4D-Bench: A Benchmark Suite for 4D Stylization

Beiqi Chen,Shuai Shao,Haitang Feng,Jianhuang Lai,Jianlou Si,Guangcong Wang

Main category: cs.CV

TL;DR: Style4D-Bench是一个用于4D风格化的综合性基准套件,包含评估协议、高分辨率动态场景和强大的Style4D基线框架,能够实现高质量的风格化效果。

Details Motivation: 为了推动4D风格化领域的发展并提供标准化的评估工具,需要建立一个专门的基准套件。 Method: Style4D框架基于4D高斯点绘,包括基本的4DGS场景表示、风格高斯表示和整体几何保持风格迁移模块。 Result: Style4D在Style4D-Bench上的大量实验表明其在4D风格化方面达到了最先进的性能,具有稳定的动态和一致的多视角渲染效果。 Conclusion: Style4D-Bench 是一个用于4D风格化的基准套件,旨在标准化评估并推动该领域的发展。 Abstract: We introduce Style4D-Bench, the first benchmark suite specifically designed for 4D stylization, with the goal of standardizing evaluation and facilitating progress in this emerging area. Style4D-Bench comprises: 1) a comprehensive evaluation protocol measuring spatial fidelity, temporal coherence, and multi-view consistency through both perceptual and quantitative metrics, 2) a strong baseline that make an initial attempt for 4D stylization, and 3) a curated collection of high-resolution dynamic 4D scenes with diverse motions and complex backgrounds. To establish a strong baseline, we present Style4D, a novel framework built upon 4D Gaussian Splatting. It consists of three key components: a basic 4DGS scene representation to capture reliable geometry, a Style Gaussian Representation that leverages lightweight per-Gaussian MLPs for temporally and spatially aware appearance control, and a Holistic Geometry-Preserved Style Transfer module designed to enhance spatio-temporal consistency via contrastive coherence learning and structural content preservation. Extensive experiments on Style4D-Bench demonstrate that Style4D achieves state-of-the-art performance in 4D stylization, producing fine-grained stylistic details with stable temporal dynamics and consistent multi-view rendering. We expect Style4D-Bench to become a valuable resource for benchmarking and advancing research in stylized rendering of dynamic 3D scenes. Project page: https://becky-catherine.github.io/Style4D . Code: https://github.com/Becky-catherine/Style4D-Bench .

[124] Articulate3D: Zero-Shot Text-Driven 3D Object Posing

Oishi Deb,Anjun Hu,Ashkan Khakzar,Philip Torr,Christian Rupprecht

Main category: cs.CV

TL;DR: Articulate3D is a training-free method that allows for the pose manipulation of 3D assets through language control, effectively maintaining the original structure while achieving desired poses.

Details Motivation: Despite advances in vision and language models, posing 3D assets through language control remains a challenging task. The motivation is to develop a method that can effectively manipulate the pose of 3D objects while preserving their original identity. Method: The method involves modifying an image generator to create target images based on input images and text instructions, followed by a multi-view pose optimization step using keypoints for alignment. A self-attention rewiring mechanism (RSActrl) is introduced to decouple source structure from pose. Result: Articulate3D successfully demonstrates the ability to manipulate poses of a diverse range of 3D objects using free-form text prompts while maintaining the original identity of the mesh. Conclusion: Articulate3D is a training-free method for posing 3D assets through language control, which outperforms existing approaches as confirmed by quantitative evaluations and user studies. Abstract: We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85\% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/

[125] VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

Lin Li,Zehuan Huang,Haoran Feng,Gengxiong Zhuang,Rui Chen,Chunchao Guo,Lu Sheng

Main category: cs.CV

TL;DR: VoxHammer是一种新的3D局部编辑方法,通过在3D潜空间中进行编辑,实现了对保留区域的精确保持和编辑部分的连贯整合,解决了现有方法在一致性方面的不足。

Details Motivation: 现有的3D局部编辑方法在精确保留未编辑区域和整体一致性方面面临挑战,因此需要一种新的解决方案。 Method: VoxHammer通过预测3D模型的反转轨迹获取反转潜变量和关键值令牌,并在去噪和编辑阶段用这些信息替换保留区域的去噪特征,从而确保保留区域的一致性重建和编辑部分的连贯整合。 Result: 实验表明,VoxHammer在保留区域的3D一致性和整体质量方面显著优于现有方法。 Conclusion: VoxHammer是一种无需训练的新方法,在3D局部编辑中表现出色,能够精确保留未编辑区域并实现整体一致性,为上下文中的3D生成奠定了数据基础。 Abstract: 3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

eess.IV [Back]

[126] Federative ischemic stroke segmentation as alternative to overcome domain-shift multi-institution challenges

Edgar Rangel,Fabio Martinez

Main category: eess.IV

TL;DR: This paper proposes a collaborative framework using a FedAvg model to improve ischemic stroke lesion segmentation, achieving strong performance across multiple healthcare centers and demonstrating reliable generalization capabilities.

Details Motivation: The motivation is to overcome the limitations of current lesion analysis variability due to patient demographics, scanner vendors, and annotation differences, as well as the lack of sufficient labeled samples in many clinical centers. Method: The method involves developing a collaborative framework for lesion segmentation using deep center-independent representations through the FedAvg model, evaluated across 14 emulated healthcare centers with 2031 studies. Result: The FedAvg model achieved a Dice Similarity Coefficient (DSC) of 0.71 ± 0.24, Absolute Volume Difference (AVD) of 5.29 ± 22.74, Average Lesion Distance (ALD) of 2.16 ± 3.60, and Lesion F1 (LF1) of 0.70 ± 0.26 across all centers, outperforming both centralized and other federated methods. Conclusion: This paper concludes that the proposed collaborative framework using FedAvg model can effectively segment ischemic stroke lesions with strong generalization properties across different healthcare centers and lesion categories, outperforming centralized and other federated approaches. Abstract: Stroke is the second leading cause of death and the third leading cause of disability worldwide. Clinical guidelines establish diffusion resonance imaging (DWI, ADC) as the standard for localizing, characterizing, and measuring infarct volume, enabling treatment support and prognosis. Nonetheless, such lesion analysis is highly variable due to different patient demographics, scanner vendors, and expert annotations. Computational support approaches have been key to helping with the localization and segmentation of lesions. However, these strategies are dedicated solutions that learn patterns from only one institution, lacking the variability to generalize geometrical lesions shape models. Even worse, many clinical centers lack sufficient labeled samples to adjust these dedicated solutions. This work developed a collaborative framework for segmenting ischemic stroke lesions in DWI sequences by sharing knowledge from deep center-independent representations. From 14 emulated healthcare centers with 2031 studies, the FedAvg model achieved a general DSC of $0.71 \pm 0.24$, AVD of $5.29 \pm 22.74$, ALD of $2.16 \pm 3.60$ and LF1 of $0.70 \pm 0.26$ over all centers, outperforming both the centralized and other federated rules. Interestingly, the model demonstrated strong generalization properties, showing uniform performance across different lesion categories and reliable performance in out-of-distribution centers (with DSC of $0.64 \pm 0.29$ and AVD of $4.44 \pm 8.74$ without any additional training).

[127] Analise de Desaprendizado de Maquina em Modelos de Classificacao de Imagens Medicas

Andreza M. C. Falcao,Filipe R. Cordeiro

Main category: eess.IV

TL;DR: This study explores the use of the SalUn unlearning model in medical image classification, showing it can effectively remove sensitive data while preserving model performance.

Details Motivation: Machine unlearning aims to remove sensitive data from pre-trained models, which has not yet been explored in medical image classification. Method: The study evaluated the SalUn unlearning model using experiments on the PathMNIST, OrganAMNIST, and BloodMNIST datasets, and analyzed the impact of data augmentation on unlearning quality. Result: SalUn achieves performance close to full retraining, indicating its effectiveness and efficiency in medical image classification tasks. Conclusion: SalUn can efficiently remove private or sensitive data from a pre-trained model while maintaining model robustness, making it suitable for medical applications. Abstract: Machine unlearning aims to remove private or sensitive data from a pre-trained model while preserving the model's robustness. Despite recent advances, this technique has not been explored in medical image classification. This work evaluates the SalUn unlearning model by conducting experiments on the PathMNIST, OrganAMNIST, and BloodMNIST datasets. We also analyse the impact of data augmentation on the quality of unlearning. Results show that SalUn achieves performance close to full retraining, indicating an efficient solution for use in medical applications.

[128] A Deep Learning Application for Psoriasis Detection

Anna Milani,Fábio S. da Silva,Elloá B. Guedes,Ricardo Rios

Main category: eess.IV

TL;DR: 本文研究了三种卷积神经网络模型在分类受银屑病影响皮肤图像中的表现,结果表明Inception v3模型具有最高的准确率和F1分数,是辅助诊断银屑病的有效工具。

Details Motivation: 为了找到能够准确分类受银屑病影响皮肤图像的卷积神经网络模型,以支持银屑病的诊断。 Method: 对ResNet50、Inception v3和VGG19三种卷积神经网络模型的性能进行了比较研究,使用了一些技术来调整神经网络的评估指标。 Result: Inception v3模型在准确率和F1分数方面表现令人满意,分别为97.5%±0.2%。 Conclusion: Inception v3模型在分类受银屑病影响的皮肤图像中表现出色,是一种有价值的辅助诊断工具。 Abstract: In this paper a comparative study of the performance of three Convolutional Neural Network models, ResNet50, Inception v3 and VGG19 for classification of skin images with lesions affected by psoriasis is presented. The images used for training and validation of the models were obtained from specialized platforms. Some techniques were used to adjust the evaluation metrics of the neural networks. The results found suggest the model Inception v3 as a valuable tool for supporting the diagnosis of psoriasis. This is due to its satisfactory performance with respect to accuracy and F1-Score (97.5% ${\pm}$ 0.2).

[129] A Closer Look at Edema Area Segmentation in SD-OCT Images Using Adversarial Framework

Yuhui Tao,Yizhe Zhang,Qiang Chen

Main category: eess.IV

TL;DR: This paper proposes a novel framework for edema area segmentation using weakly-supervised learning with layer-structure guidance and test-time adaptation, achieving performance closer to fully-supervised models.

Details Motivation: The development of AI models for macular edema analysis heavily relies on costly, expert-annotated pixel-level datasets. While weakly-supervised anomaly detection methods show promise, they still underperform compared to fully-supervised approaches. Method: The study leverages the correlation between edema area (EA) and retinal layers in SD-OCT images, integrating layer-structure-guided post-processing and a test-time-adaptation (TTA) strategy into an existing adversarial framework for EA segmentation. Result: Experiments on two public datasets demonstrate that the proposed method enhances EA segmentation accuracy and robustness, significantly improving the performance of weakly-supervised models. Conclusion: The proposed framework, incorporating layer-structure-guided post-processing and TTA strategy, improves the accuracy and robustness of EA segmentation, narrowing the gap between weakly-supervised and fully-supervised models. Abstract: The development of artificial intelligence models for macular edema (ME) analy-sis always relies on expert-annotated pixel-level image datasets which are expen-sive to collect prospectively. While anomaly-detection-based weakly-supervised methods have shown promise in edema area (EA) segmentation task, their per-formance still lags behind fully-supervised approaches. In this paper, we leverage the strong correlation between EA and retinal layers in spectral-domain optical coherence tomography (SD-OCT) images, along with the update characteristics of weakly-supervised learning, to enhance an off-the-shelf adversarial framework for EA segmentation with a novel layer-structure-guided post-processing step and a test-time-adaptation (TTA) strategy. By incorporating additional retinal lay-er information, our framework reframes the dense EA prediction task as one of confirming intersection points between the EA contour and retinal layers, result-ing in predictions that better align with the shape prior of EA. Besides, the TTA framework further helps address discrepancies in the manifestations and presen-tations of EA between training and test sets. Extensive experiments on two pub-licly available datasets demonstrate that these two proposed ingredients can im-prove the accuracy and robustness of EA segmentation, bridging the gap between weakly-supervised and fully-supervised models.

[130] Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI Data

Jan Nikolas Morshuis,Matthias Hein,Christian F. Baumgartner

Main category: eess.IV

TL;DR: 本文提出了加速MRI数据分割的统一基准,比较了7种方法并发现两阶段方法表现最佳。

Details Motivation: 加速MRI数据的分割方法缺乏统一的比较标准,且最优策略尚不明确。 Method: 对7种方法进行比较,包括one-stage和two-stage方法,并在包含多线圈k空间数据和人工标注分割真值的两个MRI数据集上测试。 Result: 首次对欠采样MRI数据分割方法进行了全面比较,发现two-stage方法表现更优。 Conclusion: 简单考虑数据一致性的两阶段方法能够取得最好的分割效果,超过了专门为该任务开发的复杂专用方法。 Abstract: MR imaging is a valuable diagnostic tool allowing to non-invasively visualize patient anatomy and pathology with high soft-tissue contrast. However, MRI acquisition is typically time-consuming, leading to patient discomfort and increased costs to the healthcare system. Recent years have seen substantial research effort into the development of methods that allow for accelerated MRI acquisition while still obtaining a reconstruction that appears similar to the fully-sampled MR image. However, for many applications a perfectly reconstructed MR image may not be necessary, particularly, when the primary goal is a downstream task such as segmentation. This has led to growing interest in methods that aim to perform segmentation directly on accelerated MRI data. Despite recent advances, existing methods have largely been developed in isolation, without direct comparison to one another, often using separate or private datasets, and lacking unified evaluation standards. To date, no high-quality, comprehensive comparison of these methods exists, and the optimal strategy for segmenting accelerated MR data remains unknown. This paper provides the first unified benchmark for the segmentation of undersampled MRI data comparing 7 approaches. A particular focus is placed on comparing \textit{one-stage approaches}, that combine reconstruction and segmentation into a unified model, with \textit{two-stage approaches}, that utilize established MRI reconstruction methods followed by a segmentation network. We test these methods on two MRI datasets that include multi-coil k-space data as well as a human-annotated segmentation ground-truth. We find that simple two-stage methods that consider data-consistency lead to the best segmentation scores, surpassing complex specialized methods that are developed specifically for this task.

[131] RDDM: Practicing RAW Domain Diffusion Model for Real-world Image Restoration

Yan Chen,Yi Wen,Wei Li,Junchao Liu,Yong Guo,Jie Hu,Xinghao Chen

Main category: eess.IV

TL;DR: The RAW domain diffusion model (RDDM) improves image restoration by directly processing RAW sensor data, bypassing the limitations of sRGB-domain methods and achieving higher fidelity with fewer artifacts.

Details Motivation: The motivation is to improve image restoration by working directly with RAW data, which avoids the lossy nature of sRGB inputs and enhances performance in scenarios where RAW data is accessible. Method: RDDM uses a RAW-domain VAE (RVAE) and a differentiable Post Tone Processing (PTP) module, along with a scalable degradation pipeline and a configurable multi-bayer (CMB) LoRA module. Result: RDDM outperforms state-of-the-art sRGB diffusion methods, producing higher fidelity results with fewer artifacts. Conclusion: RDDM offers a superior solution to image restoration by directly processing RAW data, overcoming the limitations of sRGB-domain methods. Abstract: We present the RAW domain diffusion model (RDDM), an end-to-end diffusion model that restores photo-realistic images directly from the sensor RAW data. While recent sRGB-domain diffusion methods achieve impressive results, they are caught in a dilemma between high fidelity and realistic generation. As these models process lossy sRGB inputs and neglect the accessibility of the sensor RAW images in many scenarios, e.g., in image and video capturing in edge devices, resulting in sub-optimal performance. RDDM bypasses this limitation by directly restoring images in the RAW domain, replacing the conventional two-stage image signal processing (ISP) + IR pipeline. However, a simple adaptation of pre-trained diffusion models to the RAW domain confronts the out-of-distribution (OOD) issues. To this end, we propose: (1) a RAW-domain VAE (RVAE) learning optimal latent representations, (2) a differentiable Post Tone Processing (PTP) module enabling joint RAW and sRGB space optimization. To compensate for the deficiency in the dataset, we develop a scalable degradation pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for large-scale training. Furthermore, we devise a configurable multi-bayer (CMB) LoRA module handling diverse RAW patterns such as RGGB, BGGR, etc. Extensive experiments demonstrate RDDM's superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts.

eess.SP [Back]

[132] EMind: A Foundation Model for Multi-task Electromagnetic Signals Understanding

Luqing Luo,Wenjin Gui,Yunfei Liu,Ziyue Zhang,Yunxi Zhang,Fengxiang Wang,Zonghao Guo,Zizhi Ma,Xinzhu Liu,Hanxiang He,Jinhai Li,Xin Qiu,Wupeng Xie,Yangang Sun

Main category: eess.SP

TL;DR: 本文提出了一種名為EMind的基礎模型,用於處理電磁信號的多任務學習框架,解決了電磁信號異質性高、背景噪聲強和缺乏大規模數據集等挑戰,並在多個下游任務中表現出色。

Details Motivation: 電磁信號的高異質性、強背景噪聲和複雜的時頻結構使現有模型難以直接應用,且缺乏跨任務的通用性和大規模數據集,限制了電磁信號的多任務學習發展。 Method: 本文構建了一個統一且大規模的標準化電磁信號數據集,並利用電磁信號的物理特性,設計了一種長度自適應的多信號打包方法和硬體感知的訓練策略,以提高異質多源信號的學習效率和表示能力。 Result: 實驗結果顯示,EMind在多個下游任務中表現出強大的性能和廣泛的泛化能力,實現了從特定任務模型到電磁智能統一框架的轉變。 Conclusion: EMind是一種有效的電磁信號處理模型,能夠解決電磁信號多任務學習中的挑戰,為未來的電磁智能研究提供了新的方向和基礎。 Abstract: Deep understanding of electromagnetic signals is fundamental to dynamic spectrum management, intelligent transportation, autonomous driving and unmanned vehicle perception. The field faces challenges because electromagnetic signals differ greatly from text and images, showing high heterogeneity, strong background noise and complex joint time frequency structure, which prevents existing general models from direct use. Electromagnetic communication and sensing tasks are diverse, current methods lack cross task generalization and transfer efficiency, and the scarcity of large high quality datasets blocks the creation of a truly general multitask learning framework. To overcome these issue, we introduce EMind, an electromagnetic signals foundation model that bridges large scale pretraining and the unique nature of this modality. We build the first unified and largest standardized electromagnetic signal dataset covering multiple signal types and tasks. By exploiting the physical properties of electromagnetic signals, we devise a length adaptive multi-signal packing method and a hardware-aware training strategy that enable efficient use and representation learning from heterogeneous multi-source signals. Experiments show that EMind achieves strong performance and broad generalization across many downstream tasks, moving decisively from task specific models to a unified framework for electromagnetic intelligence. The code is available at: https://github.com/GabrielleTse/EMind.