Skip to content

Table of Contents

cs.CL [Back]

[1] Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

Fatemeh Taherinezhad,Mohamad Javad Momeni Nezhad,Sepehr Karimi,Sina Rashidi,Ali Zolnour,Maryam Dadkhah,Yasaman Haghbin,Hossein AzadMaleki,Maryam Zolnoori

Main category: cs.CL

TL;DR: 本文探讨了使用基于语言模型的语音筛查方法在阿尔茨海默病和相关失智症检测中的应用,比较了多种模型适应策略的有效性。

Details Motivation: 大量阿尔茨海默病和相关失智症患者未被诊断,需要可扩展的检测方法。 Method: 使用DementiaBank语音语料库,评估了9种纯文本模型和3种多模态音视频文本模型,比较了不同的模型适应策略,包括上下文学习、推理增强提示、参数高效微调和多模态集成。 Result: 基于类别中心的演示选择在上下文学习中表现最佳,推理设计提升了小型模型性能,令牌级微调效果最好。添加分类头显著改善了表现不佳的模型。微调的音视频文本系统表现良好,但未超过最佳纯文本模型。 Conclusion: 模型适应策略对基于语音的失智症检测至关重要,适当调整的开源模型可以匹敌或超越商业系统。 Abstract: Over half of US adults with Alzheimer disease and related dementias remain undiagnosed, and speech-based screening offers a scalable detection approach. We compared large language model adaptation strategies for dementia detection using the DementiaBank speech corpus, evaluating nine text-only models and three multimodal audio-text models on recordings from DementiaBank speech corpus. Adaptations included in-context learning with different demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Results showed that class-centroid demonstrations achieved the highest in-context learning performance, reasoning improved smaller models, and token-level fine-tuning generally produced the best scores. Adding a classification head substantially improved underperforming models. Among multimodal models, fine-tuned audio-text systems performed well but did not surpass the top text-only models. These findings highlight that model adaptation strategies, including demonstration selection, reasoning design, and tuning method, critically influence speech-based dementia detection, and that properly adapted open-weight models can match or exceed commercial systems.

[2] Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Yansong Liu,Jiateng Li,Yuan Liu

Main category: cs.CL

TL;DR: This paper proposes Reinforced Behavior Alignment (RBA), a self-synthesis and reinforcement learning-based framework to improve the instruction-following capabilities of SpeechLMs, achieving superior performance on speech-related tasks without relying on human annotations.

Details Motivation: Speech-based LLMs (SpeechLMs) exhibit a significant performance gap compared to text-based LLMs in instruction-following due to inter-modal discrepancies and the dynamic nature of user speech. This necessitates a more effective alignment framework. Method: The paper introduces the Reinforced Behavior Alignment (RBA) framework, which uses a self-synthesis methodology to generate alignment data via a powerful teacher LLM, and aligns the behavior of SpeechLMs with the teacher using reinforcement learning. Result: Experimental results show that RBA effectively enhances the instruction-following capabilities of SpeechLMs, enabling them to outperform conventional distillation baselines and achieve state-of-the-art performance on spoken question answering and speech-to-text translation tasks. Conclusion: RBA provides a promising framework for enhancing the instruction-following capabilities of SpeechLMs, outperforming conventional distillation baselines and achieving state-of-the-art performance on open benchmarks. Abstract: The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

[3] Multilevel Analysis of Cryptocurrency News using RAG Approach with Fine-Tuned Mistral Large Language Model

Bohdan M. Pavlyshenko

Main category: cs.CL

TL;DR: The paper presents a multilevel multitask analysis approach for cryptocurrency news using a fine-tuned Mistral 7B large language model with retrieval-augmented generation, providing comprehensive reports and eliminating large language model hallucinations.

Details Motivation: The motivation behind the paper is to provide multilevel multitask analysis of cryptocurrency news and to essentially eliminate problems with large language model hallucinations by representing cryptocurrency news as a knowledge graph. Method: The paper uses a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG) for multilevel multitask analysis of cryptocurrency news. The model is fine-tuned with 4-bit quantization using the PEFT/LoRA approach. Result: The results demonstrate that the use of fine-tuned Mistral 7B LLM models for multilevel cryptocurrency news analysis can provide informative qualitative and quantitative analytics. Conclusion: The paper concludes that the fine-tuned Mistral 7B LLM model can effectively conduct informative qualitative and quantitative analytics of cryptocurrency news, providing important insights. Abstract: In the paper, we consider multilevel multitask analysis of cryptocurrency news using a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG). On the first level of analytics, the fine-tuned model generates graph and text summaries with sentiment scores as well as JSON representations of summaries. Higher levels perform hierarchical stacking that consolidates sets of graph-based and text-based summaries as well as summaries of summaries into comprehensive reports. The combination of graph and text summaries provides complementary views of cryptocurrency news. The model is fine-tuned with 4-bit quantization using the PEFT/LoRA approach. The representation of cryptocurrency news as knowledge graph can essentially eliminate problems with large language model hallucinations. The obtained results demonstrate that the use of fine-tuned Mistral 7B LLM models for multilevel cryptocurrency news analysis can conduct informative qualitative and quantitative analytics, providing important insights.

[4] The ProLiFIC dataset: Leveraging LLMs to Unveil the Italian Lawmaking Process

Matilde Contestabile,Chiara Ferrara,Alberto Giovannetti,Giovanni Parrillo,Andrea Vandin

Main category: cs.CL

TL;DR: ProLiFIC 是一个用于法律领域流程挖掘的全面事件日志,基于意大利立法过程的数据,利用大型语言模型进行结构化处理。

Details Motivation: 流程挖掘 (PM) 在法律领域的有效性受到数据可访问性和质量的限制,因此需要开发 ProLiFIC。 Method: 从 Normattiva 门户的非结构化数据中创建 ProLiFIC,并使用大型语言模型 (LLMs) 进行结构化处理。 Result: ProLiFIC 是一个从 1987 年到 2022 年意大利立法过程的全面事件日志,并展示了初步分析结果。 Conclusion: ProLiFIC 是一个用于意大利立法过程的程序性法律制定流程的综合事件日志,旨在推动法律领域的流程挖掘发展。 Abstract: Process Mining (PM), initially developed for industrial and business contexts, has recently been applied to social systems, including legal ones. However, PM's efficacy in the legal domain is limited by the accessibility and quality of datasets. We introduce ProLiFIC (Procedural Lawmaking Flow in Italian Chambers), a comprehensive event log of the Italian lawmaking process from 1987 to 2022. Created from unstructured data from the Normattiva portal and structured using large language models (LLMs), ProLiFIC aligns with recent efforts in integrating PM with LLMs. We exemplify preliminary analyses and propose ProLiFIC as a benchmark for legal PM, fostering new developments.

[5] Multimodal Proposal for an AI-Based Tool to Increase Cross-Assessment of Messages

Alejandro Álvarez Castro,Joaquín Ordieres-Meré

Main category: cs.CL

TL;DR: This paper proposes a novel multi-modal framework for analyzing earnings calls by encoding them as hierarchical discourse trees, resulting in embeddings that capture emotional, structural, and thematic aspects effectively.

Details Motivation: Earnings calls are a rich source of financial communication with a layered discourse structure that existing systems fail to capture effectively. The authors aim to improve the analysis of such interactions through a more nuanced, multi-modal approach. Method: A two-stage transformer architecture is used: the first stage encodes multi-modal content and discourse metadata at the node level with contrastive learning, and the second synthesizes a global embedding for the entire conference call. Result: The model generates stable, semantically meaningful embeddings that reflect affective tone, structural logic, and thematic alignment in earnings calls, demonstrating generalization to other domains like tele-medicine, education, and political discourse. Conclusion: The proposed multi-modal framework effectively generates semantically rich and structurally aware embeddings for earnings calls and generalizes to other high-stakes communication domains, offering utility for financial forecasting and discourse evaluation. Abstract: Earnings calls represent a uniquely rich and semi-structured source of financial communication, blending scripted managerial commentary with unscripted analyst dialogue. Although recent advances in financial sentiment analysis have integrated multi-modal signals, such as textual content and vocal tone, most systems rely on flat document-level or sentence-level models, failing to capture the layered discourse structure of these interactions. This paper introduces a novel multi-modal framework designed to generate semantically rich and structurally aware embeddings of earnings calls, by encoding them as hierarchical discourse trees. Each node, comprising either a monologue or a question-answer pair, is enriched with emotional signals derived from text, audio, and video, as well as structured metadata including coherence scores, topic labels, and answer coverage assessments. A two-stage transformer architecture is proposed: the first encodes multi-modal content and discourse metadata at the node level using contrastive learning, while the second synthesizes a global embedding for the entire conference. Experimental results reveal that the resulting embeddings form stable, semantically meaningful representations that reflect affective tone, structural logic, and thematic alignment. Beyond financial reporting, the proposed system generalizes to other high-stakes unscripted communicative domains such as tele-medicine, education, and political discourse, offering a robust and explainable approach to multi-modal discourse representation. This approach offers practical utility for downstream tasks such as financial forecasting and discourse evaluation, while also providing a generalizable method applicable to other domains involving high-stakes communication.

[6] Reading Between the Signs: Predicting Future Suicidal Ideation from Adolescent Social Media Texts

Paul Blum,Enrico Liscio,Ruixuan Zhang,Caroline Figueroa,Pradeep K. Murukannaiah

Main category: cs.CL

TL;DR: This study introduces Early-SIB, a machine learning model that predicts suicidal ideation in adolescents from social media posts before they explicitly express it, achieving 73% balanced accuracy.

Details Motivation: Suicide is a leading cause of death among adolescents, and many cases go undetected due to limited contact with mental health services. Social media provides a real-time platform where young people express their thoughts, offering an opportunity for early detection of suicidal ideation before it is explicitly stated. Method: The study introduces Early-SIB, a transformer-based model that sequentially processes forum posts written and engaged with by users to predict whether an adolescent will write a post indicating suicidal ideation and behavior (SIB), without relying on explicit self-disclosure. Result: The Early-SIB model achieved a balanced accuracy of 0.73 in predicting future SIB on a Dutch youth forum, indicating the feasibility of using social media data for early suicide risk detection. Conclusion: The proposed Early-SIB model demonstrates potential as a predictive tool for identifying suicidal ideation and behavior among adolescents through social media forum posts, offering a meaningful addition to traditional suicide prediction methods. Abstract: Suicide is a leading cause of death among adolescents (12-18), yet predicting it remains a significant challenge. Many cases go undetected due to a lack of contact with mental health services. Social media, however, offers a unique opportunity, as young people often share their thoughts and struggles online in real time. In this work, we propose a novel task and method to approach it: predicting suicidal ideation and behavior (SIB) from forum posts before an adolescent explicitly expresses suicidal ideation on an online forum. This predictive framing, where no self-disclosure is used as input at any stage, remains largely unexplored in the suicide prediction literature. To this end, we introduce Early-SIB, a transformer-based model that sequentially processes the posts a user writes and engages with to predict whether they will write a SIB post. Our model achieves a balanced accuracy of 0.73 for predicting future SIB on a Dutch youth forum, demonstrating that such tools can offer a meaningful addition to traditional methods.

[7] Real-Time Detection of Hallucinated Entities in Long-Form Generation

Oscar Obeso,Andy Arditi,Javier Ferrando,Joshua Freeman,Cameron Holmes,Neel Nanda

Main category: cs.CL

TL;DR: 本论文提出了一种廉价且可扩展的方法,用于实时识别大型语言模型生成的长文本中的幻觉内容,通过实体级幻觉检测实现流式检测,并开发了一种基于网络搜索的标注方法,以训练高效的幻觉分类器。

Details Motivation: 大型语言模型在高风险应用中的使用日益增多,而幻觉问题可能导致严重后果,但现有的幻觉检测方法在实际应用中不够实用。 Method: 研究团队提出了一种针对实体级幻觉的检测方法,利用网络搜索对模型输出进行标注,生成带有基础标签的数据集,用于训练简单的高效幻觉分类器,如线性探测器。 Result: 该方法在多个模型家族中表现优于现有基线,包括在长文本回答中优于语义熵等更复杂方法,并在数学推理任务中显示出泛化能力。 Conclusion: 论文表明,该方法为可扩展的实际幻觉检测提供了一种有希望的新途径,并通过公开发布数据集促进后续研究和复用。 Abstract: Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification. We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets \emph{entity-level hallucinations} -- e.g., fabricated names, dates, citations -- rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Moreover, despite being trained only with entity-level labels, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.

[8] Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck

Igor Halperin

Main category: cs.CL

TL;DR: This paper introduces UDIB, an improved method for detecting confabulations in Large Language Models by optimizing information-theoretic analysis through a novel clustering algorithm.

Details Motivation: To bridge the gap between spatial proximity optimization and downstream information-theoretic analysis for detecting intrinsic faithfulness hallucinations in LLMs. Method: Transforming the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound. Result: UDIB can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters. Conclusion: UDIB provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations. Abstract: Large Language Models (LLMs) are prone to critical failure modes, including \textit{intrinsic faithfulness hallucinations} (also known as confabulations), where a response deviates semantically from the provided context. Frameworks designed to detect this, such as Semantic Divergence Metrics (SDM), rely on identifying latent topics shared between prompts and responses, typically by applying geometric clustering to their sentence embeddings. This creates a disconnect, as the topics are optimized for spatial proximity, not for the downstream information-theoretic analysis. In this paper, we bridge this gap by developing a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering. Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound. The resulting method, which we dub UDIB, can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters. By applying UDIB to the joint clustering of LLM prompt and response embeddings, we generate a shared topic representation that is not merely spatially coherent but is fundamentally structured to be maximally informative about the prompt-response relationship. This provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations.

[9] QuesGenie: Intelligent Multimodal Question Generation

Ahmed Mubarak,Amna Ahmed,Amira Nasser,Aya Mohamed,Fares El-Sadek,Mohammed Ahmed,Ahmed Salah,Youssef Sobhy

Main category: cs.CL

TL;DR: 该项目开发了一个智能、可扩展的多模态问题生成系统,以解决教育资源中练习材料不足的问题。

Details Motivation: 在信息丰富的今天,学习者虽然拥有丰富的教育资源,但缺乏与这些资源相匹配的练习材料,这带来了重大挑战。 Method: 开发了一个包含多模态输入处理、问题生成、来自人类反馈的强化学习(RLHF)和端到端交互界面的多模态问题生成系统。 Result: 成功开发了一个多模态问题生成系统,能够从各种内容格式中自动生成多样化的问题类型。 Conclusion: 该项目为自动化、可扩展和智能的教育问题生成奠定了基础,同时在资源效率、功能稳健性和用户体验之间实现了平衡。 Abstract: In today's information-rich era, learners have access to abundant educational resources, but the lack of practice materials tailored to these resources presents a significant challenge. This project addresses that gap by developing a multi-modal question generation system that can automatically generate diverse question types from various content formats. The system features four major components: multi-modal input handling, question generation, reinforcement learning from human feedback (RLHF), and an end-to-end interactive interface. This project lays the foundation for automated, scalable, and intelligent question generation, carefully balancing resource efficiency, robust functionality and a smooth user experience.

[10] AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models

Cheng-Kai Yeh,Hsing-Wang Lee,Chung-Hung Kuo,Hen-Hsen Huang

Main category: cs.CL

TL;DR: AR$^2$ is a novel adversarial reinforcement learning framework designed to improve abstraction skills in large language models by training them to solve complex narrative problems through extracting core computational kernels.

Details Motivation: Despite advances in training LLMs for code generation using reinforcement learning, most approaches focus on superficial pattern recognition and overlook explicit training for abstraction, which is a foundational skill in computer science. Method: The study introduces AR$^2$, a framework that uses a teacher model to transform kernel problems into narrative-rich descriptions and trains a student coding model to extract underlying computational kernels using adversarial reinforcement learning. Result: Experimental results show that AR$^2$ significantly improves the accuracy of student models on previously unseen and challenging programming tasks. Conclusion: The study concludes that AR$^2$ effectively enhances the abstraction abilities of LLMs, highlighting abstraction as a crucial skill for improving model generalization. Abstract: Abstraction--the ability to recognize and distill essential computational patterns from complex problem statements--is a foundational skill in computer science, critical both for human problem-solvers and coding-oriented large language models (LLMs). Despite recent advances in training LLMs for code generation using reinforcement learning (RL), most existing approaches focus primarily on superficial pattern recognition, overlooking explicit training for abstraction. In this study, we propose AR$^2$ (Adversarial Reinforcement Learning for Abstract Reasoning), a novel framework explicitly designed to enhance the abstraction abilities of LLMs. AR$^2$ employs a teacher model to transform kernel problems into narrative-rich, challenging descriptions without changing their fundamental logic. Simultaneously, a student coding model is trained to solve these complex narrative problems by extracting their underlying computational kernels. Experimental results demonstrate that AR$^2$ substantially improves the student model's accuracy on previously unseen, challenging programming tasks, underscoring abstraction as a key skill for enhancing LLM generalization.

[11] Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

Shanglin Wu,Lihui Liu,Jinho D. Choi,Kai Shu

Main category: cs.CL

TL;DR: A new framework for dynamically constructing and expanding knowledge graphs during inference improves the factual accuracy and interpretability of large language models.

Details Motivation: LLMs struggle with factual consistency, and current RAG methods have limitations in supporting compositional reasoning and identifying factual inconsistencies. Method: The method involves extracting a seed KG from the question, expanding it using the LLM's latent knowledge, and refining it through external retrieval. Result: The approach showed consistent improvements in factual accuracy, answer precision, and interpretability over baseline prompting and static KG-augmented methods on three diverse factual QA benchmarks. Conclusion: The proposed method of inference-time KG construction enhances LLM factuality in a structured, interpretable, and scalable manner. Abstract: Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) methods address this issue by incorporating external knowledge from trusted sources at inference time. However, such methods typically treat knowledge as unstructured text, which limits their ability to support compositional reasoning and identify factual inconsistencies. To overcome these limitations, we propose a novel framework that dynamically constructs and expands knowledge graphs (KGs) during inference, integrating both internal knowledge extracted from LLMs and external information retrieved from external sources. Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM's latent knowledge. The graph is then selectively refined through external retrieval, enhancing factual coverage and correcting inaccuracies. We evaluate our approach on three diverse factual QA benchmarks, demonstrating consistent improvements in factual accuracy, answer precision, and interpretability over baseline prompting and static KG-augmented methods. Our findings suggest that inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.

[12] ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference

Qi Chen,Jingxuan Wei,Zhuoya Yao,Haiguang Wang,Gaowei Wu,Bihui Yu,Siyuan Li,Cheng Tan

Main category: cs.CL

TL;DR: 本文提出ResearchPulse框架和相关基准数据集,用于跨多篇论文的科学推理,以重建研究发展链条。

Details Motivation: 理解科学思想的发展需要超越单篇论文的总结,需要对相关研究进行结构化的跨文档推理。 Method: 提出ResearchPulse,一个基于代理的框架,包括任务分解的Plan Agent、构建动机-方法思维导图的Mmap-Agent和合成实验折线图的Lchart-Agent。 Result: ResearchPulse在使用7B规模代理的情况下,始终优于GPT-4o等强基线模型,并推出了ResearchPulse-Bench基准数据集。 Conclusion: ResearchPulse框架及其基准数据集为多文档科学推理提供了有效支持,展示了其在语义对齐、结构一致性和视觉保真度方面的优越性能。 Abstract: Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in https://huggingface.co/datasets/ResearchPulse/ResearchPulse-Bench.

[13] NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management

Josh Wisoff,Yao Tang,Zhengyu Fang,Jordan Guzman,YuTang Wang,Alex Yu

Main category: cs.CL

TL;DR: NoteBar is an AI-assisted note-taking tool that improves efficiency by leveraging persona information and efficient language models, supported by a diverse and semantically rich dataset.

Details Motivation: Note-taking is a critical practice for capturing, organizing, and reflecting on information in both academic and professional settings. Existing AI-assisted note-taking solutions often struggle with efficiency. Method: NoteBar leverages persona information and efficient language models to automatically organize notes into multiple categories. Result: NoteBar can be deployed in a practical and cost-effective manner, enabling interactive use without reliance on heavy infrastructure. Conclusion: NoteBar and its accompanying dataset provide a scalable and extensible foundation for advancing AI-assisted personal knowledge management. Abstract: Note-taking is a critical practice for capturing, organizing, and reflecting on information in both academic and professional settings. The recent success of large language models has accelerated the development of AI-assisted tools, yet existing solutions often struggle with efficiency. We present NoteBar, an AI-assisted note-taking tool that leverages persona information and efficient language models to automatically organize notes into multiple categories and better support user workflows. To support research and evaluation in this space, we further introduce a novel persona-conditioned dataset of 3,173 notes and 8,494 annotated concepts across 16 MBTI personas, offering both diversity and semantic richness for downstream tasks. Finally, we demonstrate that NoteBar can be deployed in a practical and cost-effective manner, enabling interactive use without reliance on heavy infrastructure. Together, NoteBar and its accompanying dataset provide a scalable and extensible foundation for advancing AI-assisted personal knowledge management.

[14] E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

Aryan Gupta,Anupam Purwar

Main category: cs.CL

TL;DR: Sprinklr-Edge-OCR是一个专为边缘设备优化的高效OCR系统,在多语言、现实世界的图像中表现出比大型视觉语言模型更高的效率和更低的成本。

Details Motivation: 随着大型视觉语言模型(LVLMs)的发展,它们在超越固定OCR流水线方面的能力引起了广泛关注,但OCR在多语言、噪声和多样化的现实图像中仍然是一个重大挑战。 Method: 引入了Sprinklr-Edge-OCR这一专为资源受限环境优化的OCR系统,并对五种最先进的LVLMs和两种传统OCR系统在多语言图像数据集上的性能进行了大规模比较评估。 Result: Qwen在精度上表现最好(0.54),而Sprinklr-Edge-OCR在F1分数上表现最佳(0.46),并且在效率上优于其他系统,处理速度快35倍(平均每张图像0.17秒),成本仅为LVLM的0.01(每千张图像0.006美元)。 Conclusion: 传统OCR系统在边缘部署方面仍然优于大型视觉语言模型(LVLMs),因为它们计算需求低、延迟低且成本低廉。 Abstract: Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. To better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 USD per 1,000 images) compared to LVLM. Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability.

[15] Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Dani Roytburg,Matthew Bozoukov,Matthew Nguyen,Jou Barzdukas,Simon Fu,Narmeen Oozeer

Main category: cs.CL

TL;DR: This paper explores using steering vectors to reduce self-preference bias in large language models during evaluation, showing significant improvements over existing methods but also highlighting limitations in handling legitimate self-preference and unbiased agreement.

Details Motivation: Large language models (LLMs) are increasingly used as automated evaluators but suffer from self-preference bias, which undermines fairness and reliability in evaluation pipelines. This study aims to explore whether lightweight steering vectors can reduce this bias without retraining the models. Method: The study uses a curated dataset to differentiate between justified and unjustified self-preference bias. Steering vectors are constructed using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach, aiming to mitigate self-preference bias at inference time without retraining. Result: Steering vectors were able to reduce unjustified self-preference bias by up to 97%, significantly outperforming prompting and direct preference optimization baselines. However, they showed instability in dealing with legitimate self-preference and unbiased agreement, suggesting that self-preference may involve multiple or nonlinear directions. Conclusion: Steering vectors can effectively reduce unjustified self-preference bias in large language models (LLMs) during evaluation tasks, but they are less stable in handling legitimate self-preference and unbiased agreement, indicating the need for more robust interventions. Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97\%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.

[16] Semantic Analysis of SNOMED CT Concept Co-occurrences in Clinical Documentation using MIMIC-IV

Ali Noori,Somya Mohanty,Prashanti Manda

Main category: cs.CL

TL;DR: 该研究通过MIMIC-IV数据库探索了SNOMED CT概念共现模式与语义相似性之间的关系,发现语义嵌入可以提高临床文档的完整性和揭示潜在临床关系。

Details Motivation: 临床笔记包含丰富的临床叙述,但其非结构化格式对大规模分析提出了挑战。标准术语(如SNOMED CT)可以提高互操作性,但对概念之间通过共现和语义相似性的关系理解仍不足。 Method: 利用MIMIC-IV数据库,通过标准化点互信息(NPMI)和预训练嵌入(如ClinicalBERT、BioBERT)研究SNOMED CT概念共现模式与语义相似性之间的关系。 Result: 分析显示,虽然概念共现和语义相似性之间相关性较弱,但嵌入能够捕捉文档频率中不总是反映的临床有意义关联。基于嵌入的建议经常与之后记录的概念匹配,支持其在增强临床注释中的实用性。 Conclusion: 研究结果表明,共现统计和语义嵌入具有互补价值,可以提高文档完整性,揭示潜在的临床关系,并为决策支持和表型应用提供信息。 Abstract: Clinical notes contain rich clinical narratives but their unstructured format poses challenges for large-scale analysis. Standardized terminologies such as SNOMED CT improve interoperability, yet understanding how concepts relate through co-occurrence and semantic similarity remains underexplored. In this study, we leverage the MIMIC-IV database to investigate the relationship between SNOMED CT concept co-occurrence patterns and embedding-based semantic similarity. Using Normalized Pointwise Mutual Information (NPMI) and pretrained embeddings (e.g., ClinicalBERT, BioBERT), we examine whether frequently co-occurring concepts are also semantically close, whether embeddings can suggest missing concepts, and how these relationships evolve temporally and across specialties. Our analyses reveal that while co-occurrence and semantic similarity are weakly correlated, embeddings capture clinically meaningful associations not always reflected in documentation frequency. Embedding-based suggestions frequently matched concepts later documented, supporting their utility for augmenting clinical annotations. Clustering of concept embeddings yielded coherent clinical themes (symptoms, labs, diagnoses, cardiovascular conditions) that map to patient phenotypes and care patterns. Finally, co-occurrence patterns linked to outcomes such as mortality and readmission demonstrate the practical utility of this approach. Collectively, our findings highlight the complementary value of co-occurrence statistics and semantic embeddings in improving documentation completeness, uncovering latent clinical relationships, and informing decision support and phenotyping applications.

[17] MLSD: A Novel Few-Shot Learning Approach to Enhance Cross-Target and Cross-Domain Stance Detection

Parush Gera,Tempestt Neal

Main category: cs.CL

TL;DR: The paper introduces MLSD, a new method for cross-target and cross-domain stance detection using metric learning with triplet loss to improve performance.

Details Motivation: The motivation is to present a novel approach for stance detection across domains and targets. Method: MLSD utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, enhancing domain adaptation. Result: They showed statistically significant improvement in stance detection performance across six widely used stance detection models. Conclusion: MLSD allows a cross-target or cross-domain stance detection model to acquire useful examples from new target domains. Abstract: We present the novel approach for stance detection across domains and targets, Metric Learning-Based Few-Shot Learning for Cross-Target and Cross-Domain Stance Detection (MLSD). MLSD utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, enhancing domain adaptation. By constructing a discriminative embedding space, MLSD allows a cross-target or cross-domain stance detection model to acquire useful examples from new target domains. We evaluate MLSD in multiple cross-target and cross-domain scenarios across two datasets, showing statistically significant improvement in stance detection performance across six widely used stance detection models.

[18] SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation

Saki Imai,Mert İnan,Anthony Sicilia,Malihe Alikhani

Main category: cs.CL

TL;DR: 提出了一种新的手语生成评估指标 SiLVERScore,该指标具有语义感知能力,能更准确地评估手语生成质量,并在多个数据集上展示了优越性能。

Details Motivation: 现有评估手语生成的方法依赖于回译,这种方法不仅无法捕捉手语的多模态特性,还难以确定评估错误的来源。因此,需要一种新的评估指标来解决这些问题。 Method: 通过 SiLVERScore 在联合嵌入空间中评估手语生成,并在 PHOENIX-14T 和 CSL-Daily 数据集上进行实验验证。 Result: SiLVERScore 在正确对和随机对之间表现出近乎完美的区分能力(ROC AUC = 0.99,重叠 < 7%),显著优于传统指标。 Conclusion: SiLVERScore 是一种新的、具有语义感知能力的嵌入式评估指标,用于评估手语生成,解决了现有指标的一些局限性,并展示了其在不同数据集上的强大性能。 Abstract: Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.

[19] Measuring How (Not Just Whether) VLMs Build Common Ground

Saki Imai,Mert İnan,Anthony Sicilia,Malihe Alikhani

Main category: cs.CL

TL;DR: This paper introduces a four-metric suite to evaluate the performance of vision language models (VLMs) in interactive grounding contexts and finds that current VLMs diverge from human patterns, highlighting limitations in existing evaluation methods.

Details Motivation: Current benchmarks evaluate VLMs in single-turn or question answering settings, while grounding is an interactive process that requires ongoing communication to develop shared understanding. Method: Deploying a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) on 150 self-play sessions of interactive referential games between three proprietary VLMs and comparing them with human dyads. Result: All three VLMs diverged from human patterns on at least three metrics, with GPT4o-mini being the closest overall. Task success scores did not indicate successful grounding, and high image-utterance alignment did not necessarily predict task success. Conclusion: The introduced four-metric suite reveals that current VLMs diverge from human patterns in interactive grounding contexts, and task success scores do not necessarily indicate successful grounding. Abstract: Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

[20] Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

Jiaxin Guo,Daimeng Wei,Yuanchang Luo,Xiaoyu Chen,Zhanglin Wu,Huan Yang,Hengchao Shang,Zongyao Li,Zhiqiang Rao,Jinlong Yang,Hao Yang

Main category: cs.CL

TL;DR: This paper introduces Align-then-Slide, a new evaluation framework for document-level machine translation that aligns sentences and evaluates at multiple granularities, showing strong correlation with human judgments and improved training outcomes.

Details Motivation: The outputs of large language models (LLMs) in document-level machine translation challenge existing evaluation methods that assume sentence-by-sentence alignment, necessitating a new evaluation framework. Method: The Align-then-Slide framework consists of two stages: Align, where sentence-level source-target correspondences are inferred and the target is rebuilt to match the source sentence number; and n-Chunk Sliding Evaluate, where averaged metric scores are calculated at multiple granularities (1-, 2-, 3-, and 4-chunk). Result: Experiments on the WMT benchmark showed a Pearson correlation of 0.929 between the method and expert MQM rankings. The method also aligned closely with human judgments on a new real-world test set. Preference data from the framework enabled effective training with CPO and use as a reward model for GRPO, both yielding better translations than a vanilla SFT baseline. Conclusion: The proposed Align-then-Slide framework is validated as an accurate, robust, and actionable evaluation tool for document-level machine translation (doc-MT) systems. Abstract: Large language models (LLMs) have ushered in a new era for document-level machine translation (\textit{doc}-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit{\textbf{Align-then-Slide}}, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.

[21] NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation

Huhong Xian,Rui Liu,Berrak Sisman,Haizhou Li

Main category: cs.CL

TL;DR: NE-PADD improves partial audio deepfake detection by incorporating named entity recognition and attention-based fusion and transfer mechanisms.

Details Motivation: Named entity information from audio remains underexplored in partial audio deepfake detection, despite its potential to enhance detection accuracy. Method: NE-PADD uses two parallel branches, SpeechNER and PADD, with Attention Fusion and Attention Transfer mechanisms to leverage named entity knowledge. Result: Experiments on the PartialSpoof-NER dataset show that NE-PADD outperforms existing baselines in PADD performance. Conclusion: The proposed NE-PADD method effectively integrates named entity knowledge for Partial Audio Deepfake Detection and demonstrates superior performance over existing baselines. Abstract: Different from traditional sentence-level audio deepfake detection (ADD), partial audio deepfake detection (PADD) requires frame-level positioning of the location of fake speech. While some progress has been made in this area, leveraging semantic information from audio, especially named entities, remains an underexplored aspect. To this end, we propose NE-PADD, a novel method for Partial Audio Deepfake Detection (PADD) that leverages named entity knowledge through two parallel branches: Speech Name Entity Recognition (SpeechNER) and PADD. The approach incorporates two attention aggregation mechanisms: Attention Fusion (AF) for combining attention weights and Attention Transfer (AT) for guiding PADD with named entity semantics using an auxiliary loss. Built on the PartialSpoof-NER dataset, experiments show our method outperforms existing baselines, proving the effectiveness of integrating named entity knowledge in PADD. The code is available at https://github.com/AI-S2-Lab/NE-PADD.

[22] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Yang Wang,Chenghao Xiao,Chia-Yi Hsiao,Zi Yan Chang,Chi-Li Chen,Tyler Loakman,Chenghua Lin

Main category: cs.CL

TL;DR: 本文研究了大型语言模型对一种名为Drivelology的语言现象的理解能力,发现模型在分类、生成和推理任务中表现不佳,暴露出其在语用理解和深层语义建模方面的不足。

Details Motivation: Drivelology是一种独特的语言现象,表现为“有深度的废话”,其表达虽然语法上连贯,但语用上具有矛盾、情感负载或修辞颠覆性,当前LLM在理解这类文本方面存在挑战。 Method: 构建了一个包含1200多个精心策划的Drivelology例子的基准数据集,并对多个LLM进行了分类、生成和推理任务的评估。 Result: LLM经常将Drivelology与浅层废话混淆,生成不连贯的解释,或完全错过隐含的修辞功能。 Conclusion: 大型语言模型在理解具有深层语义的Drivelology文本方面存在明显局限,这表明其语用理解存在更深层次的表征差距。 Abstract: We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

[23] A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models

Yanbo Wang,Yongcan Yu,Jian Liang,Ran He

Main category: cs.CL

TL;DR: 本文综述了基于CoT的推理模型在可信推理方面的研究进展,分析了其在真实性、安全性、鲁棒性、公平性和隐私性方面的优势和漏洞。

Details Motivation: 尽管CoT推理方法在提升语言模型性能方面取得了进展,但其对模型可信度的影响尚缺乏全面理解。 Method: 本文通过调查基于CoT的推理模型,从五个核心维度(真实性、安全性、鲁棒性、公平性和隐私性)对可信推理进行了全面分析,并按时间顺序提供了相关研究的概述。 Result: 研究发现,推理技术可以通过减少幻觉、检测有害内容和提高鲁棒性来增强模型的可信度,但最新的推理模型在安全性和隐私等方面仍然存在显著漏洞。 Conclusion: 虽然推理技术在提升模型可信度方面具有潜力,但在安全性、鲁棒性和隐私性方面,尖端的推理模型本身仍然存在相当甚至更大的漏洞。 Abstract: The development of Long-CoT reasoning has advanced LLM performance across various tasks, including language understanding, complex problem solving, and code generation. This paradigm enables models to generate intermediate reasoning steps, thereby improving both accuracy and interpretability. However, despite these advancements, a comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models remains underdeveloped. In this paper, we survey recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. For each aspect, we provide a clear and structured overview of recent studies in chronological order, along with detailed analyses of their methodologies, findings, and limitations. Future research directions are also appended at the end for reference and discussion. Overall, while reasoning techniques hold promise for enhancing model trustworthiness through hallucination mitigation, harmful content detection, and robustness improvement, cutting-edge reasoning models themselves often suffer from comparable or even greater vulnerabilities in safety, robustness, and privacy. By synthesizing these insights, we hope this work serves as a valuable and timely resource for the AI safety community to stay informed on the latest progress in reasoning trustworthiness. A full list of related papers can be found at \href{https://github.com/ybwang119/Awesome-reasoning-safety}{https://github.com/ybwang119/Awesome-reasoning-safety}.

[24] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Cheng Wang,Zeming Wei,Qin Liu,Muhao Chen

Main category: cs.CL

TL;DR: 本论文系统分析了基于探测的安全性评估方法的局限性,并提出了重新设计模型和评估协议的建议。

Details Motivation: 由于现有探测方法在分布外数据上的表现不佳,怀疑探测器学习的是表面模式而非语义特征。 Method: 通过系统性实验和分析,比较简单的n-gram方法的性能,并分析模式依赖性。 Result: 确认探测器学习到的是指令模式和触发词等表面特征,揭示了当前方法的虚假安全性。 Conclusion: 当前基于探测的方法在评估模型安全性方面存在局限性,需要重新设计模型和评估协议。 Abstract: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

[25] MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation

Gowen Loo,Chang Liu,Qinghong Yin,Xiang Chen,Jiawei Chen,Jingyuan Zhang,Yu Tian

Main category: cs.CL

TL;DR: 本文提出了一种新的移动代理框架MobileRAG,该框架利用检索增强生成技术来提高任务执行的速度和准确性,并且通过引入新的基准测试证明了其有效性。

Details Motivation: 当前基于大型语言模型的移动代理在执行任务时存在理解错误、缺乏环境交互能力和记忆能力的问题,这限制了它们在复杂或重复操作中的应用。 Method: 提出了MobileRAG框架,包括InterRAG、LocalRAG和MemRAG三个部分,并引入了一个更具挑战性的基准测试MobileRAG-Eval来全面评估其性能。 Result: 实验结果显示,MobileRAG在处理现实世界中的移动任务方面表现出色,相比最先进的方法有了10.3%的提升,并且所需的操作步骤更少。 Conclusion: MobileRAG框架通过结合检索增强生成(RAG)技术,有效解决了当前基于大型语言模型的移动代理在任务执行中存在的一些问题,如理解错误、缺乏环境交互能力和记忆能力等。 Abstract: Smartphones have become indispensable in people's daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3\% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: https://github.com/liuxiaojieOutOfWorld/MobileRAG_arxiv

[26] MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering

Fengxiao Tang,Yufeng Li,Zongzong Wu,Ming Zhao

Main category: cs.CL

TL;DR: This paper proposes Matrix of Thought (MoT) and a fact-correction mechanism to improve complex question answering, achieving better accuracy and efficiency than existing methods.

Details Motivation: Large language models (LLMs) face performance degradation in complex and abstract QA tasks due to limited reasoning capabilities. Existing methods like Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Retrieval-Augmented Generation (RAG) have limitations such as redundancy, single-path reasoning, and difficulty handling multi-entity, multi-hop information. Method: The study introduces the Matrix of Thought (MoT) structure, which uses a 'column-cell communication' mechanism for multi-strategy and deep-level thinking, and a fact-correction mechanism that builds knowledge units from knowledge graph triples and raw text. These approaches reduce redundancy and enhance reasoning capabilities. Result: The MTQA framework outperformed state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4% of baseline methods, demonstrating both higher accuracy and greater efficiency. Conclusion: The proposed Matrix of Thought (MoT) framework, along with the fact-correction mechanism, significantly improves the efficiency and accuracy of complex question answering tasks compared to existing state-of-the-art methods. Abstract: Complex Question Answering (QA) is a fundamental and challenging task in NLP. While large language models (LLMs) exhibit impressive performance in QA, they suffer from significant performance degradation when facing complex and abstract QA tasks due to insufficient reasoning capabilities. Works such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance LLMs' reasoning abilities, but they face issues such as in-layer redundancy in tree structures and single paths in chain structures. Although some studies utilize Retrieval-Augmented Generation (RAG) methods to assist LLMs in reasoning, the challenge of effectively utilizing large amounts of information involving multiple entities and hops remains critical. To address this, we propose the Matrix of Thought (MoT), a novel and efficient LLM thought structure. MoT explores the problem in both horizontal and vertical dimensions through the "column-cell communication" mechanism, enabling LLMs to actively engage in multi-strategy and deep-level thinking, reducing redundancy within the column cells and enhancing reasoning capabilities. Furthermore, we develop a fact-correction mechanism by constructing knowledge units from retrieved knowledge graph triples and raw text to enhance the initial knowledge for LLM reasoning and correct erroneous answers. This leads to the development of an efficient and accurate QA framework (MTQA). Experimental results show that our framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4\% of the baseline methods, demonstrating both its efficiency and accuracy. The code for this framework is available at https://github.com/lyfiter/mtqa.

[27] Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling

Iro Lim,Haein Ji,Byungjun Kim

Main category: cs.CL

TL;DR: 本研究提出了KPoEM数据集和模型,专门用于分析现代韩语诗歌中的情感,显著提高了情感识别的准确性。

Details Motivation: 尽管基于文本的情感分类在大型语言模型方面取得了显著进展,但由于韩语诗歌使用的比喻性语言和文化特定性,相关研究仍较少。 Method: 研究者构建了一个包含7662个多标签情感条目的数据集,并对最先进的韩语语言模型进行微调,采用顺序微调方法(先在通用语料库上,然后在KPoEM数据集上)进行训练。 Result: 在KPoEM数据集上微调的模型显著优于之前的模型,F1-micro达到了0.60,而基于通用语料库训练的模型仅为0.34。 Conclusion: 该研究通过构建KPoEM数据集和相应的模型,为韩语诗歌情感分析提供了新的计算方法,并展示了其在识别时间和文化特定情感表达方面的有效性。 Abstract: This study introduces KPoEM (Korean Poetry Emotion Mapping) , a novel dataset for computational emotion analysis in modern Korean poetry. Despite remarkable progress in text-based emotion classification using large language models, poetry-particularly Korean poetry-remains underexplored due to its figurative language and cultural specificity. We built a multi-label emotion dataset of 7,662 entries, including 7,007 line-level entries from 483 poems and 615 work-level entries, annotated with 44 fine-grained emotion categories from five influential Korean poets. A state-of-the-art Korean language model fine-tuned on this dataset significantly outperformed previous models, achieving 0.60 F1-micro compared to 0.34 from models trained on general corpora. The KPoEM model, trained through sequential fine-tuning-first on general corpora and then on the KPoEM dataset-demonstrates not only an enhanced ability to identify temporally and culturally specific emotional expressions, but also a strong capacity to preserve the core sentiments of modern Korean poetry. This study bridges computational methods and literary analysis, presenting new possibilities for the quantitative exploration of poetic emotions through structured data that faithfully retains the emotional and cultural nuances of Korean literature.

[28] SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment

Yuqing Huang,Rongyang Zhang,Qimeng Wang,Chengqiang Lu,Yan Gao,Yi Wu,Yao Hu,Xuyang Zhi,Guiquan Liu,Xin Li,Hao Wang,Enhong Chen

Main category: cs.CL

TL;DR: This paper proposes SelfAug, a method to prevent catastrophic forgetting in Retrieval-Augmented Generation scenarios by aligning input sequence logits to preserve the model's semantic distribution.

Details Motivation: Supervised fine-tuning, particularly in Retrieval-Augmented Generation scenarios, often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Method: SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model's semantic distribution was proposed. Result: Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. The comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios. Conclusion: SelfAug is an effective method for mitigating catastrophic forgetting in Retrieval-Augmented Generation scenarios, and it achieves a superior balance between downstream learning and general capability retention. Abstract: Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, effectively enhances task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model's original distribution. To overcome these limitations, we propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model's semantic distribution, thereby mitigating catastrophic forgetting and improving downstream performance. Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning. Our findings not only advance the understanding of catastrophic forgetting in RAG contexts but also provide a practical solution applicable across diverse fine-tuning scenarios. Our code is publicly available at https://github.com/USTC-StarTeam/SelfAug.

[29] SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

Yuhao Zhang,Shaoming Duan,Jinhang Su,Chuanyi Liu,Peiyi Han

Main category: cs.CL

TL;DR: SPFT-SQL是一种针对Text-to-SQL任务的新自我对弈微调方法,通过迭代微调和错误驱动损失方法,显著提升了模型生成准确SQL查询的能力。

Details Motivation: 尽管自对弈微调(SPIN)在提升弱大型语言模型(LLM)方面取得了显著进展,但在Text-to-SQL任务中仍存在挑战。SPIN无法生成新信息,且对手模型生成的大量正确SQL查询会降低主模型生成准确SQL的能力。 Method: SPFT-SQL方法包括两个主要阶段:一是基于验证的迭代微调,以数据库模式和验证反馈为基础合成高质量的微调数据;二是在自我对弈微调阶段引入错误驱动损失方法,激励对手模型生成错误输出,从而提升主模型区分正确和错误SQL的能力。 Result: SPFT-SQL在六个开源LLM和五个广泛使用的基准测试中表现出色,优于现有的最先进方法。 Conclusion: SPFT-SQL面对Text-to-SQL任务的挑战,通过引入一种新的自我对弈微调方法,有效提升了模型生成准确SQL查询的能力。实验结果表明,SPFT-SQL优于现有的最先进方法。 Abstract: Despite the significant advancements of self-play fine-tuning (SPIN), which can transform a weak large language model (LLM) into a strong one through competitive interactions between models of varying capabilities, it still faces challenges in the Text-to-SQL task. SPIN does not generate new information, and the large number of correct SQL queries produced by the opponent model during self-play reduces the main model's ability to generate accurate SQL queries. To address this challenge, we propose a new self-play fine-tuning method tailored for the Text-to-SQL task, called SPFT-SQL. Prior to self-play, we introduce a verification-based iterative fine-tuning approach, which synthesizes high-quality fine-tuning data iteratively based on the database schema and validation feedback to enhance model performance, while building a model base with varying capabilities. During the self-play fine-tuning phase, we propose an error-driven loss method that incentivizes incorrect outputs from the opponent model, enabling the main model to distinguish between correct SQL and erroneous SQL generated by the opponent model, thereby improving its ability to generate correct SQL. Extensive experiments and in-depth analyses on six open-source LLMs and five widely used benchmarks demonstrate that our approach outperforms existing state-of-the-art (SOTA) methods.

[30] VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

Weihao Wu,Liang Cao,Xinyu Wu,Zhiwei Lin,Rui Niu,Jingbei Li,Zhiyong Wu

Main category: cs.CL

TL;DR: This paper introduces VoxRole, a new benchmark for evaluating speech-based role-playing conversational agents (RPCAs). It addresses the lack of standardized evaluation benchmarks and the neglect of paralinguistic features in current research, offering insights into the strengths and weaknesses of current spoken dialogue models.

Details Motivation: The motivation for this research is to address two key limitations in current RPCA research: the focus on text-only models that ignore paralinguistic features like intonation and rhythm, and the lack of standardized benchmarks for evaluating speech-based RPCAs. Method: The research introduces VoxRole, a new benchmark for evaluating speech-based RPCAs. It uses a two-stage automated pipeline: aligning movie audio with scripts, and employing a large language model (LLM) to create multi-dimensional character profiles. The benchmark contains 13,335 multi-turn dialogues from 261 movies, totaling 65.6 hours of speech. Result: VoxRole was successfully developed as the first comprehensive benchmark for speech-based RPCAs. The research team conducted a multi-dimensional evaluation of current spoken dialogue models using VoxRole, revealing insights into their strengths and weaknesses in maintaining persona consistency. Conclusion: The study concludes that current speech-based role-playing conversational agents (RPCAs) lack standardized evaluation benchmarks and overlook important paralinguistic features. VoxRole provides a solution to these issues and offers valuable insights into the strengths and limitations of current spoken dialogue models. Abstract: Recent significant advancements in Large Language Models (LLMs) have greatly propelled the development of Role-Playing Conversational Agents (RPCAs). These systems aim to create immersive user experiences through consistent persona adoption. However, current RPCA research faces dual limitations. First, existing work predominantly focuses on the textual modality, entirely overlooking critical paralinguistic features including intonation, prosody, and rhythm in speech, which are essential for conveying character emotions and shaping vivid identities. Second, the speech-based role-playing domain suffers from a long-standing lack of standardized evaluation benchmarks. Most current spoken dialogue datasets target only fundamental capability assessments, featuring thinly sketched or ill-defined character profiles. Consequently, they fail to effectively quantify model performance on core competencies like long-term persona consistency. To address this critical gap, we introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs. The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies. To construct this resource, we propose a novel two-stage automated pipeline that first aligns movie audio with scripts and subsequently employs an LLM to systematically build multi-dimensional profiles for each character. Leveraging VoxRole, we conduct a multi-dimensional evaluation of contemporary spoken dialogue models, revealing crucial insights into their respective strengths and limitations in maintaining persona consistency.

[31] CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking

Ruiling Guo,Xinwei Yang,Chen Huang,Tong Zhang,Yong Hu

Main category: cs.CL

TL;DR: CANDY基准测试显示,大型语言模型在核查中文虚假信息方面存在局限,但作为辅助工具具有提升人类表现的潜力。

Details Motivation: 大型语言模型在核查虚假信息方面的效果尚不确定,尽管它们的使用日益广泛。 Method: 构建了一个名为CANDY的基准测试,包括约20,000个精心标注的实例,并使用思维链推理和少样本提示进行分析。 Result: 当前的大型语言模型在生成准确的事实核查结论方面存在局限性,最常见的失败模式是事实捏造。 Conclusion: 尽管大型语言模型在事实核查中存在局限性,但它们在辅助人类任务时表现出相当大的潜力。 Abstract: The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of ~20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code can be accessed at https://github.com/SCUNLP/CANDY

[32] Exploring NLP Benchmarks in an Extremely Low-Resource Setting

Ulin Nuha,Adam Jatowt

Main category: cs.CL

TL;DR: 本文通过创建合成数据集来提升低资源语言(Ladin语)的自然语言处理效果,特别是在情感分析和多选问答任务上,同时填补了该语言高质量数据集的空白。

Details Motivation: 由于缺乏标注数据,大语言模型在低资源语言(如土著语言)上的效果较差,而高质量自然语言处理数据集的匮乏限制了这些语言的语言技术发展。 Method: 利用少量的Ladin-意大利语平行句对,通过翻译单语意大利语数据创建情感分析和多选问答合成数据集,并通过严格的过滤和回译程序确保语言质量和可靠性。 Result: 将这些合成数据集用于机器翻译训练显著提高了现有的意大利语-Ladin语翻译基线效果,同时首次公开发布了适用于Ladin语的情感分析和多选问答数据集。 Conclusion: 研究成功填补了低资源语言Ladin语在自然语言处理领域的数据集空白,为该语言的基础研究和下游应用提供了重要资源。 Abstract: The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.

[33] Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study

Junghwan Lim,Gangwon Jo,Sungmin Lee,Jiyoung Park,Dongseok Kim,Jihwan Kim,Junhyeok Lee,Wai Ting Cheung,Dahye Choi,Kibong Choi,Jaeyeon Huh,Beomgyu Kim,Jangwoong Kim,Taehyun Kim,Haesol Lee,Jeesoo Lee,Dongpin Oh,Changseok Song,Daewon Suh

Main category: cs.CL

TL;DR: Llama-3-Motif 是一个专注于提升韩语能力的 1020 亿参数语言模型,在保持英语性能的同时,其表现优于现有模型,并与 GPT-4 相当。

Details Motivation: 开发一个专注于韩语能力的语言模型,同时保持强大的英语性能。 Method: 基于 Llama 3 架构,使用 LlamaPro 和 Masked Structure Growth 等先进技术进行模型扩展,并通过 MoAI 平台在超大规模 GPU 集群上进行高效训练。 Result: Llama-3-Motif 在韩语特定基准测试中表现出色,优于现有模型,并达到与 GPT-4 相当的水平。 Conclusion: Llama-3-Motif 是一个专注于提升韩语能力同时保持英语性能的大型语言模型,其表现优于现有模型,并与 GPT-4 相当。 Abstract: We introduce Llama-3-Motif, a language model consisting of 102 billion parameters, specifically designed to enhance Korean capabilities while retaining strong performance in English. Developed on the Llama 3 architecture, Llama-3-Motif employs advanced training techniques, including LlamaPro and Masked Structure Growth, to effectively scale the model without altering its core Transformer architecture. Using the MoAI platform for efficient training across hyperscale GPU clusters, we optimized Llama-3-Motif using a carefully curated dataset that maintains a balanced ratio of Korean and English data. Llama-3-Motif shows decent performance on Korean-specific benchmarks, outperforming existing models and achieving results comparable to GPT-4.

[34] RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models

Zhaoyan Gong,Juan Li,Zhiqiang Liu,Lei Liang,Huajun Chen,Wen Zhang

Main category: cs.CL

TL;DR: 本文提出RTQA,一种无需训练的新型框架,通过递归分解问题和多路径答案聚合,显著提升了时间知识图谱问答的推理能力和容错性。

Details Motivation: 当前的时间知识图谱问答方法主要关注隐式时间约束,缺乏处理更复杂时间查询的能力,在分解框架中存在推理能力有限和错误传播的问题。 Method: RTQA采用递归思维,将问题分解为子问题,自下而上地利用大型语言模型(LLMs)和TKG知识进行求解,并通过多路径答案聚合提高容错能力。 Result: 在MultiTQ和TimelineKGQA基准测试中,RTQA在'Multiple'和'Complex'类别中显著提高了Hits@1指标,优于最先进的方法。 Conclusion: RTQA是一个无需训练的新型框架,通过增强对时间知识图谱的推理能力,有效解决了现有方法在处理复杂时间查询和错误传播方面的局限性。 Abstract: Current temporal knowledge graph question answering (TKGQA) methods primarily focus on implicit temporal constraints, lacking the capability of handling more complex temporal queries, and struggle with limited reasoning abilities and error propagation in decomposition frameworks. We propose RTQA, a novel framework to address these challenges by enhancing reasoning over TKGs without requiring training. Following recursive thinking, RTQA recursively decomposes questions into sub-problems, solves them bottom-up using LLMs and TKG knowledge, and employs multi-path answer aggregation to improve fault tolerance. RTQA consists of three core components: the Temporal Question Decomposer, the Recursive Solver, and the Answer Aggregator. Experiments on MultiTQ and TimelineKGQA benchmarks demonstrate significant Hits@1 improvements in "Multiple" and "Complex" categories, outperforming state-of-the-art methods. Our code and data are available at https://github.com/zjukg/RTQA.

[35] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Riccardo Lunardi,Vincenzo Della Mea,Stefano Mizzaro,Kevin Roitero

Main category: cs.CL

TL;DR: 该研究通过改写基准问题评估大型语言模型的鲁棒性,发现模型在面对语言多样性时有效性显著下降,挑战了当前基准评估的可靠性。

Details Motivation: 当前LLM的有效性评估主要依赖于固定格式的基准测试,如MMLU、ARC-C或HellaSwag,但实际应用中需要模型能够处理多样化的语言表达。因此,研究旨在评估LLM在不同语义表达下的鲁棒性及其对评估方法的影响。 Method: 研究人员对六个常见基准测试中的所有问题进行了系统性的改写,并测量了34种最先进的LLMs在不同改写输入下的有效性变化。 Result: 研究发现,LLM在改写问题下的绝对有效性下降显著,尽管排名相对稳定。这表明模型在面对语言变化时表现不佳,从而对它们的泛化能力和评估方法提出了质疑。 Conclusion: 该研究发现,尽管大型语言模型(LLMs)在重述问题中的排名相对稳定,但其绝对有效性显著下降,表明LLMs在处理语言多样性时存在困难,也对当前基于基准的评估方法的可靠性提出了质疑。研究强调了开发更具鲁棒性意识的基准测试的必要性,以更好地反映实际应用场景。 Abstract: Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model's robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.

[36] What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

Debangan Mishra,Arihant Rastogi,Agyeya Negi,Shashwat Goel,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: 本研究通过 $\kappa_p$ 度量发现,大型多语言模型在不同语言下的输出更一致,表明该度量对评估和改进多语言系统具有价值。

Details Motivation: 研究不同语言下模型输出的相似性,以评估多语言模型的可靠性与一致性。 Method: 使用新提出的模型相似性度量 $\kappa_p$,在 GlobalMMLU 中对 20 种语言和 47 个科目进行分析。 Result: 发现模型规模和能力增长时,其跨语言响应的一致性增强,并且模型自身在不同语言下的输出一致性高于与其他模型在同一语言下的输出一致性。 Conclusion: 模型输出在不同语言间的一致性随着模型规模和能力的提升而增强,并且模型在不同语言下的输出比相同语言下与其他模型的输出更具一致性。 Abstract: How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $\kappa_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $\kappa_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.

[37] A RoBERTa-Based Functional Syntax Annotation Model for Chinese Texts

Han Xiaohui,Zhang Yunlong,Guo Yuxi

Main category: cs.CL

TL;DR: 本文提出了一种基于RoBERTa的中文功能句法自动标注模型,通过微调RoBERTa-Chinese wwm-ext模型,在功能句法分析中取得了良好效果,尤其是在识别主语、动词和补语方面。

Details Motivation: 系统功能语法及其分支卡迪夫语法在话语分析和语义功能研究等领域广泛应用,但缺乏针对中文文本的自动标注系统,这限制了相关理论的应用与发展。为填补这一空白,研究旨在开发一个基于深度学习的中文功能句法自动标注模型。 Method: 本研究从2014年人民日报语料库中随机选取4100个句子,并根据功能句法理论对其进行标注,构建了训练数据集。随后基于该数据集对RoBERTa-Chinese wwm-ext模型进行微调,以执行命名实体识别任务。 Result: 在测试集上,模型的F1得分为0.852,显著优于其他对比模型,尤其在识别核心句法成分(如主语、动词和补语)方面表现优异。然而,在识别标签样本不平衡的实体时仍有改进空间。 Conclusion: 研究成功实现了基于RoBERTa的中文功能句法标注模型,在识别核心句法成分方面表现出色,为中文自动化功能句法分析提供了新方法,并为后续研究奠定了基础。 Abstract: Systemic Functional Grammar and its branch, Cardiff Grammar, have been widely applied to discourse analysis, semantic function research, and other tasks across various languages and texts. However, an automatic annotation system based on this theory for Chinese texts has not yet been developed, which significantly constrains the application and promotion of relevant theories. To fill this gap, this research introduces a functional syntax annotation model for Chinese based on RoBERTa (Robustly Optimized BERT Pretraining Approach). The study randomly selected 4,100 sentences from the People's Daily 2014 corpus and annotated them according to functional syntax theory to establish a dataset for training. The study then fine-tuned the RoBERTa-Chinese wwm-ext model based on the dataset to implement the named entity recognition task, achieving an F1 score of 0.852 on the test set that significantly outperforms other comparative models. The model demonstrated excellent performance in identifying core syntactic elements such as Subject (S), Main Verb (M), and Complement (C). Nevertheless, there remains room for improvement in recognizing entities with imbalanced label samples. As the first integration of functional syntax with attention-based NLP models, this research provides a new method for automated Chinese functional syntax analysis and lays a solid foundation for subsequent studies.

[38] Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning

Zhilin Wang,Zhe Yang,Yun Luo,Yafu Li,Haoran Zhang,Runzhe Zhan,Derek F. Wong,Jizhe Zhou,Yu Cheng

Main category: cs.CL

TL;DR: 该研究提出了一种基于音乐理论的乐谱问题合成方法,生成了可用于评估和训练的数据集,通过强化学习提升模型的乐谱理解和推理能力,并展示了其在音乐创作中的潜力。

Details Motivation: 当前缺乏用于评估和训练大型语言模型和多模态大型语言模型解读乐谱能力的基准和数据,为此提出一种新的方法以推动AI音乐家的发展。 Method: 引入了一个数据合成框架,以音乐理论为基础生成可验证的乐谱问题,生成了合成乐谱推理基准(SSMR-Bench)和相应的训练集,并采用强化学习与可验证奖励(RLVR)的方法进行模型训练。 Result: 在SSMR-Bench上的评估结果显示,模型的推理能力对解读乐谱至关重要。通过合成数据进行RLVR训练的Qwen3-8B-Base和Qwen2.5-VL-Instruct在基准测试中表现有所提升,Qwen3-8B-Base在MusicTheoryBench上的总体性能超越了GPT-4,并在推理能力上达到了与GPT-4相当的水平。此外,训练后的模型在音乐创作方面也表现出增强的能力。 Conclusion: 该论文提出了一种基于音乐理论规则的乐谱问题合成方法,并展示了其在提升模型对乐谱理解的推理能力方面的有效性,同时为AI辅助音乐创作开辟了新的可能性。 Abstract: Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. To address this, we propose the idea of synthesizing sheet music problems grounded in music theory, which can serve both as evaluation benchmarks and as training data for reinforcement learning with verifiable rewards (RLVR). We introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench show the importance of models' reasoning abilities in interpreting sheet music. At the same time, the poor performance of Gemini 2.5-Pro highlights the challenges that MLLMs still face in interpreting sheet music in a visual format. By leveraging synthetic data for RLVR, Qwen3-8B-Base and Qwen2.5-VL-Instruct achieve improvements on the SSMR-Bench. Besides, the trained Qwen3-8B-Base surpasses GPT-4 in overall performance on MusicTheoryBench and achieves reasoning performance comparable to GPT-4 with the strategies of Role play and Chain-of-Thought. Notably, its performance on math problems also improves relative to the original Qwen3-8B-Base. Furthermore, our results show that the enhanced reasoning ability can also facilitate music composition. In conclusion, we are the first to propose the idea of synthesizing sheet music problems based on music theory rules, and demonstrate its effectiveness not only in advancing model reasoning for sheet music understanding but also in unlocking new possibilities for AI-assisted music creation.

[39] Arabic Chatbot Technologies in Education: An Overview

Hicham Bourhil,Yacine El Younoussi

Main category: cs.CL

TL;DR: This paper surveys Arabic chatbots in education, highlighting that few use modern AI techniques compared to those in other languages, and suggests future research directions to bridge this gap.

Details Motivation: The motivation stems from the increasing adoption of AI and NLP technologies, particularly chatbots, in various domains including education. The authors aim to identify research gaps in the development of Arabic educational chatbots despite advancements in the field. Method: The study uses a survey method to analyze existing Arabic chatbots in education, focusing on their approaches, language varieties, and performance metrics. Result: The survey identifies that only a few educational Arabic chatbots utilize modern AI techniques, such as those based on Large Language Models like BERT and GPT, which contrasts with the progress seen in other languages like English. Conclusion: The study concludes that while chatbots have seen success in languages like English, there is a noticeable research gap regarding the use of modern techniques in educational Arabic chatbots. The authors suggest exploring future research directions to address this gap. Abstract: The recent advancements in Artificial Intelligence (AI) in general, and in Natural Language Processing (NLP) in particular, and some of its applications such as chatbots, have led to their implementation in different domains like education, healthcare, tourism, and customer service. Since the COVID-19 pandemic, there has been an increasing interest in these digital technologies to allow and enhance remote access. In education, e-learning systems have been massively adopted worldwide. The emergence of Large Language Models (LLM) such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformers) made chatbots even more popular. In this study, we present a survey on existing Arabic chatbots in education and their different characteristics such as the adopted approaches, language variety, and metrics used to measure their performance. We were able to identified some research gaps when we discovered that, despite the success of chatbots in other languages such as English, only a few educational Arabic chatbots used modern techniques. Finally, we discuss future directions of research in this field.

[40] Improving Narrative Classification and Explanation via Fine Tuned Language Models

Rishit Tyagi,Rahul Bouri,Mohit Gupta

Main category: cs.CL

TL;DR: 本研究通过改进BERT模型和GPT-4o流水线,结合ReACT框架和辅助知识库,解决了新闻文章中隐性叙事检测和解释的挑战。

Details Motivation: 研究动机在于解决传统NLP方法难以检测微妙措辞和隐藏议程的问题,以及分析新闻文章中的多标签叙事和子叙事挑战。 Method: 研究采用了基于BERT模型的回忆导向微调方法用于叙事检测,并使用GPT-4o流水线进行预测优化,同时提出了基于语义检索的ReACT框架用于叙事解释。 Result: 结果表明,通过在提示中加入结构化分类表作为辅助知识库,提高了叙事检测和解释的准确性与可靠性。 Conclusion: 研究得出结论,通过在提示中集成辅助知识可以提高分类准确性和论证可靠性,适用于媒体分析、教育和情报收集等领域。 Abstract: Understanding covert narratives and implicit messaging is essential for analyzing bias and sentiment. Traditional NLP methods struggle with detecting subtle phrasing and hidden agendas. This study tackles two key challenges: (1) multi-label classification of narratives and sub-narratives in news articles, and (2) generating concise, evidence-based explanations for dominant narratives. We fine-tune a BERT model with a recall-oriented approach for comprehensive narrative detection, refining predictions using a GPT-4o pipeline for consistency. For narrative explanation, we propose a ReACT (Reasoning + Acting) framework with semantic retrieval-based few-shot prompting, ensuring grounded and relevant justifications. To enhance factual accuracy and reduce hallucinations, we incorporate a structured taxonomy table as an auxiliary knowledge base. Our results show that integrating auxiliary knowledge in prompts improves classification accuracy and justification reliability, with applications in media analysis, education, and intelligence gathering.

[41] Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

Keara Schaaij,Roel Boumans,Tibor Bosse,Iris Hendrickx

Main category: cs.CL

TL;DR: 本研究探索了如何在对话代理中实现词汇对齐,通过构建稳定且个性化的词汇轮廓,利用最小数据量达到高效的沟通。

Details Motivation: 词汇对齐在成功沟通中起着重要作用,但在对话代理中的实现仍研究不足,尤其是考虑到大型语言模型(LLMs)的最新进展。 Method: 研究者通过改变用于构建词汇轮廓的转录语音数据量以及每个词性类别中包含的项目数量,评估了轮廓性能在时间上的表现,使用了回忆率、覆盖率和余弦相似度等指标。 Result: 研究表明,使用10分钟的转录语音数据创建的小而紧凑的轮廓,包括形容词5项、连词5项、副词、名词、代词和动词各10项,提供了最佳的性能与数据效率平衡。 Conclusion: 该研究得出结论,通过考虑最小数据需求,构建稳定的个性化词汇轮廓是实现对话代理中词汇对齐的基础步骤。 Abstract: Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.

[42] MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Dan Saattrup Smart

Main category: cs.CL

TL;DR: The paper introduces MultiWikiQA, a new reading comprehension dataset covering 306 languages derived from Wikipedia articles, with questions generated by an LLM. It demonstrates the dataset's quality and difficulty through human evaluations and model testing, making it freely available for research purposes.

Details Motivation: The motivation for creating the MultiWikiQA dataset is to provide a comprehensive benchmark for multilingual reading comprehension, which can help evaluate and improve the performance of language models across a wide range of languages. Method: The paper describes the creation of the MultiWikiQA dataset using Wikipedia articles for context and an LLM to generate questions. The answers to these questions appear verbatim in the Wikipedia articles. Additionally, the paper evaluates the fluency of the generated questions through a crowdsourced human evaluation across 30 languages and assesses the performance of 6 different language models on the dataset. Result: The results indicate that the generated questions are of good quality as per human evaluation across 30 languages. The evaluation of 6 different language models on the dataset reveals that the benchmark is challenging, with notable performance discrepancies observed among the different languages. Conclusion: The paper concludes that the MultiWikiQA dataset is a valuable resource for evaluating multilingual reading comprehension models, with its availability intended to foster further research and development in this area. Abstract: We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.

[43] Joint Modeling of Entities and Discourse Relations for Coherence Assessment

Wei Liu,Michael Strube

Main category: cs.CL

TL;DR: This study explores the joint modeling of entities and discourse relations for coherence assessment, demonstrating that combining both features significantly improves coherence modeling performance.

Details Motivation: Most existing work on coherence modeling focuses only on either entity features or discourse relation features, with little attention given to combining both. Method: Explored two methods for jointly modeling entities and discourse relations for coherence assessment. Result: Experiments on three benchmark datasets showed significant performance improvement when both features were integrated. Conclusion: Combining entity and discourse relation features enhances coherence modeling performance. Abstract: In linguistics, coherence can be achieved by different means, such as by maintaining reference to the same set of entities across sentences and by establishing discourse relations between them. However, most existing work on coherence modeling focuses exclusively on either entity features or discourse relation features, with little attention given to combining the two. In this study, we explore two methods for jointly modeling entities and discourse relations for coherence assessment. Experiments on three benchmark datasets show that integrating both types of features significantly enhances the performance of coherence models, highlighting the benefits of modeling both simultaneously for coherence evaluation.

[44] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

Aishik Mandal,Tanmoy Chakraborty,Iryna Gurevych

Main category: cs.CL

TL;DR: MAGneT is a multi-agent framework that generates high-quality synthetic psychological counseling sessions, outperforming single-agent methods and enhancing open-source model performance in counseling skills.

Details Motivation: The scarcity of high-quality, privacy-compliant data for fine-tuning open-source LLMs in psychological counseling motivates the development of MAGneT for scalable and nuanced synthetic session generation. Method: MAGneT uses a multi-agent framework where specialized LLM agents handle sub-tasks related to key psychological techniques, generating synthetic counseling sessions. A unified evaluation framework is also proposed. Result: MAGneT outperforms existing methods in quality, diversity, and therapeutic alignment, with improvements in general counseling (3.2%) and CBT-specific skills (4.3%) on CTRS. Experts prefer MAGneT sessions 77.2% of the time. Conclusion: MAGneT-generated sessions enhance the performance of open-source models in general counseling and CBT-specific skills, showing the framework's effectiveness and potential for scalable psychological counseling. Abstract: The growing demand for scalable psychological counseling highlights the need for fine-tuning open-source Large Language Models (LLMs) with high-quality, privacy-compliant data, yet such data remains scarce. Here we introduce MAGneT, a novel multi-agent framework for synthetic psychological counseling session generation that decomposes counselor response generation into coordinated sub-tasks handled by specialized LLM agents, each modeling a key psychological technique. Unlike prior single-agent approaches, MAGneT better captures the structure and nuance of real counseling. In addition, we address inconsistencies in prior evaluation protocols by proposing a unified evaluation framework integrating diverse automatic and expert metrics. Furthermore, we expand the expert evaluations from four aspects of counseling in previous works to nine aspects, enabling a more thorough and robust assessment of data quality. Empirical results show that MAGneT significantly outperforms existing methods in quality, diversity, and therapeutic alignment of the generated counseling sessions, improving general counseling skills by 3.2% and CBT-specific skills by 4.3% on average on cognitive therapy rating scale (CTRS). Crucially, experts prefer MAGneT-generated sessions in 77.2% of cases on average across all aspects. Moreover, fine-tuning an open-source model on MAGneT-generated sessions shows better performance, with improvements of 6.3% on general counseling skills and 7.3% on CBT-specific skills on average on CTRS over those fine-tuned with sessions generated by baseline methods. We also make our code and data public.

[45] Explicit and Implicit Data Augmentation for Social Event Detection

Congbo Ma,Yuxia Wang,Jia Wu,Jian Yang,Jing Du,Zitai Qiu,Qing Li,Hu Wang,Preslav Nakov

Main category: cs.CL

TL;DR: 本文提出了一种名为SED-Aug的社交事件检测增强框架,结合显式文本增强和隐式特征空间增强,显著提升了模型性能,在Twitter2012和Twitter2018数据集上分别取得了17.67%和15.57%的F1分数提升。

Details Motivation: 社交事件检测依赖于标注数据,而人工标注成本高且耗时,因此需要一种有效的增强框架来提升模型性能并减少对标注数据的依赖。 Method: 提出了一种名为SED-Aug的增强框架,包含显式文本增强和隐式特征空间增强。显式增强利用大语言模型通过五种不同的生成策略来丰富文本信息;隐式增强则设计了五种新的扰动技术,在特征空间上操作结构融合嵌入,以保持嵌入的语义和关系特性并增加多样性。 Result: SED-Aug在Twitter2012和Twitter2018数据集上的平均F1分数分别优于最佳基线模型约17.67%和15.57%。 Conclusion: SED-Aug通过结合显式文本增强和隐式特征空间增强,有效提升了模型在社交事件检测任务上的性能。实验结果表明,SED-Aug在Twitter2012和Twitter2018数据集上的平均F1分数分别优于最佳基线模型约17.67%和15.57%。 Abstract: Social event detection involves identifying and categorizing important events from social media, which relies on labeled data, but annotation is costly and labor-intensive. To address this problem, we propose Augmentation framework for Social Event Detection (SED-Aug), a plug-and-play dual augmentation framework, which combines explicit text-based and implicit feature-space augmentation to enhance data diversity and model robustness. The explicit augmentation utilizes large language models to enhance textual information through five diverse generation strategies. For implicit augmentation, we design five novel perturbation techniques that operate in the feature space on structural fused embeddings. These perturbations are crafted to keep the semantic and relational properties of the embeddings and make them more diverse. Specifically, SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score. The code is available at GitHub: https://github.com/congboma/SED-Aug.

[46] Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Qinyan Zhang,Xinping Lei,Ruijie Miao,Yu Fu,Haojie Fan,Le Chang,Jiafan Hou,Dingling Zhang,Zhongfei Hou,Ziqiang Yang,Changxin Pu,Fei Hu,Jingkai Liu,Mengyun Liu,Yang Liu,Xiang Gao,Jiaheng Liu,Tong Yang,Zaiyuan Wang,Ge Zhang,Wenhao Huang

Main category: cs.CL

TL;DR: 该论文提出了Inverse IFEval基准测试,用于评估大型语言模型在对抗性指令下克服训练诱导偏差的能力。

Details Motivation: 大型语言模型(LLMs)在多种任务中表现出色,但往往表现出认知惯性,难以遵循与监督微调(SFT)期间学到的标准模式相冲突的指令。 Method: 通过人类参与的流水线,构建了一个包含1012个高质量中英文问题的数据集,涵盖23个领域,并在一个优化的LLM-as-a-Judge框架下进行评估。 Result: 实验结果表明了所提出的Inverse IFEval基准测试的必要性。 Conclusion: 该论文强调未来对齐工作不仅要追求流畅性和事实正确性,还要在非常规情境下考虑适应性,提出Inverse IFEval作为诊断工具和开发方法的基础,以减轻认知惯性,减少对狭窄模式的过拟合,最终增强LLMs在多样且不可预测的实际场景中的指令跟随可靠性。 Abstract: Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

[47] Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models

Juraj Vladika,Mahdi Dhaini,Florian Matthes

Main category: cs.CL

TL;DR: This paper introduces MedRevQA and MedChangeQA datasets to evaluate LLMs in healthcare, revealing their reliance on outdated knowledge and suggesting future improvements for more reliable medical AI.

Details Motivation: LLMs have significant potential in healthcare but may provide harmful advice if they rely on outdated knowledge due to changes in medical consensus over time. Method: The authors introduced two novel datasets, MedRevQA and MedChangeQA, and evaluated eight prominent LLMs on these datasets to assess their reliance on outdated medical knowledge. Result: Evaluation of eight LLMs showed consistent reliance on outdated knowledge, with analysis focusing on the impact of obsolete training data and training strategies. Conclusion: The study concludes that current LLMs consistently rely on outdated medical knowledge, which poses risks in healthcare applications, and suggests future directions to mitigate this issue. Abstract: The growing capabilities of Large Language Models (LLMs) show significant potential to enhance healthcare by assisting medical researchers and physicians. However, their reliance on static training data is a major risk when medical recommendations evolve with new research and developments. When LLMs memorize outdated medical knowledge, they can provide harmful advice or fail at clinical reasoning tasks. To investigate this problem, we introduce two novel question-answering (QA) datasets derived from systematic reviews: MedRevQA (16,501 QA pairs covering general biomedical knowledge) and MedChangeQA (a subset of 512 QA pairs where medical consensus has changed over time). Our evaluation of eight prominent LLMs on the datasets reveals consistent reliance on outdated knowledge across all models. We additionally analyze the influence of obsolete pre-training data and training strategies to explain this phenomenon and propose future directions for mitigation, laying the groundwork for developing more current and reliable medical AI systems.

[48] PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

Jiajun He,Naoki Sawada,Koichi Miyazaki,Tomoki Toda

Main category: cs.CL

TL;DR: PARCO: Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation.

Details Motivation: ASR systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. Method: PARCO integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. Result: Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech. Conclusion: PARCO enhances phonetic discrimination, ensures complete entity retrieval, and reduces false positives under uncertainty. Abstract: Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.

[49] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

Bufan Gao,Elisa Kreiss

Main category: cs.CL

TL;DR: 这篇论文研究了提示的变化如何影响大型语言模型(LLM)中测量到的性别偏见,发现即使是微小的提示变化也可能显著改变偏差结果,有时甚至完全逆转结果。

Details Motivation: 随着LLMs在社会影响较大的领域中的应用越来越多,人们对性别偏见的担忧促使了对这种偏见的测量和缓解的努力。这些努力通常依赖于不同于自然语言分布的评估任务。论文旨在研究任务的评估目的如何影响测量到的LLM性别偏见。 Method: 论文的方法包括在不同提示条件下测试模型,这些条件使测试背景或性别相关内容突出。通过四种任务格式和两种度量方式(token-probability和discrete-choice)来评估提示敏感性。 Result: 研究发现,即使是微小的提示变化也可能显著改变偏差结果,有时甚至完全逆转结果。离散选择度量相对于概率度量更容易放大偏差。 Conclusion: 论文的结论是,提示的微小变化可能会显著改变偏差结果,有时甚至完全逆转结果。离散选择度量相对于概率度量更容易放大偏差。这不仅突显了LLM性别偏差评估的脆弱性,也提出了一个新问题:精心设计的测试在多大程度上会触发LLM的“测试模式”表现,这对未来基准的生态效度意味着什么。 Abstract: As LLMs are increasingly applied in socially impactful settings, concerns about gender bias have prompted growing efforts both to measure and mitigate such bias. These efforts often rely on evaluation tasks that differ from natural language distributions, as they typically involve carefully constructed task prompts that overtly or covertly signal the presence of gender bias-related content. In this paper, we examine how signaling the evaluative purpose of a task impacts measured gender bias in LLMs. Concretely, we test models under prompt conditions that (1) make the testing context salient, and (2) make gender-focused content salient. We then assess prompt sensitivity across four task formats with both token-probability and discrete-choice metrics. We find that even minor prompt changes can substantially alter bias outcomes, sometimes reversing their direction entirely. Discrete-choice metrics further tend to amplify bias relative to probabilistic measures. These findings do not only highlight the brittleness of LLM gender bias evaluations but open a new puzzle for the NLP benchmarking and development community: To what extent can well-controlled testing designs trigger LLM ``testing mode'' performance, and what does this mean for the ecological validity of future benchmarks.

[50] Can Language Models Handle a Non-Gregorian Calendar?

Mutsumi Sasaki,Go Kamoda,Ryosuke Takahashi,Kosuke Sato,Kentaro Inui,Keisuke Sakaguchi,Benjamin Heinzerling

Main category: cs.CL

TL;DR: This study evaluates how well open-source language models handle the Japanese calendar, a non-Gregorian system, revealing that while some models can convert calendars, many struggle with arithmetic and consistency, highlighting the need for better culture-specific temporal reasoning in LMs.

Details Motivation: Temporal reasoning and knowledge are essential for language models, yet most research has focused only on the Gregorian calendar. This study addresses the gap by evaluating how well current LMs handle non-Gregorian systems like the Japanese calendar, which are actively used and culturally significant. Method: The researchers created datasets for four tasks requiring temporal knowledge and reasoning to evaluate a range of English-centric and Japanese-centric open-source language models in handling the Japanese calendar, a non-Gregorian system. Result: Some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and maintaining consistency across calendars. Conclusion: The study concludes that while some language models can perform calendar conversions, most models, including Japanese-centric ones, struggle with Japanese-calendar arithmetic and maintaining consistency across calendars, emphasizing the need for improved culture-specific calendar understanding in LMs. Abstract: Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well open-source LMs handle one such non-Gregorian system: the Japanese calendar. For our evaluation, we create datasets for four tasks that require both temporal knowledge and temporal reasoning. Evaluating a range of English-centric and Japanese-centric LMs, we find that some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and with maintaining consistency across calendars. Our results highlight the importance of developing LMs that are better equipped for culture-specific calendar understanding.

cs.CV [Back]

[51] Towards Efficient General Feature Prediction in Masked Skeleton Modeling

Shengkai Sun,Zefan Zhang,Jianfeng Dong,Zhiyong Cheng,Xiaojun Chang,Meng Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的通用特征预测框架(GFP),用于高效的掩码骨架建模,通过高层次特征预测和协作学习框架提升自监督骨架动作识别的计算效率和表示质量。

Details Motivation: 现有基于MAE的骨架动作识别方法局限于重建原始关节坐标或其简单变体,导致计算冗余和有限的语义表示。 Method: 提出了一种新的通用特征预测框架(GFP),通过从局部运动模式到全局语义表示的高层次特征预测替代传统的低级重建,并引入了一个协作学习框架,其中轻量级目标生成网络动态生成跨时空层次的多样化监督信号。 Result: 在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD数据集上的实验表明,该方法的训练速度比标准的掩码骨架建模方法快6.2倍,并在各种下游任务中达到了最先进的性能。 Conclusion: 实验结果表明,GFP框架在计算效率和表示质量方面均优于现有方法,在各种下游任务中达到了最先进的性能。 Abstract: Recent advances in the masked autoencoder (MAE) paradigm have significantly propelled self-supervised skeleton-based action recognition. However, most existing approaches limit reconstruction targets to raw joint coordinates or their simple variants, resulting in computational redundancy and limited semantic representation. To address this, we propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling. Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations. Specifically, we introduce a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, avoiding reliance on pre-computed offline features. The framework incorporates constrained optimization to ensure feature diversity while preventing model collapse. Experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD demonstrate the benefits of our approach: Computational efficiency (with 6.2$\times$ faster training than standard masked skeleton modeling methods) and superior representation quality, achieving state-of-the-art performance in various downstream tasks.

[52] Teacher-Student Model for Detecting and Classifying Mitosis in the MIDOG 2025 Challenge

Seungho Choe,Xiaoli Qin,Abubakr Shafique,Amanda Dy,Susan Done,Dimitrios Androutsos,April Khademi

Main category: cs.CV

TL;DR: This paper proposes a teacher-student model using a UNet-based segmentation framework with domain generalization techniques to robustly detect mitotic figures and classify atypical mitoses, achieving promising results on a preliminary test set.

Details Motivation: The motivation is to address the time-intensive nature of mitotic figure counting for pathologists and the issue of inter-observer variability. Additionally, the paper aims to overcome challenges such as domain shift and imbalanced data in AI-based detection systems. Method: The authors use a teacher-student model with a UNet segmentation backbone that incorporates domain generalization modules, such as contrastive representation learning and domain-adversarial training. They also implement a multi-scale CNN classifier in a multi-task learning paradigm to handle detection and classification tasks. Result: On the preliminary test set, the algorithm achieved an F1 score of 0.7660 in Track 1 (mitosis detection) and a balanced accuracy of 0.8414 in Track 2 (atypical mitosis classification). Conclusion: The paper concludes that integrating segmentation-based detection and classification into a unified framework improves the robustness of mitosis analysis and effectively addresses domain shift issues. Abstract: Counting mitotic figures is time-intensive for pathologists and leads to inter-observer variability. Artificial intelligence (AI) promises a solution by automatically detecting mitotic figures while maintaining decision consistency. However, AI tools are susceptible to domain shift, where a significant drop in performance can occur due to differences in the training and testing sets, including morphological diversity between organs, species, and variations in staining protocols. Furthermore, the number of mitoses is much less than the count of normal nuclei, which introduces severely imbalanced data for the detection task. In this work, we formulate mitosis detection as a pixel-level segmentation and propose a teacher-student model that simultaneously addresses mitosis detection (Track 1) and atypical mitosis classification (Track 2). Our method is based on a UNet segmentation backbone that integrates domain generalization modules, namely contrastive representation learning and domain-adversarial training. A teacher-student strategy is employed to generate pixel-level pseudo-masks not only for annotated mitoses and hard negatives but also for normal nuclei, thereby enhancing feature discrimination and improving robustness against domain shift. For the classification task, we introduce a multi-scale CNN classifier that leverages feature maps from the segmentation model within a multi-task learning paradigm. On the preliminary test set, the algorithm achieved an F1 score of 0.7660 in Track 1 and balanced accuracy of 0.8414 in Track 2, demonstrating the effectiveness of integrating segmentation-based detection and classification into a unified framework for robust mitosis analysis.

[53] Multi Attribute Bias Mitigation via Representation Learning

Rajeev Ranjan Dwivedi,Ankur Kumar,Vinod K Kurmi

Main category: cs.CV

TL;DR: This paper introduces GMBM, an end-to-end framework for mitigating multiple biases in visual recognition, demonstrating superior performance under complex bias scenarios and distribution shifts.

Details Motivation: Real-world images often have overlapping biases that degrade model performance, and existing methods are insufficient when dealing with multiple biases simultaneously. Method: GMBM uses Adaptive Bias Integrated Learning (ABIL) to identify bias influences and Gradient Suppression Fine Tuning to reduce biases at test time; Scaled Bias Amplification (SBA) is introduced to measure bias. Result: GMBM improves worst-group accuracy, halves multi-attribute bias amplification, and achieves a new low in SBA across multiple datasets. Conclusion: The paper proposes GMBM, a two-stage framework for mitigating multiple biases in vision models, which outperforms existing methods in handling bias complexity and distribution shifts. Abstract: Real world images frequently exhibit multiple overlapping biases, including textures, watermarks, gendered makeup, scene object pairings, etc. These biases collectively impair the performance of modern vision models, undermining both their robustness and fairness. Addressing these biases individually proves inadequate, as mitigating one bias often permits or intensifies others. We tackle this multi bias problem with Generalized Multi Bias Mitigation (GMBM), a lean two stage framework that needs group labels only while training and minimizes bias at test time. First, Adaptive Bias Integrated Learning (ABIL) deliberately identifies the influence of known shortcuts by training encoders for each attribute and integrating them with the main backbone, compelling the classifier to explicitly recognize these biases. Then Gradient Suppression Fine Tuning prunes those very bias directions from the backbone's gradients, leaving a single compact network that ignores all the shortcuts it just learned to recognize. Moreover we find that existing bias metrics break under subgroup imbalance and train test distribution shifts, so we introduce Scaled Bias Amplification (SBA): a test time measure that disentangles model induced bias amplification from distributional differences. We validate GMBM on FB CMNIST, CelebA, and COCO, where we boost worst group accuracy, halve multi attribute bias amplification, and set a new low in SBA even as bias complexity and distribution shifts intensify, making GMBM the first practical, end to end multibias solution for visual recognition. Project page: http://visdomlab.github.io/GMBM/

[54] Lightweight image segmentation for echocardiography

Anders Kjelsrud,Lasse Løvstakken,Erik Smistad,Håvard Dalen,Gilles Van De Vyver

Main category: cs.CV

TL;DR: A lightweight U-Net model was developed that matches the performance of the more complex nnU-Net for left ventricle segmentation in echocardiography while being significantly smaller and faster, enabling potential real-time use.

Details Motivation: The motivation for this study was to overcome the limitations of existing models like nnU-Net, which are large and slow, thus not suitable for real-time applications. The goal was to develop a more efficient model for accurate segmentation of the left ventricle in echocardiography. Method: The researchers conducted an ablation study to evaluate the effectiveness of different components of the nnU-Net model, including data augmentation schemes, architectural modifications, loss functions, and post-processing techniques. Based on the insights gained, they developed a lightweight U-Net model. Result: The lightweight U-Net achieved statistically equivalent performance to nnU-Net on the CAMUS dataset with Dice scores of 0.93/0.85/0.89 for LV/MYO/LA compared to 0.93/0.86/0.89 for nnU-Net, while being 16 times smaller and 4 times faster. Cross-dataset evaluation confirmed comparable generalization. Conclusion: The study concluded that a lightweight U-Net model can achieve performance equivalent to the more complex and larger nnU-Net model for left ventricle segmentation in echocardiography while being significantly smaller and faster. Abstract: Accurate segmentation of the left ventricle in echocardiography can enable fully automatic extraction of clinical measurements such as volumes and ejection fraction. While models configured by nnU-Net perform well, they are large and slow, thus limiting real-time use. We identified the most effective components of nnU-Net for cardiac segmentation through an ablation study, incrementally evaluating data augmentation schemes, architectural modifications, loss functions, and post-processing techniques. Our analysis revealed that simple affine augmentations and deep supervision drive performance, while complex augmentations and large model capacity offer diminishing returns. Based on these insights, we developed a lightweight U-Net (2M vs 33M parameters) that achieves statistically equivalent performance to nnU-Net on CAMUS (N=500) with Dice scores of 0.93/0.85/0.89 vs 0.93/0.86/0.89 for LV/MYO/LA ($p>0.05$), while being 16 times smaller and 4 times faster (1.35ms vs 5.40ms per frame) than the default nnU-Net configuration. Cross-dataset evaluation on an internal dataset (N=311) confirms comparable generalization.

[55] treeX: Unsupervised Tree Instance Segmentation in Dense Forest Point Clouds

Josafat-Mattias Burmeister,Andreas Tockner,Stefan Reder,Markus Engel,Rico Richter,Jan-Peter Mund,Jürgen Döllner

Main category: cs.CV

TL;DR: The paper presents a revised treeX algorithm for efficient tree instance segmentation in 3D forest data, offering improved performance and open-source implementation for resource-efficient use compared to deep learning methods.

Details Motivation: Deep learning methods for tree instance segmentation require large annotated datasets and significant computational resources; the revised treeX algorithm offers a more resource-efficient alternative. Method: The revised treeX algorithm uses clustering-based stem detection and region growing for crown delineation, with parameter presets for different laser scanning methods. Result: The revised treeX algorithm reduces runtime and improves accuracy compared to the original, with significant F1-score gains for ground-based data and successful segmentation for UAV-borne data. Conclusion: The revised treeX algorithm is a resource-efficient alternative to deep learning methods for tree instance segmentation in 3D point cloud data, offering improved performance and potential applications for label generation. Abstract: Close-range laser scanning provides detailed 3D captures of forest stands but requires efficient software for processing 3D point cloud data and extracting individual trees. Although recent studies have introduced deep learning methods for tree instance segmentation, these approaches require large annotated datasets and substantial computational resources. As a resource-efficient alternative, we present a revised version of the treeX algorithm, an unsupervised method that combines clustering-based stem detection with region growing for crown delineation. While the original treeX algorithm was developed for personal laser scanning (PLS) data, we provide two parameter presets, one for ground-based laser scanning (stationary terrestrial - TLS and PLS), and one for UAV-borne laser scanning (ULS). We evaluated the method on six public datasets (FOR-instance, ForestSemantic, LAUTx, NIBIO MLS, TreeLearn, Wytham Woods) and compared it to six open-source methods (original treeX, treeiso, RayCloudTools, ForAINet, SegmentAnyTree, TreeLearn). Compared to the original treeX algorithm, our revision reduces runtime and improves accuracy, with instance detection F$_1$-score gains of +0.11 to +0.49 for ground-based data. For ULS data, our preset achieves an F$_1$-score of 0.58, whereas the original algorithm fails to segment any correct instances. For TLS and PLS data, our algorithm achieves accuracy similar to recent open-source methods, including deep learning. Given its algorithmic design, we see two main applications for our method: (1) as a resource-efficient alternative to deep learning approaches in scenarios where the data characteristics align with the method design (sufficient stem visibility and point density), and (2) for the semi-automatic generation of labels for deep learning models. To enable broader adoption, we provide an open-source Python implementation in the pointtree package.

[56] Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding

Hongpei Zheng,Lintao Xiang,Qijun Yang,Qian Lin,Hujun Yin

Main category: cs.CV

TL;DR: Reg3D是一个新的3D几何指导微调框架,通过将几何感知监督纳入训练过程,解决了现有方法在学习强大3D空间表示上的不足。

Details Motivation: 现有的大型多模态模型(LMMs)主要依赖纯文本监督,无法提供学习鲁棒3D空间表示所需的几何约束。 Method: Reg3D采用双监督范式,利用3D几何信息作为输入和明确的学习目标,设计了对象级和帧级的重建任务,以促进空间推理能力的发展。 Result: 在ScanQA、Scan2Cap、ScanRefer和SQA3D数据集上的实验表明,Reg3D显著提高了性能,建立了一种新的具有空间感知能力的多模态模型训练范式。 Conclusion: Reg3D通过重建底层几何结构,有效提升了3D场景理解能力,为多模态模型提供了一种新的训练范式。 Abstract: The rapid development of Large Multimodal Models (LMMs) has led to remarkable progress in 2D visual understanding; however, extending these capabilities to 3D scene understanding remains a significant challenge. Existing approaches predominantly rely on text-only supervision, which fails to provide the geometric constraints required for learning robust 3D spatial representations. In this paper, we introduce Reg3D, a novel Reconstructive Geometry Instruction Tuning framework that addresses this limitation by incorporating geometry-aware supervision directly into the training process. Our key insight is that effective 3D understanding necessitates reconstructing underlying geometric structures rather than merely describing them. Unlike existing methods that inject 3D information solely at the input level, Reg3D adopts a dual-supervision paradigm that leverages 3D geometric information both as input and as explicit learning targets. Specifically, we design complementary object-level and frame-level reconstruction tasks within a dual-encoder architecture, enforcing geometric consistency to encourage the development of spatial reasoning capabilities. Extensive experiments on ScanQA, Scan2Cap, ScanRefer, and SQA3D demonstrate that Reg3D delivers substantial performance improvements, establishing a new training paradigm for spatially aware multimodal models.

[57] QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception

Seth Z. Zhao,Huizhi Zhang,Zhaowei Li,Juntong Peng,Anthony Chui,Zewei Zhou,Zonglin Meng,Hao Xiang,Zhiyu Huang,Fujia Wang,Ran Tian,Chenfeng Xu,Bolei Zhou,Jiaqi Ma

Main category: cs.CV

TL;DR: QuantV2X是一种高效的多智能体V2X协作感知系统,通过量化策略降低了计算和传输成本,同时保持了良好的准确性,并提升了系统性能和可扩展性。

Details Motivation: 现有的V2X协作感知系统主要依赖全精度模型,导致计算和传输成本高,难以在资源受限的环境中实时运行。 Method: 引入统一的端到端量化策略,涵盖神经网络模型和传输消息表示,以降低计算负载和传输带宽。 Result: QuantV2X在低比特约束下实现了与全精度系统相当的准确性,系统级延迟降低了3.2倍,在mAP30上提升了+9.5,并且能更有效地扩展,适应严格的内存预算。 Conclusion: QuantV2X是一个完全量化的多智能体系统,旨在高效、可扩展地部署多模态、多智能体V2X协作感知,具备实际部署的可行性。 Abstract: Cooperative perception through Vehicle-to-Everything (V2X) communication offers significant potential for enhancing vehicle perception by mitigating occlusions and expanding the field of view. However, past research has predominantly focused on improving accuracy metrics without addressing the crucial system-level considerations of efficiency, latency, and real-world deployability. Noticeably, most existing systems rely on full-precision models, which incur high computational and transmission costs, making them impractical for real-time operation in resource-constrained environments. In this paper, we introduce \textbf{QuantV2X}, the first fully quantized multi-agent system designed specifically for efficient and scalable deployment of multi-modal, multi-agent V2X cooperative perception. QuantV2X introduces a unified end-to-end quantization strategy across both neural network models and transmitted message representations that simultaneously reduces computational load and transmission bandwidth. Remarkably, despite operating under low-bit constraints, QuantV2X achieves accuracy comparable to full-precision systems. More importantly, when evaluated under deployment-oriented metrics, QuantV2X reduces system-level latency by 3.2$\times$ and achieves a +9.5 improvement in mAP30 over full-precision baselines. Furthermore, QuantV2X scales more effectively, enabling larger and more capable models to fit within strict memory budgets. These results highlight the viability of a fully quantized multi-agent intermediate fusion system for real-world deployment. The system will be publicly released to promote research in this field: https://github.com/ucla-mobility/QuantV2X.

[58] Transfer Learning-Based CNN Models for Plant Species Identification Using Leaf Venation Patterns

Bandita Bharadwaj,Ankur Mishra,Saurav Bharadwaj

Main category: cs.CV

TL;DR: 研究比较了三种深度学习模型用于基于叶脉的植物分类,EfficientNetB0表现最佳。

Details Motivation: 叶脉模式是具有高度分类学意义的关键形态特征,研究旨在评估不同深度学习架构在此分类任务中的表现。 Method: 研究评估了三种深度学习架构(ResNet50、MobileNetV2和EfficientNetB0)在基于叶脉模式的自动植物种类分类中的有效性,并使用标准性能指标进行模型演示。 Result: ResNet50在训练中表现良好,但测试中出现过拟合;MobileNetV2表现出更好的泛化能力;EfficientNetB0表现最佳,测试准确率达到94.67%,精确度、召回率和F1分数均超过94.6%。 Conclusion: 该研究得出结论,深度学习模型,特别是EfficientNetB0,在基于叶脉分类的植物分类中具有可扩展且准确的潜力。 Abstract: This study evaluates the efficacy of three deep learning architectures: ResNet50, MobileNetV2, and EfficientNetB0 for automated plant species classification based on leaf venation patterns, a critical morphological feature with high taxonomic relevance. Using the Swedish Leaf Dataset comprising images from 15 distinct species (75 images per species, totalling 1,125 images), the models were demonstrated using standard performance metrics during training and testing phases. ResNet50 achieved a training accuracy of 94.11% but exhibited overfitting, reflected by a reduced testing accuracy of 88.45% and an F1 score of 87.82%. MobileNetV2 demonstrated better generalization capabilities, attaining a testing accuracy of 93.34% and an F1 score of 93.23%, indicating its suitability for lightweight, real-time applications. EfficientNetB0 outperformed both models, achieving a testing accuracy of 94.67% with precision, recall, and F1 scores exceeding 94.6%, highlighting its robustness in venation-based classification. The findings underscore the potential of deep learning, particularly EfficientNetB0, in developing scalable and accurate tools for automated plant taxonomy using venation traits.

[59] LayoutGKN: Graph Similarity Learning of Floor Plans

Casper van Engelenburg,Jan van Gemert,Seyran Khademi

Main category: cs.CV

TL;DR: 本文提出 LayoutGKN,一种高效的图比较方法,通过延迟跨图节点交互并使用图核函数,提高了推理速度而不牺牲性能。

Details Motivation: 现有的图形匹配网络需要昂贵的中间跨图节点级交互,导致推理速度慢,因此需要一种更高效的替代方法。 Method: 使用可微分图核作为最终学习节点级嵌入的距离函数,并延迟跨图节点交互。 Result: LayoutGKN 在保持或提升相似性计算性能的同时,显著提高了推理速度。 Conclusion: LayoutGKN 提供了一种更高效的图形比较方法,通过将跨图节点级交互推迟到最后,同时保持相似性计算的准确性。 Abstract: Floor plans depict building layouts and are often represented as graphs to capture the underlying spatial relationships. Comparison of these graphs is critical for applications like search, clustering, and data visualization. The most successful methods to compare graphs \ie, graph matching networks, rely on costly intermediate cross-graph node-level interactions, therefore being slow in inference time. We introduce \textbf{LayoutGKN}, a more efficient approach that postpones the cross-graph node-level interactions to the end of the joint embedding architecture. We do so by using a differentiable graph kernel as a distance function on the final learned node-level embeddings. We show that LayoutGKN computes similarity comparably or better than graph matching networks while significantly increasing the speed. \href{https://github.com/caspervanengelenburg/LayoutGKN}{Code and data} are open.

[60] Singular Value Few-shot Adaptation of Vision-Language Models

Taha Koleilat,Hassan Rivaz,Yiming Xiao

Main category: cs.CV

TL;DR: CLIP-SVD is a multi-modal and parameter-efficient adaptation method that modifies CLIP's internal parameter space using SVD, achieving strong performance with minimal parameter changes.

Details Motivation: The motivation is to address the challenges of adapting vision-language models like CLIP to new fine-grained domains efficiently without relying on prompt engineering or full model fine-tuning. Method: CLIP-SVD uses Singular Value Decomposition (SVD) to fine-tune only the singular values of CLIP's parameter matrices, enabling domain adaptation while retaining the pretrained model. Result: CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, using only 0.04% of the model's total parameters and outperforming previous methods in accuracy and generalization. Conclusion: The paper concludes that CLIP-SVD is a parameter-efficient adaptation technique that effectively improves adaptation performance and generalization ability without compromising the model's pretraining knowledge. Abstract: Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present \textbf{CLIP-SVD}, a novel \textit{multi-modal} and \textit{parameter-efficient} adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only \textbf{0.04\%} of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

[61] STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification

Zongsen Qiu

Main category: cs.CV

TL;DR: STA-Net addresses challenges in deploying high-precision plant disease diagnosis on edge devices by combining DeepMAD and STAM for improved shape and texture feature capture.

Details Motivation: The need for high-precision plant disease diagnosis on edge devices due to global food security concerns. Method: A twofold solution: DeepMAD for creating an efficient network backbone and STAM for capturing shape and texture features using DCNv4 and Gabor filters. Result: STA-Net achieved 89.00% accuracy and an F1 score of 88.96% on the CCMT dataset, with ablation studies confirming STAM's performance improvement. Conclusion: The proposed STA-Net model, incorporating DeepMAD and STAM, presents a promising approach for deploying high-precision plant disease diagnosis on edge devices. Abstract: Responding to rising global food security needs, precision agriculture and deep learning-based plant disease diagnosis have become crucial. Yet, deploying high-precision models on edge devices is challenging. Most lightweight networks use attention mechanisms designed for generic object recognition, which poorly capture subtle pathological features like irregular lesion shapes and complex textures. To overcome this, we propose a twofold solution: first, using a training-free neural architecture search method (DeepMAD) to create an efficient network backbone for edge devices; second, introducing the Shape-Texture Attention Module (STAM). STAM splits attention into two branches -- one using deformable convolutions (DCNv4) for shape awareness and the other using a Gabor filter bank for texture awareness. On the public CCMT plant disease dataset, our STA-Net model (with 401K parameters and 51.1M FLOPs) reached 89.00% accuracy and an F1 score of 88.96%. Ablation studies confirm STAM significantly improves performance over baseline and standard attention models. Integrating domain knowledge via decoupled attention thus presents a promising path for edge-deployed precision agriculture AI. The source code is available at https://github.com/RzMY/STA-Net.

[62] SLENet: A Guidance-Enhanced Network for Underwater Camouflaged Object Detection

Xinxin Wang,Han Sun,Ningzhong Liu,Huiyu Zhou,Yinan Yao

Main category: cs.CV

TL;DR: This paper introduces the UCOD task and proposes SLENet along with the DeepCamo dataset to address challenges in detecting underwater camouflaged objects, achieving superior performance over existing methods.

Details Motivation: Underwater Camouflaged Object Detection (UCOD) is a critical task for marine ecology that remains underexplored due to challenges such as optical distortions, water turbidity, and the complex traits of marine organisms. Method: The authors introduced the UCOD task and presented DeepCamo, a benchmark dataset for this domain. They proposed a novel framework called SLENet, which includes a Gamma-Asymmetric Enhancement (GAE) module, a Localization Guidance Branch (LGB), and a Multi-Scale Supervised Decoder (MSSD). Result: Experiments on the DeepCamo dataset and three benchmark COD datasets confirm that SLENet outperforms state-of-the-art methods and demonstrates high generality for the broader COD task. Conclusion: SLENet demonstrates superior performance on the DeepCamo dataset and other benchmark COD datasets, showing its effectiveness and generality in addressing the UCOD task. Abstract: Underwater Camouflaged Object Detection (UCOD) aims to identify objects that blend seamlessly into underwater environments. This task is critically important to marine ecology. However, it remains largely underexplored and accurate identification is severely hindered by optical distortions, water turbidity, and the complex traits of marine organisms. To address these challenges, we introduce the UCOD task and present DeepCamo, a benchmark dataset designed for this domain. We also propose Semantic Localization and Enhancement Network (SLENet), a novel framework for UCOD. We first benchmark state-of-the-art COD models on DeepCamo to reveal key issues, upon which SLENet is built. In particular, we incorporate Gamma-Asymmetric Enhancement (GAE) module and a Localization Guidance Branch (LGB) to enhance multi-scale feature representation while generating a location map enriched with global semantic information. This map guides the Multi-Scale Supervised Decoder (MSSD) to produce more accurate predictions. Experiments on our DeepCamo dataset and three benchmark COD datasets confirm SLENet's superior performance over SOTA methods, and underscore its high generality for the broader COD task.

[63] Fitting Image Diffusion Models on Video Datasets

Juhun Lee,Simon S. Woo

Main category: cs.CV

TL;DR: 本文介绍了一种基于连续视频帧的时间归纳偏置改进扩散模型训练的新方法,这种方法无需修改架构即可提高收敛速度、生成多样性和降低FID。

Details Motivation: 由于图像扩散模型是基于独立采样的静态图像进行训练的,这种设计在捕捉时间世界时存在信息不足的问题,导致收敛速度慢、分布覆盖范围有限和泛化能力降低。 Method: 提出了一种新的训练策略,利用连续视频帧中的时间归纳偏置来改进扩散模型的训练。 Result: 该方法在HandCo数据集上的实验结果显示,收敛速度提高了超过2倍,并在训练和验证分布上实现了更低的FID。此外,该方法还通过鼓励模型捕捉有意义的时间变化来提高生成多样性。 Conclusion: 本文提出了一种利用视频帧间时间归纳偏置来改进扩散模型训练的简单而有效的方法,无需修改架构即可集成到标准扩散训练流程中,并且通过优化分析显示所提出的正则化降低了梯度方差,从而加快了收敛速度。 Abstract: Image diffusion models are trained on independently sampled static images. While this is the bedrock task protocol in generative modeling, capturing the temporal world through the lens of static snapshots is information-deficient by design. This limitation leads to slower convergence, limited distributional coverage, and reduced generalization. In this work, we propose a simple and effective training strategy that leverages the temporal inductive bias present in continuous video frames to improve diffusion training. Notably, the proposed method requires no architectural modification and can be seamlessly integrated into standard diffusion training pipelines. We evaluate our method on the HandCo dataset, where hand-object interactions exhibit dense temporal coherence and subtle variations in finger articulation often result in semantically distinct motions. Empirically, our method accelerates convergence by over 2$\text{x}$ faster and achieves lower FID on both training and validation distributions. It also improves generative diversity by encouraging the model to capture meaningful temporal variations. We further provide an optimization analysis showing that our regularization reduces the gradient variance, which contributes to faster convergence.

[64] MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

Yuheng Li,Yenho Chen,Yuxiang Lai,Jike Zhong,Vanessa Wildman,Xiaofeng Yang

Main category: cs.CV

TL;DR: MedVista3D是一种用于3D CT分析的多尺度语义增强视觉-语言预训练框架,通过整合局部与全局图像-文本对齐和语义匹配技术,显著提升了医学影像分析的准确性和适用性。

Details Motivation: 放射诊断错误主要源于局部异常遗漏、全局上下文理解不足以及报告语言的不一致性,而现有3D视觉-语言模型无法同时满足局部检测、全局推理和自然语言报告生成的需求。 Method: MedVista3D采用局部和全局图像-文本对齐策略,结合语言模型重写和放射学语义匹配库,以实现精细的表示学习和语义感知对齐。 Result: MedVista3D在零样本疾病分类、报告检索和医学视觉问答任务中达到最先进性能,并能有效迁移到器官分割和预后预测任务中。 Conclusion: MedVista3D通过多尺度语义增强的视觉-语言预训练框架,有效解决了3D医学影像分析中的局部-全局理解与语义一致性问题,展示了在多种医学任务中的先进性能和广泛应用潜力。 Abstract: Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.

[65] Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Mengyu Gao,Qiulei Dong

Main category: cs.CV

TL;DR: CaPL is a novel prompt learning method that enhances CLIP's performance on fine-grained datasets through causality-guided visual granulation.

Details Motivation: Existing CLIP-based prompt learning methods have limited performance on fine-grained datasets, which the CaPL method aims to improve. Method: CaPL uses a causality-guided text prompt learning approach with visual granulation, including an attribute disentanglement module and a granule learning module. Result: Extensive experiments on 15 datasets show that CaPL achieves superior performance over state-of-the-art methods. Conclusion: The proposed CaPL method significantly outperforms existing prompt learning methods, especially on fine-grained datasets. Abstract: Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

[66] EGTM: Event-guided Efficient Turbulence Mitigation

Huanan Li,Rui Fan,Juntao Guan,Weidong Hao,Lai Rui,Tong Wu,Yikai Wang,Lin Gu

Main category: cs.CV

TL;DR: This paper introduces an efficient EGTM framework using event cameras for turbulence mitigation, offering significant improvements in speed, model efficiency, and image restoration quality over existing methods.

Details Motivation: The motivation is to overcome the limitations of existing deep-learning TM methods that require high computational resources and are inefficient due to reliance on synchronous frames with limited frame rates. Method: The authors introduced an 'event-lucky insight' and proposed the EGTM framework, which uses event streams for pixel-level turbulence-free guidance in temporal lucky fusion. Result: The EGTM framework outperforms existing methods by 710 times in model size, 214 times in inference latency, and 224 times in model complexity while improving restoration quality (by +0.94 PSNR and +0.08 SSIM). Conclusion: The paper concludes that using event cameras with the proposed EGTM framework significantly improves turbulence mitigation in terms of efficiency and restoration quality. Abstract: Turbulence mitigation (TM) aims to remove the stochastic distortions and blurs introduced by atmospheric turbulence into frame cameras. Existing state-of-the-art deep-learning TM methods extract turbulence cues from multiple degraded frames to find the so-called "lucky'', not distorted patch, for "lucky fusion''. However, it requires high-capacity network to learn from coarse-grained turbulence dynamics between synchronous frames with limited frame-rate, thus fall short in computational and storage efficiency. Event cameras, with microsecond-level temporal resolution, have the potential to fundamentally address this bottleneck with efficient sparse and asynchronous imaging mechanism. In light of this, we (i) present the fundamental \textbf{``event-lucky insight''} to reveal the correlation between turbulence distortions and inverse spatiotemporal distribution of event streams. Then, build upon this insight, we (ii) propose a novel EGTM framework that extracts pixel-level reliable turbulence-free guidance from the explicit but noisy turbulent events for temporal lucky fusion. Moreover, we (iii) build the first turbulence data acquisition system to contribute the first real-world event-driven TM dataset. Extensive experimental results demonstrate that our approach significantly surpass the existing SOTA TM method by 710 times, 214 times and 224 times in model size, inference latency and model complexity respectively, while achieving the state-of-the-art in restoration quality (+0.94 PSNR and +0.08 SSIM) on our real-world EGTM dataset. This demonstrating the great efficiency merit of introducing event modality into TM task. Demo code and data have been uploaded in supplementary material and will be released once accepted.

[67] Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection

Nan Yang,Yang Wang,Zhanwen Liu,Yuchao Dai,Yang Liu,Xiangmo Zhao

Main category: cs.CV

TL;DR: FocusMamba通过自适应稀疏化与跨模态融合,有效提升了RGB-事件检测的准确性和效率。

Details Motivation: 现有RGB-事件检测方法在特征提取和融合过程中对低信息区域进行均匀处理,导致计算成本高且性能次优。此外,固定阈值的标记稀疏化方法无法根据样本复杂度保留重要信息。 Method: 提出了一种名为FocusMamba的方法,包括事件引导的多模态稀疏化(EGMS)和跨模态聚焦融合(CMFF)模块,以实现多模态特征的自适应稀疏化和高效融合。 Result: 实验结果表明,FocusMamba在DSEC-Det和PKU-DAVIS-SOD数据集上均表现出优于现有方法的性能。 Conclusion: FocusMamba实现了更精确和高效的多模态特征处理,相比现有方法在准确性和效率上均有提升。 Abstract: Existing RGB-Event detection methods process the low-information regions of both modalities (background in images and non-event regions in event data) uniformly during feature extraction and fusion, resulting in high computational costs and suboptimal performance. To mitigate the computational redundancy during feature extraction, researchers have respectively proposed token sparsification methods for the image and event modalities. However, these methods employ a fixed number or threshold for token selection, hindering the retention of informative tokens for samples with varying complexity. To achieve a better balance between accuracy and efficiency, we propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features and efficiently integrates complementary information. Specifically, an Event-Guided Multimodal Sparsification (EGMS) strategy is designed to identify and adaptively discard low-information regions within each modality by leveraging scene content changes perceived by the event camera. Based on the sparsification results, a Cross-Modality Focus Fusion (CMFF) module is proposed to effectively capture and integrate complementary features from both modalities. Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency compared to existing methods. The code will be available at https://github.com/Zizzzzzzz/FocusMamba.

[68] SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition

Jiajun Song,Xiaoou Liu

Main category: cs.CV

TL;DR: This paper introduces the Compositional Zero-Shot Food Recognition task and proposes SalientFusion, a method that effectively addresses challenges like background redundancy, role confusion, and semantic bias, achieving top results on benchmarks and datasets.

Details Motivation: The need for recognizing unseen food categories due to the rapid emergence of new dishes motivates the development of Zero-Shot Food Learning, leading to the proposed Compositional Zero-Shot Food Recognition task. Method: The proposed SalientFusion method includes SalientFormer to remove background redundancy and resolve role confusion using depth features, and DebiasAT to reduce semantic bias by aligning prompts with visual features. Result: SalientFusion achieves state-of-the-art results on the proposed benchmarks CZSFood-90 and CZSFood-164, as well as on popular general datasets for Compositional Zero-Shot Learning. Conclusion: SalientFusion is an effective method for Compositional Zero-Shot Food Recognition, addressing challenges like background redundancy, role confusion, and semantic bias, achieving state-of-the-art results. Abstract: Food recognition has gained significant attention, but the rapid emergence of new dishes requires methods for recognizing unseen food categories, motivating Zero-Shot Food Learning (ZSFL). We propose the task of Compositional Zero-Shot Food Recognition (CZSFR), where cuisines and ingredients naturally align with attributes and objects in Compositional Zero-Shot learning (CZSL). However, CZSFR faces three challenges: (1) Redundant background information distracts models from learning meaningful food features, (2) Role confusion between staple and side dishes leads to misclassification, and (3) Semantic bias in a single attribute can lead to confusion of understanding. Therefore, we propose SalientFusion, a context-aware CZSFR method with two components: SalientFormer, which removes background redundancy and uses depth features to resolve role confusion; DebiasAT, which reduces the semantic bias by aligning prompts with visual features. Using our proposed benchmarks, CZSFood-90 and CZSFood-164, we show that SalientFusion achieves state-of-the-art results on these benchmarks and the most popular general datasets for the general CZSL. The code is avaliable at https://github.com/Jiajun-RUC/SalientFusion.

[69] Human Motion Video Generation: A Survey

Haiwei Xue,Xiangyang Luo,Zhanghao Hu,Xin Zhang,Xunzhi Xiang,Yuqin Dai,Jianzhuang Liu,Zhensong Zhang,Minglei Li,Jian Yang,Fei Ma,Zhiyong Wu,Changpeng Yang,Zonghong Dai,Fei Richard Yu

Main category: cs.CV

TL;DR: This paper presents a comprehensive survey of human motion video generation, covering over two hundred studies and identifying five key phases in the process. It is the first to explore the use of large language models in this field and aims to serve as a valuable resource for future research and development in digital human applications.

Details Motivation: The motivation behind this research is the growing interest in human motion video generation and its diverse applications, such as photorealistic singing heads and dynamic avatars. Despite existing surveys, there was a lack of comprehensive overviews of the entire generative process, which this paper aims to address. Method: The paper adopts a survey methodology, analyzing over two hundred papers on human motion video generation. It categorizes the generative process into five key phases: input, motion planning, motion video generation, refinement, and output. The survey also discusses the role of large language models in enhancing this process. Result: The paper provides an in-depth survey of human motion video generation, identifying over ten sub-tasks and reviewing the latest developments across vision, text, and audio modalities. It highlights milestone works that have driven technological breakthroughs and is the first survey to explore the potential of large language models in this domain. Conclusion: The paper concludes that human motion video generation has promising prospects and can significantly advance the applications of digital humans. It emphasizes the importance of large language models and provides a comprehensive resource for future research and development in the field. Abstract: Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.

[70] OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction

Bu Jin,Songen Gu,Xiaotao Hu,Yupeng Zheng,Xiaoyang Guo,Qian Zhang,Xiaoxiao Long,Wei Yin

Main category: cs.CV

TL;DR: OccTENS is a generative occupancy world model that addresses inefficiency, temporal degradation in long-term generation and lack of controllability in recent approaches based on autoregression (AR) by reformulating the occupancy world model as a temporal next-scale prediction (TENS) task and utilizing a TensFormer to manage the temporal causality and spatial relationships of occupancy sequences.

Details Motivation: The occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from inefficiency, temporal degradation in long-term generation and lack of controllability. Method: We reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a TensFormer, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. Result: OccTENS enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Conclusion: Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time. Abstract: In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from \textbf{inefficiency}, \textbf{temporal degradation} in long-term generation and \textbf{lack of controllability}. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a \textbf{TensFormer}, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.

[71] Weakly-Supervised Learning of Dense Functional Correspondences

Stefan Stojanov,Linan Zhao,Yunzhi Zhang,Daniel L. K. Yamins,Jiajun Wu

Main category: cs.CV

TL;DR: 该论文提出了一个弱监督学习框架,用于预测密集的功能对应关系,通过利用视觉-语言模型和密集对比学习来整合功能和空间知识,从而在跨类别图像匹配任务中取得优势。

Details Motivation: 在跨类别的图像匹配中,物体的功能可以指导如何建立对应关系,因为实现特定功能的物体部分通常在形状和外观上有相似性。 Method: 基于视觉-语言模型伪标记多视角图像以获取功能部件,并结合密集对比学习从像素对应中提炼功能和空间知识,从而训练出一个新的模型。 Result: 实验结果表明,与现有的自监督图像表示和基础视觉语言模型等基线方法相比,所提出的方法在合成和真实数据集上均具有优势。 Conclusion: 论文成功地将功能理解与空间对应结合在一起,为跨类别图像匹配提供了一个有效的解决方案。 Abstract: Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.

[72] Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

Phuoc-Nguyen Bui,Khanh-Binh Nguyen,Hyunseung Choo

Main category: cs.CV

TL;DR: 本文提出了一种名为Attn-Adapter的在线少样本学习框架,通过双注意力机制提升对比视觉-语言模型(如CLIP)的适应性和泛化能力,避免了传统方法中的计算密集型离线微调和过拟合风险。

Details Motivation: 对比视觉-语言模型在零样本图像识别中表现出色,但在少样本场景下面临计算密集的离线微调问题,这可能导致过拟合。 Method: 提出了Attn-Adapter,一种通过双注意力机制增强CLIP适应性的新方法,包含Memory Attn-Adapter和Local-Global Attn-Adapter两个组件。 Result: Attn-Adapter能够通过少量标记样本实现动态适应,在跨类别和跨数据集的泛化能力上优于最先进的方法,并且保持了高效的推理能力和良好的扩展性。 Conclusion: Attn-Adapter是一个在线少样本学习框架,可以提高CLIP的适应性,无需重新训练基础模型,同时在跨类别和跨数据集的泛化方面优于现有技术,保持高效的推理并在CLIP骨干网络中扩展。 Abstract: Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

[73] SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

Xiaofu Chen,Israfel Salazar,Yova Kementchedjhieva

Main category: cs.CV

TL;DR: SPECS是一个参考免费的代表性相似性指标,专为长图像字幕设计,具有高效率和准确性。

Details Motivation: 随着对生成长且详细的图像字幕的兴趣增加,标准评估指标变得越来越不可靠。N-gram-based指标虽然高效,但无法捕捉语义正确性。尽管大型语言模型(LLMs)的指标与人类判断有很强的相关性,但在模型开发过程中仍然过于昂贵,无法迭代使用。 Method: SPECS通过修改CLIP引入了一种新的目标,强调细节的正确性:奖励正确的细节,惩罚错误的细节。 Result: SPECS在与人类判断的相关性方面与基于开源LLM的指标性能相当,但效率更高。 Conclusion: SPECS是一个实用的迭代检查点评估替代方案,能够在图像字幕模型开发中有效评估长而详细的图像字幕。 Abstract: As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.

[74] A Generative Foundation Model for Chest Radiography

Yuanfeng Ji,Dan Lin,Xiyue Wang,Lu Zhang,Wenhui Zhou,Chongjian Ge,Ruihang Chu,Xiaoli Yang,Junhan Zhao,Junsong Chen,Xiangde Luo,Sen Yang,Jin Fang,Ping Luo,Ruijiang Li

Main category: cs.CV

TL;DR: ChexGen is a generative vision-language foundation model that synthesizes chest radiographs and enhances the development of more accurate, data-efficient, and equitable medical AI systems.

Details Motivation: The scarcity of well-annotated and diverse medical images is a significant barrier to developing reliable AI models in healthcare. Generative foundation models have made substantial technical advances for natural images, which can be leveraged in the medical field. Method: ChexGen was built using a latent diffusion transformer architecture and was pretrained on a large dataset of 960,000 chest X-ray radiograph-report pairs. Result: ChexGen accurately synthesizes chest radiographs through expert evaluations and quantitative metrics. It also enhances training data augmentation, supervised pretraining, disease classification, detection, segmentation tasks, and enables the creation of diverse patient cohorts for improved model fairness. Conclusion: ChexGen, a generative vision-language foundation model, has the potential to transform the development of medical AI systems by enabling more accurate, data-efficient, and equitable healthcare AI. Abstract: The scarcity of well-annotated diverse medical images is a major hurdle for developing reliable AI models in healthcare. Substantial technical advances have been made in generative foundation models for natural images. Here we develop `ChexGen', a generative vision-language foundation model that introduces a unified framework for text-, mask-, and bounding box-guided synthesis of chest radiographs. Built upon the latent diffusion transformer architecture, ChexGen was pretrained on the largest curated chest X-ray dataset to date, consisting of 960,000 radiograph-report pairs. ChexGen achieves accurate synthesis of radiographs through expert evaluations and quantitative metrics. We demonstrate the utility of ChexGen for training data augmentation and supervised pretraining, which led to performance improvements across disease classification, detection, and segmentation tasks using a small fraction of training data. Further, our model enables the creation of diverse patient cohorts that enhance model fairness by detecting and mitigating demographic biases. Our study supports the transformative role of generative foundation models in building more accurate, data-efficient, and equitable medical AI systems.

[75] LMVC: An End-to-End Learned Multiview Video Coding Framework

Xihua Sheng,Yingwen Zhang,Long Xu,Shiqi Wang

Main category: cs.CV

TL;DR: This paper proposes an end-to-end learned multiview video coding (LMVC) framework that improves compression efficiency by leveraging inter-view motion and content correlations, outperforming traditional MV-HEVC standards.

Details Motivation: Multiview video is crucial for volumetric video applications but poses significant challenges in storage and transmission. While deep learning-based video coding has been successful for single-view or stereo videos, general multiview scenarios remain underexplored, motivating the need for an efficient multiview coding solution. Method: An end-to-end learned multiview video coding (LMVC) framework is introduced, incorporating feature-based inter-view motion vector prediction, an inter-view motion entropy model, and a disparity-free inter-view context prediction module combined with an inter-view contextual entropy model. Result: The proposed LMVC framework significantly outperforms the traditional MV-HEVC standard, demonstrating enhanced compression efficiency while ensuring random access and backward compatibility. Conclusion: The proposed LMVC framework outperforms traditional methods like MV-HEVC and sets a strong baseline for future research in multiview video coding. Abstract: Multiview video is a key data source for volumetric video, enabling immersive 3D scene reconstruction but posing significant challenges in storage and transmission due to its massive data volume. Recently, deep learning-based end-to-end video coding has achieved great success, yet most focus on single-view or stereo videos, leaving general multiview scenarios underexplored. This paper proposes an end-to-end learned multiview video coding (LMVC) framework that ensures random access and backward compatibility while enhancing compression efficiency. Our key innovation lies in effectively leveraging independent-view motion and content information to enhance dependent-view compression. Specifically, to exploit the inter-view motion correlation, we propose a feature-based inter-view motion vector prediction method that conditions dependent-view motion encoding on decoded independent-view motion features, along with an inter-view motion entropy model that learns inter-view motion priors. To exploit the inter-view content correlation, we propose a disparity-free inter-view context prediction module that predicts inter-view contexts from decoded independent-view content features, combined with an inter-view contextual entropy model that captures inter-view context priors. Experimental results show that our proposed LMVC framework outperforms the reference software of the traditional MV-HEVC standard by a large margin, establishing a strong baseline for future research in this field.

[76] TopoSculpt: Betti-Steered Topological Sculpting of 3D Fine-grained Tubular Shapes

Minghui Zhang,Yaoyu Liu,Junyang Wu,Xin You,Hanxiao Zhang,Junjun He,Yun Gu

Main category: cs.CV

TL;DR: TopoSculpt is a novel framework for refining the topology of 3D tubular structures, achieving high-fidelity modeling improvements in medical applications.

Details Motivation: Accurate reconstruction of medical tubular anatomical structures is crucial for various applications, but existing methods often fail to capture topological correctness and completeness. Method: TopoSculpt uses a holistic whole-region modeling strategy, introduces a Topological Integrity Betti constraint, and employs a curriculum refinement scheme with persistent homology. Result: Experiments show substantial improvements in both geometry and topology, with significant reductions in β0 errors and improvements in tree length and branch detection rates. Conclusion: TopoSculpt effectively corrects critical topological errors and advances the high-fidelity modeling of complex 3D tubular anatomy. Abstract: Medical tubular anatomical structures are inherently three-dimensional conduits with lumens, enclosing walls, and complex branching topologies. Accurate reconstruction of their geometry and topology is crucial for applications such as bronchoscopic navigation and cerebral arterial connectivity assessment. Existing methods often rely on voxel-wise overlap measures, which fail to capture topological correctness and completeness. Although topology-aware losses and persistent homology constraints have shown promise, they are usually applied patch-wise and cannot guarantee global preservation or correct geometric errors at inference. To address these limitations, we propose a novel TopoSculpt, a framework for topological refinement of 3D fine-grained tubular structures. TopoSculpt (i) adopts a holistic whole-region modeling strategy to capture full spatial context, (ii) first introduces a Topological Integrity Betti (TIB) constraint that jointly enforces Betti number priors and global integrity, and (iii) employs a curriculum refinement scheme with persistent homology to progressively correct errors from coarse to fine scales. Extensive experiments on challenging pulmonary airway and Circle of Willis datasets demonstrate substantial improvements in both geometry and topology. For instance, $\beta_{0}$ errors are reduced from 69.00 to 3.40 on the airway dataset and from 1.65 to 0.30 on the CoW dataset, with Tree length detected and branch detected rates improving by nearly 10\%. These results highlight the effectiveness of TopoSculpt in correcting critical topological errors and advancing the high-fidelity modeling of complex 3D tubular anatomy. The project homepage is available at: https://github.com/Puzzled-Hui/TopoSculpt.

[77] Chest X-ray Pneumothorax Segmentation Using EfficientNet-B4 Transfer Learning in a U-Net Architecture

Alvaro Aranibar Roque,Helga Sebastian

Main category: cs.CV

TL;DR: This paper proposes an automated deep-learning pipeline using a U-Net with EfficientNet-B4 encoder to accurately detect and segment pneumothorax regions, achieving promising results on an independent dataset.

Details Motivation: Pneumothorax, an abnormal accumulation of air in the pleural space, can be life-threatening if undetected. Chest X-rays are the first-line diagnostic tool, but small cases can be subtle and difficult to detect, necessitating an automated detection method. Method: An automated deep-learning pipeline was developed using a U-Net with an EfficientNet-B4 encoder to segment pneumothorax regions. The model was trained on the SIIM-ACR dataset using data augmentation and a combined binary cross-entropy plus Dice loss function. Result: The model achieved an IoU of 0.7008 and a Dice score of 0.8241 on the independent PTX-498 dataset, indicating its ability to accurately localize pneumothoraces. Conclusion: The deep-learning model proposed in this paper, which uses a U-Net with an EfficientNet-B4 encoder, demonstrates high accuracy in localizing pneumothoraces and can effectively support radiologists. Abstract: Pneumothorax, the abnormal accumulation of air in the pleural space, can be life-threatening if undetected. Chest X-rays are the first-line diagnostic tool, but small cases may be subtle. We propose an automated deep-learning pipeline using a U-Net with an EfficientNet-B4 encoder to segment pneumothorax regions. Trained on the SIIM-ACR dataset with data augmentation and a combined binary cross-entropy plus Dice loss, the model achieved an IoU of 0.7008 and Dice score of 0.8241 on the independent PTX-498 dataset. These results demonstrate that the model can accurately localize pneumothoraces and support radiologists.

[78] ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection

Zhu Wenjie,Zhang Yabin,Xin Jin,Wenjun Zeng,Lei Zhang

Main category: cs.CV

TL;DR: This paper proposes Adaptive Negative Textual Space (ANTS), a zero-shot method for OOD detection that uses MLLMs to generate precise negative labels, improving both far-OOD and near-OOD detection performance without requiring training.

Details Motivation: Existing OOD detection methods struggle with constructing accurate negative spaces and suffer from false negatives, especially in near-OOD scenarios. This work aims to address these limitations by leveraging the reasoning capabilities of MLLMs. Method: ANTS leverages MLLMs to generate negative textual descriptions for identified OOD-like images. It creates tailored negative labels for visually similar ID subsets in near-OOD settings and uses an adaptive weighted score to balance far- and near-OOD detection without task-specific priors. Result: On the ImageNet benchmark, ANTS reduces FPR95 by 4.2%, achieves state-of-the-art performance, and operates in a training-free, zero-shot manner, enabling high scalability. Conclusion: The proposed ANTS method significantly improves both far-OOD and near-OOD detection by leveraging MLLMs to generate expressive and accurate negative labels, achieving state-of-the-art performance on the ImageNet benchmark. Abstract: The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

[79] Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection

Yijun Zhou,Yikui Zhai,Zilu Ying,Tingfeng Xian,Wenlve Zhou,Zhiheng Zhou,Xiaolin Tian,Xudong Jia,Hongsheng Zhang,C. L. Philip Chen

Main category: cs.CV

TL;DR: MMChange improves remote sensing change detection by integrating image and text modalities, achieving better accuracy and robustness than existing methods.

Details Motivation: Most deep learning methods for remote sensing change detection rely solely on image modality, which limits feature representation, change pattern modeling, and generalization, especially under illumination and noise disturbances. Method: MMChange introduces an Image Feature Refinement (IFR) module, a vision language model (VLM) for semantic descriptions, a Textual Difference Enhancement (TDE) module, and an Image Text Feature Fusion (ITFF) module for cross-modal integration. Result: Extensive experiments on LEVIRCD, WHUCD, and SYSUCD datasets demonstrate that MMChange consistently surpasses state-of-the-art methods across multiple metrics. Conclusion: MMChange is an effective multimodal method for remote sensing change detection that combines image and text modalities, outperforming state-of-the-art methods. Abstract: Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.

[80] SAC-MIL: Spatial-Aware Correlated Multiple Instance Learning for Histopathology Whole Slide Image Classification

Yu Bai,Zitong Yu,Haowen Tian,Xijing Wang,Shuo Yan,Lin Wang,Honglin Li,Xitong Ling,Bo Zhang,Zheng Zhang,Wufan Wang,Hui Gao,Xiangyang Gong,Wendong Wang

Main category: cs.CV

TL;DR: SAC-MIL 是一种用于 WSI 分类的新方法,其简单结构使其易于部署,并在多个数据集上实现了最先进的性能。

Details Motivation: 开发一种能够处理不同长度训练和测试序列的位置外推问题,并且不需要自定义 CUDA 内核的简单结构的 WSI 分类方法。 Method: SAC-MIL 包括一个用于编码位置信息的位置编码模块和一个用于执行完整实例相关性的 SAC 块。 Result: SAC-MIL 在 CAMELYON-16、TCGA-LUNG 和 TCGA-BRAC 数据集上实现了最先进的性能。 Conclusion: SAC-MIL 是一种用于 WSI 分类的新方法,它实现了最先进的性能,并且代码将在接受后发布。 Abstract: We propose Spatial-Aware Correlated Multiple Instance Learning (SAC-MIL) for performing WSI classification. SAC-MIL consists of a positional encoding module to encode position information and a SAC block to perform full instance correlations. The positional encoding module utilizes the instance coordinates within the slide to encode the spatial relationships instead of the instance index in the input WSI sequence. The positional encoding module can also handle the length extrapolation issue where the training and testing sequences have different lengths. The SAC block is an MLP-based method that performs full instance correlation in linear time complexity with respect to the sequence length. Due to the simple structure of MLP, it is easy to deploy since it does not require custom CUDA kernels, compared to Transformer-based methods for WSI classification. SAC-MIL has achieved state-of-the-art performance on the CAMELYON-16, TCGA-LUNG, and TCGA-BRAC datasets. The code will be released upon acceptance.

[81] Improving Vessel Segmentation with Multi-Task Learning and Auxiliary Data Available Only During Model Training

Daniel Sobotka,Alexander Herold,Matthias Perkonigg,Lucian Beer,Nina Bastati,Alina Sablatnig,Ahmed Ba-Ssalamah,Georg Langs

Main category: cs.CV

TL;DR: This paper proposes a multi-task learning approach to improve liver vessel segmentation in non-contrast MRI by leveraging contrast-enhanced data during training, reducing the reliance on large annotated datasets.

Details Motivation: Liver vessel segmentation is crucial for analyzing vascular remodeling in liver diseases, but existing methods rely on contrast-enhanced imaging which is not always available. Non-contrast images are more frequently acquired, but vessel segmentation in these images is challenging and requires large annotated datasets. Method: The authors proposed a multi-task learning framework that utilizes auxiliary contrast-enhanced MRI data during training to improve vessel segmentation in non-contrast liver MRI. They trained the model using paired native and contrast-enhanced data with and without annotations. Result: The use of auxiliary contrast-enhanced data during training improved segmentation accuracy in non-contrast MRI, especially when only a limited number of annotated examples were available. The framework also showed benefits in a different domain—brain tumor segmentation. Conclusion: The proposed multi-task learning framework improves vessel segmentation in non-contrast liver MRI, particularly when limited annotations are available, and demonstrates cross-domain applicability, as validated in brain tumor segmentation. Abstract: Liver vessel segmentation in magnetic resonance imaging data is important for the computational analysis of vascular remodelling, associated with a wide spectrum of diffuse liver diseases. Existing approaches rely on contrast enhanced imaging data, but the necessary dedicated imaging sequences are not uniformly acquired. Images without contrast enhancement are acquired more frequently, but vessel segmentation is challenging, and requires large-scale annotated data. We propose a multi-task learning framework to segment vessels in liver MRI without contrast. It exploits auxiliary contrast enhanced MRI data available only during training to reduce the need for annotated training examples. Our approach draws on paired native and contrast enhanced data with and without vessel annotations for model training. Results show that auxiliary data improves the accuracy of vessel segmentation, even if they are not available during inference. The advantage is most pronounced if only few annotations are available for training, since the feature representation benefits from the shared task structure. A validation of this approach to augment a model for brain tumor segmentation confirms its benefits across different domains. An auxiliary informative imaging modality can augment expert annotations even if it is only available during training.

[82] Promptception: How Sensitive Are Large Multimodal Models to Prompts?

Mohamed Insaf Ismithdeen,Muhammad Uzair Khattak,Salman Khan

Main category: cs.CV

TL;DR: This paper introduces Promptception, a comprehensive framework for evaluating prompt sensitivity in Large Multimodal Models (LMMs), revealing key differences in how proprietary and open-source models respond to variations in prompt design for Multiple-Choice Question Answering.

Details Motivation: Prompt design for Large Multimodal Models (LMMs) in Multiple-Choice Question Answering (MCQA) is poorly understood, with significant accuracy variations observed due to minor changes in prompt phrasing and structure. Method: The study introduces Promptception, a framework with 61 prompt types across 15 categories and 6 supercategories, used to evaluate 10 LMMs on 3 MCQA benchmarks: MMStar, MMMU-Pro, and MVBench. Result: Findings show that proprietary models are more sensitive to prompt phrasing, indicating better alignment with instruction semantics, while open-source models are more stable but struggle with complex phrasing. Conclusion: Promptception provides a systematic framework for evaluating prompt sensitivity in LMMs, highlighting the differences in performance between proprietary and open-source models and proposing tailored prompting principles for robust and fair evaluation. Abstract: Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple-Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open-source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU-Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open-source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.

[83] SliceSemOcc: Vertical Slice Based Multimodal 3D Semantic Occupancy Representation

Han Huang,Han Sun,Ningzhong Liu,Huiyu Zhou,Jiaquan Shen

Main category: cs.CV

TL;DR: 本文提出SliceSemOcc,通过基于垂直切片的多模态方法提升3D语义占用预测性能,特别是在小物体类别上。

Details Motivation: 现有方法在处理体素特征时忽略了高度轴信息,且传统SENet风格的通道注意力不能有效强调不同高度的特征。 Method: 提出了基于垂直切片的多模态框架SliceSemOcc,通过全局和局部垂直切片提取高度轴信息,并采用SEAttention3D模块动态分配通道注意力权重。 Result: 在nuScenes-SurroundOcc和nuScenes-OpenOccupancy数据集上的实验表明,该方法在平均IoU上显著提升。 Conclusion: SliceSemOcc有效提升了3D语义占用预测的性能,尤其在小物体类别上效果显著。 Abstract: Driven by autonomous driving's demands for precise 3D perception, 3D semantic occupancy prediction has become a pivotal research topic. Unlike bird's-eye-view (BEV) methods, which restrict scene representation to a 2D plane, occupancy prediction leverages a complete 3D voxel grid to model spatial structures in all dimensions, thereby capturing semantic variations along the vertical axis. However, most existing approaches overlook height-axis information when processing voxel features. And conventional SENet-style channel attention assigns uniform weight across all height layers, limiting their ability to emphasize features at different heights. To address these limitations, we propose SliceSemOcc, a novel vertical slice based multimodal framework for 3D semantic occupancy representation. Specifically, we extract voxel features along the height-axis using both global and local vertical slices. Then, a global local fusion module adaptively reconciles fine-grained spatial details with holistic contextual information. Furthermore, we propose the SEAttention3D module, which preserves height-wise resolution through average pooling and assigns dynamic channel attention weights to each height layer. Extensive experiments on nuScenes-SurroundOcc and nuScenes-OpenOccupancy datasets verify that our method significantly enhances mean IoU, achieving especially pronounced gains on most small-object categories. Detailed ablation studies further validate the effectiveness of the proposed SliceSemOcc framework.

[84] Detecting Regional Spurious Correlations in Vision Transformers via Token Discarding

Solha Kang,Esla Timothy Anzaku,Wesley De Neve,Arnout Van Messem,Joris Vankerschaver,Francois Rameau,Utku Ozbulak

Main category: cs.CV

TL;DR: This paper introduces a new method to detect spurious correlations in vision transformers, demonstrating their impact on model reliability and highlighting the influence of training methodology. The authors provide a list of affected ImageNet images and present a real-world case study on breast mass classification.

Details Motivation: Neural network-based computer vision models can exploit unintended patterns in data, leading to predictions based on spurious correlations. Detecting and mitigating these correlations is crucial for building trustworthy and reliable machine learning models. Method: The authors propose a novel method to detect spurious correlations in vision transformers, conducting large-scale experiments on the ImageNet dataset using both supervised and self-supervised trained models. They also provide a case study on invasive breast mass classification. Result: The proposed method successfully identifies spurious correlations in vision transformers. The experiments show that training methodology significantly affects the model's reliance on these correlations, and certain classes in ImageNet contain easily detectable spurious signals. Conclusion: The study concludes that spurious correlations are a significant concern in vision transformers, affecting model reliability and generalizability. The training methodology impacts the model's reliance on these correlations, and the authors urge caution in the use of certain ImageNet images for future research. Abstract: Due to their powerful feature association capabilities, neural network-based computer vision models have the ability to detect and exploit unintended patterns within the data, potentially leading to correct predictions based on incorrect or unintended but statistically relevant signals. These clues may vary from simple color aberrations to small texts within the image. In situations where these unintended signals align with the predictive task, models can mistakenly link these features with the task and rely on them for making predictions. This phenomenon is referred to as spurious correlations, where patterns appear to be associated with the task but are actually coincidental. As a result, detection and mitigation of spurious correlations have become crucial tasks for building trustworthy, reliable, and generalizable machine learning models. In this work, we present a novel method to detect spurious correlations in vision transformers, a type of neural network architecture that gained significant popularity in recent years. Using both supervised and self-supervised trained models, we present large-scale experiments on the ImageNet dataset demonstrating the ability of the proposed method to identify spurious correlations. We also find that, even if the same architecture is used, the training methodology has a significant impact on the model's reliance on spurious correlations. Furthermore, we show that certain classes in the ImageNet dataset contain spurious signals that are easily detected by the models and discuss the underlying reasons for those spurious signals. In light of our findings, we provide an exhaustive list of the aforementioned images and call for caution in their use in future research efforts. Lastly, we present a case study investigating spurious signals in invasive breast mass classification, grounding our work in real-world scenarios.

[85] Learning from Majority Label: A Novel Problem in Multi-class Multiple-Instance Learning

Shiku Kaito,Shinnosuke Matsuo,Daiki Suehiro,Ryoma Bise

Main category: cs.CV

TL;DR: This paper introduces a new approach called Learning from Majority Label (LML) in multi-class Multiple-Instance Learning, where a Counting Network and Majority Proportion Enhancement Module (MPEM) are proposed to improve classification performance by leveraging bag-level majority labels.

Details Motivation: The motivation is to address a novel multi-class Multiple-Instance Learning (MIL) problem where bag-level labels are derived from the majority class of instances. This has valuable applications in areas like pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. Method: The paper proposes a Counting Network and a Majority Proportion Enhancement Module (MPEM) to solve the LML problem. The Counting Network estimates bag-level majority labels by counting the number of instances in each class, while MPEM enhances the proportion of the majority class by removing minority instances. Result: Experiments showed the superiority of the proposed method on four datasets compared to conventional MIL methods. Ablation studies confirmed the effectiveness of each module, particularly highlighting that bags with a higher proportion of the majority class facilitate better learning. Conclusion: The paper concludes that the proposed method, Learning from Majority Label (LML), is effective in training a classification model to estimate instance classes using bag-level majority labels, with the proposed modules proving effective in enhancing performance. Abstract: The paper proposes a novel multi-class Multiple-Instance Learning (MIL) problem called Learning from Majority Label (LML). In LML, the majority class of instances in a bag is assigned as the bag-level label. The goal of LML is to train a classification model that estimates the class of each instance using the majority label. This problem is valuable in a variety of applications, including pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. To solve LML, we propose a Counting Network trained to produce bag-level majority labels, estimated by counting the number of instances in each class. Furthermore, analysis experiments on the characteristics of LML revealed that bags with a high proportion of the majority class facilitate learning. Based on this result, we developed a Majority Proportion Enhancement Module (MPEM) that increases the proportion of the majority class by removing minority class instances within the bags. Experiments demonstrate the superiority of the proposed method on four datasets compared to conventional MIL methods. Moreover, ablation studies confirmed the effectiveness of each module. The code is available at \href{https://github.com/Shiku-Kaito/Learning-from-Majority-Label-A-Novel-Problem-in-Multi-class-Multiple-Instance-Learning}{here}.

[86] Millisecond-Response Tracking and Gazing System for UAVs: A Domestic Solution Based on "Phytium + Cambricon"

Yuchen Zhu,Longxiang Yin,Kai Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的异构计算架构,结合Phytium处理器和Cambricon加速卡,通过硬件和软件层面的创新设计,显著提高了视频监控系统的实时响应能力和识别精度。

Details Motivation: 传统的摄像系统由于自动识别算法的深度特征提取能力不足和计算架构的效率瓶颈,在动态场景中存在超过200毫秒的响应延迟,无法满足复杂场景中的实时性要求。 Method: 在硬件层面,系统采用了Phytium FT-2000/4处理器和MLU220加速卡的协同计算架构;在软件层面,创新性地将轻量级YOLOv5s检测网络与DeepSORT级联跟踪算法集成在一起,形成“检测-跟踪-反馈”的闭环控制链。 Result: 实验结果表明,该系统在1920*1080分辨率视频流处理中实现了稳定的单帧综合处理延迟50-100毫秒,多尺度目标识别准确率超过98.5%,兼具低延迟和高精度。 Conclusion: 本文提出了一种基于飞腾处理器和寒武纪加速卡的异构计算架构,为无人机监控和国产芯片应用提供了创新解决方案。 Abstract: In the frontier research and application of current video surveillance technology, traditional camera systems exhibit significant limitations of response delay exceeding 200 ms in dynamic scenarios due to the insufficient deep feature extraction capability of automatic recognition algorithms and the efficiency bottleneck of computing architectures, failing to meet the real-time requirements in complex scenes. To address this issue, this study proposes a heterogeneous computing architecture based on Phytium processors and Cambricon accelerator cards, constructing a UAV tracking and gazing system with millisecond-level response capability. At the hardware level, the system adopts a collaborative computing architecture of Phytium FT-2000/4 processors and MLU220 accelerator cards, enhancing computing power through multi-card parallelism. At the software level, it innovatively integrates a lightweight YOLOv5s detection network with a DeepSORT cascaded tracking algorithm, forming a closed-loop control chain of "detection-tracking-feedback". Experimental results demonstrate that the system achieves a stable single-frame comprehensive processing delay of 50-100 ms in 1920*1080 resolution video stream processing, with a multi-scale target recognition accuracy of over 98.5%, featuring both low latency and high precision. This study provides an innovative solution for UAV monitoring and the application of domestic chips.

[87] A Re-ranking Method using K-nearest Weighted Fusion for Person Re-identification

Quang-Huy Che,Le-Chuong Nguyen,Gia-Nghia Tran,Dinh-Duy Phan,Vinh-Tiep Nguyen

Main category: cs.CV

TL;DR: This paper proposes an unsupervised re-ranking method using multi-view features to improve person re-identification accuracy and efficiency without additional model training.

Details Motivation: The motivation stems from the limitations of single-view feature-based re-ranking methods, which suffer from view bias and challenges like pose variation and occlusion. Method: The method involves generating multi-view features using the K-nearest Weighted Fusion (KWF) approach by aggregating neighboring features in an unsupervised manner, focusing on optimizing weight selection during feature aggregation. Result: The results show significant improvements in Rank@1 and mAP metrics on datasets like Market1501, MSMT17, and Occluded-DukeMTMC, with enhancements of 9.8% and 22.0% on MSMT17 and Occluded-DukeMTMC, respectively. Conclusion: The study concludes that the proposed re-ranking method effectively improves person re-identification accuracy and computational efficiency without requiring model fine-tuning or additional annotations. Abstract: In person re-identification, re-ranking is a crucial step to enhance the overall accuracy by refining the initial ranking of retrieved results. Previous studies have mainly focused on features from single-view images, which can cause view bias and issues like pose variation, viewpoint changes, and occlusions. Using multi-view features to present a person can help reduce view bias. In this work, we present an efficient re-ranking method that generates multi-view features by aggregating neighbors' features using K-nearest Weighted Fusion (KWF) method. Specifically, we hypothesize that features extracted from re-identification models are highly similar when representing the same identity. Thus, we select K neighboring features in an unsupervised manner to generate multi-view features. Additionally, this study explores the weight selection strategies during feature aggregation, allowing us to identify an effective strategy. Our re-ranking approach does not require model fine-tuning or extra annotations, making it applicable to large-scale datasets. We evaluate our method on the person re-identification datasets Market1501, MSMT17, and Occluded-DukeMTMC. The results show that our method significantly improves Rank@1 and mAP when re-ranking the top M candidates from the initial ranking results. Specifically, compared to the initial results, our re-ranking method achieves improvements of 9.8%/22.0% in Rank@1 on the challenging datasets: MSMT17 and Occluded-DukeMTMC, respectively. Furthermore, our approach demonstrates substantial enhancements in computational efficiency compared to other re-ranking methods.

[88] TEn-CATS: Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph

Yaru Chen,Faegheh Sardari,Peiliang Zhang,Ruohao Guo,Yang Xiang,Zhenbo Li,Wenwu Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的AVVP方法,结合了BiT和CATS模块,解决了现有方法中的伪标签噪声和注意力分散问题,在两个基准数据集上表现出色。

Details Motivation: 现有的音频-视觉视频解析(AVVP)方法存在伪标签噪声放大和注意力分散的问题,作者旨在通过结合两种研究方向的优势来解决这些问题。 Method: 该论文通过BiT模块进行语义注入和动态校准以提取更干净、更丰富的语义线索,并利用CATS模块进行语义传播和连接,以实现跨时间的精确语义信息传播。 Result: 实验结果表明,所提出的方法在多个关键指标上达到了最先进的性能。 Conclusion: 该论文提出了一种结合Bi-Directional Text Fusion (BiT)模块和Category-Aware Temporal Graph (CATS)模块的新方法,在两个基准数据集LLP和UnAV-100上的多个关键指标上实现了最先进的(SOTA)性能。 Abstract: Audio-Visual Video Parsing (AVVP) task aims to identify event categories and their occurrence times in a given video with weakly supervised labels. Existing methods typically fall into two categories: (i) designing enhanced architectures based on attention mechanism for better temporal modeling, and (ii) generating richer pseudo-labels to compensate for the absence of frame-level annotations. However, the first type methods treat noisy segment-level pseudo labels as reliable supervision and the second type methods let indiscriminate attention spread them across all frames, the initial errors are repeatedly amplified during training. To address this issue, we propose a method that combines the Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module. Specifically, we integrate the strengths and complementarity of the two previous research directions. We first perform semantic injection and dynamic calibration on audio and visual modality features through the BiT module, to locate and purify cleaner and richer semantic cues. Then, we leverage the CATS module for semantic propagation and connection to enable precise semantic information dissemination across time. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators on two benchmark datasets, LLP and UnAV-100.

[89] TriLiteNet: Lightweight Model for Multi-Task Visual Perception

Quang-Huy Che,Duc-Khai Lam

Main category: cs.CV

TL;DR: The study proposes TriLiteNet, an efficient multi-task perception model for Advanced Driver Assistance Systems (ADAS), demonstrating competitive performance with low computational costs and power consumption.

Details Motivation: Efficient perception models are essential for Advanced Driver Assistance Systems (ADAS), as these applications require rapid processing and response to ensure safety and effectiveness in real-world environments. The study aims to address the real-time execution needs of such perception models. Method: This study introduces the TriLiteNet model, which is designed to optimize performance while maintaining low computational costs. It simultaneously manages multiple tasks related to panoramic driving perception. Result: Experimental results on the BDD100k dataset demonstrate that the model achieves competitive performance across three key tasks: vehicle detection, drivable area segmentation, and lane line segmentation. TriLiteNet_{base} demonstrated a recall of 85.6% for vehicle detection, a mean Intersection over Union (mIoU) of 92.4% for drivable area segmentation, and an Acc of 82.3% for lane line segmentation with only 2.35M parameters and a computational cost of 7.72 GFLOPs. The tiny configuration of TriLiteNet has just 0.14M parameters, providing a multi-task solution with minimal computational demand. TriLiteNet in both configurations shows low latency and reasonable power during inference on embedded devices. Conclusion: TriLiteNet offers a practical and deployable solution for real-world autonomous driving applications by balancing performance, computational efficiency, and scalability. Abstract: Efficient perception models are essential for Advanced Driver Assistance Systems (ADAS), as these applications require rapid processing and response to ensure safety and effectiveness in real-world environments. To address the real-time execution needs of such perception models, this study introduces the TriLiteNet model. This model can simultaneously manage multiple tasks related to panoramic driving perception. TriLiteNet is designed to optimize performance while maintaining low computational costs. Experimental results on the BDD100k dataset demonstrate that the model achieves competitive performance across three key tasks: vehicle detection, drivable area segmentation, and lane line segmentation. Specifically, the TriLiteNet_{base} demonstrated a recall of 85.6% for vehicle detection, a mean Intersection over Union (mIoU) of 92.4% for drivable area segmentation, and an Acc of 82.3% for lane line segmentation with only 2.35M parameters and a computational cost of 7.72 GFLOPs. Our proposed model includes a tiny configuration with just 0.14M parameters, which provides a multi-task solution with minimal computational demand. Evaluated for latency and power consumption on embedded devices, TriLiteNet in both configurations shows low latency and reasonable power during inference. By balancing performance, computational efficiency, and scalability, TriLiteNet offers a practical and deployable solution for real-world autonomous driving applications. Code is available at https://github.com/chequanghuy/TriLiteNet.

[90] DVS-PedX: Synthetic-and-Real Event-Based Pedestrian Dataset

Mustafa Sakhai,Kaung Sithu,Min Khant Soe Oke,Maciej Wielgosz

Main category: cs.CV

TL;DR: DVS-PedX is a new dataset for pedestrian detection and intention analysis using event-based cameras, combining synthetic and real-world data to advance neuromorphic perception research.

Details Motivation: To enable research in pedestrian detection and crossing-intention analysis under varied conditions using event-based vision, which offers advantages like low latency and high dynamic range. Method: The dataset combines synthetic event streams from the CARLA simulator with real-world JAAD dash-cam videos converted to event streams using the v2e tool. It includes paired RGB frames, DVS event frames, and frame-level labels, along with raw event files and metadata. Result: The dataset provides structured data for both synthetic and real-world scenarios, and baseline experiments with spiking neural networks highlight the sim-to-real gap. Conclusion: DVS-PedX is a neuromorphic dataset designed to advance research in event-based pedestrian safety, intention prediction, and neuromorphic perception, highlighting the need for domain adaptation and multimodal fusion. Abstract: Event cameras like Dynamic Vision Sensors (DVS) report micro-timed brightness changes instead of full frames, offering low latency, high dynamic range, and motion robustness. DVS-PedX (Dynamic Vision Sensor Pedestrian eXploration) is a neuromorphic dataset designed for pedestrian detection and crossing-intention analysis in normal and adverse weather conditions across two complementary sources: (1) synthetic event streams generated in the CARLA simulator for controlled "approach-cross" scenes under varied weather and lighting; and (2) real-world JAAD dash-cam videos converted to event streams using the v2e tool, preserving natural behaviors and backgrounds. Each sequence includes paired RGB frames, per-frame DVS "event frames" (33 ms accumulations), and frame-level labels (crossing vs. not crossing). We also provide raw AEDAT 2.0/AEDAT 4.0 event files and AVI DVS video files and metadata for flexible re-processing. Baseline spiking neural networks (SNNs) using SpikingJelly illustrate dataset usability and reveal a sim-to-real gap, motivating domain adaptation and multimodal fusion. DVS-PedX aims to accelerate research in event-based pedestrian safety, intention prediction, and neuromorphic perception.

[91] TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Ayan Banerjee,Josep Lladós,Umapada Pal,Anjan Dutta

Main category: cs.CV

TL;DR: TaleDiffusion is a new framework for text-to-story visualization that ensures character consistency, minimizes artifacts, and accurately renders dialogues using a combination of LLMs and attention mechanisms.

Details Motivation: Text-to-story visualization is challenging due to inconsistencies and artifacts in existing methods. This work aims to address these issues by introducing a framework that ensures consistent character interaction and accurate dialogue rendering. Method: TaleDiffusion uses a pre-trained LLM for generating per-frame descriptions and dialogues, a bounded attention-based per-box mask technique for controlling interactions, identity-consistent self-attention for consistency across frames, and region-aware cross-attention for object placement. CLIPSeg is used for dialogue assignment. Result: Experimental results show that TaleDiffusion surpasses existing approaches in terms of character consistency, noise reduction, and dialogue rendering accuracy. Conclusion: TaleDiffusion provides an effective solution for text-to-story visualization by improving character consistency, reducing artifacts, and accurately rendering dialogues, outperforming existing methods. Abstract: Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.

[92] MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation

Yuan Zhao,Liu Lin

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到图像生成框架MEPG,结合了位置-风格感知的LLM和多专家扩散模块,显著提高了生成图像的质量和风格多样性。

Details Motivation: 文本到图像扩散模型在复杂多元素提示和风格多样性方面仍存在不足,因此需要一种更有效的方法来提升生成效果。 Method: 提出了多专家规划与生成框架(MEPG),包括位置-风格感知模块(PSA)和多专家扩散模块(MED),分别用于分解输入提示并实现跨区域生成。 Result: 实验表明,MEPG在相同骨干模型下,显著优于基线模型,在图像质量和风格多样性方面均有提升。 Conclusion: MEPG是一个具有可扩展性的框架,通过结合位置-风格感知的LLM和多专家扩散模块,显著提高了文本到图像生成的质量和风格多样性。 Abstract: Text-to-image diffusion models have achieved remarkable image quality, but they still struggle with complex, multiele ment prompts, and limited stylistic diversity. To address these limitations, we propose a Multi-Expert Planning and Gen eration Framework (MEPG) that synergistically integrates position- and style-aware large language models (LLMs) with spatial-semantic expert modules. The framework comprises two core components: (1) a Position-Style-Aware (PSA) module that utilizes a supervised fine-tuned LLM to decom pose input prompts into precise spatial coordinates and style encoded semantic instructions; and (2) a Multi-Expert Dif fusion (MED) module that implements cross-region genera tion through dynamic expert routing across both local regions and global areas. During the generation process for each lo cal region, specialized models (e.g., realism experts, styliza tion specialists) are selectively activated for each spatial par tition via attention-based gating mechanisms. The architec ture supports lightweight integration and replacement of ex pert models, providing strong extensibility. Additionally, an interactive interface enables real-time spatial layout editing and per-region style selection from a portfolio of experts. Ex periments show that MEPG significantly outperforms base line models with the same backbone in both image quality and style diversity.

[93] Revisiting Simple Baselines for In-The-Wild Deepfake Detection

Orlando Castaneda,Kevin So-Tang,Kshitij Gurung

Main category: cs.CV

TL;DR: 通过优化超参数,Ojha等人的方法在更现实的基准Deepfake-Eval-2024上达到了81%的准确率,表现接近商业检测器。

Details Motivation: 现有研究大多在高度受控的数据集上评估深度伪造检测器,而本文关注最近发布的“野外”基准Deepfake-Eval-2024,旨在提供更现实的评估和可访问的检测器。 Method: 重新审视Ojha等人提出的适应标准预训练视觉骨干网络以产生可泛化的深度伪造检测器的方法,并通过更好地调整超参数提高性能。 Result: 在Deepfake-Eval-2024数据集上,通过更好地调整超参数,简单的方法达到了81%的准确率,超过了之前报告的61%-69%的准确率,甚至与领先的商业检测器(82%)相媲美。 Conclusion: 通过调整超参数,基于Ojha等人提出的方法在Deepfake-Eval-2024数据集上达到了81%的准确率,超越了之前报告的基线方法18%,并与商业深度伪造检测器竞争。 Abstract: The widespread adoption of synthetic media demands accessible deepfake detectors and realistic benchmarks. While most existing research evaluates deepfake detectors on highly controlled datasets, we focus on the recently released "in-the-wild" benchmark, Deepfake-Eval-2024. Initial reporting on Deepfake-Eval-2024 showed that three finetuned open-source models achieve accuracies between 61% and 69%, significantly lagging behind the leading commercial deepfake detector with 82% accuracy. Our work revisits one of these baseline approaches, originally introduced by Ojha et al., which adapts standard pretrained vision backbones to produce generalizable deepfake detectors. We demonstrate that with better-tuned hyperparameters, this simple approach actually yields much higher performance -- 81% accuracy on Deepfake-Eval-2024 -- surpassing the previously reported accuracy of this baseline approach by 18% and competing with commercial deepfake detectors. We discuss tradeoffs in accuracy, computational costs, and interpretability, focusing on how practical these deepfake detectors might be when deployed in real-world settings. Our code can be found at https://github.com/Deepfake-Detection-KKO/deepfake-detection.

[94] YOLO Ensemble for UAV-based Multispectral Defect Detection in Wind Turbine Components

Serhii Svystun,Pavlo Radiuk,Oleksandr Melnychenko,Oleg Savenko,Anatoliy Sachenko

Main category: cs.CV

TL;DR: This research proposes an ensemble of YOLO-based models to improve defect detection in wind power plants by integrating visible and thermal data, achieving higher accuracy than a standalone YOLOv8 model.

Details Motivation: Reliable defect detection in wind power plants requires high-resolution data and efficient processing of multispectral imagery, which current methods may not adequately address. Method: An ensemble of YOLO-based deep learning models integrating visible and thermal channels was developed, using a bounding box fusion algorithm to combine predictions. Result: The proposed approach achieved an mAP@.5 of 0.93 and an F1-score of 0.90, outperforming the standalone YOLOv8 model which scored an mAP@.5 of 0.91. Conclusion: The ensemble approach combining YOLOv8 and a specialized thermal model effectively improves the detection of defects in wind power plants compared to a standalone YOLOv8 model. Abstract: Unmanned aerial vehicles (UAVs) equipped with advanced sensors have opened up new opportunities for monitoring wind power plants, including blades, towers, and other critical components. However, reliable defect detection requires high-resolution data and efficient methods to process multispectral imagery. In this research, we aim to enhance defect detection accuracy through the development of an ensemble of YOLO-based deep learning models that integrate both visible and thermal channels. We propose an ensemble approach that integrates a general-purpose YOLOv8 model with a specialized thermal model, using a sophisticated bounding box fusion algorithm to combine their predictions. Our experiments show this approach achieves a mean Average Precision (mAP@.5) of 0.93 and an F1-score of 0.90, outperforming a standalone YOLOv8 model, which scored an mAP@.5 of 0.91. These findings demonstrate that combining multiple YOLO architectures with fused multispectral data provides a more reliable solution, improving the detection of both visual and thermal defects.

[95] VisioFirm: Cross-Platform AI-assisted Annotation Tool for Computer Vision

Safouane El Ghazouali,Umberto Michelucci

Main category: cs.CV

TL;DR: VisioFirm是一个AI辅助的图像标注工具,通过整合先进的基础模型和过滤流水线,大大减少了人工标注的工作量,提高了标注效率。

Details Motivation: 标注通常是一个劳动密集型步骤,传统工具需要大量手动输入,限制了大规模数据集的扩展性,因此需要一个能够提高标注效率的工具。 Method: VisioFirm采用了一种混合方法,结合CLIP和预训练检测器(如Ultralytics模型)以及零样本模型(如Grounding DINO)来自动生成初始标注,通过低置信度阈值最大化召回率,并通过交互式工具进行优化。 Result: 测试表明,VisioFirm在COCO类型类上的初始预测大多正确,同时通过CLIP-based disambiguate组件的聚类和IoU图抑制冗余检测,保持了高标注准确性,并且实现了浏览器端高效的即时分割。 Conclusion: VisioFirm是一个开源的网络应用程序,通过AI辅助自动化显著减少了手动标注工作,最多可减少90%的手动工作量,同时保持了高标注准确性,并支持多种导出格式和离线操作。 Abstract: AI models rely on annotated data to learn pattern and perform prediction. Annotation is usually a labor-intensive step that require associating labels ranging from a simple classification label to more complex tasks such as object detection, oriented bounding box estimation, and instance segmentation. Traditional tools often require extensive manual input, limiting scalability for large datasets. To address this, we introduce VisioFirm, an open-source web application designed to streamline image labeling through AI-assisted automation. VisioFirm integrates state-of-the-art foundation models into an interface with a filtering pipeline to reduce human-in-the-loop efforts. This hybrid approach employs CLIP combined with pre-trained detectors like Ultralytics models for common classes and zero-shot models such as Grounding DINO for custom labels, generating initial annotations with low-confidence thresholding to maximize recall. Through this framework, when tested on COCO-type of classes, initial prediction have been proven to be mostly correct though the users can refine these via interactive tools supporting bounding boxes, oriented bounding boxes, and polygons. Additionally, VisioFirm has on-the-fly segmentation powered by Segment Anything accelerated through WebGPU for browser-side efficiency. The tool supports multiple export formats (YOLO, COCO, Pascal VOC, CSV) and operates offline after model caching, enhancing accessibility. VisioFirm demonstrates up to 90\% reduction in manual effort through benchmarks on diverse datasets, while maintaining high annotation accuracy via clustering of connected CLIP-based disambiguate components and IoU-graph for redundant detection suppression. VisioFirm can be accessed from \href{https://github.com/OschAI/VisioFirm}{https://github.com/OschAI/VisioFirm}.

[96] DUDE: Diffusion-Based Unsupervised Cross-Domain Image Retrieval

Ruohong Yang,Peng Hu,Yunfan Li,Xi Peng

Main category: cs.CV

TL;DR: DUDE是一种新的无监督跨域图像检索方法,通过文本到图像生成模型分离对象特征和域特定样式,实现了跨13个域的三个基准数据集上的最先进性能。

Details Motivation: 现有方法难以解决跨域图像检索中的域间差异问题,因为对象特征常与域特定样式纠缠。 Method: DUDE利用文本到图像生成模型分离对象特征和域特定样式,并通过逐步对齐域内和域间的相互邻居来实现可靠对齐。 Result: 在三个基准数据集上进行的广泛实验表明,DUDE在跨13个域的情况下达到了最先进的性能。 Conclusion: DUDE通过特征解耦和渐进式对齐策略,有效解决了跨域图像检索中的域间差异问题,表现出色。 Abstract: Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images of the same category across diverse domains without relying on annotations. Existing UCIR methods, which align cross-domain features for the entire image, often struggle with the domain gap, as the object features critical for retrieval are frequently entangled with domain-specific styles. To address this challenge, we propose DUDE, a novel UCIR method building upon feature disentanglement. In brief, DUDE leverages a text-to-image generative model to disentangle object features from domain-specific styles, thus facilitating semantical image retrieval. To further achieve reliable alignment of the disentangled object features, DUDE aligns mutual neighbors from within domains to across domains in a progressive manner. Extensive experiments demonstrate that DUDE achieves state-of-the-art performance across three benchmark datasets over 13 domains. The code will be released.

[97] Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Wanfu Wang,Qipeng Huang,Guangquan Xue,Xiaobo Liang,Juntao Li

Main category: cs.CV

TL;DR: 本文提出LASER,一种结合蒙特卡洛质量估计与基于IoU的区域质量评估的自我演化框架,以逐步赋予VLMs多步骤感知能力,从而在GUI定位任务中实现精确的坐标预测并达到新的SoTA性能。

Details Motivation: 尽管VLMs在连接视觉感知和语言推理方面取得了显著进展,但在高分辨率输入和复杂多元素视觉交互情况下,如何使VLMs有效推理适当的图像区域仍然是GUI定位中的一个核心挑战。 Method: 提出LASER,一种自我演化的框架,结合蒙特卡洛质量估计与基于IoU的区域质量评估,逐步赋予VLMs多步骤感知能力,以实现精确的坐标预测。 Result: 在ScreenSpot Pro和ScreenSpot-v2基准测试中进行了全面实验,验证了该方法的成效;在GTA1-7B上微调后,LASER在ScreenSpot-Pro基准测试中取得了55.7的得分,成为7B级模型中的SoTA。 Conclusion: LASER通过整合蒙特卡洛质量估计和基于IoU的区域质量评估,显著提升了VLMs在GUI定位任务中的性能,实现了精确的坐标预测,并在ScreenSpot-Pro基准测试中达到了新的7B级模型SoTA。 Abstract: Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning. Recently, OpenAI o3 model introduced a zoom-in search strategy that effectively elicits active perception capabilities in VLMs, improving downstream task performance. However, enabling VLMs to reason effectively over appropriate image regions remains a core challenge in GUI grounding, particularly under high-resolution inputs and complex multi-element visual interactions. In this work, we propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities, enabling precise coordinate prediction. Specifically, our approach integrate Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity. Comprehensive experiments on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate consistent performance gains, validating the effectiveness of our method. Furthermore, when fine-tuned on GTA1-7B, LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, establishing a new state-of-the-art (SoTA) among 7B-scale models.

[98] Differential Morphological Profile Neural Networks for Semantic Segmentation

David Huangal,J. Alex Hurt

Main category: cs.CV

TL;DR: This paper explores the integration of differential morphological profiles (DMP) into modern semantic segmentation networks to improve performance on remote sensing imagery, showing that hybrid DMP architectures can outperform traditional models.

Details Motivation: State-of-the-art segmentation networks are primarily designed for ground-perspective images and do not address specific challenges of remote sensing imagery such as scale variation, foreground-background imbalance, and large image sizes. Method: The authors integrated differential morphological profile (DMP) features into three state-of-the-art semantic segmentation architectures using direct input and hybrid (dual-stream) approaches, evaluating them on the iSAID benchmark dataset. Result: Hybrid DMP architectures consistently outperformed direct-input variants and were capable of surpassing non-DMP models in mIoU, F1, and Recall metrics. Conclusion: The study concludes that integrating DMP features using hybrid architectures enhances the performance of semantic segmentation models on overhead remote sensing imagery, surpassing non-DMP models in certain metrics. Abstract: Semantic segmentation of overhead remote sensing imagery enables applications in mapping, urban planning, and disaster response. State-of-the-art segmentation networks are typically developed and tuned on ground-perspective photographs and do not directly address remote sensing challenges such as extreme scale variation, foreground-background imbalance, and large image sizes. We explore the incorporation of the differential morphological profile (DMP), a multi-scale shape extraction method based on grayscale morphology, into modern segmentation networks. Prior studies have shown that the DMP can provide critical shape information to Deep Neural Networks to enable superior detection and classification performance in overhead imagery. In this work, we extend prior DMPNet work beyond classification and object detection by integrating DMP features into three state-of-the-art convolutional and transformer semantic segmentation architectures. We utilize both direct input, which adapts the input stem of feature extraction architectures to accept DMP channels, and hybrid architectures, a dual-stream design that fuses RGB and DMP encoders. Using the iSAID benchmark dataset, we evaluate a variety of DMP differentials and structuring element shapes to more effectively provide shape information to the model. Our results show that while non-DMP models generally outperform the direct-input variants, hybrid DMP consistently outperforms direct-input and is capable of surpassing a non-DMP model on mIoU, F1, and Recall.

[99] TauGenNet: Plasma-Driven Tau PET Image Synthesis via Text-Guided 3D Diffusion Models

Yuxin Gong,Se-in Jang,Wei Shao,Yi Su,Kuang Gong

Main category: cs.CV

TL;DR: This paper presents a text-guided 3D diffusion model for synthesizing 3D tau PET images using multimodal conditions from structural MRI and plasma measurements, providing a non-invasive, cost-effective alternative for visualizing tau pathology.

Details Motivation: The high cost and limited availability of tau PET scans restrict their widespread use, while structural MRI and plasma-based biomarkers provide non-invasive and widely available complementary information. Method: A text-guided 3D diffusion model was proposed for 3D tau PET image synthesis, leveraging multimodal conditions from both structural MRI and plasma measurement. Result: Experimental results demonstrate that the approach can generate realistic, clinically meaningful 3D tau PET images across a range of disease stages. Conclusion: The proposed framework can generate realistic, clinically meaningful 3D tau PET images and provides a non-invasive, cost-effective alternative for visualizing tau pathology. Abstract: Accurate quantification of tau pathology via tau positron emission tomography (PET) scan is crucial for diagnosing and monitoring Alzheimer's disease (AD). However, the high cost and limited availability of tau PET restrict its widespread use. In contrast, structural magnetic resonance imaging (MRI) and plasma-based biomarkers provide non-invasive and widely available complementary information related to brain anatomy and disease progression. In this work, we propose a text-guided 3D diffusion model for 3D tau PET image synthesis, leveraging multimodal conditions from both structural MRI and plasma measurement. Specifically, the textual prompt is from the plasma p-tau217 measurement, which is a key indicator of AD progression, while MRI provides anatomical structure constraints. The proposed framework is trained and evaluated using clinical AV1451 tau PET data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Experimental results demonstrate that our approach can generate realistic, clinically meaningful 3D tau PET across a range of disease stages. The proposed framework can help perform tau PET data augmentation under different settings, provide a non-invasive, cost-effective alternative for visualizing tau pathology, and support the simulation of disease progression under varying plasma biomarker levels and cognitive conditions.

[100] Dual-Scale Volume Priors with Wasserstein-Based Consistency for Semi-Supervised Medical Image Segmentation

Junying Meng,Gangxuan Zhou,Jun Liu,Weihong Guo

Main category: cs.CV

TL;DR: This paper proposes a semi-supervised medical image segmentation method that incorporates spatial regularization and volume priors, achieving superior performance on multiple datasets.

Details Motivation: Most existing segmentation networks lack effective methodological guidance for feature extraction and ignore important prior information from datasets in semi-supervised medical image segmentation. Method: The approach integrates an explicit volume prior and Threshold Dynamics spatial regularization into the segmentation network, using Wasserstein distance constraints to align class ratios and enforce volume distribution similarity. Result: Experimental results on the 2017 ACDC dataset, PROMISE12 dataset, and thigh muscle MR image dataset show that the proposed method outperforms existing approaches. Conclusion: The proposed semi-supervised medical image segmentation framework successfully integrates spatial regularization methods and volume priors, demonstrating superior performance on multiple datasets. Abstract: Despite signi cant progress in semi-supervised medical image segmentation, most existing segmentation networks overlook e ective methodological guidance for feature extraction and important prior information from datasets. In this paper, we develop a semi-supervised medical image segmentation framework that e ectively integrates spatial regularization methods and volume priors. Speci cally, our approach integrates a strong explicit volume prior at the image scale and Threshold Dynamics spatial regularization, both derived from variational models, into the backbone segmentation network. The target region volumes for each unlabeled image are estimated by a regression network, which e ectively regularizes the backbone segmentation network through an image-scale Wasserstein distance constraint, ensuring that the class ratios in the segmentation results for each unlabeled image match those predicted by the regression network. Additionally, we design a dataset-scale Wasserstein distance loss function based on a weak implicit volume prior, which enforces that the volume distribution predicted for the unlabeled dataset is similar to that of labeled dataset. Experimental results on the 2017 ACDC dataset, PROMISE12 dataset, and thigh muscle MR image dataset show the superiority of the proposed method.

[101] Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Jingen Qu,Lijun Li,Bo Zhang,Yichen Yan,Jing Shao

Main category: cs.CV

TL;DR: This paper presents an innovative image-oriented approach to constructing multimodal safety datasets, addressing the limitations of current methods and introducing a standardized evaluation metric to validate its effectiveness.

Details Motivation: Current dataset construction methods for multimodal large language models (MLLMs) are risk-oriented and insufficient to address the complexity of real-world multimodal safety scenarios (RMS). Additionally, the lack of a unified evaluation metric limits the validation of these methods' overall effectiveness. Method: The research proposes an image-oriented self-adaptive dataset construction method, starting with images to generate paired text and guidance responses. It also introduces a standardized evaluation metric involving the fine-tuning of a safety judge model to assess capabilities across safety datasets. Result: Using the proposed method, the researchers generated an RMS dataset containing 35k image-text pairs with guidance responses. The experiments demonstrated the effectiveness of the image-oriented pipeline, confirming its scalability and potential for improving real-world multimodal safety dataset construction. Conclusion: The study validates the scalability and effectiveness of the image-oriented approach in constructing real-world multimodal safety datasets, offering a new perspective and methodology for future dataset development. Abstract: Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges. However, current dataset construction methods, which are risk-oriented, fail to cover the growing complexity of real-world multimodal safety scenarios (RMS). And due to the lack of a unified evaluation metric, their overall effectiveness remains unproven. This paper introduces a novel image-oriented self-adaptive dataset construction method for RMS, which starts with images and end constructing paired text and guidance responses. Using the image-oriented method, we automatically generate an RMS dataset comprising 35k image-text pairs with guidance responses. Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety datasets.Extensive experiments on various tasks demonstrate the effectiveness of the proposed image-oriented pipeline. The results confirm the scalability and effectiveness of the image-oriented approach, offering a new perspective for the construction of real-world multimodal safety datasets.

[102] PAOLI: Pose-free Articulated Object Learning from Sparse-view Images

Jianning Deng,Kartic Subr,Hakan Bilen

Main category: cs.CV

TL;DR: 提出了一种新的自监督框架,可以从稀疏视角、未定位的图像中学习可动对象的表示。

Details Motivation: 先前的方法需要密集的多视角观察和真实相机姿态,而这种方法可以在稀疏视角和无相机监督的情况下进行学习。 Method: 该方法通过独立重建每个关节,学习变形场以建立密集对应关系,并使用渐进式解耦策略分离静态和移动部件,最后通过自监督损失联合优化几何、外观和运动学。 Result: 在标准基准测试和实际示例中的实验表明,该方法能够生成准确且详细的可动对象表示。 Conclusion: 该方法在比现有方法弱得多的输入假设下,能够生成准确且详细的可动对象表示。 Abstract: We present a novel self-supervised framework for learning articulated object representations from sparse-view, unposed images. Unlike prior methods that require dense multi-view observations and ground-truth camera poses, our approach operates with as few as four views per articulation and no camera supervision. To address the inherent challenges, we first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we jointly optimize geometry, appearance, and kinematics with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.

[103] The Telephone Game: Evaluating Semantic Drift in Unified Models

Sabbir Mollah,Rohit Gupta,Sirnam Swetha,Qingyang Liu,Ahnaf Munir,Mubarak Shah

Main category: cs.CV

TL;DR: This paper introduces UCF-UM, a cyclic evaluation framework for unified visual language models, highlighting the importance of consistency between image-to-text and text-to-image tasks and proposing new metrics to assess cross-modal stability.

Details Motivation: The authors aimed to address the lack of evaluation methods that assess the consistency between visual understanding and generation in unified models, as existing metrics only evaluate these capabilities in isolation. Method: The authors introduced a cyclic evaluation framework called UCF-UM, which alternates between image-to-text and text-to-image generations to quantify semantic drift. Three metrics were proposed: Mean Cumulative Drift (MCD), Semantic Drift Rate (SDR), and Multi-Generation GenEval (MGG). Evaluations were conducted on a new benchmark ND400 and seven recent models. Result: The evaluation revealed significant variation in cross-modal stability among models. Some models (e.g., BAGEL) maintained semantic consistency over multiple generations, while others (e.g., Vila-u) exhibited rapid drift despite strong single-pass performance. Conclusion: The study concludes that cyclic consistency is essential for evaluating cross-modal stability in unified visual language models, and proposes practical metrics for assessing such models' performance. Abstract: Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model's cross-modal stability and strength of their shared representations. Code: https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

[104] Noisy Label Refinement with Semantically Reliable Synthetic Images

Yingxuan Li,Jiafeng Mao,Yusuke Matsui

Main category: cs.CV

TL;DR: This paper proposes a method to address semantic noise in image classification by leveraging synthetic images as reference points to correct mislabeled data, achieving significant accuracy improvements when combined with existing noise-robust techniques.

Details Motivation: Semantic noise, where visually similar categories are mislabeled, poses a significant challenge to conventional supervised learning approaches. Advanced text-to-image models provide high-quality synthetic images with reliable labels, but their direct application is limited by domain gaps and diversity constraints. Method: A novel method that leverages synthetic images as reliable reference points to identify and correct mislabeled samples in noisy datasets. Result: Extensive experiments show that the approach significantly improves classification accuracy under various noise conditions, especially in scenarios with semantic label noise. It improves accuracy by 30% on CIFAR-10, 11% on CIFAR-100 under 70% semantic noise, and 24% on ImageNet-100 under real-world noise conditions when combined with state-of-the-art noise-robust methods. Conclusion: The proposed method effectively addresses semantic noise in image classification datasets by using synthetic images as reference points to correct mislabeled samples, and it demonstrates significant improvements in accuracy when combined with existing noise-robust techniques. Abstract: Semantic noise in image classification datasets, where visually similar categories are frequently mislabeled, poses a significant challenge to conventional supervised learning approaches. In this paper, we explore the potential of using synthetic images generated by advanced text-to-image models to address this issue. Although these high-quality synthetic images come with reliable labels, their direct application in training is limited by domain gaps and diversity constraints. Unlike conventional approaches, we propose a novel method that leverages synthetic images as reliable reference points to identify and correct mislabeled samples in noisy datasets. Extensive experiments across multiple benchmark datasets show that our approach significantly improves classification accuracy under various noise conditions, especially in challenging scenarios with semantic label noise. Additionally, since our method is orthogonal to existing noise-robust learning techniques, when combined with state-of-the-art noise-robust training methods, it achieves superior performance, improving accuracy by 30% on CIFAR-10 and by 11% on CIFAR-100 under 70% semantic noise, and by 24% on ImageNet-100 under real-world noise conditions.

[105] Efficient Odd-One-Out Anomaly Detection

Silvio Chito,Paolo Rabino,Tatiana Tommasi

Main category: cs.CV

TL;DR: 本文介绍了一种高效解决奇偶检测任务的DINO模型,减少了参数数量和训练时间,并引入了多模态大语言模型基线。

Details Motivation: 奇偶检测任务对现代深度学习模型提出了挑战,需要高效的解决方案来处理多视角的空间推理和关系推理。 Method: 本文采用了基于DINO的方法,并引入了多模态大语言模型作为基线,以评估其在结构化视觉推理任务中的表现。 Result: 提出的模型将参数数量减少了三分之一,训练时间缩短了三倍,同时保持了竞争力的性能。 Conclusion: 本文提出了一种基于DINO的模型,用于解决奇偶检测任务中的效率问题,同时保持了竞争力的性能,并引入了多模态大语言模型基线。 Abstract: The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks. The project page can be found at https://silviochito.github.io/EfficientOddOneOut/

[106] GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization

Pengyue Jia,Yingyi Zhang,Xiangyu Zhao,Yixuan Li

Main category: cs.CV

TL;DR: GeoArena is an open platform for evaluating vision-language models' ability to determine image locations, using real-world images and human feedback to avoid data leakage and improve evaluation accuracy.

Details Motivation: The authors identify two key issues in current image geolocalization evaluations: data leakage due to LVLMs being pretrained on test datasets, and overreliance on exact geographic coordinates that ignore reasoning and raise privacy concerns. Method: The paper introduces GeoArena, which uses in-the-wild images uploaded by users and leverages pairwise human judgments to evaluate and rank LVLMs' geolocalization performance. Result: GeoArena has been deployed online for two months, collecting thousands of voting records, which were used to analyze LVLM performance and establish a leaderboard for image geolocalization. Conclusion: GeoArena provides an open, in-the-wild, and human-centered platform for evaluating LVLMs on image geolocalization tasks, addressing previous limitations in data leakage and evaluation metrics. Abstract: Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.

[107] From Editor to Dense Geometry Estimator

JiYuan Wang,Chunyu Lin,Lei Sun,Rongying Liu,Lang Nie,Mingxing Li,Kang Liao,Xiangxiang Chu,Yao Zhao

Main category: cs.CV

TL;DR: FE2E is a framework that adapts an image editing model for dense geometry prediction, achieving high performance without requiring large amounts of training data.

Details Motivation: Dense prediction is an image-to-image task, suggesting that image editing models may be a more suitable foundation for fine-tuning than T2I generative models. Method: FE2E adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction, reformulating the editor's flow matching loss into a 'consistent velocity' training objective and using logarithmic quantization to resolve precision conflicts. Result: FE2E achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100x data. Conclusion: FE2E demonstrates that image editing models are a more suitable foundation for dense geometry prediction than T2I generative models, achieving impressive performance improvements without scaling up training data. Abstract: Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

[108] MICACL: Multi-Instance Category-Aware Contrastive Learning for Long-Tailed Dynamic Facial Expression Recognition

Feng-Qi Cui,Zhen Lin,Xinlong Rao,Anyang Tong,Shiyao Li,Fei Wang,Changlin Chen,Bin Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的多实例学习框架MICACL,用于动态面部表情识别,通过图增强实例交互模块和加权实例聚合网络以及多尺度类别感知对比学习策略,解决了长尾类别分布和时空特征建模的复杂性问题。

Details Motivation: 由于长尾类别分布和时空特征建模的复杂性,动态面部表情识别(DFER)面临重大挑战。现有基于深度学习的方法往往无法解决这些问题,导致严重的模型归纳偏差。 Method: 设计了一个名为MICACL的多实例学习框架,包括图增强实例交互模块(GEIIM)和加权实例聚合网络(WIAN),并引入了多尺度类别感知对比学习(MCCL)策略。 Result: 在DFEW和FERV39k等真实数据集上的大量实验表明,MICACL达到了最先进的性能。 Conclusion: MICACL通过结合时空依赖建模和长尾对比学习优化,提高了动态面部表情识别的性能,表现出优越的鲁棒性和泛化能力。 Abstract: Dynamic facial expression recognition (DFER) faces significant challenges due to long-tailed category distributions and complexity of spatio-temporal feature modeling. While existing deep learning-based methods have improved DFER performance, they often fail to address these issues, resulting in severe model induction bias. To overcome these limitations, we propose a novel multi-instance learning framework called MICACL, which integrates spatio-temporal dependency modeling and long-tailed contrastive learning optimization. Specifically, we design the Graph-Enhanced Instance Interaction Module (GEIIM) to capture intricate spatio-temporal between adjacent instances relationships through adaptive adjacency matrices and multiscale convolutions. To enhance instance-level feature aggregation, we develop the Weighted Instance Aggregation Network (WIAN), which dynamically assigns weights based on instance importance. Furthermore, we introduce a Multiscale Category-aware Contrastive Learning (MCCL) strategy to balance training between major and minor categories. Extensive experiments on in-the-wild datasets (i.e., DFEW and FERV39k) demonstrate that MICACL achieves state-of-the-art performance with superior robustness and generalization.

[109] Stitching the Story: Creating Panoramic Incident Summaries from Body-Worn Footage

Dor Cohen,Inga Efrosman,Yehudit Aperstein,Alexander Apartsin

Main category: cs.CV

TL;DR: A computer vision pipeline converts body-camera videos into panoramic images, providing first responders with quick and effective situational awareness.

Details Motivation: The motivation behind this work is the impracticality of reviewing lengthy video footage from body-worn cameras in time-sensitive situations. First responders require a quick and effective way to gain situational awareness, which current methods fail to provide efficiently. Method: The method involves a computer vision pipeline that uses monocular Simultaneous Localization and Mapping (SLAM) to estimate camera trajectories and reconstruct the environment's spatial layout. Key viewpoints are identified by clustering camera poses, and representative frames are stitched into panoramic images using multi-frame stitching techniques. Result: The result is a system that successfully generates spatially coherent panoramic images from body-camera footage. These summaries provide first responders with a clear and rapid understanding of complex incident scenes, improving their ability to make informed decisions quickly. Conclusion: This paper concludes that the proposed computer vision pipeline effectively transforms body-camera footage into concise panoramic summaries, enhancing situational awareness and facilitating faster decision-making and incident review. Abstract: First responders widely adopt body-worn cameras to document incident scenes and support post-event analysis. However, reviewing lengthy video footage is impractical in time-critical situations. Effective situational awareness demands a concise visual summary that can be quickly interpreted. This work presents a computer vision pipeline that transforms body-camera footage into informative panoramic images summarizing the incident scene. Our method leverages monocular Simultaneous Localization and Mapping (SLAM) to estimate camera trajectories and reconstruct the spatial layout of the environment. Key viewpoints are identified by clustering camera poses along the trajectory, and representative frames from each cluster are selected. These frames are fused into spatially coherent panoramic images using multi-frame stitching techniques. The resulting summaries enable rapid understanding of complex environments and facilitate efficient decision-making and incident review.

Hao Ju,Hu Zhang,Zhedong Zheng

Main category: cs.CV

TL;DR: 本文提出了AnomalyLMM,一种利用大型多模态模型进行基于文本的人员异常搜索的新框架,通过一种无需训练的自适应方法,实现了对细粒度异常的有效检索。

Details Motivation: 随着公众安全需求的增长,基于文本的人员异常搜索成为一项关键任务,但存在细粒度跨模态对齐和稀疏现实样本中的异常识别等挑战。 Method: 提出了一种新的从粗到细的管道,结合LMM生成世界知识与检索为中心的异常检测,并采用了一种无需训练的自适应方法。 Result: 实验表明,该方法的有效性超过了竞争基线+0.96% Recall@1准确率,并能可解释地对齐文本异常与视觉行为。 Conclusion: AnomalyLMM是第一个利用LMM进行基于文本的人员异常搜索的框架,并在PAB数据集上进行了严格的评估,证明了其在检索中的有效性。 Abstract: With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.

[111] Aesthetic Image Captioning with Saliency Enhanced MLLMs

Yilin Tao,Jiashui Huang,Huaze Xu,Ling Shao

Main category: cs.CV

TL;DR: 本文提出了一种新的AIC框架ASE-MLLM,通过将图像美学显著性集成到MLLM中,显著提高了AIC任务的性能,并达到了最先进的水平。

Details Motivation: 当前AIC研究主要依赖微调方法,而没有专门调整MLLM以关注目标审美内容,因此提出新的框架以解决这一问题。 Method: 提出了ASE-MLLM框架,其中包括IASM模块用于提取美学显著性特征,以及IAS-ViT模块通过交叉注意力机制融合特征。 Result: 实验表明,ASE-MLLM在主流AIC基准上显著优于传统方法和通用MLLM,达到SOTA性能。 Conclusion: ASE-MLLM成功地将图像美学显著性集成到MLLM中,提升了AIC任务的性能,达到了最先进的水平。 Abstract: Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.

[112] SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer

Jimin Xu,Bosheng Qin,Tao Jin,Zhou Zhao,Zhenhui Ye,Jun Yu,Fei Wu

Main category: cs.CV

TL;DR: This paper introduces a novel 3D style transfer pipeline that integrates pretrained 2D diffusion models to achieve structured and visually coherent stylization of 3D scenes, overcoming limitations of existing methods.

Details Motivation: The motivation stems from the limitations of existing 3D style transfer methods, which struggle with capturing high-level style semantics and often produce results lacking structural clarity and separation. Method: The method involves a two-stage pipeline: first, using diffusion priors to generate stylized renderings of key viewpoints, and second, transferring these stylized views onto a 3D representation. Innovations include cross-view style alignment and instance-level style transfer to ensure consistency and clarity. Result: The result is a more structured, visually coherent, and artistically enriched stylization of 3D scenes. The pipeline demonstrates superior performance across various environments, including challenging 360-degree settings. Conclusion: The paper concludes that the proposed 3D style transfer pipeline significantly outperforms state-of-the-art methods in generating structured, visually coherent, and artistically enriched stylization for 3D scenes. Abstract: Recent advancements in neural representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have increased interest in applying style transfer to 3D scenes. While existing methods can transfer style patterns onto 3D-consistent neural representations, they struggle to effectively extract and transfer high-level style semantics from the reference style image. Additionally, the stylized results often lack structural clarity and separation, making it difficult to distinguish between different instances or objects within the 3D scene. To address these limitations, we propose a novel 3D style transfer pipeline that effectively integrates prior knowledge from pretrained 2D diffusion models. Our pipeline consists of two key stages: First, we leverage diffusion priors to generate stylized renderings of key viewpoints. Then, we transfer the stylized key views onto the 3D representation. This process incorporates two innovative designs. The first is cross-view style alignment, which inserts cross-view attention into the last upsampling block of the UNet, allowing feature interactions across multiple key views. This ensures that the diffusion model generates stylized key views that maintain both style fidelity and instance-level consistency. The second is instance-level style transfer, which effectively leverages instance-level consistency across stylized key views and transfers it onto the 3D representation. This results in a more structured, visually coherent, and artistically enriched stylization. Extensive qualitative and quantitative experiments demonstrate that our 3D style transfer pipeline significantly outperforms state-of-the-art methods across a wide range of scenes, from forward-facing to challenging 360-degree environments. Visit our project page https://jm-xu.github.io/SSGaussian for immersive visualization.

[113] Learning neural representations for X-ray ptychography reconstruction with unknown probes

Tingyou Li,Zixin Xu,Zirui Gao,Hanfei Yan,Xiaojing Huang,Jizhou Li

Main category: cs.CV

TL;DR: The paper introduces PtyINR, a self-supervised deep learning framework that improves X-ray ptychography by simultaneously recovering object and probe information from raw diffraction data, especially effective in low-signal scenarios.

Details Motivation: The motivation is to overcome the limitations of conventional iterative methods and deep learning approaches in accurately reconstructing images when the illuminating probe is unknown, especially under low-signal conditions. Method: The method uses Ptychographic Implicit Neural Representation (PtyINR), which parameterizes both the object and probe as continuous neural representations to perform end-to-end reconstruction directly from raw diffraction patterns. Result: PtyINR achieves superior reconstruction quality on both simulated and experimental data and demonstrates remarkable robustness under challenging low-signal conditions. Conclusion: PtyINR provides a self-supervised framework for simultaneous object and probe recovery in X-ray ptychography, offering high reconstruction quality and robustness under low-signal conditions. Abstract: X-ray ptychography provides exceptional nanoscale resolution and is widely applied in materials science, biology, and nanotechnology. However, its full potential is constrained by the critical challenge of accurately reconstructing images when the illuminating probe is unknown. Conventional iterative methods and deep learning approaches are often suboptimal, particularly under the low-signal conditions inherent to low-dose and high-speed experiments. These limitations compromise reconstruction fidelity and restrict the broader adoption of the technique. In this work, we introduce the Ptychographic Implicit Neural Representation (PtyINR), a self-supervised framework that simultaneously addresses the object and probe recovery problem. By parameterizing both as continuous neural representations, PtyINR performs end-to-end reconstruction directly from raw diffraction patterns without requiring any pre-characterization of the probe. Extensive evaluations demonstrate that PtyINR achieves superior reconstruction quality on both simulated and experimental data, with remarkable robustness under challenging low-signal conditions. Furthermore, PtyINR offers a generalizable, physics-informed framework for addressing probe-dependent inverse problems, making it applicable to a wide range of computational microscopy problems.

[114] Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

Zanwei Zhou,Taoran Yi,Jiemin Fang,Chen Yang,Lingxi Xie,Xinggang Wang,Wei Shen,Qi Tian

Main category: cs.CV

TL;DR: MDT-dist is a new framework for accelerating 3D generation models, reducing sampling steps and achieving significant speedup while maintaining high fidelity.

Details Motivation: Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. Method: MDT-dist converts the optimization target from the transport level to the velocity and distribution level through two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD). Result: The method reduces sampling steps of each flow transformer from 25 to 1 or 2, achieving 0.68s (1 step x 2) and 0.94s (2 steps x 2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Conclusion: MDT-dist is a novel framework for few-step 3D flow distillation that significantly outperforms existing CM distillation methods, enabling TRELLIS to achieve superior performance in few-step 3D generation. Abstract: Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1 or 2, achieving 0.68s (1 step x 2) and 0.94s (2 steps x 2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Extensive experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.

[115] Durian: Dual Reference-guided Portrait Animation with Attribute Transfer

Hyunsoo Cha,Byungjun Kim,Hanbyul Joo

Main category: cs.CV

TL;DR: Durian is a zero-shot method for generating high-fidelity portrait animation videos with facial attribute transfer, using dual reference networks and a diffusion model to ensure spatial consistency and robustness, achieving state-of-the-art results.

Details Motivation: The motivation is to develop a zero-shot method for generating high-fidelity portrait animation videos with facial attribute transfer, ensuring spatial consistency across frames. Method: Durian introduces dual reference networks to inject spatial features from both the portrait and attribute images into a diffusion model's denoising process. The model is trained using a self-reconstruction formulation and a mask expansion strategy, along with spatial and appearance-level transformations for robustness. Result: Durian achieves state-of-the-art performance on portrait animation with attribute transfer and enables multi-attribute composition in a single generation pass without additional training. Conclusion: Durian enables high-fidelity portrait animation with facial attribute transfer without explicit triplet supervision, achieving state-of-the-art performance and supporting multi-attribute composition in a single generation pass. Abstract: We present Durian, the first method for generating portrait animation videos with facial attribute transfer from a given reference image to a target portrait in a zero-shot manner. To enable high-fidelity and spatially consistent attribute transfer across frames, we introduce dual reference networks that inject spatial features from both the portrait and attribute images into the denoising process of a diffusion model. We train the model using a self-reconstruction formulation, where two frames are sampled from the same portrait video: one is treated as the attribute reference and the other as the target portrait, and the remaining frames are reconstructed conditioned on these inputs and their corresponding masks. To support the transfer of attributes with varying spatial extent, we propose a mask expansion strategy using keypoint-conditioned image generation for training. In addition, we further augment the attribute and portrait images with spatial and appearance-level transformations to improve robustness to positional misalignment between them. These strategies allow the model to effectively generalize across diverse attributes and in-the-wild reference combinations, despite being trained without explicit triplet supervision. Durian achieves state-of-the-art performance on portrait animation with attribute transfer, and notably, its dual reference design enables multi-attribute composition in a single generation pass without additional training.

[116] From Lines to Shapes: Geometric-Constrained Segmentation of X-Ray Collimators via Hough Transform

Benjamin El-Zein,Dominik Eckert,Andreas Fieselmann,Christopher Syben,Ludwig Ritschl,Steffen Kappler,Sebastian Stober

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的X射线图像准直分割方法,通过结合可微Hough变换网络,实现了对准直边界的精确检测和ROI中心信息的提取,从而生成优化的线约束分割掩码。

Details Motivation: 准直在X射线成像中用于限制辐射剂量,但准直阴影的检测因散射辐射而变得困难,因此需要一种能够克服这些挑战并提高分割精度的方法。 Method: 引入了一种基于不同iable Hough变换的网络,用于检测准直边界并提取ROI中心信息,同时在推理过程中结合这两项任务的信息生成更精确的分割掩码。 Result: 在多样化的实际X射线图像测试集中,实现了4.3-5.0毫米的中位Hausdorff距离,证明了该方法对准直区域重建的鲁棒性。 Conclusion: 所提出的方法不仅有效解决了准直阴影检测的问题,而且不局限于特定数量的边缘,具有更广泛的应用潜力。 Abstract: Collimation in X-ray imaging restricts exposure to the region-of-interest (ROI) and minimizes the radiation dose applied to the patient. The detection of collimator shadows is an essential image-based preprocessing step in digital radiography posing a challenge when edges get obscured by scattered X-ray radiation. Regardless, the prior knowledge that collimation forms polygonal-shaped shadows is evident. For this reason, we introduce a deep learning-based segmentation that is inherently constrained to its geometry. We achieve this by incorporating a differentiable Hough transform-based network to detect the collimation borders and enhance its capability to extract the information about the ROI center. During inference, we combine the information of both tasks to enable the generation of refined, line-constrained segmentation masks. We demonstrate robust reconstruction of collimated regions achieving median Hausdorff distances of 4.3-5.0mm on diverse test sets of real Xray images. While this application involves at most four shadow borders, our method is not fundamentally limited by a specific number of edges.

[117] One Flight Over the Gap: A Survey from Perspective to Panoramic Vision

Xin Lin,Xian Ge,Dizhe Zhang,Zhaoliang Wan,Xianshun Wang,Xiangtai Li,Wenjie Jiang,Bo Du,Dacheng Tao,Ming-Hsuan Yang,Lu Qi

Main category: cs.CV

TL;DR: 这篇论文综述了全景视觉技术的发展,重点放在从透视到全景的适应挑战和策略,并分类讨论了全景视觉的主要研究领域和未来方向。

Details Motivation: 由于对空间智能和整体场景感知的需求增加,全景图像因其提供360度视野的能力而受到广泛关注。然而,它们与透视图像存在显著差异,这使得直接适应领域变得困难。 Method: 作者回顾了全景成像流程和投影方法,总结了领域适应的三大挑战,并基于超过300篇研究论文分析了20多个代表性任务。 Result: 论文提供了对全景视觉挑战的跨方法分析,并将全景视觉分为四个主要类别:视觉质量增强与评估、视觉理解、多模态理解和视觉生成。 Conclusion: 这篇论文总结了全景视觉技术的研究进展,强调了从透视到全景的适应,并讨论了数据、模型和应用中的开放性挑战和未来方向。 Abstract: Driven by the demand for spatial intelligence and holistic scene perception, omnidirectional images (ODIs), which provide a complete 360\textdegree{} field of view, are receiving growing attention across diverse applications such as virtual reality, autonomous driving, and embodied robotics. Despite their unique characteristics, ODIs exhibit remarkable differences from perspective images in geometric projection, spatial distribution, and boundary continuity, making it challenging for direct domain adaption from perspective methods. This survey reviews recent panoramic vision techniques with a particular emphasis on the perspective-to-panorama adaptation. We first revisit the panoramic imaging pipeline and projection methods to build the prior knowledge required for analyzing the structural disparities. Then, we summarize three challenges of domain adaptation: severe geometric distortions near the poles, non-uniform sampling in Equirectangular Projection (ERP), and periodic boundary continuity. Building on this, we cover 20+ representative tasks drawn from more than 300 research papers in two dimensions. On one hand, we present a cross-method analysis of representative strategies for addressing panoramic specific challenges across different tasks. On the other hand, we conduct a cross-task comparison and classify panoramic vision into four major categories: visual quality enhancement and assessment, visual understanding, multimodal understanding, and visual generation. In addition, we discuss open challenges and future directions in data, models, and applications that will drive the advancement of panoramic vision research. We hope that our work can provide new insight and forward looking perspectives to advance the development of panoramic vision technologies. Our project page is https://insta360-research-team.github.io/Survey-of-Panorama

[118] Plot'n Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models

Kiymet Akdemir,Jing Shi,Kushal Kafle,Brian Price,Pinar Yanardag

Main category: cs.CV

TL;DR: 本文提出 Plot'n Polish 方法,用于增强文本到图像扩散模型在故事可视化中的可控性和一致性。

Details Motivation: 随着文本到图像扩散模型在现实世界创意领域的应用增加,需要提供更强大的控制、精炼和一致性修改能力,尤其是在多帧间保持视觉和叙事一致性。 Method: 引入了一种名为 Plot'n Polish 的零样本框架,以增强对文本到图像扩散模型在故事可视化中的控制、精炼和修改能力。 Result: 解决了现有方法在精细或粗略编辑上的灵活性不足的问题,使创作者能够无缝地制作和优化视觉故事。 Conclusion: Plot'n Polish 提供了一种零样本框架,以实现一致的故事生成,并在不同细节层次上对故事可视化进行细粒度控制。 Abstract: Text-to-image diffusion models have demonstrated significant capabilities to generate diverse and detailed visuals in various domains, and story visualization is emerging as a particularly promising application. However, as their use in real-world creative domains increases, the need for providing enhanced control, refinement, and the ability to modify images post-generation in a consistent manner becomes an important challenge. Existing methods often lack the flexibility to apply fine or coarse edits while maintaining visual and narrative consistency across multiple frames, preventing creators from seamlessly crafting and refining their visual stories. To address these challenges, we introduce Plot'n Polish, a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.

[119] TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection

Zehong Yan,Peng Qi,Wynne Hsu,Mong Li Lee

Main category: cs.CV

TL;DR: 本研究介绍了一种用于一般多模态虚假信息检测的统一和可解释视觉语言模型TRUST-VL及其支持训练的TRUST-Instruct大规模指令数据集。

Details Motivation: 多模态虚假信息构成日益严重的社会威胁,而现有方法通常只关注单一类型的扭曲,并且难以推广到未见过的场景中。 Method: 引入了一种统一且可解释的视觉语言模型TRUST-VL,以及支持训练的TRUST-Instruct大规模指令数据集。 Result: TRUST-VL在同领域和零样本基准测试中均表现出色。 Conclusion: TRUST-VL实现了最先进的性能,同时提供了强大的泛化性和可解释性。 Abstract: Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

[120] Virtual Fitting Room: Generating Arbitrarily Long Videos of Virtual Try-On from a Single Image -- Technical Preview

Jun-Kun Chen,Aayush Bansal,Minh Phuoc Vo,Yu-Xiong Wang

Main category: cs.CV

TL;DR: The Virtual Fitting Room (VFR) is a novel video generative model that efficiently creates long virtual try-on videos with smoothness and temporal consistency.

Details Motivation: The motivation is to eliminate resource-intensive generation processes and enable flexible, arbitrarily long virtual try-on video creation. Method: VFR models long video generation as an auto-regressive, segment-by-segment process using a prefix video condition for smoothness and an anchor video for consistency. Result: VFR successfully generates minute-scale virtual try-on videos with local smoothness and global temporal consistency across various motions. Conclusion: The VFR framework is a pioneering approach in generating long virtual try-on videos, addressing challenges of local smoothness and global temporal consistency. Abstract: We introduce the Virtual Fitting Room (VFR), a novel video generative model that produces arbitrarily long virtual try-on videos. Our VFR models long video generation tasks as an auto-regressive, segment-by-segment generation process, eliminating the need for resource-intensive generation and lengthy video data, while providing the flexibility to generate videos of arbitrary length. The key challenges of this task are twofold: ensuring local smoothness between adjacent segments and maintaining global temporal consistency across different segments. To address these challenges, we propose our VFR framework, which ensures smoothness through a prefix video condition and enforces consistency with the anchor video -- a 360-degree video that comprehensively captures the human's wholebody appearance. Our VFR generates minute-scale virtual try-on videos with both local smoothness and global temporal consistency under various motions, making it a pioneering work in long virtual try-on video generation.