Skip to content

Table of Contents

cs.CL [Back]

[1] Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

Fatemeh Taherinezhad,Mohamad Javad Momeni Nezhad,Sepehr Karimi,Sina Rashidi,Ali Zolnour,Maryam Dadkhah,Yasaman Haghbin,Hossein AzadMaleki,Maryam Zolnoori

Main category: cs.CL

TL;DR: This study explores various adaptation strategies for large language models in detecting dementia from speech, showing that appropriate model adaptations can significantly improve performance and that open-weight models can match or surpass commercial systems.

Details Motivation: Over half of US adults with Alzheimer's disease and related dementias remain undiagnosed, and there is a need for scalable detection methods. Speech-based screening is a promising approach, prompting the investigation into optimizing models for dementia detection. Method: The researchers compared nine text-only models and three multimodal audio-text models using the DementiaBank speech corpus. They evaluated different adaptation strategies such as in-context learning with various demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Result: Class-centroid demonstrations yielded the best in-context learning results. Reasoning techniques enhanced the performance of smaller models, and token-level fine-tuning generally achieved the highest scores. The addition of a classification head improved underperforming models. While fine-tuned audio-text multimodal models performed well, they did not outperform the top text-only models. Conclusion: The study concluded that model adaptation strategies significantly affect the performance of speech-based dementia detection systems, and appropriately adapted open-weight models can match or exceed commercial systems. Abstract: Over half of US adults with Alzheimer disease and related dementias remain undiagnosed, and speech-based screening offers a scalable detection approach. We compared large language model adaptation strategies for dementia detection using the DementiaBank speech corpus, evaluating nine text-only models and three multimodal audio-text models on recordings from DementiaBank speech corpus. Adaptations included in-context learning with different demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Results showed that class-centroid demonstrations achieved the highest in-context learning performance, reasoning improved smaller models, and token-level fine-tuning generally produced the best scores. Adding a classification head substantially improved underperforming models. Among multimodal models, fine-tuned audio-text systems performed well but did not surpass the top text-only models. These findings highlight that model adaptation strategies, including demonstration selection, reasoning design, and tuning method, critically influence speech-based dementia detection, and that properly adapted open-weight models can match or exceed commercial systems.

[2] Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Yansong Liu,Jiateng Li,Yuan Liu

Main category: cs.CL

TL;DR: This paper introduces a framework called Reinforced Behavior Alignment (RBA) to enhance the language generation capabilities of speech-based Large Language Models (SpeechLMs) through self-synthesis and reinforcement learning, achieving superior performance over conventional methods.

Details Motivation: Speech-based LLMs (SpeechLMs) still lag behind text-based LLMs in instruction-following due to inter-modal discrepancies and the dynamic nature of speech. This work aims to bridge this performance gap. Method: The paper proposes Reinforced Behavior Alignment (RBA), which uses a powerful teacher LLM to self-synthesize large-scale, high-fidelity alignment data. The SpeechLM is then aligned with the teacher model using a reinforcement learning approach, avoiding reliance on supervised fine-tuning with human annotations. Result: Experimental results show that RBA effectively improves the instruction-following capabilities of SpeechLMs, outperforming conventional distillation baselines. It also generalizes well to tasks like spoken question answering and speech-to-text translation. Conclusion: RBA successfully enhances the performance of SpeechLMs using self-synthesis and reinforcement learning, achieving state-of-the-art results on open benchmarks without requiring human-annotated data. Abstract: The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

[3] Multilevel Analysis of Cryptocurrency News using RAG Approach with Fine-Tuned Mistral Large Language Model

Bohdan M. Pavlyshenko

Main category: cs.CL

TL;DR: This paper presents a method for multilevel analysis of cryptocurrency news using a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG).

Details Motivation: The motivation is to overcome the limitations of large language models, such as hallucinations, by representing cryptocurrency news as a knowledge graph, and to provide comprehensive reports through hierarchical stacking of graph-based and text-based summaries. Method: The paper uses a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG) for multilevel multitask analysis of cryptocurrency news. The model is fine-tuned with 4-bit quantization using the PEFT/LoRA approach. Result: The results show that the fine-tuned Mistral 7B LLM models can effectively perform informative qualitative and quantitative analytics of cryptocurrency news. Conclusion: The paper concludes that fine-tuned Mistral 7B LLM models can effectively conduct multilevel analysis of cryptocurrency news, providing both qualitative and quantitative insights. Abstract: In the paper, we consider multilevel multitask analysis of cryptocurrency news using a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG). On the first level of analytics, the fine-tuned model generates graph and text summaries with sentiment scores as well as JSON representations of summaries. Higher levels perform hierarchical stacking that consolidates sets of graph-based and text-based summaries as well as summaries of summaries into comprehensive reports. The combination of graph and text summaries provides complementary views of cryptocurrency news. The model is fine-tuned with 4-bit quantization using the PEFT/LoRA approach. The representation of cryptocurrency news as knowledge graph can essentially eliminate problems with large language model hallucinations. The obtained results demonstrate that the use of fine-tuned Mistral 7B LLM models for multilevel cryptocurrency news analysis can conduct informative qualitative and quantitative analytics, providing important insights.

[4] The ProLiFIC dataset: Leveraging LLMs to Unveil the Italian Lawmaking Process

Matilde Contestabile,Chiara Ferrara,Alberto Giovannetti,Giovanni Parrillo,Andrea Vandin

Main category: cs.CL

TL;DR: This paper introduces ProLiFIC, a comprehensive event log of the Italian lawmaking process, structured using LLMs to enhance legal Process Mining.

Details Motivation: Process Mining's efficacy in the legal domain is limited by dataset accessibility and quality. This work aims to overcome these limitations. Method: ProLiFIC was created using unstructured data from the Normattiva portal and structured with large language models (LLMs), aligning with recent integrations of PM and LLMs. Result: A comprehensive event log of the Italian lawmaking process from 1987 to 2022 was successfully developed and preliminary analyses were exemplified. Conclusion: ProLiFIC is introduced as a benchmark for legal PM, fostering new developments in the analysis of the Italian lawmaking process. Abstract: Process Mining (PM), initially developed for industrial and business contexts, has recently been applied to social systems, including legal ones. However, PM's efficacy in the legal domain is limited by the accessibility and quality of datasets. We introduce ProLiFIC (Procedural Lawmaking Flow in Italian Chambers), a comprehensive event log of the Italian lawmaking process from 1987 to 2022. Created from unstructured data from the Normattiva portal and structured using large language models (LLMs), ProLiFIC aligns with recent efforts in integrating PM with LLMs. We exemplify preliminary analyses and propose ProLiFIC as a benchmark for legal PM, fostering new developments.

[5] Multimodal Proposal for an AI-Based Tool to Increase Cross-Assessment of Messages

Alejandro Álvarez Castro,Joaquín Ordieres-Meré

Main category: cs.CL

TL;DR: 本文提出了一种基于多模态和层次化话语结构的财报电话会议分析框架,具有良好的语义表达能力和跨领域应用潜力。

Details Motivation: 现有的财务情感分析系统多依赖于文档级或句子级模型,未能捕捉财报电话会议中互动的层次化话语结构,因此需要一种更精细的多模态分析方法。 Method: 论文采用分两阶段的Transformer架构:第一阶段使用对比学习对多模态内容和话语元数据进行节点级编码,第二阶段为整个会议生成全局嵌入表示。 Result: 实验结果表明,所提出的框架能够生成稳定且语义上有意义的嵌入表示,反映了情感基调、结构逻辑和主题一致性,并且在金融预测和话语评估等下游任务中具有实用价值。 Conclusion: 该论文提出了一种新的多模态框架,能够生成语义丰富且结构感知的财报电话会议嵌入表示,并证明了该方法在金融报告及其他高风险非脚本通信领域的实用性。 Abstract: Earnings calls represent a uniquely rich and semi-structured source of financial communication, blending scripted managerial commentary with unscripted analyst dialogue. Although recent advances in financial sentiment analysis have integrated multi-modal signals, such as textual content and vocal tone, most systems rely on flat document-level or sentence-level models, failing to capture the layered discourse structure of these interactions. This paper introduces a novel multi-modal framework designed to generate semantically rich and structurally aware embeddings of earnings calls, by encoding them as hierarchical discourse trees. Each node, comprising either a monologue or a question-answer pair, is enriched with emotional signals derived from text, audio, and video, as well as structured metadata including coherence scores, topic labels, and answer coverage assessments. A two-stage transformer architecture is proposed: the first encodes multi-modal content and discourse metadata at the node level using contrastive learning, while the second synthesizes a global embedding for the entire conference. Experimental results reveal that the resulting embeddings form stable, semantically meaningful representations that reflect affective tone, structural logic, and thematic alignment. Beyond financial reporting, the proposed system generalizes to other high-stakes unscripted communicative domains such as tele-medicine, education, and political discourse, offering a robust and explainable approach to multi-modal discourse representation. This approach offers practical utility for downstream tasks such as financial forecasting and discourse evaluation, while also providing a generalizable method applicable to other domains involving high-stakes communication.

[6] Reading Between the Signs: Predicting Future Suicidal Ideation from Adolescent Social Media Texts

Paul Blum,Enrico Liscio,Ruixuan Zhang,Caroline Figueroa,Pradeep K. Murukannaiah

Main category: cs.CL

TL;DR: This study proposes Early-SIB, a transformer-based model that predicts suicidal ideation among adolescents by analyzing their social media activity before they explicitly express such thoughts.

Details Motivation: Suicide is a leading cause of death among adolescents, and many cases go undetected due to limited contact with mental health services. Social media offers an opportunity to detect early signs of suicidal ideation as young people often share their struggles online in real time. Method: The study introduces Early-SIB, a transformer-based model that processes forum posts written and engaged with by users to predict whether they will express suicidal ideation in the future, without relying on explicit self-disclosure. Result: The Early-SIB model achieved a balanced accuracy of 0.73 in predicting future suicidal ideation on a Dutch youth forum. Conclusion: The proposed Early-SIB model demonstrates that predictive tools can meaningfully complement traditional methods in identifying suicidal ideation and behavior among adolescents through social media analysis. Abstract: Suicide is a leading cause of death among adolescents (12-18), yet predicting it remains a significant challenge. Many cases go undetected due to a lack of contact with mental health services. Social media, however, offers a unique opportunity, as young people often share their thoughts and struggles online in real time. In this work, we propose a novel task and method to approach it: predicting suicidal ideation and behavior (SIB) from forum posts before an adolescent explicitly expresses suicidal ideation on an online forum. This predictive framing, where no self-disclosure is used as input at any stage, remains largely unexplored in the suicide prediction literature. To this end, we introduce Early-SIB, a transformer-based model that sequentially processes the posts a user writes and engages with to predict whether they will write a SIB post. Our model achieves a balanced accuracy of 0.73 for predicting future SIB on a Dutch youth forum, demonstrating that such tools can offer a meaningful addition to traditional methods.

[7] Real-Time Detection of Hallucinated Entities in Long-Form Generation

Oscar Obeso,Andy Arditi,Javier Ferrando,Joshua Freeman,Cameron Holmes,Neel Nanda

Main category: cs.CL

TL;DR: 提出了一种廉价且可扩展的实时幻觉检测方法,适用于70B参数的大模型,并通过实体级别的标注实现了流式检测。

Details Motivation: 现有的幻觉检测方法在实际应用中不切实际,因为它们受限于短事实查询或需要昂贵的外部验证。 Method: 开发了一种基于网络搜索的注释方法,并训练了有效的幻觉分类器,如线性探测器,以检测实体级别的幻觉。 Result: 分类器在长形式回答中持续优于基线,包括在Llama-3.3-70B上的AUC为0.90,优于语义熵方法的0.71。 Conclusion: 论文的结论是,尽管标注方法昂贵,但一个模型的标注响应可以用来训练其他模型的有效分类器,从而实现可扩展的现实世界幻觉检测。 Abstract: Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification. We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets \emph{entity-level hallucinations} -- e.g., fabricated names, dates, citations -- rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Moreover, despite being trained only with entity-level labels, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.

[8] Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck

Igor Halperin

Main category: cs.CL

TL;DR: The paper introduces UDIB, a novel method for detecting hallucinations in Large Language Models by creating a shared topic representation optimized for information-theoretic analysis rather than spatial proximity.

Details Motivation: The motivation is to bridge the gap between spatial proximity optimization in existing frameworks like SDM and the need for information-theoretic analysis for detecting hallucinations in Large Language Models. Method: The paper develops a topic identification method based on the Deterministic Information Bottleneck (DIB) by substituting its intractable KL divergence term with a computationally efficient upper bound, resulting in the UDIB algorithm. Result: The resulting UDIB method can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters, providing a more informative shared topic representation for the prompt-response relationship. Conclusion: The paper concludes that the UDIB method provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations in LLMs. Abstract: Large Language Models (LLMs) are prone to critical failure modes, including \textit{intrinsic faithfulness hallucinations} (also known as confabulations), where a response deviates semantically from the provided context. Frameworks designed to detect this, such as Semantic Divergence Metrics (SDM), rely on identifying latent topics shared between prompts and responses, typically by applying geometric clustering to their sentence embeddings. This creates a disconnect, as the topics are optimized for spatial proximity, not for the downstream information-theoretic analysis. In this paper, we bridge this gap by developing a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering. Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound. The resulting method, which we dub UDIB, can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters. By applying UDIB to the joint clustering of LLM prompt and response embeddings, we generate a shared topic representation that is not merely spatially coherent but is fundamentally structured to be maximally informative about the prompt-response relationship. This provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations.

[9] QuesGenie: Intelligent Multimodal Question Generation

Ahmed Mubarak,Amna Ahmed,Amira Nasser,Aya Mohamed,Fares El-Sadek,Mohammed Ahmed,Ahmed Salah,Youssef Sobhy

Main category: cs.CL

TL;DR: 本文介绍了一种自动化、可扩展的多模态问题生成系统,旨在解决教育资源中缺乏定制练习材料的问题。

Details Motivation: 在当今信息丰富的时代,学习者虽然拥有丰富的教育资源,但缺乏针对这些资源定制的练习材料是一个重大挑战。这项工作旨在通过开发一种多模态问题生成系统来解决这一差距,该系统能够从各种内容格式中自动生成多种类型的问题。 Method: 开发了一个包含多模态输入处理、问题生成、来自人类反馈的强化学习 (RLHF) 和端到端交互界面的系统。 Result: 开发出了一个能够自动从各种内容格式生成多样化问题类型的多模态问题生成系统。 Conclusion: 该项目为自动化、可扩展和智能的问题生成奠定了基础,同时在资源效率、强大功能和流畅用户体验之间实现了精心平衡。 Abstract: In today's information-rich era, learners have access to abundant educational resources, but the lack of practice materials tailored to these resources presents a significant challenge. This project addresses that gap by developing a multi-modal question generation system that can automatically generate diverse question types from various content formats. The system features four major components: multi-modal input handling, question generation, reinforcement learning from human feedback (RLHF), and an end-to-end interactive interface. This project lays the foundation for automated, scalable, and intelligent question generation, carefully balancing resource efficiency, robust functionality and a smooth user experience.

[10] AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models

Cheng-Kai Yeh,Hsing-Wang Lee,Chung-Hung Kuo,Hen-Hsen Huang

Main category: cs.CL

TL;DR: AR$^2$是一种新的对抗强化学习框架,旨在提升大型语言模型的抽象推理能力,从而增强其解决新编程任务的泛化能力。

Details Motivation: 尽管LLM在代码生成方面取得进展,但现有方法主要关注表层模式识别,缺乏对抽象能力的显式训练。 Method: AR$^2$使用教师模型将核心问题转化为叙述丰富、具有挑战性的描述,并训练学生模型从中提取核心计算逻辑。 Result: 实验结果显示AR$^2$显著提高了学生模型在未见过的复杂编程任务上的准确率。 Conclusion: AR$^2$有效提升了学生模型在新编程任务上的准确性,强调了抽象能力在增强LLM泛化中的重要性。 Abstract: Abstraction--the ability to recognize and distill essential computational patterns from complex problem statements--is a foundational skill in computer science, critical both for human problem-solvers and coding-oriented large language models (LLMs). Despite recent advances in training LLMs for code generation using reinforcement learning (RL), most existing approaches focus primarily on superficial pattern recognition, overlooking explicit training for abstraction. In this study, we propose AR$^2$ (Adversarial Reinforcement Learning for Abstract Reasoning), a novel framework explicitly designed to enhance the abstraction abilities of LLMs. AR$^2$ employs a teacher model to transform kernel problems into narrative-rich, challenging descriptions without changing their fundamental logic. Simultaneously, a student coding model is trained to solve these complex narrative problems by extracting their underlying computational kernels. Experimental results demonstrate that AR$^2$ substantially improves the student model's accuracy on previously unseen, challenging programming tasks, underscoring abstraction as a key skill for enhancing LLM generalization.

[11] Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

Shanglin Wu,Lihui Liu,Jinho D. Choi,Kai Shu

Main category: cs.CL

TL;DR: 本文提出了一种在推理时动态构建和扩展知识图谱(KG)的新框架,以提高大型语言模型(LLM)在事实一致性上的表现。

Details Motivation: 由于LLM在参数记忆上的限制,导致其在生成事实一致的答案上存在困难。传统的检索增强生成(RAG)方法将知识视为非结构化文本,限制了其支持组合推理和识别事实不一致的能力。 Method: 通过提示从问题中提取一个种子知识图谱(seed KG),然后利用LLM的潜在知识进行迭代扩展。该图谱通过外部检索进行选择性优化,以增强事实覆盖并纠正错误。 Result: 在三个不同的事实问答基准上的评估表明,该方法在事实准确性、答案精确性和可解释性方面均优于基线提示和静态知识图谱增强方法。 Conclusion: 推理时知识图谱的构建是一种提高LLM事实性的有前景的方向,具有结构化、可解释和可扩展的优势。 Abstract: Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) methods address this issue by incorporating external knowledge from trusted sources at inference time. However, such methods typically treat knowledge as unstructured text, which limits their ability to support compositional reasoning and identify factual inconsistencies. To overcome these limitations, we propose a novel framework that dynamically constructs and expands knowledge graphs (KGs) during inference, integrating both internal knowledge extracted from LLMs and external information retrieved from external sources. Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM's latent knowledge. The graph is then selectively refined through external retrieval, enhancing factual coverage and correcting inaccuracies. We evaluate our approach on three diverse factual QA benchmarks, demonstrating consistent improvements in factual accuracy, answer precision, and interpretability over baseline prompting and static KG-augmented methods. Our findings suggest that inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.

[12] ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference

Qi Chen,Jingxuan Wei,Zhuoya Yao,Haiguang Wang,Gaowei Wu,Bihui Yu,Siyuan Li,Cheng Tan

Main category: cs.CL

TL;DR: ResearchPulse introduces a multi-document inference framework to reconstruct research development chains using agent-based task coordination.

Details Motivation: Understanding scientific evolution requires structured reasoning across related research papers. Method: ResearchPulse uses an agent-based approach with three coordinated agents: Plan Agent, Mmap-Agent, and Lchart-Agent. Result: ResearchPulse-Bench benchmark demonstrates the framework's superior performance in semantic alignment, structural consistency, and visual fidelity. Conclusion: The proposed ResearchPulse framework outperforms strong baselines in multi-document scientific inference tasks. Abstract: Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in https://huggingface.co/datasets/ResearchPulse/ResearchPulse-Bench.

[13] NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management

Josh Wisoff,Yao Tang,Zhengyu Fang,Jordan Guzman,YuTang Wang,Alex Yu

Main category: cs.CL

TL;DR: NoteBar is an AI-assisted note-taking tool that improves efficiency by leveraging persona information and efficient language models, supported by a novel dataset.

Details Motivation: Note-taking is crucial for information management, but existing AI-assisted tools face efficiency challenges. The emergence of large language models offers an opportunity for improvement. Method: We present NoteBar, an AI-assisted note-taking tool that leverages persona information and efficient language models to automatically organize notes into multiple categories and better support user workflows. We also introduce a novel persona-conditioned dataset. Result: NoteBar can be deployed practically and cost-effectively, enabling interactive use without reliance on heavy infrastructure. The introduced dataset offers diversity and semantic richness for downstream tasks. Conclusion: NoteBar and its accompanying dataset provide a scalable and extensible foundation for advancing AI-assisted personal knowledge management. Abstract: Note-taking is a critical practice for capturing, organizing, and reflecting on information in both academic and professional settings. The recent success of large language models has accelerated the development of AI-assisted tools, yet existing solutions often struggle with efficiency. We present NoteBar, an AI-assisted note-taking tool that leverages persona information and efficient language models to automatically organize notes into multiple categories and better support user workflows. To support research and evaluation in this space, we further introduce a novel persona-conditioned dataset of 3,173 notes and 8,494 annotated concepts across 16 MBTI personas, offering both diversity and semantic richness for downstream tasks. Finally, we demonstrate that NoteBar can be deployed in a practical and cost-effective manner, enabling interactive use without reliance on heavy infrastructure. Together, NoteBar and its accompanying dataset provide a scalable and extensible foundation for advancing AI-assisted personal knowledge management.

[14] E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

Aryan Gupta,Anupam Purwar

Main category: cs.CL

TL;DR: 本文提出Sprinklr-Edge-OCR,一种专为边缘设备优化的高效OCR系统,并与大型视觉-语言模型进行比较,结果显示传统OCR在资源受限环境中仍具有显著优势。

Details Motivation: 多语言、噪声和多样化现实世界图像中的光学字符识别(OCR)仍然是OCR系统的一大挑战。随着大型视觉-语言模型的发展,人们越来越关注它们在固定OCR流程之外的泛化和推理能力。 Method: 引入了Sprinklr-Edge-OCR,并在包含54种语言的大规模数据集上对五种最先进的大型视觉-语言模型和两种传统OCR系统进行了比较评估。 Result: Qwen在准确率上表现最佳(0.54),而Sprinklr-Edge-OCR在F1得分(0.46)以及效率方面表现最佳,平均每张图像处理时间为0.17秒,成本仅为0.006美元每千张图像。 Conclusion: 传统OCR系统在边缘部署中仍然优于大型视觉-语言模型,因其计算需求低、延迟低且成本低廉。 Abstract: Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. To better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 USD per 1,000 images) compared to LVLM. Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability.

[15] Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Dani Roytburg,Matthew Bozoukov,Matthew Nguyen,Jou Barzdukas,Simon Fu,Narmeen Oozeer

Main category: cs.CL

TL;DR: This paper explores using steering vectors to reduce self-preference bias in large language models (LLMs), showing promising results in mitigating unjustified bias while highlighting limitations in handling legitimate self-preference and unbiased agreement.

Details Motivation: Large language models (LLMs) are increasingly used as automated evaluators but suffer from 'self-preference bias,' where they favor their own outputs over others. This bias undermines fairness and reliability in evaluation pipelines, especially in tasks like preference tuning and model routing. The study investigates whether lightweight steering vectors can mitigate this issue. Method: The authors used a curated dataset to distinguish between justified and unjustified self-preference bias. They constructed steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach, aiming to mitigate self-preference bias at inference time without retraining. Result: The results show that steering vectors can reduce unjustified self-preference bias by up to 97%, outperforming prompting and direct preference optimization baselines. However, they are unstable when dealing with legitimate self-preference and unbiased agreement, suggesting that self-preference spans multiple or nonlinear directions. Conclusion: Steering vectors have potential in reducing unjustified self-preference bias in LLMs but show instability in handling legitimate self-preference and unbiased agreement, highlighting the need for more robust interventions. Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97\%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.

[16] Semantic Analysis of SNOMED CT Concept Co-occurrences in Clinical Documentation using MIMIC-IV

Ali Noori,Somya Mohanty,Prashanti Manda

Main category: cs.CL

TL;DR: This study explores the relationship between SNOMED CT concept co-occurrence and semantic similarity using the MIMIC-IV database and pre-trained embeddings. It finds that while co-occurrence and semantic similarity are weakly correlated, embeddings capture meaningful clinical associations, support augmenting clinical annotations, and demonstrate practical utility in linking co-occurrence patterns to patient outcomes.

Details Motivation: The motivation behind this study is to better understand how clinical concepts relate through co-occurrence and semantic similarity, despite the challenges posed by the unstructured format of clinical notes and the underexplored nature of concept relationships. Method: The study uses the MIMIC-IV database and leverages techniques such as Normalized Pointwise Mutual Information (NPMI) along with pre-trained embeddings like ClinicalBERT and BioBERT. It explores the relationship between SNOMED CT concept co-occurrence patterns and embedding-based semantic similarity. Result: The analyses reveal a weak correlation between co-occurrence and semantic similarity, but embeddings capture meaningful clinical associations not always reflected in documentation frequency. Embedding-based suggestions matched concepts later documented, showing utility for augmenting annotations. Clustering of concept embeddings mapped to coherent clinical themes and patient phenotypes. Co-occurrence patterns were also linked to outcomes like mortality and readmission. Conclusion: This paper concludes that co-occurrence statistics and semantic embeddings have complementary value in improving clinical documentation completeness, uncovering latent clinical relationships, and informing decision support and phenotyping applications. Abstract: Clinical notes contain rich clinical narratives but their unstructured format poses challenges for large-scale analysis. Standardized terminologies such as SNOMED CT improve interoperability, yet understanding how concepts relate through co-occurrence and semantic similarity remains underexplored. In this study, we leverage the MIMIC-IV database to investigate the relationship between SNOMED CT concept co-occurrence patterns and embedding-based semantic similarity. Using Normalized Pointwise Mutual Information (NPMI) and pretrained embeddings (e.g., ClinicalBERT, BioBERT), we examine whether frequently co-occurring concepts are also semantically close, whether embeddings can suggest missing concepts, and how these relationships evolve temporally and across specialties. Our analyses reveal that while co-occurrence and semantic similarity are weakly correlated, embeddings capture clinically meaningful associations not always reflected in documentation frequency. Embedding-based suggestions frequently matched concepts later documented, supporting their utility for augmenting clinical annotations. Clustering of concept embeddings yielded coherent clinical themes (symptoms, labs, diagnoses, cardiovascular conditions) that map to patient phenotypes and care patterns. Finally, co-occurrence patterns linked to outcomes such as mortality and readmission demonstrate the practical utility of this approach. Collectively, our findings highlight the complementary value of co-occurrence statistics and semantic embeddings in improving documentation completeness, uncovering latent clinical relationships, and informing decision support and phenotyping applications.

[17] MLSD: A Novel Few-Shot Learning Approach to Enhance Cross-Target and Cross-Domain Stance Detection

Parush Gera,Tempestt Neal

Main category: cs.CL

TL;DR: The paper introduces MLSD, a new method for stance detection across domains and targets, which improves performance by using metric learning to better adapt to new domains.

Details Motivation: The motivation is to improve stance detection across different domains and targets. Method: MLSD utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, enhancing domain adaptation. Result: They evaluated MLSD in multiple cross-target and cross-domain scenarios across two datasets, showing statistically significant improvement in stance detection performance across six widely used stance detection models. Conclusion: MLSD allows a cross-target or cross-domain stance detection model to acquire useful examples from new target domains. Abstract: We present the novel approach for stance detection across domains and targets, Metric Learning-Based Few-Shot Learning for Cross-Target and Cross-Domain Stance Detection (MLSD). MLSD utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, enhancing domain adaptation. By constructing a discriminative embedding space, MLSD allows a cross-target or cross-domain stance detection model to acquire useful examples from new target domains. We evaluate MLSD in multiple cross-target and cross-domain scenarios across two datasets, showing statistically significant improvement in stance detection performance across six widely used stance detection models.

[18] SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation

Saki Imai,Mert İnan,Anthony Sicilia,Malihe Alikhani

Main category: cs.CL

TL;DR: This paper introduces SiLVERScore, a novel semantically-aware evaluation metric for sign language generation that outperforms traditional methods by operating in a joint embedding space and better capturing the multimodal nature of sign language.

Details Motivation: Existing evaluation methods for sign language generation, like back-translation, fail to capture the multimodal nature of sign language and make it difficult to identify the source of errors, prompting the need for a more robust and semantically-aware evaluation approach. Method: SiLVERScore is a semantically-aware embedding-based evaluation metric that operates in a joint embedding space, addressing the limitations of the two-step back-translation pipeline. Result: SiLVERScore achieves an ROC AUC of 0.99 with less than 7% overlap between correct and random pairs, demonstrating robustness to semantic and prosodic variations and outperforming traditional metrics. Conclusion: SiLVERScore substantially outperforms traditional metrics in evaluating sign language generation, showing near-perfect discrimination between correct and random pairs on PHOENIX-14T and CSL-Daily datasets. Abstract: Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.

[19] Measuring How (Not Just Whether) VLMs Build Common Ground

Saki Imai,Mert İnan,Anthony Sicilia,Malihe Alikhani

Main category: cs.CL

TL;DR: 研究评估大型视觉语言模型在交互式基础任务中的表现,发现当前模型与人类行为存在差异,并提出新的评估框架。

Details Motivation: 现有的基准测试主要在单轮或问答环境中评估VLM,而真实的基础过程是通过持续交流逐步建立共同理解的交互过程。 Method: 提出一个包含四个指标的评估套件,并在150次自玩交互指称游戏中对三个专有VLM进行测试,然后与人类配对进行比较。 Result: 所有三个模型在至少三个指标上与人类模式存在差异,任务成功分数不能表明基础成功,高图像-语句对齐不一定能预测任务成功。 Conclusion: 当前的VLM在交互式基础任务中与人类行为存在差异,GPT4o-mini表现最好,但整体上任务成功与基础成功之间没有直接关联。 Abstract: Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

[20] Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

Jiaxin Guo,Daimeng Wei,Yuanchang Luo,Xiaoyu Chen,Zhanglin Wu,Huan Yang,Hengchao Shang,Zongyao Li,Zhiqiang Rao,Jinlong Yang,Hao Yang

Main category: cs.CL

TL;DR: The paper proposes Align-then-Slide, a complete evaluation framework for ultra-long document-level machine translation that aligns sentence-level correspondences and evaluates using multi-granularity chunks, demonstrating high correlation with human judgments and effectiveness in training improved translation models.

Details Motivation: The motivation is to address the challenge posed by large language models' whole-document outputs, which are not compatible with existing evaluation methods that assume sentence-by-sentence alignment in document-level machine translation. Method: The Align-then-Slide framework consists of two stages: Align, where sentence-level source-target correspondences are inferred and the target is rebuilt to match the source sentence number; and n-Chunk Sliding Evaluate, where averaged metric scores are calculated under 1-, 2-, 3-, and 4-chunks for multi-granularity assessment. Result: Experiments on the WMT benchmark showed a Pearson correlation of 0.929 between the proposed method and expert MQM rankings. On a newly curated real-world test set, the method closely aligned with human judgments. Additionally, the preference data generated by Align-then-Slide enabled effective CPO training and served as a reward model for GRPO, resulting in translations preferred over a vanilla SFT baseline. Conclusion: The proposed Align-then-Slide framework is validated as an accurate, robust, and actionable evaluation tool for document-level machine translation systems. Abstract: Large language models (LLMs) have ushered in a new era for document-level machine translation (\textit{doc}-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit{\textbf{Align-then-Slide}}, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.

[21] NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation

Huhong Xian,Rui Liu,Berrak Sisman,Haizhou Li

Main category: cs.CL

TL;DR: 本文提出NE-PADD方法,通过引入命名实体知识,在部分音频深度伪造检测(PADD)任务中取得了优异表现。

Details Motivation: 部分音频深度伪造检测(PADD)需要帧级定位假语音位置,但利用音频中的语义信息,尤其是命名实体的研究仍不足。 Method: NE-PADD方法通过两个并行分支SpeechNER和PADD利用命名实体知识,并引入注意力融合(AF)和注意力迁移(AT)机制。 Result: 实验表明,NE-PADD在PADD任务中表现优异,验证了命名实体知识在该任务中的有效性。 Conclusion: NE-PADD方法在PartialSpoof-NER数据集上的实验结果证明了其优于现有基线模型,代码已开源。 Abstract: Different from traditional sentence-level audio deepfake detection (ADD), partial audio deepfake detection (PADD) requires frame-level positioning of the location of fake speech. While some progress has been made in this area, leveraging semantic information from audio, especially named entities, remains an underexplored aspect. To this end, we propose NE-PADD, a novel method for Partial Audio Deepfake Detection (PADD) that leverages named entity knowledge through two parallel branches: Speech Name Entity Recognition (SpeechNER) and PADD. The approach incorporates two attention aggregation mechanisms: Attention Fusion (AF) for combining attention weights and Attention Transfer (AT) for guiding PADD with named entity semantics using an auxiliary loss. Built on the PartialSpoof-NER dataset, experiments show our method outperforms existing baselines, proving the effectiveness of integrating named entity knowledge in PADD. The code is available at https://github.com/AI-S2-Lab/NE-PADD.

[22] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Yang Wang,Chenghao Xiao,Chia-Yi Hsiao,Zi Yan Chang,Chi-Li Chen,Tyler Loakman,Chenghua Lin

Main category: cs.CL

TL;DR: 本研究引入了Drivelology这一语言现象,指出当前语言模型在理解其复杂语义方面的局限性,并构建了用于研究的数据集。

Details Motivation: Drivelology是一种独特的语言现象,表现为“有深度的废话”,尽管当前的大型语言模型在许多自然语言处理任务中表现出色,但它们难以掌握Drivelology文本的隐含意义。 Method: 构建了一个包含1200多个精心策划实例的小型多样化基准数据集,用于评估分类、生成和推理任务。 Result: 实验结果显示,模型常常将Drivelology与浅层废话混淆,生成不连贯的解释,或完全忽略隐含的修辞功能。 Conclusion: 当前大型语言模型在理解Drivelology文本的多层次语义方面存在明显局限,这表明其在语用理解上存在更深层次的表征差距。 Abstract: We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

[23] A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models

Yanbo Wang,Yongcan Yu,Jian Liang,Ran He

Main category: cs.CL

TL;DR: 本文综述了长链式推理(Long-CoT)技术在提升语言模型性能的同时,对可信度(包括真实性、安全性、鲁棒性、公平性和隐私性)的影响,指出尽管推理技术有潜力提升模型可信度,但当前模型仍存在显著的安全、鲁棒性和隐私漏洞。

Details Motivation: 尽管Long-CoT推理技术在提升模型性能方面取得进展,但其对模型可信度的全面影响尚不明确,需要系统性分析。 Method: 文章从可信推理的五个核心维度(truthfulness, safety, robustness, fairness, privacy)出发,按时间顺序梳理相关研究,分析其方法、发现及局限性,并提出未来研究方向。 Result: 推理技术在减轻幻觉、检测有害内容和提升鲁棒性方面有潜力,但当前推理模型在安全、鲁棒性和隐私方面仍面临较大挑战。 Conclusion: 本文综合现有研究,为AI安全社区提供关于推理模型可信度的最新进展与挑战的系统性参考。 Abstract: The development of Long-CoT reasoning has advanced LLM performance across various tasks, including language understanding, complex problem solving, and code generation. This paradigm enables models to generate intermediate reasoning steps, thereby improving both accuracy and interpretability. However, despite these advancements, a comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models remains underdeveloped. In this paper, we survey recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. For each aspect, we provide a clear and structured overview of recent studies in chronological order, along with detailed analyses of their methodologies, findings, and limitations. Future research directions are also appended at the end for reference and discussion. Overall, while reasoning techniques hold promise for enhancing model trustworthiness through hallucination mitigation, harmful content detection, and robustness improvement, cutting-edge reasoning models themselves often suffer from comparable or even greater vulnerabilities in safety, robustness, and privacy. By synthesizing these insights, we hope this work serves as a valuable and timely resource for the AI safety community to stay informed on the latest progress in reasoning trustworthiness. A full list of related papers can be found at \href{https://github.com/ybwang119/Awesome-reasoning-safety}{https://github.com/ybwang119/Awesome-reasoning-safety}.

[24] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Cheng Wang,Zeming Wei,Qin Liu,Muhao Chen

Main category: cs.CL

TL;DR: This paper systematically examines the limitations of current probing-based safety detection methods for Large Language Models (LLMs), revealing that they rely on superficial patterns rather than semantic understanding. The authors recommend a redesign of models and evaluation protocols and provide open-source resources for further research.

Details Motivation: The motivation for this study stems from the poor out-of-distribution performance of current probing-based methods, which suggests that they detect superficial patterns rather than the semantic harmfulness of inputs. Method: The investigation follows a systematic approach, starting with demonstrating the performance of simple n-gram methods, conducting controlled experiments with semantically cleaned datasets, and analyzing pattern dependencies. Result: Through controlled experiments, the researchers confirmed that probes learn superficial patterns such as instructional patterns and trigger words, rather than semantic harmfulness, leading to a false sense of security in safety detection. Conclusion: The study concludes that current probing-based safety detection methods for Large Language Models (LLMs) provide a false sense of security, as they rely on superficial patterns rather than semantic understanding. This highlights the need to redesign both models and evaluation protocols. Abstract: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

[25] MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation

Gowen Loo,Chang Liu,Qinghong Yin,Xiang Chen,Jiawei Chen,Jingyuan Zhang,Yu Tian

Main category: cs.CL

TL;DR: 本文提出MobileRAG框架,结合RAG技术,解决了当前移动代理在任务处理中的多项关键问题,并通过新基准测试验证了其性能提升。

Details Motivation: 当前基于大语言模型的移动代理在任务处理中存在误解、遗漏步骤、缺乏外部环境交互和记忆能力不足的问题,亟需改进。 Method: 提出了一种名为MobileRAG的框架,包含InterRAG、LocalRAG和MemRAG,利用检索增强生成(RAG)技术提升任务处理效率和准确性。 Result: 在MobileRAG-Eval基准测试中,MobileRAG比现有最先进方法提升了10.3%,且操作步骤更少。 Conclusion: MobileRAG框架通过整合RAG技术,显著提升了移动代理处理复杂和长序列任务的能力,并通过MobileRAG-Eval基准测试验证了其性能优越性。 Abstract: Smartphones have become indispensable in people's daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3\% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: https://github.com/liuxiaojieOutOfWorld/MobileRAG_arxiv

[26] MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering

Fengxiao Tang,Yufeng Li,Zongzong Wu,Ming Zhao

Main category: cs.CL

TL;DR: This paper proposes Matrix of Thought (MoT) and a fact-correction mechanism to improve the reasoning capabilities of large language models (LLMs) for complex QA tasks, resulting in an efficient and accurate QA framework (MTQA).

Details Motivation: To overcome the limitations of existing methods like Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Retrieval-Augmented Generation (RAG), which face challenges such as redundancy, single-path constraints, and difficulty in handling multi-hop reasoning with multiple entities. Method: The study introduces the Matrix of Thought (MoT) structure, which enables horizontal and vertical exploration using a 'column-cell communication' mechanism. It also incorporates a fact-correction mechanism using knowledge units derived from knowledge graph triples and raw text. Result: The proposed MTQA framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, while requiring only 14.4% of the reasoning time of baseline methods. Conclusion: The proposed Matrix of Thought (MoT) framework, combined with a fact-correction mechanism, significantly enhances reasoning capabilities of LLMs, leading to superior performance in terms of efficiency and accuracy on complex QA tasks. Abstract: Complex Question Answering (QA) is a fundamental and challenging task in NLP. While large language models (LLMs) exhibit impressive performance in QA, they suffer from significant performance degradation when facing complex and abstract QA tasks due to insufficient reasoning capabilities. Works such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance LLMs' reasoning abilities, but they face issues such as in-layer redundancy in tree structures and single paths in chain structures. Although some studies utilize Retrieval-Augmented Generation (RAG) methods to assist LLMs in reasoning, the challenge of effectively utilizing large amounts of information involving multiple entities and hops remains critical. To address this, we propose the Matrix of Thought (MoT), a novel and efficient LLM thought structure. MoT explores the problem in both horizontal and vertical dimensions through the "column-cell communication" mechanism, enabling LLMs to actively engage in multi-strategy and deep-level thinking, reducing redundancy within the column cells and enhancing reasoning capabilities. Furthermore, we develop a fact-correction mechanism by constructing knowledge units from retrieved knowledge graph triples and raw text to enhance the initial knowledge for LLM reasoning and correct erroneous answers. This leads to the development of an efficient and accurate QA framework (MTQA). Experimental results show that our framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4\% of the baseline methods, demonstrating both its efficiency and accuracy. The code for this framework is available at https://github.com/lyfiter/mtqa.

[27] Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling

Iro Lim,Haein Ji,Byungjun Kim

Main category: cs.CL

TL;DR: 本研究创建了KPoEM数据集和模型,用于提升韩国现代诗歌中的情感计算分析能力,并取得了显著的性能提升。

Details Motivation: 尽管基于文本的情感分类在大语言模型方面取得了显著进展,但由于韩国诗歌使用比喻语言和文化特定性,因此在计算情感分析方面仍研究较少。 Method: 构建了一个包含7662个条目的多标签情感数据集,其中包括7007个诗句级别条目和615个工作级别条目,并使用最先进的韩国语言模型进行微调训练。 Result: 微调后的模型在KPoEM数据集上显著优于之前的模型,F1-micro达到0.60,而基于通用语料库训练的模型仅为0.34。 Conclusion: 该研究通过构建KPoEM数据集和相应的模型,成功地将计算方法与文学分析相结合,为韩国现代诗歌情感的定量探索提供了新的可能性。 Abstract: This study introduces KPoEM (Korean Poetry Emotion Mapping) , a novel dataset for computational emotion analysis in modern Korean poetry. Despite remarkable progress in text-based emotion classification using large language models, poetry-particularly Korean poetry-remains underexplored due to its figurative language and cultural specificity. We built a multi-label emotion dataset of 7,662 entries, including 7,007 line-level entries from 483 poems and 615 work-level entries, annotated with 44 fine-grained emotion categories from five influential Korean poets. A state-of-the-art Korean language model fine-tuned on this dataset significantly outperformed previous models, achieving 0.60 F1-micro compared to 0.34 from models trained on general corpora. The KPoEM model, trained through sequential fine-tuning-first on general corpora and then on the KPoEM dataset-demonstrates not only an enhanced ability to identify temporally and culturally specific emotional expressions, but also a strong capacity to preserve the core sentiments of modern Korean poetry. This study bridges computational methods and literary analysis, presenting new possibilities for the quantitative exploration of poetic emotions through structured data that faithfully retains the emotional and cultural nuances of Korean literature.

[28] SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment

Yuqing Huang,Rongyang Zhang,Qimeng Wang,Chengqiang Lu,Yan Gao,Yi Wu,Yao Hu,Xuyang Zhi,Guiquan Liu,Xin Li,Hao Wang,Enhong Chen

Main category: cs.CL

TL;DR: SelfAug is introduced as a solution to catastrophic forgetting in large language models during fine-tuning, particularly in RAG scenarios, by aligning input sequence logits to maintain the model's semantic distribution.

Details Motivation: Supervised fine-tuning in RAG scenarios often leads to catastrophic forgetting, where models lose previously acquired knowledge. Existing solutions have limitations in preserving the model's original distribution. Method: SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model's semantic distribution, is proposed. Result: Extensive experiments show that SelfAug achieves a superior balance between downstream learning and general capability retention. The study reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios. Conclusion: SelfAug effectively balances downstream learning and the retention of general capabilities, offering a practical solution to catastrophic forgetting in various fine-tuning scenarios. Abstract: Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, effectively enhances task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model's original distribution. To overcome these limitations, we propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model's semantic distribution, thereby mitigating catastrophic forgetting and improving downstream performance. Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning. Our findings not only advance the understanding of catastrophic forgetting in RAG contexts but also provide a practical solution applicable across diverse fine-tuning scenarios. Our code is publicly available at https://github.com/USTC-StarTeam/SelfAug.

[29] SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

Yuhao Zhang,Shaoming Duan,Jinhang Su,Chuanyi Liu,Peiyi Han

Main category: cs.CL

TL;DR: 本文提出了一种针对Text-to-SQL任务的新自我对弈微调方法SPFT-SQL,其通过验证迭代微调和错误驱动的损失方法优于现有最先进的方法。

Details Motivation: SPIN在Text-to-SQL任务中面临挑战,不能生成新信息,且对手模型生成的大量正确SQL查询降低了主模型生成准确SQL查询的能力。 Method: SPFT-SQL在自我对弈微调阶段采用了错误驱动的损失方法,并在自我对弈之前引入了基于验证的迭代微调方法。 Result: 对六个开源LLM和五个广泛使用的基准进行了广泛的实验和深入分析,结果证明SPFT-SQL优于现有最先进的方法。 Conclusion: SPFT-SQL面对Text-to-SQL任务中的挑战,通过验证迭代微调和错误驱动的损失方法,优于现有最先进的方法。 Abstract: Despite the significant advancements of self-play fine-tuning (SPIN), which can transform a weak large language model (LLM) into a strong one through competitive interactions between models of varying capabilities, it still faces challenges in the Text-to-SQL task. SPIN does not generate new information, and the large number of correct SQL queries produced by the opponent model during self-play reduces the main model's ability to generate accurate SQL queries. To address this challenge, we propose a new self-play fine-tuning method tailored for the Text-to-SQL task, called SPFT-SQL. Prior to self-play, we introduce a verification-based iterative fine-tuning approach, which synthesizes high-quality fine-tuning data iteratively based on the database schema and validation feedback to enhance model performance, while building a model base with varying capabilities. During the self-play fine-tuning phase, we propose an error-driven loss method that incentivizes incorrect outputs from the opponent model, enabling the main model to distinguish between correct SQL and erroneous SQL generated by the opponent model, thereby improving its ability to generate correct SQL. Extensive experiments and in-depth analyses on six open-source LLMs and five widely used benchmarks demonstrate that our approach outperforms existing state-of-the-art (SOTA) methods.

[30] VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

Weihao Wu,Liang Cao,Xinyu Wu,Zhiwei Lin,Rui Niu,Jingbei Li,Zhiyong Wu

Main category: cs.CL

TL;DR: This paper introduces VoxRole, the first benchmark for speech-based Role-Playing Conversational Agents, highlighting the need to evaluate persona consistency through 65.6 hours of movie dialogue data.

Details Motivation: Current RPCA research focuses on text-based modalities, ignoring important paralinguistic speech features, and lacks standardized benchmarks for assessing speech-based role-playing capabilities like long-term persona consistency. Method: Construction of the VoxRole benchmark using a two-stage automated pipeline involving audio-script alignment and LLM-based character profile creation, followed by multi-dimensional evaluation of spoken dialogue models. Result: VoxRole contains 13,335 multi-turn dialogues (65.6 hours of speech) from 1,228 characters across 261 movies, enabling the first systematic evaluation of speech-based RPCAs. Conclusion: VoxRole provides a comprehensive benchmark for evaluating speech-based RPCAs, revealing key insights into the strengths and weaknesses of current models in maintaining persona consistency. Abstract: Recent significant advancements in Large Language Models (LLMs) have greatly propelled the development of Role-Playing Conversational Agents (RPCAs). These systems aim to create immersive user experiences through consistent persona adoption. However, current RPCA research faces dual limitations. First, existing work predominantly focuses on the textual modality, entirely overlooking critical paralinguistic features including intonation, prosody, and rhythm in speech, which are essential for conveying character emotions and shaping vivid identities. Second, the speech-based role-playing domain suffers from a long-standing lack of standardized evaluation benchmarks. Most current spoken dialogue datasets target only fundamental capability assessments, featuring thinly sketched or ill-defined character profiles. Consequently, they fail to effectively quantify model performance on core competencies like long-term persona consistency. To address this critical gap, we introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs. The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies. To construct this resource, we propose a novel two-stage automated pipeline that first aligns movie audio with scripts and subsequently employs an LLM to systematically build multi-dimensional profiles for each character. Leveraging VoxRole, we conduct a multi-dimensional evaluation of contemporary spoken dialogue models, revealing crucial insights into their respective strengths and limitations in maintaining persona consistency.

[31] CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking

Ruiling Guo,Xinwei Yang,Chen Huang,Tong Zhang,Yong Hu

Main category: cs.CL

TL;DR: CANDY基准测试揭示了大型语言模型在事实核查中的局限性,但它们作为辅助工具具有增强人类表现的潜力。

Details Motivation: 尽管大型语言模型被广泛使用,但其在事实核查虚假信息方面的有效性仍不确定。 Method: 开发了一个名为CANDY的基准测试,包含约20,000个标注实例,用于系统评估LLM在事实核查中的能力与局限。 Result: 当前的LLM在生成准确的事实核查结论方面表现出局限性,即使使用了链式推理和少量提示,最常见的失败模式是事实捏造。 Conclusion: 尽管大型语言模型在事实核查中存在局限性,但它们在辅助人类任务时表现出巨大的潜力。 Abstract: The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of ~20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code can be accessed at https://github.com/SCUNLP/CANDY

[32] Exploring NLP Benchmarks in an Extremely Low-Resource Setting

Ulin Nuha,Adam Jatowt

Main category: cs.CL

TL;DR: 本文通过创建合成数据集来提升低资源语言Ladin的自然语言处理效果,特别是针对情感分析和多选题问答任务。

Details Motivation: 由于缺乏标记数据,大型语言模型对低资源语言(如土著语言)的效果减弱。高质量自然语言处理数据集的缺乏,使得开发强大的语言技术变得困难。 Method: 利用少量的Ladin-Italian平行句对,通过翻译单语意大利语数据创建情感分析和多选题问答的合成数据集,并应用严格的过滤和回译程序以确保语言质量和可靠性。 Result: 将这些合成数据集纳入机器翻译训练中,显著提高了现有的意大利语-Ladin翻译基线效果。 Conclusion: 本文贡献了首个公开可用的Ladin情感分析和MCQA数据集,为这一代表性不足的语言的更广泛的自然语言处理研究和下游应用提供了基础资源。 Abstract: The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.

[33] Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study

Junghwan Lim,Gangwon Jo,Sungmin Lee,Jiyoung Park,Dongseok Kim,Jihwan Kim,Junhyeok Lee,Wai Ting Cheung,Dahye Choi,Kibong Choi,Jaeyeon Huh,Beomgyu Kim,Jangwoong Kim,Taehyun Kim,Haesol Lee,Jeesoo Lee,Dongpin Oh,Changseok Song,Daewon Suh

Main category: cs.CL

TL;DR: Llama-3-Motif 是一个专注于韩语和英语的 1020 亿参数语言模型,通过先进训练技术实现了卓越性能,尤其在韩语任务上表现出色。

Details Motivation: 为了提升韩语处理能力,同时保持英语性能,开发一个强大且多语言的语言模型。 Method: 基于 Llama 3 架构,采用 LlamaPro 和 Masked Structure Growth 等先进技术进行模型扩展,并通过 MoAI 平台在超大规模 GPU 集群上进行高效训练。 Result: Llama-3-Motif 在韩语特定基准测试中表现出色,超越了现有模型,并达到了与 GPT-4 相当的水平。 Conclusion: Llama-3-Motif 是一个专注于提升韩语能力同时保持英语性能的大型语言模型,其表现可与 GPT-4 相媲美。 Abstract: We introduce Llama-3-Motif, a language model consisting of 102 billion parameters, specifically designed to enhance Korean capabilities while retaining strong performance in English. Developed on the Llama 3 architecture, Llama-3-Motif employs advanced training techniques, including LlamaPro and Masked Structure Growth, to effectively scale the model without altering its core Transformer architecture. Using the MoAI platform for efficient training across hyperscale GPU clusters, we optimized Llama-3-Motif using a carefully curated dataset that maintains a balanced ratio of Korean and English data. Llama-3-Motif shows decent performance on Korean-specific benchmarks, outperforming existing models and achieving results comparable to GPT-4.

[34] RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models

Zhaoyan Gong,Juan Li,Zhiqiang Liu,Lei Liang,Huajun Chen,Wen Zhang

Main category: cs.CL

TL;DR: RTQA是一个无需训练的新型框架,通过递归分解问题、自底向上求解和多路径答案聚合来提升时间知识图谱问答的性能。

Details Motivation: 当前的时间知识图谱问答方法主要关注隐式时间约束,难以处理更复杂的时间查询,并且在分解框架中存在推理能力有限和错误传播的问题。 Method: RTQA框架采用递归思维,将问题递归分解为子问题,使用大语言模型和时间知识图谱知识自底向上求解,并通过多路径答案聚合提高容错能力。框架包含三个核心组件:时间问题分解器、递归求解器和答案聚合器。 Result: 在MultiTQ和TimelineKGQA基准测试中,RTQA在'Multiple'和'Complex'类别中显著提升了Hits@1性能,超过了最先进的方法。 Conclusion: RTQA提供了一种有效解决时间知识图谱问答挑战的新方法,具有更好的推理能力和容错性。 Abstract: Current temporal knowledge graph question answering (TKGQA) methods primarily focus on implicit temporal constraints, lacking the capability of handling more complex temporal queries, and struggle with limited reasoning abilities and error propagation in decomposition frameworks. We propose RTQA, a novel framework to address these challenges by enhancing reasoning over TKGs without requiring training. Following recursive thinking, RTQA recursively decomposes questions into sub-problems, solves them bottom-up using LLMs and TKG knowledge, and employs multi-path answer aggregation to improve fault tolerance. RTQA consists of three core components: the Temporal Question Decomposer, the Recursive Solver, and the Answer Aggregator. Experiments on MultiTQ and TimelineKGQA benchmarks demonstrate significant Hits@1 improvements in "Multiple" and "Complex" categories, outperforming state-of-the-art methods. Our code and data are available at https://github.com/zjukg/RTQA.

[35] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Riccardo Lunardi,Vincenzo Della Mea,Stefano Mizzaro,Kevin Roitero

Main category: cs.CL

TL;DR: 该研究发现,尽管LLMs在标准基准测试中表现良好,但在面对语言变体时其有效性显著下降,表明其实际应用中的稳健性存在问题。

Details Motivation: 评估LLMs在现实应用场景中面对语言多样性时的有效性,以及现有基准测试是否能可靠地衡量模型能力。 Method: 该研究通过系统生成六个常见基准测试中所有问题的各种改写形式,评估了34个最先进的LLMs在不同语言变体下的有效性变化。 Result: 研究发现,LLMs在改写问题下的绝对有效性下降显著,表明其对语言变体的适应能力有限;尽管模型排名相对稳定,但基准测试的可靠性受到挑战。 Conclusion: 研究表明,尽管LLMs在不同改写输入中的排名相对稳定,但其绝对有效性显著下降,表明其对语言变体的处理能力有限。这引发了对现有基准评估方法可靠性的担忧,并强调了开发更稳健的评估基准的必要性。 Abstract: Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model's robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.

[36] What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

Debangan Mishra,Arihant Rastogi,Agyeya Negi,Shashwat Goel,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $\kappa_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $\kappa_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.

[37] A RoBERTa-Based Functional Syntax Annotation Model for Chinese Texts

Han Xiaohui,Zhang Yunlong,Guo Yuxi

Main category: cs.CL

TL;DR: 本研究提出了首个基于RoBERTa的中文功能句法自动标注模型,推动了系统功能语法在中文NLP领域的应用。

Details Motivation: 为了填补基于系统功能语法理论的中文自动标注系统的空白,推动相关理论的应用与发展。 Method: 基于RoBERTa模型构建了一个中文功能句法标注模型,并利用4100个从《人民日报》语料库中随机选取的句子进行训练和优化。 Result: 在测试集上F1得分为0.852,显著优于其他对比模型,尤其在识别主语(S)、动词(M)和补语(C)等核心句法成分方面表现优异,但对标签样本不平衡的实体识别仍有改进空间。 Conclusion: 该研究成功将功能句法与基于注意力机制的NLP模型相结合,为中文功能句法自动分析提供了新方法,并为后续研究奠定了坚实基础。 Abstract: Systemic Functional Grammar and its branch, Cardiff Grammar, have been widely applied to discourse analysis, semantic function research, and other tasks across various languages and texts. However, an automatic annotation system based on this theory for Chinese texts has not yet been developed, which significantly constrains the application and promotion of relevant theories. To fill this gap, this research introduces a functional syntax annotation model for Chinese based on RoBERTa (Robustly Optimized BERT Pretraining Approach). The study randomly selected 4,100 sentences from the People's Daily 2014 corpus and annotated them according to functional syntax theory to establish a dataset for training. The study then fine-tuned the RoBERTa-Chinese wwm-ext model based on the dataset to implement the named entity recognition task, achieving an F1 score of 0.852 on the test set that significantly outperforms other comparative models. The model demonstrated excellent performance in identifying core syntactic elements such as Subject (S), Main Verb (M), and Complement (C). Nevertheless, there remains room for improvement in recognizing entities with imbalanced label samples. As the first integration of functional syntax with attention-based NLP models, this research provides a new method for automated Chinese functional syntax analysis and lays a solid foundation for subsequent studies.

[38] Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning

Zhilin Wang,Zhe Yang,Yun Luo,Yafu Li,Haoran Zhang,Runzhe Zhan,Derek F. Wong,Jizhe Zhou,Yu Cheng

Main category: cs.CL

TL;DR: 这篇论文提出了一种基于音乐理论的乐谱问题合成方法,用于生成评估基准和训练数据,从而提升人工智能模型对乐谱的理解能力,并推动人工智能辅助音乐创作的发展。

Details Motivation: 当前研究缺乏乐谱解释的评估基准和训练数据,阻碍了大语言模型和多模态大语言模型在音乐理解方面的能力。 Method: 提出了一种数据合成框架,可以生成文本和视觉模态的可验证乐谱问题,并利用合成数据进行强化学习训练。 Result: 合成数据方法在SSMR-Bench上取得了改进,Qwen3-8B-Base和Qwen2.5-VL-Instruct模型表现良好,同时增强了模型在音乐创作方面的能力。 Conclusion: 该论文提出了一种基于音乐理论规则合成乐谱问题的新方法,并展示了其在提升模型乐谱理解和解锁人工智能辅助音乐创作方面的作用。 Abstract: Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. To address this, we propose the idea of synthesizing sheet music problems grounded in music theory, which can serve both as evaluation benchmarks and as training data for reinforcement learning with verifiable rewards (RLVR). We introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench show the importance of models' reasoning abilities in interpreting sheet music. At the same time, the poor performance of Gemini 2.5-Pro highlights the challenges that MLLMs still face in interpreting sheet music in a visual format. By leveraging synthetic data for RLVR, Qwen3-8B-Base and Qwen2.5-VL-Instruct achieve improvements on the SSMR-Bench. Besides, the trained Qwen3-8B-Base surpasses GPT-4 in overall performance on MusicTheoryBench and achieves reasoning performance comparable to GPT-4 with the strategies of Role play and Chain-of-Thought. Notably, its performance on math problems also improves relative to the original Qwen3-8B-Base. Furthermore, our results show that the enhanced reasoning ability can also facilitate music composition. In conclusion, we are the first to propose the idea of synthesizing sheet music problems based on music theory rules, and demonstrate its effectiveness not only in advancing model reasoning for sheet music understanding but also in unlocking new possibilities for AI-assisted music creation.

[39] Arabic Chatbot Technologies in Education: An Overview

Hicham Bourhil,Yacine El Younoussi

Main category: cs.CL

TL;DR: This paper surveys Arabic chatbots in education, highlighting their characteristics and identifying research gaps.

Details Motivation: The motivation is to examine the use of chatbots in education, especially in Arabic, and identify research gaps. Method: The method involves a survey of existing Arabic chatbots in education and their characteristics. Result: The result shows that few educational Arabic chatbots use modern techniques, despite the success of chatbots in other languages. Conclusion: The conclusion discusses future research directions for Arabic chatbots in education. Abstract: The recent advancements in Artificial Intelligence (AI) in general, and in Natural Language Processing (NLP) in particular, and some of its applications such as chatbots, have led to their implementation in different domains like education, healthcare, tourism, and customer service. Since the COVID-19 pandemic, there has been an increasing interest in these digital technologies to allow and enhance remote access. In education, e-learning systems have been massively adopted worldwide. The emergence of Large Language Models (LLM) such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformers) made chatbots even more popular. In this study, we present a survey on existing Arabic chatbots in education and their different characteristics such as the adopted approaches, language variety, and metrics used to measure their performance. We were able to identified some research gaps when we discovered that, despite the success of chatbots in other languages such as English, only a few educational Arabic chatbots used modern techniques. Finally, we discuss future directions of research in this field.

[40] Improving Narrative Classification and Explanation via Fine Tuned Language Models

Rishit Tyagi,Rahul Bouri,Mohit Gupta

Main category: cs.CL

TL;DR: 该研究通过改进BERT模型和引入结构化知识库,提高了新闻文章中叙事分类和解释的效果。

Details Motivation: 传统的NLP方法难以检测微妙措辞和隐藏议程,因此需要研究隐蔽叙事和隐性信息传递以分析偏见和情感。 Method: 微调BERT模型并使用GPT-4o流水线进行预测优化,同时采用ReACT框架结合基于语义检索的少样本提示方法进行叙事解释。 Result: 结果表明,在提示中集成辅助知识可以提高分类准确性和解释的可靠性,并可应用于媒体分析、教育和情报收集领域。 Conclusion: 研究得出在新闻文章中进行叙事分类和解释的有效方法,通过结合BERT和GPT-4o模型以及引入结构化分类表作为辅助知识库,提高了分类准确性和解释可靠性。 Abstract: Understanding covert narratives and implicit messaging is essential for analyzing bias and sentiment. Traditional NLP methods struggle with detecting subtle phrasing and hidden agendas. This study tackles two key challenges: (1) multi-label classification of narratives and sub-narratives in news articles, and (2) generating concise, evidence-based explanations for dominant narratives. We fine-tune a BERT model with a recall-oriented approach for comprehensive narrative detection, refining predictions using a GPT-4o pipeline for consistency. For narrative explanation, we propose a ReACT (Reasoning + Acting) framework with semantic retrieval-based few-shot prompting, ensuring grounded and relevant justifications. To enhance factual accuracy and reduce hallucinations, we incorporate a structured taxonomy table as an auxiliary knowledge base. Our results show that integrating auxiliary knowledge in prompts improves classification accuracy and justification reliability, with applications in media analysis, education, and intelligence gathering.

[41] Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

Keara Schaaij,Roel Boumans,Tibor Bosse,Iris Hendrickx

Main category: cs.CL

TL;DR: 本研究探索了如何利用个性化策略构建高效的词汇配置文件,以实现人机对话中的词汇对齐,结果显示小而紧凑的配置文件在性能和数据效率上表现最佳。

Details Motivation: 词汇对齐在成功交流中起着重要作用,但在对话代理中的实现仍然研究不足,尤其是在大语言模型(LLMs)的最新进展背景下。 Method: 该研究借鉴了个性化对话代理的策略,调查了个性化词汇配置文件的构建,并通过回忆率、覆盖率和余弦相似度指标随时间推移评估了配置文件的表现。 Result: 研究表明,通过10分钟的转录语音数据创建的小而紧凑的配置文件,在形容词、连词各包含5个项目,副词、名词、代词和动词各包含10个项目时,性能和数据效率之间达到了最佳平衡。 Conclusion: 该研究为对话代理中的词汇对齐策略提供了实用见解,考虑到最小的数据需求,构建稳定的个性化词汇配置文件是实现词汇对齐的基础步骤。 Abstract: Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.

[42] MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Dan Saattrup Smart

Main category: cs.CL

TL;DR: MultiWikiQA是一个涵盖306种语言的新阅读理解数据集,上下文数据来自维基百科文章,问题由LLM生成,答案在维基百科文章中逐字出现。

Details Motivation: 为了推动多语言阅读理解的发展,并提供一个覆盖广泛语言的高质量数据集。 Method: 利用维基百科文章创建上下文数据,使用LLM生成问题,并进行众包人类评估以确保问题质量。 Result: 评估了6种不同语言模型的表现,结果显示基准足够困难,且不同语言之间存在显著的性能差异。 Conclusion: MultiWikiQA是一个具有挑战性的多语言阅读理解数据集,具有高质量的问题和广泛的可用性。 Abstract: We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.

[43] Joint Modeling of Entities and Discourse Relations for Coherence Assessment

Wei Liu,Michael Strube

Main category: cs.CL

TL;DR: This study demonstrates that combining entity and discourse relation features improves coherence assessment in linguistic analysis.

Details Motivation: Most existing work on coherence modeling focuses on either entity features or discourse relation features, with little attention given to combining the two. Method: Explored two methods for jointly modeling entities and discourse relations and tested their effectiveness on three benchmark datasets. Result: Experiments showed that integrating both entity and discourse relation features significantly enhances the performance of coherence models. Conclusion: Modeling both entities and discourse relations simultaneously improves coherence evaluation. Abstract: In linguistics, coherence can be achieved by different means, such as by maintaining reference to the same set of entities across sentences and by establishing discourse relations between them. However, most existing work on coherence modeling focuses exclusively on either entity features or discourse relation features, with little attention given to combining the two. In this study, we explore two methods for jointly modeling entities and discourse relations for coherence assessment. Experiments on three benchmark datasets show that integrating both types of features significantly enhances the performance of coherence models, highlighting the benefits of modeling both simultaneously for coherence evaluation.

[44] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

Aishik Mandal,Tanmoy Chakraborty,Iryna Gurevych

Main category: cs.CL

TL;DR: MAGneT is a novel framework for generating synthetic psychological counseling sessions using a multi-agent system, outperforming previous methods and enhancing the fine-tuning of open-source LLMs.

Details Motivation: The scarcity of high-quality, privacy-compliant data for fine-tuning open-source LLMs in psychological counseling prompted the development of a novel framework for generating synthetic counseling sessions. Method: MAGneT employs a multi-agent framework where specialized LLM agents handle sub-tasks related to key psychological techniques. It also introduces a unified evaluation framework with expanded expert assessments. Result: MAGneT outperformed existing methods, showing improvements of 3.2% in general counseling skills and 4.3% in CBT-specific skills on average on CTRS. Experts preferred its generated sessions 77.2% of the time. Fine-tuning using MAGneT sessions improved performance by 6.3% for general skills and 7.3% for CBT-specific skills. Conclusion: MAGneT offers a more effective and thorough approach to generating synthetic psychological counseling sessions, leading to better performance in both general and CBT-specific counseling skills when used for fine-tuning open-source models. Abstract: The growing demand for scalable psychological counseling highlights the need for fine-tuning open-source Large Language Models (LLMs) with high-quality, privacy-compliant data, yet such data remains scarce. Here we introduce MAGneT, a novel multi-agent framework for synthetic psychological counseling session generation that decomposes counselor response generation into coordinated sub-tasks handled by specialized LLM agents, each modeling a key psychological technique. Unlike prior single-agent approaches, MAGneT better captures the structure and nuance of real counseling. In addition, we address inconsistencies in prior evaluation protocols by proposing a unified evaluation framework integrating diverse automatic and expert metrics. Furthermore, we expand the expert evaluations from four aspects of counseling in previous works to nine aspects, enabling a more thorough and robust assessment of data quality. Empirical results show that MAGneT significantly outperforms existing methods in quality, diversity, and therapeutic alignment of the generated counseling sessions, improving general counseling skills by 3.2% and CBT-specific skills by 4.3% on average on cognitive therapy rating scale (CTRS). Crucially, experts prefer MAGneT-generated sessions in 77.2% of cases on average across all aspects. Moreover, fine-tuning an open-source model on MAGneT-generated sessions shows better performance, with improvements of 6.3% on general counseling skills and 7.3% on CBT-specific skills on average on CTRS over those fine-tuned with sessions generated by baseline methods. We also make our code and data public.

[45] Explicit and Implicit Data Augmentation for Social Event Detection

Congbo Ma,Yuxia Wang,Jia Wu,Jian Yang,Jing Du,Zitai Qiu,Qing Li,Hu Wang,Preslav Nakov

Main category: cs.CL

TL;DR: 本文提出了一种用于社交媒体事件检测的插件式双增强框架 SED-Aug,通过结合显式的基于文本增强和隐式的特征空间增强,提高了数据的多样性和模型的鲁棒性,并在多个数据集上验证了其优越性能。

Details Motivation: 社交媒体事件检测依赖于标记数据,但标注成本高且耗时。因此,需要一种减少对标注数据依赖同时提高模型鲁棒性和性能的方法。 Method: 提出 SED-Aug 框架,包括显式文本增强(使用大语言模型和五种生成策略)和隐式特征空间增强(设计五种扰动技术以保持语义和关系属性)。 Result: SED-Aug 在 Twitter2012 数据集上比最佳基线模型性能平均提升约 17.67%,在 Twitter2018 数据集上提升约 15.57%。 Conclusion: SED-Aug 通过双重增强策略有效提升了社交媒体事件检测的性能,为未来少样本或无监督场景下的事件检测提供了新思路。 Abstract: Social event detection involves identifying and categorizing important events from social media, which relies on labeled data, but annotation is costly and labor-intensive. To address this problem, we propose Augmentation framework for Social Event Detection (SED-Aug), a plug-and-play dual augmentation framework, which combines explicit text-based and implicit feature-space augmentation to enhance data diversity and model robustness. The explicit augmentation utilizes large language models to enhance textual information through five diverse generation strategies. For implicit augmentation, we design five novel perturbation techniques that operate in the feature space on structural fused embeddings. These perturbations are crafted to keep the semantic and relational properties of the embeddings and make them more diverse. Specifically, SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score. The code is available at GitHub: https://github.com/congboma/SED-Aug.

[46] Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Qinyan Zhang,Xinping Lei,Ruijie Miao,Yu Fu,Haojie Fan,Le Chang,Jiafan Hou,Dingling Zhang,Zhongfei Hou,Ziqiang Yang,Changxin Pu,Fei Hu,Jingkai Liu,Mengyun Liu,Yang Liu,Xiang Gao,Jiaheng Liu,Tong Yang,Zaiyuan Wang,Ge Zhang,Wenhao Huang

Main category: cs.CL

TL;DR: 本文提出 Inverse IFEval 基准测试,以评估大型语言模型在面对与训练模式冲突的指令时的适应能力,并强调未来对齐工作应提高模型在多样且不可预测的现实场景中的指令遵循可靠性。

Details Motivation: 大型语言模型(LLMs)在多种任务中表现出色,但往往表现出认知惯性,难以遵循与监督微调(SFT)期间学到的标准模式相冲突的指令。为了评估这一局限性,作者提出了 Inverse IFEval 基准测试,用于衡量模型的反直觉能力,即克服训练引起的偏差并遵循对抗性指令的能力。 Method: 通过人工参与的流水线构建了一个包含 1012 个高质量中英文问题的数据集,涵盖 23 个领域,并在一个优化的 LLM-as-a-Judge 框架下进行评估。 Result: 实验结果表明了所提出的 Inverse IFEval 基准测试的必要性,并强调了未来对齐工作应考虑在非常规情境下的适应能力。 Conclusion: 未来的对齐工作不仅应追求流畅性和事实正确性,还应考虑在非常规情境下的适应能力。Inverse IFEval 作为一个诊断工具和开发方法的基础,有助于减轻认知惯性,减少对狭窄模式的过拟合,最终提高 LLMs 在多样且不可预测的现实场景中的指令遵循可靠性。 Abstract: Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

[47] Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models

Juraj Vladika,Mahdi Dhaini,Florian Matthes

Main category: cs.CL

TL;DR: This paper identifies that Large Language Models (LLMs) often rely on outdated medical knowledge, posing risks in healthcare applications, and proposes new datasets and strategies to address this issue and improve the reliability of medical AI.

Details Motivation: LLMs have significant potential to enhance healthcare but pose risks when providing outdated medical advice due to reliance on static training data. This study investigates this issue to improve the reliability of medical AI systems. Method: The authors introduced two novel QA datasets, MedRevQA and MedChangeQA, to evaluate eight prominent LLMs. They analyzed the influence of obsolete pre-training data and training strategies on the models' performance. Result: Evaluation of eight LLMs revealed consistent reliance on outdated medical knowledge, especially on the MedChangeQA dataset, indicating a failure in clinical reasoning tasks and potential harm in real-world applications. Conclusion: The study concludes that current LLMs consistently rely on outdated medical knowledge, highlighting the need for updated training data and strategies to develop more reliable medical AI systems. Abstract: The growing capabilities of Large Language Models (LLMs) show significant potential to enhance healthcare by assisting medical researchers and physicians. However, their reliance on static training data is a major risk when medical recommendations evolve with new research and developments. When LLMs memorize outdated medical knowledge, they can provide harmful advice or fail at clinical reasoning tasks. To investigate this problem, we introduce two novel question-answering (QA) datasets derived from systematic reviews: MedRevQA (16,501 QA pairs covering general biomedical knowledge) and MedChangeQA (a subset of 512 QA pairs where medical consensus has changed over time). Our evaluation of eight prominent LLMs on the datasets reveals consistent reliance on outdated knowledge across all models. We additionally analyze the influence of obsolete pre-training data and training strategies to explain this phenomenon and propose future directions for mitigation, laying the groundwork for developing more current and reliable medical AI systems.

[48] PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

Jiajun He,Naoki Sawada,Koichi Miyazaki,Tomoki Toda

Main category: cs.CL

TL;DR: PARCO是一种改进的上下文语音识别方法,通过引入音素感知和对比学习技术,提高了对领域特定命名实体的识别能力。

Details Motivation: ASR系统在领域特定的命名实体识别中存在困难,尤其是同音词问题。现有方法在捕捉音素细微变化和多词偏置方面存在不足。 Method: PARCO结合了音素感知编码、对比实体消歧、实体级监督和分层实体过滤,以增强语音识别的准确性和鲁棒性。 Result: PARCO在中文AISHELL-1数据集上达到4.22%的CER,在英文DATA2数据集上达到11.14%的WER,并在THCHS-30和LibriSpeech等域外数据集中表现出显著的性能提升。 Conclusion: PARCO有效提升了ASR系统在领域特定命名实体识别中的性能,尤其是在处理同音词和实体多样性有限的场景中。 Abstract: Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.

[49] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

Bufan Gao,Elisa Kreiss

Main category: cs.CL

TL;DR: 研究发现提示设计对测量大型语言模型中的性别偏见有显著影响,即使是微小的变化也可能显著改变结果,离散选择度量通常会放大偏见。

Details Motivation: 随着大型语言模型越来越多地应用于具有社会影响的领域,性别偏见问题引发了广泛关注。当前的偏见评估任务通常依赖于与自然语言分布不同的提示设计,这可能会影响评估结果的可靠性和有效性。因此,研究旨在探讨提示设计如何影响性别偏见的测量结果。 Method: 研究通过测试模型在不同提示条件下的表现来评估性别偏见,包括使测试背景显著化和性别聚焦内容显著化的提示条件。研究使用了四种任务格式,并结合了基于概率的度量(token-probability)和离散选择度量(discrete-choice metrics)来分析提示敏感性。 Result: 研究发现,即使是对提示的微小改动,也可能显著改变性别偏见的测量结果,有时甚至完全逆转结果方向。此外,离散选择度量往往比基于概率的度量更容易放大偏见。这些结果表明,LLM的性别偏见评估具有高度的提示敏感性。 Conclusion: 这项研究强调了在评估大型语言模型(LLM)中的性别偏见时,提示(prompt)设计的脆弱性和重要性。研究指出,即使是微小的提示变化也可能显著改变偏见结果,甚至完全逆转结果方向。此外,离散选择度量通常会放大相对于概率度量的偏见。这些发现对自然语言处理(NLP)基准测试和开发社区提出了新的挑战,即如何确保测试设计的生态有效性。 Abstract: As LLMs are increasingly applied in socially impactful settings, concerns about gender bias have prompted growing efforts both to measure and mitigate such bias. These efforts often rely on evaluation tasks that differ from natural language distributions, as they typically involve carefully constructed task prompts that overtly or covertly signal the presence of gender bias-related content. In this paper, we examine how signaling the evaluative purpose of a task impacts measured gender bias in LLMs. Concretely, we test models under prompt conditions that (1) make the testing context salient, and (2) make gender-focused content salient. We then assess prompt sensitivity across four task formats with both token-probability and discrete-choice metrics. We find that even minor prompt changes can substantially alter bias outcomes, sometimes reversing their direction entirely. Discrete-choice metrics further tend to amplify bias relative to probabilistic measures. These findings do not only highlight the brittleness of LLM gender bias evaluations but open a new puzzle for the NLP benchmarking and development community: To what extent can well-controlled testing designs trigger LLM ``testing mode'' performance, and what does this mean for the ecological validity of future benchmarks.

[50] Can Language Models Handle a Non-Gregorian Calendar?

Mutsumi Sasaki,Go Kamoda,Ryosuke Takahashi,Kosuke Sato,Kentaro Inui,Keisuke Sakaguchi,Benjamin Heinzerling

Main category: cs.CL

TL;DR: This paper evaluates how well language models handle the Japanese calendar, finding that while some can perform calendar conversions, even Japanese-centric models struggle with calendar arithmetic and cross-calendar consistency.

Details Motivation: Temporal reasoning and knowledge are essential for language models, but most studies have focused only on the Gregorian calendar. The authors aim to evaluate how well current LMs handle non-Gregorian calendars, such as the Japanese calendar, which are actively used and reflect culturally grounded conceptions of time. Method: The researchers created datasets for four tasks requiring temporal knowledge and reasoning to evaluate a range of English-centric and Japanese-centric language models. Result: Some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and maintaining consistency across different calendars. Conclusion: The study concludes that while some language models can perform calendar conversions, even Japanese-centric models struggle with Japanese-calendar arithmetic and maintaining consistency across calendars, highlighting the importance of developing culture-specific calendar understanding in LMs. Abstract: Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well open-source LMs handle one such non-Gregorian system: the Japanese calendar. For our evaluation, we create datasets for four tasks that require both temporal knowledge and temporal reasoning. Evaluating a range of English-centric and Japanese-centric LMs, we find that some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and with maintaining consistency across calendars. Our results highlight the importance of developing LMs that are better equipped for culture-specific calendar understanding.

cs.CV [Back]

[51] Towards Efficient General Feature Prediction in Masked Skeleton Modeling

Shengkai Sun,Zefan Zhang,Jianfeng Dong,Zhiyong Cheng,Xiaojun Chang,Meng Wang

Main category: cs.CV

TL;DR: This paper proposes a novel General Feature Prediction (GFP) framework for skeleton-based action recognition, improving computational efficiency and representation quality while achieving state-of-the-art performance.

Details Motivation: The motivation is to overcome the limitations of existing masked autoencoder approaches that focus on low-level reconstruction, leading to computational redundancy and limited semantic representation. Method: The paper introduces a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, incorporating constrained optimization to ensure feature diversity and prevent model collapse. Result: The experiments show that the proposed method achieves 6.2× faster training speed and state-of-the-art performance on datasets like NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD. Conclusion: The paper concludes that the proposed General Feature Prediction (GFP) framework significantly improves computational efficiency and representation quality in self-supervised skeleton-based action recognition, achieving state-of-the-art performance. Abstract: Recent advances in the masked autoencoder (MAE) paradigm have significantly propelled self-supervised skeleton-based action recognition. However, most existing approaches limit reconstruction targets to raw joint coordinates or their simple variants, resulting in computational redundancy and limited semantic representation. To address this, we propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling. Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations. Specifically, we introduce a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, avoiding reliance on pre-computed offline features. The framework incorporates constrained optimization to ensure feature diversity while preventing model collapse. Experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD demonstrate the benefits of our approach: Computational efficiency (with 6.2$\times$ faster training than standard masked skeleton modeling methods) and superior representation quality, achieving state-of-the-art performance in various downstream tasks.

[52] Teacher-Student Model for Detecting and Classifying Mitosis in the MIDOG 2025 Challenge

Seungho Choe,Xiaoli Qin,Abubakr Shafique,Amanda Dy,Susan Done,Dimitrios Androutsos,April Khademi

Main category: cs.CV

TL;DR: This paper proposes a robust AI framework for mitosis detection and classification, combining segmentation-based methods with domain generalization to address challenges like domain shift and data imbalance.

Details Motivation: The motivation is to overcome the challenges of time-intensive manual mitotic figure counting, inter-observer variability, domain shift in AI tools, and data imbalance between mitotic and normal nuclei. Method: The paper proposes a teacher-student model based on a UNet segmentation backbone, incorporating domain generalization modules such as contrastive representation learning and domain-adversarial training. It also introduces a multi-scale CNN classifier within a multi-task learning paradigm for classification. Result: On the preliminary test set, the algorithm achieved an F1 score of 0.7660 in mitosis detection (Track 1) and balanced accuracy of 0.8414 in atypical mitosis classification (Track 2). Conclusion: The study concludes that integrating segmentation-based detection and classification into a unified framework enhances robustness and effectiveness in mitosis analysis, addressing challenges like domain shift and data imbalance. Abstract: Counting mitotic figures is time-intensive for pathologists and leads to inter-observer variability. Artificial intelligence (AI) promises a solution by automatically detecting mitotic figures while maintaining decision consistency. However, AI tools are susceptible to domain shift, where a significant drop in performance can occur due to differences in the training and testing sets, including morphological diversity between organs, species, and variations in staining protocols. Furthermore, the number of mitoses is much less than the count of normal nuclei, which introduces severely imbalanced data for the detection task. In this work, we formulate mitosis detection as a pixel-level segmentation and propose a teacher-student model that simultaneously addresses mitosis detection (Track 1) and atypical mitosis classification (Track 2). Our method is based on a UNet segmentation backbone that integrates domain generalization modules, namely contrastive representation learning and domain-adversarial training. A teacher-student strategy is employed to generate pixel-level pseudo-masks not only for annotated mitoses and hard negatives but also for normal nuclei, thereby enhancing feature discrimination and improving robustness against domain shift. For the classification task, we introduce a multi-scale CNN classifier that leverages feature maps from the segmentation model within a multi-task learning paradigm. On the preliminary test set, the algorithm achieved an F1 score of 0.7660 in Track 1 and balanced accuracy of 0.8414 in Track 2, demonstrating the effectiveness of integrating segmentation-based detection and classification into a unified framework for robust mitosis analysis.

[53] Multi Attribute Bias Mitigation via Representation Learning

Rajeev Ranjan Dwivedi,Ankur Kumar,Vinod K Kurmi

Main category: cs.CV

TL;DR: GMBM 是一种解决视觉识别中多偏差问题的新方法,通过集成学习和梯度抑制提升模型鲁棒性和公平性。

Details Motivation: 现实图像中的多种偏差影响模型性能和公平性,单一偏差缓解方法效果有限。 Method: 使用两阶段框架 GMBM,包含自适应偏差集成学习(ABIL)和梯度抑制微调,结合 SBA 指标进行偏差评估。 Result: GMBM 在多个数据集上显著提升了最差组准确率,降低了多属性偏差放大。 Conclusion: GMBM 提供了一种有效的多偏差缓解方法,适用于复杂的视觉识别任务,具有较高的实用性。 Abstract: Real world images frequently exhibit multiple overlapping biases, including textures, watermarks, gendered makeup, scene object pairings, etc. These biases collectively impair the performance of modern vision models, undermining both their robustness and fairness. Addressing these biases individually proves inadequate, as mitigating one bias often permits or intensifies others. We tackle this multi bias problem with Generalized Multi Bias Mitigation (GMBM), a lean two stage framework that needs group labels only while training and minimizes bias at test time. First, Adaptive Bias Integrated Learning (ABIL) deliberately identifies the influence of known shortcuts by training encoders for each attribute and integrating them with the main backbone, compelling the classifier to explicitly recognize these biases. Then Gradient Suppression Fine Tuning prunes those very bias directions from the backbone's gradients, leaving a single compact network that ignores all the shortcuts it just learned to recognize. Moreover we find that existing bias metrics break under subgroup imbalance and train test distribution shifts, so we introduce Scaled Bias Amplification (SBA): a test time measure that disentangles model induced bias amplification from distributional differences. We validate GMBM on FB CMNIST, CelebA, and COCO, where we boost worst group accuracy, halve multi attribute bias amplification, and set a new low in SBA even as bias complexity and distribution shifts intensify, making GMBM the first practical, end to end multibias solution for visual recognition. Project page: http://visdomlab.github.io/GMBM/

[54] Lightweight image segmentation for echocardiography

Anders Kjelsrud,Lasse Løvstakken,Erik Smistad,Håvard Dalen,Gilles Van De Vyver

Main category: cs.CV

TL;DR: 通过分析nnU-Net在心脏分割中的关键组件,研究开发了更轻量且更快的U-Net模型,保持了与原模型相当的性能。

Details Motivation: nnU-Net模型虽然在心脏分割中表现良好,但其体积大且推理速度慢,限制了其在实时应用中的使用。 Method: 通过消融实验,逐步评估了数据增强方案、架构修改、损失函数和后处理技术,识别了nnU-Net中最有效的组件。 Result: 开发的轻量级U-Net模型参数量为2M(对比33M),在CAMUS数据集上实现了与nnU-Net统计上相当的Dice分数(LV/MYO/LA:0.93/0.85/0.89 vs 0.93/0.86/0.89),推理速度提高了4倍(1.35ms vs 5.40ms每帧),并在内部数据集上验证了其泛化能力。 Conclusion: 研究得出,通过简化nnU-Net模型,可以开发出更轻量级的U-Net,实现与nnU-Net相当的性能,同时显著减小模型大小和推理时间。 Abstract: Accurate segmentation of the left ventricle in echocardiography can enable fully automatic extraction of clinical measurements such as volumes and ejection fraction. While models configured by nnU-Net perform well, they are large and slow, thus limiting real-time use. We identified the most effective components of nnU-Net for cardiac segmentation through an ablation study, incrementally evaluating data augmentation schemes, architectural modifications, loss functions, and post-processing techniques. Our analysis revealed that simple affine augmentations and deep supervision drive performance, while complex augmentations and large model capacity offer diminishing returns. Based on these insights, we developed a lightweight U-Net (2M vs 33M parameters) that achieves statistically equivalent performance to nnU-Net on CAMUS (N=500) with Dice scores of 0.93/0.85/0.89 vs 0.93/0.86/0.89 for LV/MYO/LA ($p>0.05$), while being 16 times smaller and 4 times faster (1.35ms vs 5.40ms per frame) than the default nnU-Net configuration. Cross-dataset evaluation on an internal dataset (N=311) confirms comparable generalization.

[55] treeX: Unsupervised Tree Instance Segmentation in Dense Forest Point Clouds

Josafat-Mattias Burmeister,Andreas Tockner,Stefan Reder,Markus Engel,Rico Richter,Jan-Peter Mund,Jürgen Döllner

Main category: cs.CV

TL;DR: The paper presents a revised treeX algorithm that efficiently processes 3D point cloud data for forest stand analysis, offering improved performance and resource efficiency compared to deep learning methods.

Details Motivation: The need for efficient software to process 3D point cloud data from close-range laser scanning in forest stands, especially as existing deep learning methods require large annotated datasets and substantial computational resources. Method: The revised treeX algorithm uses clustering-based stem detection combined with region growing for crown delineation, with two parameter presets for different laser scanning data types (ground-based and UAV-borne). Result: The revised treeX algorithm reduces runtime and improves accuracy compared to the original, with F1-score gains for ground-based data, and successfully segments UAV-borne data (achieving an F1-score of 0.58) where the original failed. The algorithm performs similarly to recent open-source deep learning methods on TLS and PLS data. Conclusion: The revised treeX algorithm is a resource-efficient alternative to deep learning methods for 3D point cloud data processing in forest stands, suitable for scenarios with sufficient stem visibility and point density, and can also aid in generating labels for deep learning models. Abstract: Close-range laser scanning provides detailed 3D captures of forest stands but requires efficient software for processing 3D point cloud data and extracting individual trees. Although recent studies have introduced deep learning methods for tree instance segmentation, these approaches require large annotated datasets and substantial computational resources. As a resource-efficient alternative, we present a revised version of the treeX algorithm, an unsupervised method that combines clustering-based stem detection with region growing for crown delineation. While the original treeX algorithm was developed for personal laser scanning (PLS) data, we provide two parameter presets, one for ground-based laser scanning (stationary terrestrial - TLS and PLS), and one for UAV-borne laser scanning (ULS). We evaluated the method on six public datasets (FOR-instance, ForestSemantic, LAUTx, NIBIO MLS, TreeLearn, Wytham Woods) and compared it to six open-source methods (original treeX, treeiso, RayCloudTools, ForAINet, SegmentAnyTree, TreeLearn). Compared to the original treeX algorithm, our revision reduces runtime and improves accuracy, with instance detection F$_1$-score gains of +0.11 to +0.49 for ground-based data. For ULS data, our preset achieves an F$_1$-score of 0.58, whereas the original algorithm fails to segment any correct instances. For TLS and PLS data, our algorithm achieves accuracy similar to recent open-source methods, including deep learning. Given its algorithmic design, we see two main applications for our method: (1) as a resource-efficient alternative to deep learning approaches in scenarios where the data characteristics align with the method design (sufficient stem visibility and point density), and (2) for the semi-automatic generation of labels for deep learning models. To enable broader adoption, we provide an open-source Python implementation in the pointtree package.

[56] Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding

Hongpei Zheng,Lintao Xiang,Qijun Yang,Qian Lin,Hujun Yin

Main category: cs.CV

TL;DR: Reg3D introduces a new training framework for 3D scene understanding using geometry-aware supervision, outperforming existing methods across multiple tasks like ScanQA and SQA3D.

Details Motivation: Existing methods for 3D scene understanding rely on text-only supervision, which lacks the geometric constraints necessary for robust 3D spatial representation learning. Method: Reg3D introduces a dual-supervision paradigm that incorporates 3D geometric information both as input and explicit learning targets, using a dual-encoder architecture with object-level and frame-level reconstruction tasks to enforce geometric consistency. Result: Reg3D achieves substantial performance improvements on ScanQA, Scan2Cap, ScanRefer, and SQA3D, demonstrating the effectiveness of geometry-aware supervision in learning spatial reasoning capabilities. Conclusion: Reg3D establishes a new training paradigm for spatially aware multimodal models, addressing the limitations of existing approaches by integrating reconstructive geometry instruction tuning. Abstract: The rapid development of Large Multimodal Models (LMMs) has led to remarkable progress in 2D visual understanding; however, extending these capabilities to 3D scene understanding remains a significant challenge. Existing approaches predominantly rely on text-only supervision, which fails to provide the geometric constraints required for learning robust 3D spatial representations. In this paper, we introduce Reg3D, a novel Reconstructive Geometry Instruction Tuning framework that addresses this limitation by incorporating geometry-aware supervision directly into the training process. Our key insight is that effective 3D understanding necessitates reconstructing underlying geometric structures rather than merely describing them. Unlike existing methods that inject 3D information solely at the input level, Reg3D adopts a dual-supervision paradigm that leverages 3D geometric information both as input and as explicit learning targets. Specifically, we design complementary object-level and frame-level reconstruction tasks within a dual-encoder architecture, enforcing geometric consistency to encourage the development of spatial reasoning capabilities. Extensive experiments on ScanQA, Scan2Cap, ScanRefer, and SQA3D demonstrate that Reg3D delivers substantial performance improvements, establishing a new training paradigm for spatially aware multimodal models.

[57] QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception

Seth Z. Zhao,Huizhi Zhang,Zhaowei Li,Juntong Peng,Anthony Chui,Zewei Zhou,Zonglin Meng,Hao Xiang,Zhiyu Huang,Fujia Wang,Ran Tian,Chenfeng Xu,Bolei Zhou,Jiaqi Ma

Main category: cs.CV

TL;DR: 本文提出QuantV2X,首个完全量化的多智能体V2X感知系统,兼顾高效部署与高性能。

Details Motivation: 现有系统依赖全精度模型,计算和传输成本高,难以在资源受限的环境中实时运行。因此需要设计一种高效、可扩展的多智能体系统。 Method: 提出了一种统一的端到端量化策略,涵盖神经网络模型和传输消息表示,从而同时减少计算负载和传输带宽。 Result: QuantV2X在低比特约束下实现了与全精度系统相当的准确性,系统级延迟降低了3.2倍,mAP30提升了+9.5,并且能够适应更严格的内存预算。 Conclusion: QuantV2X是一个完全量化的多智能体系统,能够有效且可扩展地部署多模态、多智能体的V2X协同感知,显著降低系统级延迟并提升mAP30指标。 Abstract: Cooperative perception through Vehicle-to-Everything (V2X) communication offers significant potential for enhancing vehicle perception by mitigating occlusions and expanding the field of view. However, past research has predominantly focused on improving accuracy metrics without addressing the crucial system-level considerations of efficiency, latency, and real-world deployability. Noticeably, most existing systems rely on full-precision models, which incur high computational and transmission costs, making them impractical for real-time operation in resource-constrained environments. In this paper, we introduce \textbf{QuantV2X}, the first fully quantized multi-agent system designed specifically for efficient and scalable deployment of multi-modal, multi-agent V2X cooperative perception. QuantV2X introduces a unified end-to-end quantization strategy across both neural network models and transmitted message representations that simultaneously reduces computational load and transmission bandwidth. Remarkably, despite operating under low-bit constraints, QuantV2X achieves accuracy comparable to full-precision systems. More importantly, when evaluated under deployment-oriented metrics, QuantV2X reduces system-level latency by 3.2$\times$ and achieves a +9.5 improvement in mAP30 over full-precision baselines. Furthermore, QuantV2X scales more effectively, enabling larger and more capable models to fit within strict memory budgets. These results highlight the viability of a fully quantized multi-agent intermediate fusion system for real-world deployment. The system will be publicly released to promote research in this field: https://github.com/ucla-mobility/QuantV2X.

[58] Transfer Learning-Based CNN Models for Plant Species Identification Using Leaf Venation Patterns

Bandita Bharadwaj,Ankur Mishra,Saurav Bharadwaj

Main category: cs.CV

TL;DR: 这项研究评估了三种深度学习架构(ResNet50、MobileNetV2和EfficientNetB0)在基于叶脉模式进行植物物种分类的效果。研究使用了包含15种不同植物物种的瑞典叶子数据集,结果显示EfficientNetB0在测试准确率和F1分数上表现最佳,表明其在基于叶脉分类的植物分类学中的强大潜力。

Details Motivation: 研究的动机是评估不同的深度学习架构在基于叶脉模式的植物物种自动分类中的效果,叶脉模式是一个具有高度分类学意义的关键形态特征。 Method: 研究使用瑞典叶子数据集,包括15个不同物种的图像(每个物种75张图像,总计1125张图像),评估了ResNet50、MobileNetV2和EfficientNetB0三种深度学习架构的标准性能指标。 Result: ResNet50在训练中达到了94.11%的准确率,但显示出过拟合现象,测试准确率下降到88.45%,F1分数为87.82%。MobileNetV2表现出更好的泛化能力,测试准确率为93.34%,F1分数为93.23%。EfficientNetB0表现最佳,测试准确率达到94.67%,精确度、召回率和F1分数均超过94.6%。 Conclusion: 该研究得出结论,深度学习模型,特别是EfficientNetB0,在基于叶脉分类的植物分类学中具有巨大的潜力,为开发可扩展且准确的自动化工具提供了可能。 Abstract: This study evaluates the efficacy of three deep learning architectures: ResNet50, MobileNetV2, and EfficientNetB0 for automated plant species classification based on leaf venation patterns, a critical morphological feature with high taxonomic relevance. Using the Swedish Leaf Dataset comprising images from 15 distinct species (75 images per species, totalling 1,125 images), the models were demonstrated using standard performance metrics during training and testing phases. ResNet50 achieved a training accuracy of 94.11% but exhibited overfitting, reflected by a reduced testing accuracy of 88.45% and an F1 score of 87.82%. MobileNetV2 demonstrated better generalization capabilities, attaining a testing accuracy of 93.34% and an F1 score of 93.23%, indicating its suitability for lightweight, real-time applications. EfficientNetB0 outperformed both models, achieving a testing accuracy of 94.67% with precision, recall, and F1 scores exceeding 94.6%, highlighting its robustness in venation-based classification. The findings underscore the potential of deep learning, particularly EfficientNetB0, in developing scalable and accurate tools for automated plant taxonomy using venation traits.

[59] LayoutGKN: Graph Similarity Learning of Floor Plans

Casper van Engelenburg,Jan van Gemert,Seyran Khademi

Main category: cs.CV

TL;DR: 本文提出了一种高效的图比较方法LayoutGKN,通过将跨图节点级交互推迟到最后,显著提高了推理速度,同时保持了可比甚至更优的性能。

Details Motivation: 现有的图匹配网络依赖于代价高昂的中间跨图节点级交互,导致推理时间缓慢,因此需要一种更高效的解决方案。 Method: 通过使用可微分图核作为最终学习节点级嵌入的距离函数,将跨图节点级交互推迟到联合嵌入架构的末尾。 Result: LayoutGKN计算相似性效果可比或优于图匹配网络,同时显著提高了速度。 Conclusion: LayoutGKN在保持可比甚至更优的图匹配网络性能的同时,显著提高了推理速度。 Abstract: Floor plans depict building layouts and are often represented as graphs to capture the underlying spatial relationships. Comparison of these graphs is critical for applications like search, clustering, and data visualization. The most successful methods to compare graphs \ie, graph matching networks, rely on costly intermediate cross-graph node-level interactions, therefore being slow in inference time. We introduce \textbf{LayoutGKN}, a more efficient approach that postpones the cross-graph node-level interactions to the end of the joint embedding architecture. We do so by using a differentiable graph kernel as a distance function on the final learned node-level embeddings. We show that LayoutGKN computes similarity comparably or better than graph matching networks while significantly increasing the speed. \href{https://github.com/caspervanengelenburg/LayoutGKN}{Code and data} are open.

[60] Singular Value Few-shot Adaptation of Vision-Language Models

Taha Koleilat,Hassan Rivaz,Yiming Xiao

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态参数高效适配技术CLIP-SVD,利用SVD修改CLIP参数矩阵的奇异值,在保持模型泛化能力的同时实现了更优的适应性能。

Details Motivation: 适应新的细粒度领域时,现有方法依赖于附加模块,导致适应质量受限和模型不稳定。 Method: 利用奇异值分解(SVD)修改CLIP的内部参数空间,仅微调参数矩阵的奇异值。 Result: CLIP-SVD在11个自然和10个生物医学数据集上实现了最先进的分类结果,且仅使用0.04%的总参数。 Conclusion: CLIP-SVD是一种有效的参数高效适配方法,能够提升适应性能并保持模型的泛化能力。 Abstract: Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present \textbf{CLIP-SVD}, a novel \textit{multi-modal} and \textit{parameter-efficient} adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only \textbf{0.04\%} of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

[61] STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification

Zongsen Qiu

Main category: cs.CV

TL;DR: 本文提出了一种高效的植物病害诊断模型STA-Net,适用于边缘设备。通过DeepMAD和STAM模块,显著提高了模型精度和效率。

Details Motivation: 由于全球粮食安全需求增加,精准农业和基于深度学习的植物病害诊断变得至关重要。然而,在边缘设备上部署高精度模型具有挑战性,现有的轻量级网络对病害特征的捕捉效果不佳。 Method: 论文采用了训练免费的神经架构搜索方法DeepMAD,以及形状-纹理注意力模块STAM。STAM通过分解注意力机制,分别利用可变形卷积和Gabor滤波器提取形状和纹理特征。 Result: 在公共CCMT植物病害数据集上,所提出的STA-Net模型(401K参数和51.1M FLOPs)达到了89.00%的准确率和88.96%的F1分数,显著优于基线模型和标准注意力模型。 Conclusion: 该论文提出了一种基于深度学习的植物病害诊断方法,适用于边缘设备。通过引入STAM模块和DeepMAD方法,提高了模型的精度和效率,并验证了其在植物病害诊断中的有效性。 Abstract: Responding to rising global food security needs, precision agriculture and deep learning-based plant disease diagnosis have become crucial. Yet, deploying high-precision models on edge devices is challenging. Most lightweight networks use attention mechanisms designed for generic object recognition, which poorly capture subtle pathological features like irregular lesion shapes and complex textures. To overcome this, we propose a twofold solution: first, using a training-free neural architecture search method (DeepMAD) to create an efficient network backbone for edge devices; second, introducing the Shape-Texture Attention Module (STAM). STAM splits attention into two branches -- one using deformable convolutions (DCNv4) for shape awareness and the other using a Gabor filter bank for texture awareness. On the public CCMT plant disease dataset, our STA-Net model (with 401K parameters and 51.1M FLOPs) reached 89.00% accuracy and an F1 score of 88.96%. Ablation studies confirm STAM significantly improves performance over baseline and standard attention models. Integrating domain knowledge via decoupled attention thus presents a promising path for edge-deployed precision agriculture AI. The source code is available at https://github.com/RzMY/STA-Net.

[62] SLENet: A Guidance-Enhanced Network for Underwater Camouflaged Object Detection

Xinxin Wang,Han Sun,Ningzhong Liu,Huiyu Zhou,Yinan Yao

Main category: cs.CV

TL;DR: 本文提出了用于水下伪装物体检测的新方法SLENet,包含GAE模块、LGB分支和MSSD解码器,实验结果显示其在多个数据集上的优越性能。

Details Motivation: 水下伪装物体检测(UCOD)对于海洋生态至关重要,但由于光学失真、水浑浊度以及海洋生物的复杂特征,该任务仍未被充分研究。 Method: 提出了SLENet框架,包括Gamma-Asymmetric Enhancement (GAE)模块和Localization Guidance Branch (LGB),以及Multi-Scale Supervised Decoder (MSSD)。 Result: 实验表明,SLENet在DeepCamo数据集和三个基准COD数据集上均表现出优于现有最先进方法的性能。 Conclusion: SLENet在UCOD任务和更广泛的COD任务中表现出色,优于SOTA方法,并具有高度的通用性。 Abstract: Underwater Camouflaged Object Detection (UCOD) aims to identify objects that blend seamlessly into underwater environments. This task is critically important to marine ecology. However, it remains largely underexplored and accurate identification is severely hindered by optical distortions, water turbidity, and the complex traits of marine organisms. To address these challenges, we introduce the UCOD task and present DeepCamo, a benchmark dataset designed for this domain. We also propose Semantic Localization and Enhancement Network (SLENet), a novel framework for UCOD. We first benchmark state-of-the-art COD models on DeepCamo to reveal key issues, upon which SLENet is built. In particular, we incorporate Gamma-Asymmetric Enhancement (GAE) module and a Localization Guidance Branch (LGB) to enhance multi-scale feature representation while generating a location map enriched with global semantic information. This map guides the Multi-Scale Supervised Decoder (MSSD) to produce more accurate predictions. Experiments on our DeepCamo dataset and three benchmark COD datasets confirm SLENet's superior performance over SOTA methods, and underscore its high generality for the broader COD task.

[63] Fitting Image Diffusion Models on Video Datasets

Juhun Lee,Simon S. Woo

Main category: cs.CV

TL;DR: This paper proposes a training strategy for diffusion models that uses temporal information from video frames to improve convergence speed and generative performance without architectural changes.

Details Motivation: Training diffusion models on static images limits convergence, distributional coverage, and generalization due to the lack of temporal information. Method: A training strategy incorporating temporal inductive bias from continuous video frames is introduced, which does not require modifications to the model architecture and can be integrated into standard diffusion pipelines. Result: The method achieves over 2x faster convergence, lower FID scores on training and validation data, and improved generative diversity by capturing meaningful temporal variations. Conclusion: The proposed method improves the convergence and generative diversity of diffusion models by leveraging temporal inductive bias from video frames without requiring architectural changes. Abstract: Image diffusion models are trained on independently sampled static images. While this is the bedrock task protocol in generative modeling, capturing the temporal world through the lens of static snapshots is information-deficient by design. This limitation leads to slower convergence, limited distributional coverage, and reduced generalization. In this work, we propose a simple and effective training strategy that leverages the temporal inductive bias present in continuous video frames to improve diffusion training. Notably, the proposed method requires no architectural modification and can be seamlessly integrated into standard diffusion training pipelines. We evaluate our method on the HandCo dataset, where hand-object interactions exhibit dense temporal coherence and subtle variations in finger articulation often result in semantically distinct motions. Empirically, our method accelerates convergence by over 2$\text{x}$ faster and achieves lower FID on both training and validation distributions. It also improves generative diversity by encouraging the model to capture meaningful temporal variations. We further provide an optimization analysis showing that our regularization reduces the gradient variance, which contributes to faster convergence.

[64] MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

Yuheng Li,Yenho Chen,Yuxiang Lai,Jike Zhong,Vanessa Wildman,Xiaofeng Yang

Main category: cs.CV

TL;DR: MedVista3D是一个针对3D CT分析的多尺度语义增强视觉-语言预训练框架,它通过结合局部与全局图像-文本对齐和语义感知对齐策略,显著提升了医学图像分析的准确性和解释性。

Details Motivation: 放射学诊断错误(如漏读错误、无意失明和沟通失败)在临床实践中普遍存在,特别是在3D成像中,临床医生需要检查每次扫描的数百个切片,这对系统提出了更高的要求。 Method: MedVista3D采用多尺度语义增强的视觉-语言预训练框架进行3D CT分析,通过局部与全局图像-文本对齐进行细粒度表示学习,并引入语言模型重写和放射学语义匹配库来解决报告变异性问题。 Result: MedVista3D在零样本疾病分类、报告检索和医学视觉问答任务中达到了最先进的性能,并且能够很好地迁移到器官分割和预后预测中。 Conclusion: MedVista3D实现了最先进的零样本疾病分类、报告检索和医学视觉问答性能,同时在器官分割和预后预测方面也有良好的迁移能力。 Abstract: Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.

[65] Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Mengyu Gao,Qiulei Dong

Main category: cs.CV

TL;DR: 本文提出了一种新的因果引导文本提示学习方法CaPL,用于提升CLIP模型在细粒度识别任务中的性能。

Details Motivation: 现有基于CLIP的提示学习方法在处理细粒度数据集时能力有限,因此提出了CaPL。 Method: 提出了一种基于视觉粒化的因果引导文本提示学习方法(CaPL),包括属性解耦模块和粒学习模块。 Result: 通过视觉粒化技术,CaPL能够捕捉不同细粒度类别之间的细微差异,从而提升识别性能。 Conclusion: 实验结果证明,CaPL在15个数据集上显著优于现有提示学习方法,特别是在细粒度数据集上。 Abstract: Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

[66] EGTM: Event-guided Efficient Turbulence Mitigation

Huanan Li,Rui Fan,Juntao Guan,Weidong Hao,Lai Rui,Tong Wu,Yikai Wang,Lin Gu

Main category: cs.CV

TL;DR: This paper proposes EGTM, a novel turbulence mitigation framework using event cameras, which achieves state-of-the-art performance in both efficiency and restoration quality.

Details Motivation: The motivation is to overcome the limitations of existing deep-learning turbulence mitigation methods, which suffer from high computational and storage costs due to reliance on synchronous frame cameras with limited frame rates. Method: The paper introduces the "event-lucky insight" to correlate turbulence distortions with event stream distribution, proposes the EGTM framework for pixel-level turbulence-free guidance extraction, and constructs the first real-world event-driven TM dataset. Result: The EGTM framework outperforms existing methods by 710 times in model size, 214 times in inference latency, and 224 times in model complexity, while achieving superior restoration quality with +0.94 PSNR and +0.08 SSIM on the newly created real-world dataset. Conclusion: The paper concludes that the proposed EGTM framework significantly improves turbulence mitigation in terms of efficiency and restoration quality by leveraging event cameras' high temporal resolution and sparse imaging mechanism. Abstract: Turbulence mitigation (TM) aims to remove the stochastic distortions and blurs introduced by atmospheric turbulence into frame cameras. Existing state-of-the-art deep-learning TM methods extract turbulence cues from multiple degraded frames to find the so-called "lucky'', not distorted patch, for "lucky fusion''. However, it requires high-capacity network to learn from coarse-grained turbulence dynamics between synchronous frames with limited frame-rate, thus fall short in computational and storage efficiency. Event cameras, with microsecond-level temporal resolution, have the potential to fundamentally address this bottleneck with efficient sparse and asynchronous imaging mechanism. In light of this, we (i) present the fundamental \textbf{``event-lucky insight''} to reveal the correlation between turbulence distortions and inverse spatiotemporal distribution of event streams. Then, build upon this insight, we (ii) propose a novel EGTM framework that extracts pixel-level reliable turbulence-free guidance from the explicit but noisy turbulent events for temporal lucky fusion. Moreover, we (iii) build the first turbulence data acquisition system to contribute the first real-world event-driven TM dataset. Extensive experimental results demonstrate that our approach significantly surpass the existing SOTA TM method by 710 times, 214 times and 224 times in model size, inference latency and model complexity respectively, while achieving the state-of-the-art in restoration quality (+0.94 PSNR and +0.08 SSIM) on our real-world EGTM dataset. This demonstrating the great efficiency merit of introducing event modality into TM task. Demo code and data have been uploaded in supplementary material and will be released once accepted.

[67] Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection

Nan Yang,Yang Wang,Zhanwen Liu,Yuchao Dai,Yang Liu,Xiangmo Zhao

Main category: cs.CV

TL;DR: 本文提出FocusMamba方法,通过自适应稀疏化与跨模态融合策略,在RGB-Event检测中实现了精度与效率的平衡。

Details Motivation: 现有RGB-Event检测方法对低信息区域处理冗余,导致计算开销大且性能欠佳,需要一种自适应的稀疏化策略提升效率和精度。 Method: 提出Event-Guided Multimodal Sparsification (EGMS) 策略用于自适应地剔除低信息区域,结合Cross-Modality Focus Fusion (CMFF) 模块整合多模态互补信息。 Result: 在DSEC-Det和PKU-DAVIS-SOD数据集上验证,该方法在准确率和计算效率上均优于现有方法。 Conclusion: FocusMamba通过自适应多模态特征稀疏化和高效的跨模态融合策略,在RGB-Event检测任务中实现了精度与效率的平衡。 Abstract: Existing RGB-Event detection methods process the low-information regions of both modalities (background in images and non-event regions in event data) uniformly during feature extraction and fusion, resulting in high computational costs and suboptimal performance. To mitigate the computational redundancy during feature extraction, researchers have respectively proposed token sparsification methods for the image and event modalities. However, these methods employ a fixed number or threshold for token selection, hindering the retention of informative tokens for samples with varying complexity. To achieve a better balance between accuracy and efficiency, we propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features and efficiently integrates complementary information. Specifically, an Event-Guided Multimodal Sparsification (EGMS) strategy is designed to identify and adaptively discard low-information regions within each modality by leveraging scene content changes perceived by the event camera. Based on the sparsification results, a Cross-Modality Focus Fusion (CMFF) module is proposed to effectively capture and integrate complementary features from both modalities. Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency compared to existing methods. The code will be available at https://github.com/Zizzzzzzz/FocusMamba.

[68] SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition

Jiajun Song,Xiaoou Liu

Main category: cs.CV

TL;DR: SalientFusion addresses challenges in food recognition by removing background redundancy, resolving role confusion, and reducing semantic bias, achieving state-of-the-art results in Compositional Zero-Shot Food Learning.

Details Motivation: The need for recognizing unseen food categories motivates Zero-Shot Food Learning, and the unique challenges of food recognition necessitate a context-aware approach like SalientFusion. Method: SalientFusion incorporates SalientFormer to remove background redundancy and resolve role confusion using depth features, and DebiasAT to reduce semantic bias by aligning prompts with visual features. Result: SalientFusion achieves state-of-the-art results on the proposed benchmarks CZSFood-90 and CZSFood-164, as well as on popular general datasets for Compositional Zero-Shot Learning. Conclusion: SalientFusion is an effective method for Compositional Zero-Shot Food Recognition that addresses challenges such as background redundancy, role confusion, and semantic bias, achieving state-of-the-art results on proposed and general datasets. Abstract: Food recognition has gained significant attention, but the rapid emergence of new dishes requires methods for recognizing unseen food categories, motivating Zero-Shot Food Learning (ZSFL). We propose the task of Compositional Zero-Shot Food Recognition (CZSFR), where cuisines and ingredients naturally align with attributes and objects in Compositional Zero-Shot learning (CZSL). However, CZSFR faces three challenges: (1) Redundant background information distracts models from learning meaningful food features, (2) Role confusion between staple and side dishes leads to misclassification, and (3) Semantic bias in a single attribute can lead to confusion of understanding. Therefore, we propose SalientFusion, a context-aware CZSFR method with two components: SalientFormer, which removes background redundancy and uses depth features to resolve role confusion; DebiasAT, which reduces the semantic bias by aligning prompts with visual features. Using our proposed benchmarks, CZSFood-90 and CZSFood-164, we show that SalientFusion achieves state-of-the-art results on these benchmarks and the most popular general datasets for the general CZSL. The code is avaliable at https://github.com/Jiajun-RUC/SalientFusion.

[69] Human Motion Video Generation: A Survey

Haiwei Xue,Xiangyang Luo,Zhanghao Hu,Xin Zhang,Xunzhi Xiang,Yuqin Dai,Jianzhuang Liu,Zhensong Zhang,Minglei Li,Jian Yang,Fei Ma,Zhiyong Wu,Changpeng Yang,Zonghong Dai,Fei Richard Yu

Main category: cs.CV

TL;DR: 这篇论文是对人类运动视频生成领域的全面调查,涵盖了十个以上的子任务,并详细介绍了生成过程的五个关键阶段。

Details Motivation: 由于人类运动视频生成领域的发展迅速,但现有调查只关注个别方法,缺乏对整个生成过程的全面概述,因此本论文的动机是提供一个全面的调查,帮助推动数字人类的综合应用。 Method: 论文的方法是通过对200多篇论文进行调查,涵盖了人类运动视频生成的三个主要模态:视觉、文本和音频,并讨论了大型语言模型在增强人类运动视频生成中的潜力。 Result: 论文的结果包括对领域最新发展和技术趋势的回顾,强调了推动技术突破的里程碑作品,并提供了一个模型列表,作为推进数字人类应用的宝贵资源。 Conclusion: 这篇论文旨在通过提供一个全面的调查来填补这一空白,涵盖十个以上的子任务,并详细介绍了生成过程的五个关键阶段:输入、运动规划、运动视频生成、优化和输出。 Abstract: Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.

[70] OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction

Bu Jin,Songen Gu,Xiaotao Hu,Yupeng Zheng,Xiaoyang Guo,Qian Zhang,Xiaoxiao Long,Wei Yin

Main category: cs.CV

TL;DR: 本文提出了一种名为OccTENS的生成式占有率世界模型,该模型通过将占有率世界模型重新定义为时间下一个尺度预测任务,能够实现可控、高保真度的长期占有率生成,同时保持计算效率。

Details Motivation: 占有率世界模型必须捕捉细粒度的3D几何形状和3D场景的动态演化,这对生成模型提出了巨大挑战。最近基于自回归(AR)的方法在长期生成中通常存在效率低下、时间退化和缺乏可控性的问题。 Method: OccTENS将占有率世界模型重新定义为时间下一个尺度预测(TENS)任务,并通过TensFormer有效管理占有率序列的时间因果关系和空间关系。此外,还提出了一种整体姿态聚合策略来增强姿态可控性。 Result: 实验表明,OccTENS在占有率质量和推理速度方面均优于现有方法。 Conclusion: OccTENS是一个生成式占有率世界模型,能够实现可控、高保真度的长期占有率生成,同时保持计算效率。 Abstract: In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from \textbf{inefficiency}, \textbf{temporal degradation} in long-term generation and \textbf{lack of controllability}. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a \textbf{TensFormer}, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.

[71] Weakly-Supervised Learning of Dense Functional Correspondences

Stefan Stojanov,Linan Zhao,Yunzhi Zhang,Daniel L. K. Yamins,Jiajun Wu

Main category: cs.CV

TL;DR: 本文提出了一种通过视觉-语言模型伪标签多视角图像以获得功能部件,并结合像素对应关系中的密集对比学习来蒸馏功能和空间知识的方法,用于建立密集功能对应关系。

Details Motivation: 在跨类别的匹配具有挑战性的设置中,对象的功能可以指导应该如何建立对应关系,因为实现特定功能的对象部分通常在形状和外观上具有相似性。 Method: 提出了一种基于观察的密集功能对应定义,并设计了一种弱监督学习范式以解决预测任务,整合了视觉-语言模型的伪标签和密集对比学习。 Result: 研究结果证明了所提出的方法在合成和真实评估数据集上相比于基线解决方案具有优势。 Conclusion: 本文的方法能够有效地建立密集的功能对应关系,并在相关任务基准上展示了优越性能。 Abstract: Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.

[72] Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

Phuoc-Nguyen Bui,Khanh-Binh Nguyen,Hyunseung Choo

Main category: cs.CV

TL;DR: Attn-Adapter is a novel online few-shot learning framework designed to enhance CLIP's adaptability through a dual attention mechanism, addressing the limitations of computationally intensive offline fine-tuning and reducing the risk of overfitting.

Details Motivation: Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. The motivation is to overcome these limitations with a more effective few-shot learning framework. Method: The proposed method, Attn-Adapter, enhances CLIP's adaptability through a dual attention mechanism consisting of two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. Result: The Attn-Adapter framework enables dynamic adaptation from a few labeled samples without retraining the base model, leading to improved performance in cross-category and cross-dataset generalization. Conclusion: Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones. Abstract: Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

[73] SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

Xiaofu Chen,Israfel Salazar,Yova Kementchedjhieva

Main category: cs.CV

TL;DR: 本文提出了一种新的无参考RS度量SPECS,用于长图像字幕生成,具有与基于LLM的指标相当的相关性,同时效率更高。

Details Motivation: 随着对生成长且详细的图像标题的兴趣增加,传统的评估指标变得越来越不可靠。基于n-gram的指标虽然高效,但无法捕捉语义正确性。尽管基于大语言模型的指标与人类判断有很强的相关性,但它们在模型开发过程中仍过于昂贵,无法进行迭代使用。 Method: 通过修改CLIP并引入强调特异性(强调正确细节并惩罚错误细节)的新目标,创建了一种无参考的RS度量SPECS。 Result: SPECS在与人类判断的相关性方面表现与开源的基于LLM的指标相同,同时效率更高。 Conclusion: SPECS是一种实用的图像字幕模型开发迭代检查点评估的替代方法。 Abstract: As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.

[74] A Generative Foundation Model for Chest Radiography

Yuanfeng Ji,Dan Lin,Xiyue Wang,Lu Zhang,Wenhui Zhou,Chongjian Ge,Ruihang Chu,Xiaoli Yang,Junhan Zhao,Junsong Chen,Xiangde Luo,Sen Yang,Jin Fang,Ping Luo,Ruijiang Li

Main category: cs.CV

TL;DR: ChexGen是一种生成视觉-语言基础模型,用于合成胸部X光图像,通过专家评估和量化指标验证其准确性,并提升医学AI系统的性能和公平性。

Details Motivation: 医学图像标注数据稀缺,阻碍了可靠AI模型的发展。 Method: 基于潜在扩散变压器架构,使用96万张胸片X光图像和报告对进行预训练。 Result: ChexGen在疾病分类、检测和分割任务中用少量数据提升了性能,并能创建多样化患者群体以增强模型公平性。 Conclusion: ChexGen支持构建更准确、数据高效和公平的医疗AI系统。 Abstract: The scarcity of well-annotated diverse medical images is a major hurdle for developing reliable AI models in healthcare. Substantial technical advances have been made in generative foundation models for natural images. Here we develop `ChexGen', a generative vision-language foundation model that introduces a unified framework for text-, mask-, and bounding box-guided synthesis of chest radiographs. Built upon the latent diffusion transformer architecture, ChexGen was pretrained on the largest curated chest X-ray dataset to date, consisting of 960,000 radiograph-report pairs. ChexGen achieves accurate synthesis of radiographs through expert evaluations and quantitative metrics. We demonstrate the utility of ChexGen for training data augmentation and supervised pretraining, which led to performance improvements across disease classification, detection, and segmentation tasks using a small fraction of training data. Further, our model enables the creation of diverse patient cohorts that enhance model fairness by detecting and mitigating demographic biases. Our study supports the transformative role of generative foundation models in building more accurate, data-efficient, and equitable medical AI systems.

[75] LMVC: An End-to-End Learned Multiview Video Coding Framework

Xihua Sheng,Yingwen Zhang,Long Xu,Shiqi Wang

Main category: cs.CV

TL;DR: 该论文提出了一种基于深度学习的多视角视频编码框架LMVC,通过利用多视角之间的运动和内容相关性,显著提高了压缩效率,并在传统MV-HEVC标准上建立了未来研究的强基线。

Details Motivation: 多视角视频在存储和传输方面面临巨大挑战,现有的深度学习视频编码方法主要关注单视角或双视角视频,多视角场景的研究不足。 Method: 提出了一种端到端的多视角视频编码框架(LMVC),利用独立视角的运动和内容信息来增强依赖视角的压缩效率,包括基于特征的跨视角运动向量预测方法、跨视角运动熵模型、无视差的跨视角上下文预测模块和跨视角上下文熵模型。 Result: 实验结果显示,LMVC框架在压缩效率上显著优于传统MV-HEVC标准的参考软件。 Conclusion: 该研究为多视角视频编码提供了一个高效的深度学习解决方案,并为未来研究奠定了基础。 Abstract: Multiview video is a key data source for volumetric video, enabling immersive 3D scene reconstruction but posing significant challenges in storage and transmission due to its massive data volume. Recently, deep learning-based end-to-end video coding has achieved great success, yet most focus on single-view or stereo videos, leaving general multiview scenarios underexplored. This paper proposes an end-to-end learned multiview video coding (LMVC) framework that ensures random access and backward compatibility while enhancing compression efficiency. Our key innovation lies in effectively leveraging independent-view motion and content information to enhance dependent-view compression. Specifically, to exploit the inter-view motion correlation, we propose a feature-based inter-view motion vector prediction method that conditions dependent-view motion encoding on decoded independent-view motion features, along with an inter-view motion entropy model that learns inter-view motion priors. To exploit the inter-view content correlation, we propose a disparity-free inter-view context prediction module that predicts inter-view contexts from decoded independent-view content features, combined with an inter-view contextual entropy model that captures inter-view context priors. Experimental results show that our proposed LMVC framework outperforms the reference software of the traditional MV-HEVC standard by a large margin, establishing a strong baseline for future research in this field.

[76] TopoSculpt: Betti-Steered Topological Sculpting of 3D Fine-grained Tubular Shapes

Minghui Zhang,Yaoyu Liu,Junyang Wu,Xin You,Hanxiao Zhang,Junjun He,Yun Gu

Main category: cs.CV

TL;DR: 本文提出TopoSculpt方法,通过整体建模和拓扑约束显著提升了三维管状结构的几何和拓扑重建精度。

Details Motivation: 现有方法在拓扑正确性和完整性方面存在不足,无法全局保持拓扑结构或纠正几何错误。 Method: 提出了TopoSculpt框架,包括整体区域建模策略、拓扑完整性Betti(TIB)约束和基于持续同调的课程优化方案。 Result: 在肺部气道和Willis环数据集上,拓扑错误显著减少,树长度和分支检测率提高了近10%。 Conclusion: TopoSculpt有效地纠正了复杂三维管状结构的关键拓扑错误,提高了几何和拓扑的建模精度。 Abstract: Medical tubular anatomical structures are inherently three-dimensional conduits with lumens, enclosing walls, and complex branching topologies. Accurate reconstruction of their geometry and topology is crucial for applications such as bronchoscopic navigation and cerebral arterial connectivity assessment. Existing methods often rely on voxel-wise overlap measures, which fail to capture topological correctness and completeness. Although topology-aware losses and persistent homology constraints have shown promise, they are usually applied patch-wise and cannot guarantee global preservation or correct geometric errors at inference. To address these limitations, we propose a novel TopoSculpt, a framework for topological refinement of 3D fine-grained tubular structures. TopoSculpt (i) adopts a holistic whole-region modeling strategy to capture full spatial context, (ii) first introduces a Topological Integrity Betti (TIB) constraint that jointly enforces Betti number priors and global integrity, and (iii) employs a curriculum refinement scheme with persistent homology to progressively correct errors from coarse to fine scales. Extensive experiments on challenging pulmonary airway and Circle of Willis datasets demonstrate substantial improvements in both geometry and topology. For instance, $\beta_{0}$ errors are reduced from 69.00 to 3.40 on the airway dataset and from 1.65 to 0.30 on the CoW dataset, with Tree length detected and branch detected rates improving by nearly 10\%. These results highlight the effectiveness of TopoSculpt in correcting critical topological errors and advancing the high-fidelity modeling of complex 3D tubular anatomy. The project homepage is available at: https://github.com/Puzzled-Hui/TopoSculpt.

[77] Chest X-ray Pneumothorax Segmentation Using EfficientNet-B4 Transfer Learning in a U-Net Architecture

Alvaro Aranibar Roque,Helga Sebastian

Main category: cs.CV

TL;DR: 提出了一种自动化深度学习方法来检测气胸。

Details Motivation: 气胸如果未被发现可能是致命的,胸部X光片作为一线诊断工具,但小病例可能不明显。 Method: 我们提出了一种使用具有EfficientNet-B4编码器的U-Net的自动化深度学习流水线来分割气胸区域。 Result: 在SIIM-ACR数据集上进行训练,并在PTX-498数据集上测试,该模型的IoU为0.7008,Dice分数为0.8241。 Conclusion: 该模型能够准确地定位气胸区域,辅助放射科医生。 Abstract: Pneumothorax, the abnormal accumulation of air in the pleural space, can be life-threatening if undetected. Chest X-rays are the first-line diagnostic tool, but small cases may be subtle. We propose an automated deep-learning pipeline using a U-Net with an EfficientNet-B4 encoder to segment pneumothorax regions. Trained on the SIIM-ACR dataset with data augmentation and a combined binary cross-entropy plus Dice loss, the model achieved an IoU of 0.7008 and Dice score of 0.8241 on the independent PTX-498 dataset. These results demonstrate that the model can accurately localize pneumothoraces and support radiologists.

[78] ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection

Zhu Wenjie,Zhang Yabin,Xin Jin,Wenjun Zeng,Lei Zhang

Main category: cs.CV

TL;DR: ANTS improves out-of-distribution detection by creating adaptive negative textual spaces using MLLMs, achieving state-of-the-art results on ImageNet.

Details Motivation: Existing OOD detection methods struggle with constructing accurate negative spaces and suffer from false negative labels, especially in near-OOD scenarios. Method: Adaptive Negative Textual Space (ANTS) using multimodal large language models (MLLMs) to generate expressive negative sentences and tailored negative labels for near-OOD and far-OOD settings. Result: On the ImageNet benchmark, ANTS reduced FPR95 by 4.2% and achieved superior OOD detection performance without training. Conclusion: ANTS is a new state-of-the-art method for OOD detection that effectively reduces false negatives and improves near-OOD and far-OOD detection. Abstract: The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

[79] Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection

Yijun Zhou,Yikui Zhai,Zilu Ying,Tingfeng Xian,Wenlve Zhou,Zhiheng Zhou,Xiaolin Tian,Xudong Jia,Hongsheng Zhang,C. L. Philip Chen

Main category: cs.CV

TL;DR: MMChange is a new remote sensing change detection method combining image and text modalities, achieving better accuracy and robustness than existing approaches.

Details Motivation: Most deep learning methods for remote sensing change detection rely solely on image modality, which limits their ability to represent features, model change patterns, and generalize under disturbances like illumination and noise. Method: MMChange incorporates an Image Feature Refinement (IFR) module, a vision language model (VLM), a Textual Difference Enhancement (TDE) module, and an Image Text Feature Fusion (ITFF) module to enhance feature representation and cross-modal integration. Result: Experiments on LEVIRCD, WHUCD, and SYSUCD datasets show that MMChange outperforms state-of-the-art methods across multiple metrics. Conclusion: MMChange is an effective multimodal method for remote sensing change detection that improves accuracy and robustness by combining image and text modalities. Abstract: Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.

[80] SAC-MIL: Spatial-Aware Correlated Multiple Instance Learning for Histopathology Whole Slide Image Classification

Yu Bai,Zitong Yu,Haowen Tian,Xijing Wang,Shuo Yan,Lin Wang,Honglin Li,Xitong Ling,Bo Zhang,Zheng Zhang,Wufan Wang,Hui Gao,Xiangyang Gong,Wendong Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于 WSI 分类的新方法 SAC-MIL,其包括位置编码模块和 SAC 模块,实现了最先进的性能,并且易于部署。

Details Motivation: 解决 WSI 分类问题,同时处理训练和测试序列长度不同的长度外推问题,并减少对自定义 CUDA 内核的需求。 Method: SAC-MIL 包括一个用于编码位置信息的位置编码模块和一个用于执行完整实例相关性的 SAC 模块。 Result: SAC-MIL 在 CAMELYON-16、TCGA-LUNG 和 TCGA-BRAC 数据集上实现了最先进的性能。 Conclusion: SAC-MIL 是一种用于 WSI 分类的简单且易于部署的方法,具有最先进的性能。 Abstract: We propose Spatial-Aware Correlated Multiple Instance Learning (SAC-MIL) for performing WSI classification. SAC-MIL consists of a positional encoding module to encode position information and a SAC block to perform full instance correlations. The positional encoding module utilizes the instance coordinates within the slide to encode the spatial relationships instead of the instance index in the input WSI sequence. The positional encoding module can also handle the length extrapolation issue where the training and testing sequences have different lengths. The SAC block is an MLP-based method that performs full instance correlation in linear time complexity with respect to the sequence length. Due to the simple structure of MLP, it is easy to deploy since it does not require custom CUDA kernels, compared to Transformer-based methods for WSI classification. SAC-MIL has achieved state-of-the-art performance on the CAMELYON-16, TCGA-LUNG, and TCGA-BRAC datasets. The code will be released upon acceptance.

[81] Improving Vessel Segmentation with Multi-Task Learning and Auxiliary Data Available Only During Model Training

Daniel Sobotka,Alexander Herold,Matthias Perkonigg,Lucian Beer,Nina Bastati,Alina Sablatnig,Ahmed Ba-Ssalamah,Georg Langs

Main category: cs.CV

TL;DR: This paper proposes a multi-task learning approach to segment liver vessels in non-contrast MRI by leveraging contrast-enhanced MRI data during training, reducing the need for large annotated datasets and improving segmentation accuracy.

Details Motivation: Liver vessel segmentation in MRI is crucial for analyzing vascular remodeling in diffuse liver diseases, but existing methods depend on contrast-enhanced imaging, which is not always available. Non-contrast images are more frequently acquired but pose challenges for segmentation due to the lack of annotated data. Method: A multi-task learning framework was developed that utilizes paired native and contrast-enhanced MRI data during training to improve vessel segmentation performance in the absence of contrast-enhanced images during inference. Result: The method achieves improved vessel segmentation accuracy, especially when only a limited number of annotated examples are available. The benefit of using auxiliary contrast-enhanced data is most significant under such conditions, as it enhances feature representation through shared task structure. Conclusion: The proposed multi-task learning framework effectively improves liver vessel segmentation in non-contrast MRI data by leveraging auxiliary contrast-enhanced MRI data during training, reducing the reliance on large-scale annotated datasets. Abstract: Liver vessel segmentation in magnetic resonance imaging data is important for the computational analysis of vascular remodelling, associated with a wide spectrum of diffuse liver diseases. Existing approaches rely on contrast enhanced imaging data, but the necessary dedicated imaging sequences are not uniformly acquired. Images without contrast enhancement are acquired more frequently, but vessel segmentation is challenging, and requires large-scale annotated data. We propose a multi-task learning framework to segment vessels in liver MRI without contrast. It exploits auxiliary contrast enhanced MRI data available only during training to reduce the need for annotated training examples. Our approach draws on paired native and contrast enhanced data with and without vessel annotations for model training. Results show that auxiliary data improves the accuracy of vessel segmentation, even if they are not available during inference. The advantage is most pronounced if only few annotations are available for training, since the feature representation benefits from the shared task structure. A validation of this approach to augment a model for brain tumor segmentation confirms its benefits across different domains. An auxiliary informative imaging modality can augment expert annotations even if it is only available during training.

[82] Promptception: How Sensitive Are Large Multimodal Models to Prompts?

Mohamed Insaf Ismithdeen,Muhammad Uzair Khattak,Salman Khan

Main category: cs.CV

TL;DR: Promptception introduces a systematic framework for evaluating prompt sensitivity in LMMs, identifying differences between proprietary and open-source models and proposing principles for more robust and fair evaluation.

Details Motivation: Prompt design for LMMs in MCQA is poorly understood, with minor variations in prompts causing significant accuracy deviations, challenging transparent and fair evaluation. Method: Introduction of Promptception, a framework with 61 prompt types across 15 categories and 6 supercategories, used to evaluate 10 LMMs on 3 MCQA benchmarks. Result: Proprietary models show greater sensitivity to prompt phrasing, indicating tighter alignment with instruction semantics, while open-source models are steadier but struggle with complex phrasing. Conclusion: Promptception provides a systematic framework for evaluating prompt sensitivity in LMMs, highlighting the trade-offs between proprietary and open-source models and proposing tailored prompting principles for more robust and fair model evaluation. Abstract: Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple-Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open-source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU-Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open-source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.

[83] SliceSemOcc: Vertical Slice Based Multimodal 3D Semantic Occupancy Representation

Han Huang,Han Sun,Ningzhong Liu,Huiyu Zhou,Jiaquan Shen

Main category: cs.CV

TL;DR: SliceSemOcc improves 3D semantic occupancy prediction by capturing height-axis information and dynamically assigning attention weights, leading to better performance on small objects.

Details Motivation: Current BEV and voxel-based methods often overlook height-axis information and use uniform channel attention, limiting their ability to capture vertical semantic variations. Method: SliceSemOcc uses a vertical slice-based multimodal framework with a global-local fusion module and SEAttention3D to capture and reconcile fine-grained spatial details with holistic context. Result: Extensive experiments on nuScenes-SurroundOcc and nuScenes-OpenOccupancy datasets show significant improvements in mean IoU, especially for small-object categories. Conclusion: The SliceSemOcc framework significantly enhances 3D semantic occupancy prediction performance, especially for small-object categories, by effectively leveraging height-axis information. Abstract: Driven by autonomous driving's demands for precise 3D perception, 3D semantic occupancy prediction has become a pivotal research topic. Unlike bird's-eye-view (BEV) methods, which restrict scene representation to a 2D plane, occupancy prediction leverages a complete 3D voxel grid to model spatial structures in all dimensions, thereby capturing semantic variations along the vertical axis. However, most existing approaches overlook height-axis information when processing voxel features. And conventional SENet-style channel attention assigns uniform weight across all height layers, limiting their ability to emphasize features at different heights. To address these limitations, we propose SliceSemOcc, a novel vertical slice based multimodal framework for 3D semantic occupancy representation. Specifically, we extract voxel features along the height-axis using both global and local vertical slices. Then, a global local fusion module adaptively reconciles fine-grained spatial details with holistic contextual information. Furthermore, we propose the SEAttention3D module, which preserves height-wise resolution through average pooling and assigns dynamic channel attention weights to each height layer. Extensive experiments on nuScenes-SurroundOcc and nuScenes-OpenOccupancy datasets verify that our method significantly enhances mean IoU, achieving especially pronounced gains on most small-object categories. Detailed ablation studies further validate the effectiveness of the proposed SliceSemOcc framework.

[84] Detecting Regional Spurious Correlations in Vision Transformers via Token Discarding

Solha Kang,Esla Timothy Anzaku,Wesley De Neve,Arnout Van Messem,Joris Vankerschaver,Francois Rameau,Utku Ozbulak

Main category: cs.CV

TL;DR: 本文研究了视觉变换器中虚假相关的检测方法,并通过大规模实验验证了该方法的有效性,同时讨论了训练方法对模型依赖虚假相关的影响。

Details Motivation: 神经网络模型可能会利用数据中的无意模式进行预测,这种虚假相关可能导致模型的不可靠和泛化能力差。检测和缓解这些虚假相关对于构建可信的机器学习模型至关重要。 Method: 提出了一种新的检测视觉变换器中虚假相关的方法,并使用ImageNet数据集进行监督和自监督训练模型的实验。 Result: 实验表明,所提出的方法能够有效识别视觉变换器中的虚假相关,同时发现训练方法对模型依赖虚假相关有显著影响,并识别了ImageNet数据集中存在虚假信号的类别。 Conclusion: 本文的工作强调了检测和缓解虚假相关的重要性,提供了相关图像列表并呼吁在今后的研究中谨慎使用,同时通过乳腺癌分类案例研究验证了方法的实际应用价值。 Abstract: Due to their powerful feature association capabilities, neural network-based computer vision models have the ability to detect and exploit unintended patterns within the data, potentially leading to correct predictions based on incorrect or unintended but statistically relevant signals. These clues may vary from simple color aberrations to small texts within the image. In situations where these unintended signals align with the predictive task, models can mistakenly link these features with the task and rely on them for making predictions. This phenomenon is referred to as spurious correlations, where patterns appear to be associated with the task but are actually coincidental. As a result, detection and mitigation of spurious correlations have become crucial tasks for building trustworthy, reliable, and generalizable machine learning models. In this work, we present a novel method to detect spurious correlations in vision transformers, a type of neural network architecture that gained significant popularity in recent years. Using both supervised and self-supervised trained models, we present large-scale experiments on the ImageNet dataset demonstrating the ability of the proposed method to identify spurious correlations. We also find that, even if the same architecture is used, the training methodology has a significant impact on the model's reliance on spurious correlations. Furthermore, we show that certain classes in the ImageNet dataset contain spurious signals that are easily detected by the models and discuss the underlying reasons for those spurious signals. In light of our findings, we provide an exhaustive list of the aforementioned images and call for caution in their use in future research efforts. Lastly, we present a case study investigating spurious signals in invasive breast mass classification, grounding our work in real-world scenarios.

[85] Learning from Majority Label: A Novel Problem in Multi-class Multiple-Instance Learning

Shiku Kaito,Shinnosuke Matsuo,Daiki Suehiro,Ryoma Bise

Main category: cs.CV

TL;DR: This paper introduces Learning from Majority Label (LML), a new multi-class Multiple-Instance Learning (MIL) approach, which uses a Counting Network and a Majority Proportion Enhancement Module (MPEM) to improve classification performance by leveraging bag-level majority labels.

Details Motivation: The motivation is to address the limitations of traditional Multiple-Instance Learning (MIL) methods by introducing a novel problem (LML) that uses majority labels for bag-level classification, which has applications in areas like pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. Method: The paper proposes a novel approach called Learning from Majority Label (LML), where a Counting Network is trained to generate bag-level majority labels by counting instances in each class. Additionally, a Majority Proportion Enhancement Module (MPEM) is developed to enhance learning by removing minority class instances in bags. Result: Experiments show that the proposed LML method outperforms conventional MIL approaches on four datasets. Ablation studies confirm the effectiveness of both the Counting Network and the Majority Proportion Enhancement Module (MPEM). Conclusion: The paper concludes that the proposed method, Learning from Majority Label (LML), effectively trains a classification model to estimate instance classes using bag-level majority labels, with the proposed Counting Network and Majority Proportion Enhancement Module (MPEM) outperforming conventional Multiple-Instance Learning (MIL) methods. Abstract: The paper proposes a novel multi-class Multiple-Instance Learning (MIL) problem called Learning from Majority Label (LML). In LML, the majority class of instances in a bag is assigned as the bag-level label. The goal of LML is to train a classification model that estimates the class of each instance using the majority label. This problem is valuable in a variety of applications, including pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. To solve LML, we propose a Counting Network trained to produce bag-level majority labels, estimated by counting the number of instances in each class. Furthermore, analysis experiments on the characteristics of LML revealed that bags with a high proportion of the majority class facilitate learning. Based on this result, we developed a Majority Proportion Enhancement Module (MPEM) that increases the proportion of the majority class by removing minority class instances within the bags. Experiments demonstrate the superiority of the proposed method on four datasets compared to conventional MIL methods. Moreover, ablation studies confirmed the effectiveness of each module. The code is available at \href{https://github.com/Shiku-Kaito/Learning-from-Majority-Label-A-Novel-Problem-in-Multi-class-Multiple-Instance-Learning}{here}.

[86] Millisecond-Response Tracking and Gazing System for UAVs: A Domestic Solution Based on "Phytium + Cambricon"

Yuchen Zhu,Longxiang Yin,Kai Zhao

Main category: cs.CV

TL;DR: 本研究通过基于飞腾处理器和寒武纪加速卡的异构计算架构,实现了一个具有毫秒级响应能力的高效无人机跟踪系统,解决了传统视频监控系统在动态场景下的响应延迟问题。

Details Motivation: 当前视频监控技术的前沿研究与应用中,传统摄像系统在动态场景下存在超过200毫秒的响应延迟,这是由于自动识别算法的深度特征提取能力不足以及计算架构的效率瓶颈所致,无法满足复杂场景下的实时性需求。本研究旨在解决这一问题。 Method: 在硬件层面,系统采用了飞腾FT-2000/4处理器与MLU220加速卡的协同计算架构,并通过多卡并行提升算力。在软件层面,创新性地将轻量级YOLOv5s检测网络与DeepSORT级联跟踪算法结合,形成了闭环的“检测-跟踪-反馈”控制链。 Result: 实验结果表明,在1920*1080分辨率视频流处理中,系统实现了稳定的单帧综合处理延迟50-100毫秒,并实现了超过98.5%的多尺度目标识别准确率,兼具低延迟和高精度的特点。 Conclusion: 本研究通过构建基于飞腾处理器和寒武纪加速卡的异构计算架构,成功实现了具有毫秒级响应能力的无人机跟踪和凝视系统,为国内芯片在无人机监控中的应用提供了创新解决方案。 Abstract: In the frontier research and application of current video surveillance technology, traditional camera systems exhibit significant limitations of response delay exceeding 200 ms in dynamic scenarios due to the insufficient deep feature extraction capability of automatic recognition algorithms and the efficiency bottleneck of computing architectures, failing to meet the real-time requirements in complex scenes. To address this issue, this study proposes a heterogeneous computing architecture based on Phytium processors and Cambricon accelerator cards, constructing a UAV tracking and gazing system with millisecond-level response capability. At the hardware level, the system adopts a collaborative computing architecture of Phytium FT-2000/4 processors and MLU220 accelerator cards, enhancing computing power through multi-card parallelism. At the software level, it innovatively integrates a lightweight YOLOv5s detection network with a DeepSORT cascaded tracking algorithm, forming a closed-loop control chain of "detection-tracking-feedback". Experimental results demonstrate that the system achieves a stable single-frame comprehensive processing delay of 50-100 ms in 1920*1080 resolution video stream processing, with a multi-scale target recognition accuracy of over 98.5%, featuring both low latency and high precision. This study provides an innovative solution for UAV monitoring and the application of domestic chips.

[87] A Re-ranking Method using K-nearest Weighted Fusion for Person Re-identification

Quang-Huy Che,Le-Chuong Nguyen,Gia-Nghia Tran,Dinh-Duy Phan,Vinh-Tiep Nguyen

Main category: cs.CV

TL;DR: 本文提出了一种无需模型微调或额外标注的行人重识别高效重排序方法,通过K近邻加权融合生成多视角特征,显著提升了识别准确性和计算效率。

Details Motivation: 传统的行人重识别研究主要关注单视角图像特征,容易导致视角偏差和姿态变化、视角变化及遮挡等问题。 Method: 提出了一种K近邻加权融合方法,通过聚合邻居特征生成多视角特征,并探索了特征聚合时的权重选择策略。 Result: 在Market1501、MSMT17和Occluded-DukeMTMC数据集上评估表明,该方法在初始排名结果的前M个候选中显著提高了Rank@1和mAP指标,并在计算效率上优于其他重排序方法。 Conclusion: 本文提出了一种高效的重排序方法,通过使用K近邻加权融合方法生成多视角特征,有效减少了视角偏差,提高了行人重识别的准确性和计算效率。 Abstract: In person re-identification, re-ranking is a crucial step to enhance the overall accuracy by refining the initial ranking of retrieved results. Previous studies have mainly focused on features from single-view images, which can cause view bias and issues like pose variation, viewpoint changes, and occlusions. Using multi-view features to present a person can help reduce view bias. In this work, we present an efficient re-ranking method that generates multi-view features by aggregating neighbors' features using K-nearest Weighted Fusion (KWF) method. Specifically, we hypothesize that features extracted from re-identification models are highly similar when representing the same identity. Thus, we select K neighboring features in an unsupervised manner to generate multi-view features. Additionally, this study explores the weight selection strategies during feature aggregation, allowing us to identify an effective strategy. Our re-ranking approach does not require model fine-tuning or extra annotations, making it applicable to large-scale datasets. We evaluate our method on the person re-identification datasets Market1501, MSMT17, and Occluded-DukeMTMC. The results show that our method significantly improves Rank@1 and mAP when re-ranking the top M candidates from the initial ranking results. Specifically, compared to the initial results, our re-ranking method achieves improvements of 9.8%/22.0% in Rank@1 on the challenging datasets: MSMT17 and Occluded-DukeMTMC, respectively. Furthermore, our approach demonstrates substantial enhancements in computational efficiency compared to other re-ranking methods.

[88] TEn-CATS: Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph

Yaru Chen,Faegheh Sardari,Peiliang Zhang,Ruohao Guo,Yang Xiang,Zhenbo Li,Wenwu Wang

Main category: cs.CV

TL;DR: This paper proposes a novel method for Audio-Visual Video Parsing (AVVP) that combines BiT and CATS modules to improve event detection accuracy by addressing error amplification issues in existing approaches, achieving SOTA results on two datasets.

Details Motivation: Existing AVVP methods either rely on noisy pseudo-labels or indiscriminate attention mechanisms, leading to error amplification during training. This work aims to address these limitations by developing a more robust and accurate approach for event detection in videos. Method: The method combines two modules: the BiT module for semantic injection and dynamic calibration of audio-visual features, and the CATS module for semantic propagation and temporal connection. This approach leverages the complementarity of previous research directions to improve semantic cue extraction and dissemination over time. Result: The proposed method achieves state-of-the-art (SOTA) performance across multiple key indicators on the LLP and UnAV-100 benchmark datasets. Conclusion: The proposed method combining Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module effectively addresses the limitations of existing AVVP methods by integrating semantic injection, dynamic calibration, and semantic propagation, achieving state-of-the-art performance on two benchmark datasets. Abstract: Audio-Visual Video Parsing (AVVP) task aims to identify event categories and their occurrence times in a given video with weakly supervised labels. Existing methods typically fall into two categories: (i) designing enhanced architectures based on attention mechanism for better temporal modeling, and (ii) generating richer pseudo-labels to compensate for the absence of frame-level annotations. However, the first type methods treat noisy segment-level pseudo labels as reliable supervision and the second type methods let indiscriminate attention spread them across all frames, the initial errors are repeatedly amplified during training. To address this issue, we propose a method that combines the Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module. Specifically, we integrate the strengths and complementarity of the two previous research directions. We first perform semantic injection and dynamic calibration on audio and visual modality features through the BiT module, to locate and purify cleaner and richer semantic cues. Then, we leverage the CATS module for semantic propagation and connection to enable precise semantic information dissemination across time. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators on two benchmark datasets, LLP and UnAV-100.

[89] TriLiteNet: Lightweight Model for Multi-Task Visual Perception

Quang-Huy Che,Duc-Khai Lam

Main category: cs.CV

TL;DR: TriLiteNet是一种高效的多任务自动驾驶感知模型,在低计算资源下实现了卓越的性能,适用于实时应用。

Details Motivation: 为了满足高级驾驶辅助系统(ADAS)对快速处理和响应的需求,提高实时性能和安全性,需要高效的感知模型。 Method: 提出了TriLiteNet模型,通过优化设计,同时处理多个全景驾驶感知任务,包括车辆检测、可行驶区域分割和车道线分割。 Result: 在BDD100k数据集上的实验表明,TriLiteNet_{base}在三个关键任务上取得了优异的性能,包括85.6%的召回率、92.4%的mIoU和82.3%的准确率,且计算成本仅为7.72 GFLOPs。 Conclusion: TriLiteNet在保持低计算成本的同时,在实时自动驾驶感知任务中表现出色,为现实世界的应用提供了实用且可部署的解决方案。 Abstract: Efficient perception models are essential for Advanced Driver Assistance Systems (ADAS), as these applications require rapid processing and response to ensure safety and effectiveness in real-world environments. To address the real-time execution needs of such perception models, this study introduces the TriLiteNet model. This model can simultaneously manage multiple tasks related to panoramic driving perception. TriLiteNet is designed to optimize performance while maintaining low computational costs. Experimental results on the BDD100k dataset demonstrate that the model achieves competitive performance across three key tasks: vehicle detection, drivable area segmentation, and lane line segmentation. Specifically, the TriLiteNet_{base} demonstrated a recall of 85.6% for vehicle detection, a mean Intersection over Union (mIoU) of 92.4% for drivable area segmentation, and an Acc of 82.3% for lane line segmentation with only 2.35M parameters and a computational cost of 7.72 GFLOPs. Our proposed model includes a tiny configuration with just 0.14M parameters, which provides a multi-task solution with minimal computational demand. Evaluated for latency and power consumption on embedded devices, TriLiteNet in both configurations shows low latency and reasonable power during inference. By balancing performance, computational efficiency, and scalability, TriLiteNet offers a practical and deployable solution for real-world autonomous driving applications. Code is available at https://github.com/chequanghuy/TriLiteNet.

[90] DVS-PedX: Synthetic-and-Real Event-Based Pedestrian Dataset

Mustafa Sakhai,Kaung Sithu,Min Khant Soe Oke,Maciej Wielgosz

Main category: cs.CV

TL;DR: DVS-PedX is a new neuromorphic dataset for pedestrian detection and intention analysis, combining synthetic and real-world event streams to support research in event-based pedestrian safety and neuromorphic perception.

Details Motivation: Event cameras, such as Dynamic Vision Sensors (DVS), provide advantages like low latency, high dynamic range, and motion robustness. However, there is a need for comprehensive datasets to study pedestrian detection and crossing-intention analysis in varying conditions. Method: The dataset combines synthetic event streams generated in the CARLA simulator with real-world JAAD dash-cam videos converted to event streams using the v2e tool. It includes paired RGB frames, DVS event frames, and frame-level labels, along with raw event files and metadata. Result: DVS-PedX offers data from controlled 'approach-cross' scenes under varied weather and lighting, as well as natural behaviors from real-world videos. Baseline spiking neural networks demonstrate dataset usability and reveal a sim-to-real gap. Conclusion: DVS-PedX is a neuromorphic dataset aiming to advance research in event-based pedestrian safety, intention prediction, and neuromorphic perception, highlighting the need for domain adaptation and multimodal fusion. Abstract: Event cameras like Dynamic Vision Sensors (DVS) report micro-timed brightness changes instead of full frames, offering low latency, high dynamic range, and motion robustness. DVS-PedX (Dynamic Vision Sensor Pedestrian eXploration) is a neuromorphic dataset designed for pedestrian detection and crossing-intention analysis in normal and adverse weather conditions across two complementary sources: (1) synthetic event streams generated in the CARLA simulator for controlled "approach-cross" scenes under varied weather and lighting; and (2) real-world JAAD dash-cam videos converted to event streams using the v2e tool, preserving natural behaviors and backgrounds. Each sequence includes paired RGB frames, per-frame DVS "event frames" (33 ms accumulations), and frame-level labels (crossing vs. not crossing). We also provide raw AEDAT 2.0/AEDAT 4.0 event files and AVI DVS video files and metadata for flexible re-processing. Baseline spiking neural networks (SNNs) using SpikingJelly illustrate dataset usability and reveal a sim-to-real gap, motivating domain adaptation and multimodal fusion. DVS-PedX aims to accelerate research in event-based pedestrian safety, intention prediction, and neuromorphic perception.

[91] TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Ayan Banerjee,Josep Lladós,Umapada Pal,Anjan Dutta

Main category: cs.CV

TL;DR: TaleDiffusion通过迭代生成和角色一致性控制,提高了文本到故事可视化的质量和对话渲染准确性。

Details Motivation: 现有的文本到故事可视化方法在角色一致性方面存在困难,导致生成伪影和对话渲染不准确,从而造成故事不连贯。 Method: 利用预训练的LLM生成每帧描述、角色细节和对话,应用基于注意力的掩码技术控制角色交互,采用身份一致的自注意力机制和区域感知的交叉注意力机制,并通过CLIPSeg将对话渲染为气泡。 Result: 实验结果表明,TaleDiffusion在一致性、降噪和对话渲染方面优于现有方法。 Conclusion: TaleDiffusion通过迭代过程、保持角色一致性以及后处理中的对话分配解决了文本到故事可视化的挑战。 Abstract: Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.

[92] MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation

Yuan Zhao,Liu Lin

Main category: cs.CV

TL;DR: 本文提出了一种结合大语言模型和专家扩散模块的文本到图像生成框架MEPG,在复杂提示理解和风格多样性方面取得了显著性能提升。

Details Motivation: 文本到图像扩散模型在处理复杂多元素提示和实现多样化风格方面存在局限性,因此需要一种新的方法来提升模型的表现力和灵活性。 Method: 提出了一种包含Position-Style-Aware (PSA) 模块和Multi-Expert Diffusion (MED) 模块的双核心组件框架,其中PSA模块通过监督微调的大语言模型将输入提示分解为空间坐标和风格编码的语义指令,MED模块通过动态路由机制在局部和全局区域生成图像,并结合注意力门控机制选择性激活特定模型。 Result: 实验表明,MEPG框架在图像质量和风格多样性方面显著优于相同主干网络的基线模型。 Conclusion: MEPG框架通过结合位置和风格感知的大语言模型与空间语义专家模块,有效提升了文本到图像扩散模型在复杂多元素提示和风格多样性方面的表现。 Abstract: Text-to-image diffusion models have achieved remarkable image quality, but they still struggle with complex, multiele ment prompts, and limited stylistic diversity. To address these limitations, we propose a Multi-Expert Planning and Gen eration Framework (MEPG) that synergistically integrates position- and style-aware large language models (LLMs) with spatial-semantic expert modules. The framework comprises two core components: (1) a Position-Style-Aware (PSA) module that utilizes a supervised fine-tuned LLM to decom pose input prompts into precise spatial coordinates and style encoded semantic instructions; and (2) a Multi-Expert Dif fusion (MED) module that implements cross-region genera tion through dynamic expert routing across both local regions and global areas. During the generation process for each lo cal region, specialized models (e.g., realism experts, styliza tion specialists) are selectively activated for each spatial par tition via attention-based gating mechanisms. The architec ture supports lightweight integration and replacement of ex pert models, providing strong extensibility. Additionally, an interactive interface enables real-time spatial layout editing and per-region style selection from a portfolio of experts. Ex periments show that MEPG significantly outperforms base line models with the same backbone in both image quality and style diversity.

[93] Revisiting Simple Baselines for In-The-Wild Deepfake Detection

Orlando Castaneda,Kevin So-Tang,Kshitij Gurung

Main category: cs.CV

TL;DR: 本文表明,通过优化超参数,Ojha等人的基线方法在现实世界基准Deepfake-Eval-2024上实现了81%的准确率,与商业深度伪造检测器相当。

Details Motivation: 现有的深度伪造检测器在高度控制的数据集上评估,而本文关注于现实世界基准Deepfake-Eval-2024,旨在提升检测器的实用性和性能。 Method: 重新审视Ojha等人提出的基线方法,并调整超参数以提升性能。 Result: 通过调优超参数,Ojha等人的方法实现了81%的准确率,比此前报道的基线方法提高了18%,并能与领先的商业检测器相媲美。 Conclusion: 通过更好地调整超参数,Ojha等人的基线方法可以在Deepfake-Eval-2024上实现81%的准确率,超越之前该基线方法的准确率18%,并与商业深度伪造检测器竞争。 Abstract: The widespread adoption of synthetic media demands accessible deepfake detectors and realistic benchmarks. While most existing research evaluates deepfake detectors on highly controlled datasets, we focus on the recently released "in-the-wild" benchmark, Deepfake-Eval-2024. Initial reporting on Deepfake-Eval-2024 showed that three finetuned open-source models achieve accuracies between 61% and 69%, significantly lagging behind the leading commercial deepfake detector with 82% accuracy. Our work revisits one of these baseline approaches, originally introduced by Ojha et al., which adapts standard pretrained vision backbones to produce generalizable deepfake detectors. We demonstrate that with better-tuned hyperparameters, this simple approach actually yields much higher performance -- 81% accuracy on Deepfake-Eval-2024 -- surpassing the previously reported accuracy of this baseline approach by 18% and competing with commercial deepfake detectors. We discuss tradeoffs in accuracy, computational costs, and interpretability, focusing on how practical these deepfake detectors might be when deployed in real-world settings. Our code can be found at https://github.com/Deepfake-Detection-KKO/deepfake-detection.

[94] YOLO Ensemble for UAV-based Multispectral Defect Detection in Wind Turbine Components

Serhii Svystun,Pavlo Radiuk,Oleksandr Melnychenko,Oleg Savenko,Anatoliy Sachenko

Main category: cs.CV

TL;DR: This research proposes an ensemble of YOLO-based deep learning models that integrate visible and thermal channels to improve defect detection in wind power plants, achieving higher accuracy than a standalone YOLOv8 model.

Details Motivation: The motivation behind the study is to enhance defect detection accuracy in wind power plants, particularly for monitoring critical components like blades and towers, which require high-resolution data and efficient processing methods. Method: The study developed an ensemble of YOLO-based deep learning models that integrate both visible and thermal channels using a sophisticated bounding box fusion algorithm to combine their predictions. Result: The proposed approach achieved a mean Average Precision (mAP@.5) of 0.93 and an F1-score of 0.90, outperforming a standalone YOLOv8 model, which scored an mAP@.5 of 0.91. Conclusion: The study concludes that combining multiple YOLO architectures with fused multispectral data provides a more reliable solution for improving the detection of both visual and thermal defects in wind power plants. Abstract: Unmanned aerial vehicles (UAVs) equipped with advanced sensors have opened up new opportunities for monitoring wind power plants, including blades, towers, and other critical components. However, reliable defect detection requires high-resolution data and efficient methods to process multispectral imagery. In this research, we aim to enhance defect detection accuracy through the development of an ensemble of YOLO-based deep learning models that integrate both visible and thermal channels. We propose an ensemble approach that integrates a general-purpose YOLOv8 model with a specialized thermal model, using a sophisticated bounding box fusion algorithm to combine their predictions. Our experiments show this approach achieves a mean Average Precision (mAP@.5) of 0.93 and an F1-score of 0.90, outperforming a standalone YOLOv8 model, which scored an mAP@.5 of 0.91. These findings demonstrate that combining multiple YOLO architectures with fused multispectral data provides a more reliable solution, improving the detection of both visual and thermal defects.

[95] VisioFirm: Cross-Platform AI-assisted Annotation Tool for Computer Vision

Safouane El Ghazouali,Umberto Michelucci

Main category: cs.CV

TL;DR: VisioFirm是一个基于AI辅助自动化的开源图像标注工具,显著减少了人工工作量,提高了大规模数据集的标注效率。

Details Motivation: 传统的图像标注工具通常需要大量手动输入,限制了大规模数据集的可扩展性。VisioFirm旨在通过AI辅助自动化减少人工干预,提高标注效率。 Method: VisioFirm结合了CLIP、预训练检测模型(如Ultralytics模型)和零样本模型(如Grounding DINO)来生成初始标注,并通过低置信度阈值最大化召回率。它还利用IoU图来抑制冗余检测,并通过CLIP聚类来消除歧义。 Result: 在测试COCO类型类别时,VisioFirm的初始预测大多准确,并通过交互式工具让用户进行微调。基准测试显示其在多个数据集上减少了高达90%的手动工作量,同时保持了高精度。 Conclusion: VisioFirm是一个开源的Web应用程序,旨在通过AI辅助自动化来简化图像标注,减少人工工作量,并在多个数据集上实现了高达90%的手动工作量减少,同时保持了高标注准确性。 Abstract: AI models rely on annotated data to learn pattern and perform prediction. Annotation is usually a labor-intensive step that require associating labels ranging from a simple classification label to more complex tasks such as object detection, oriented bounding box estimation, and instance segmentation. Traditional tools often require extensive manual input, limiting scalability for large datasets. To address this, we introduce VisioFirm, an open-source web application designed to streamline image labeling through AI-assisted automation. VisioFirm integrates state-of-the-art foundation models into an interface with a filtering pipeline to reduce human-in-the-loop efforts. This hybrid approach employs CLIP combined with pre-trained detectors like Ultralytics models for common classes and zero-shot models such as Grounding DINO for custom labels, generating initial annotations with low-confidence thresholding to maximize recall. Through this framework, when tested on COCO-type of classes, initial prediction have been proven to be mostly correct though the users can refine these via interactive tools supporting bounding boxes, oriented bounding boxes, and polygons. Additionally, VisioFirm has on-the-fly segmentation powered by Segment Anything accelerated through WebGPU for browser-side efficiency. The tool supports multiple export formats (YOLO, COCO, Pascal VOC, CSV) and operates offline after model caching, enhancing accessibility. VisioFirm demonstrates up to 90\% reduction in manual effort through benchmarks on diverse datasets, while maintaining high annotation accuracy via clustering of connected CLIP-based disambiguate components and IoU-graph for redundant detection suppression. VisioFirm can be accessed from \href{https://github.com/OschAI/VisioFirm}{https://github.com/OschAI/VisioFirm}.

[96] DUDE: Diffusion-Based Unsupervised Cross-Domain Image Retrieval

Ruohong Yang,Peng Hu,Yunfan Li,Xi Peng

Main category: cs.CV

TL;DR: DUDE是一种新的无监督跨域图像检索方法,通过分离对象特征和域特定样式来解决域间差距问题。

Details Motivation: 现有的UCIR方法通常难以应对域间差距,因为关键对象特征经常与域特定样式纠缠在一起。 Method: DUDE利用文本到图像生成模型从域特定样式中分离对象特征,并通过在域内和域间对齐相互邻居来进一步实现可靠的对齐。 Result: 实验结果表明,DUDE在三个基准数据集上实现了最先进的性能,覆盖了13个域。 Conclusion: DUDE是一个新的UCIR方法,可以在没有注释的情况下实现跨域图像检索,并在三个基准数据集上实现了最先进的性能。 Abstract: Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images of the same category across diverse domains without relying on annotations. Existing UCIR methods, which align cross-domain features for the entire image, often struggle with the domain gap, as the object features critical for retrieval are frequently entangled with domain-specific styles. To address this challenge, we propose DUDE, a novel UCIR method building upon feature disentanglement. In brief, DUDE leverages a text-to-image generative model to disentangle object features from domain-specific styles, thus facilitating semantical image retrieval. To further achieve reliable alignment of the disentangled object features, DUDE aligns mutual neighbors from within domains to across domains in a progressive manner. Extensive experiments demonstrate that DUDE achieves state-of-the-art performance across three benchmark datasets over 13 domains. The code will be released.

[97] Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Wanfu Wang,Qipeng Huang,Guangquan Xue,Xiaobo Liang,Juntao Li

Main category: cs.CV

TL;DR: LASER是一种自我演化框架,通过多步感知能力提升视觉语言模型在GUI定位任务中的表现,实现了精确的坐标预测和新的性能基准。

Details Motivation: 尽管VLMs在连接视觉感知和语言推理方面取得了进展,但在高分辨率输入和复杂的多元素视觉交互情况下,如何让模型在合适的图像区域上进行有效推理仍是核心挑战。 Method: LASER 结合了蒙特卡洛质量估计和基于交并比(IoU)的区域质量评估,以同时鼓励在构建高质量偏好数据时的准确性和多样性,并根据任务复杂性自适应地分配推理步骤。 Result: 在ScreenSpot Pro和ScreenSpot-v2基准测试中进行了全面实验,验证了LASER方法的一致性能提升,并在7B规模模型中建立了新的最先进水平。 Conclusion: LASER 框架通过自我演化机制,逐步赋予视觉语言模型多步感知能力,从而在GUI定位任务中实现了精确的坐标预测和显著的性能提升,并在ScreenSpot-Pro基准测试中达到了55.7分的新最佳成绩。 Abstract: Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning. Recently, OpenAI o3 model introduced a zoom-in search strategy that effectively elicits active perception capabilities in VLMs, improving downstream task performance. However, enabling VLMs to reason effectively over appropriate image regions remains a core challenge in GUI grounding, particularly under high-resolution inputs and complex multi-element visual interactions. In this work, we propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities, enabling precise coordinate prediction. Specifically, our approach integrate Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity. Comprehensive experiments on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate consistent performance gains, validating the effectiveness of our method. Furthermore, when fine-tuned on GTA1-7B, LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, establishing a new state-of-the-art (SoTA) among 7B-scale models.

[98] Differential Morphological Profile Neural Networks for Semantic Segmentation

David Huangal,J. Alex Hurt

Main category: cs.CV

TL;DR: The paper explores the use of DMP features in modern segmentation networks for improved semantic segmentation of overhead remote sensing imagery.

Details Motivation: State-of-the-art segmentation networks are typically developed for ground-perspective photographs and do not directly address remote sensing challenges such as extreme scale variation, foreground-background imbalance, and large image sizes. Method: The authors integrated DMP features into three state-of-the-art convolutional and transformer semantic segmentation architectures using both direct input and hybrid architectures. Result: Hybrid DMP consistently outperforms direct-input and is capable of surpassing a non-DMP model on mIoU, F1, and Recall. Conclusion: The integration of DMP features into modern segmentation networks improves semantic segmentation performance on overhead remote sensing imagery, especially when using hybrid architectures. Abstract: Semantic segmentation of overhead remote sensing imagery enables applications in mapping, urban planning, and disaster response. State-of-the-art segmentation networks are typically developed and tuned on ground-perspective photographs and do not directly address remote sensing challenges such as extreme scale variation, foreground-background imbalance, and large image sizes. We explore the incorporation of the differential morphological profile (DMP), a multi-scale shape extraction method based on grayscale morphology, into modern segmentation networks. Prior studies have shown that the DMP can provide critical shape information to Deep Neural Networks to enable superior detection and classification performance in overhead imagery. In this work, we extend prior DMPNet work beyond classification and object detection by integrating DMP features into three state-of-the-art convolutional and transformer semantic segmentation architectures. We utilize both direct input, which adapts the input stem of feature extraction architectures to accept DMP channels, and hybrid architectures, a dual-stream design that fuses RGB and DMP encoders. Using the iSAID benchmark dataset, we evaluate a variety of DMP differentials and structuring element shapes to more effectively provide shape information to the model. Our results show that while non-DMP models generally outperform the direct-input variants, hybrid DMP consistently outperforms direct-input and is capable of surpassing a non-DMP model on mIoU, F1, and Recall.

[99] TauGenNet: Plasma-Driven Tau PET Image Synthesis via Text-Guided 3D Diffusion Models

Yuxin Gong,Se-in Jang,Wei Shao,Yi Su,Kuang Gong

Main category: cs.CV

TL;DR: This paper proposes a 3D diffusion model guided by text and multimodal conditions to synthesize tau PET images for Alzheimer's disease, combining structural MRI and plasma biomarker data to provide a cost-effective and non-invasive approach for visualizing tau pathology and simulating disease progression.

Details Motivation: Tau PET scans are crucial for diagnosing and monitoring Alzheimer's disease but are limited by high cost and low availability. Structural MRI and plasma biomarkers offer non-invasive and accessible alternatives for providing complementary information about brain anatomy and disease progression. Method: A text-guided 3D diffusion model was developed to synthesize 3D tau PET images by leveraging multimodal inputs, including textual prompts derived from plasma p-tau217 measurements and anatomical constraints from structural MRI. Result: The experimental results showed that the proposed framework can generate realistic and clinically meaningful 3D tau PET images across various stages of the disease. Conclusion: The proposed text-guided 3D diffusion model can generate realistic 3D tau PET images using multimodal conditions from MRI and plasma measurements, offering a non-invasive and cost-effective alternative for visualizing tau pathology and simulating disease progression. Abstract: Accurate quantification of tau pathology via tau positron emission tomography (PET) scan is crucial for diagnosing and monitoring Alzheimer's disease (AD). However, the high cost and limited availability of tau PET restrict its widespread use. In contrast, structural magnetic resonance imaging (MRI) and plasma-based biomarkers provide non-invasive and widely available complementary information related to brain anatomy and disease progression. In this work, we propose a text-guided 3D diffusion model for 3D tau PET image synthesis, leveraging multimodal conditions from both structural MRI and plasma measurement. Specifically, the textual prompt is from the plasma p-tau217 measurement, which is a key indicator of AD progression, while MRI provides anatomical structure constraints. The proposed framework is trained and evaluated using clinical AV1451 tau PET data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Experimental results demonstrate that our approach can generate realistic, clinically meaningful 3D tau PET across a range of disease stages. The proposed framework can help perform tau PET data augmentation under different settings, provide a non-invasive, cost-effective alternative for visualizing tau pathology, and support the simulation of disease progression under varying plasma biomarker levels and cognitive conditions.

[100] Dual-Scale Volume Priors with Wasserstein-Based Consistency for Semi-Supervised Medical Image Segmentation

Junying Meng,Gangxuan Zhou,Jun Liu,Weihong Guo

Main category: cs.CV

TL;DR: The paper proposes a new semi-supervised medical image segmentation approach that effectively integrates spatial regularization and volume priors, showing improved performance on multiple datasets.

Details Motivation: The motivation is to address the limitations of existing semi-supervised segmentation networks that overlook effective feature extraction guidance and prior information from datasets, thereby improving segmentation accuracy. Method: The method involves integrating a strong explicit volume prior and Threshold Dynamics spatial regularization into a backbone segmentation network. A regression network estimates target region volumes for unlabeled images, and a dataset-scale Wasserstein distance loss function is designed to enforce similarity between labeled and unlabeled datasets. Result: Experimental results on the 2017 ACDC dataset, PROMISE12 dataset, and thigh muscle MR image dataset demonstrate the superiority of the proposed method in semi-supervised medical image segmentation. Conclusion: The paper concludes that the proposed semi-supervised medical image segmentation framework, which integrates spatial regularization methods and volume priors, outperforms existing methods on multiple datasets. Abstract: Despite signi cant progress in semi-supervised medical image segmentation, most existing segmentation networks overlook e ective methodological guidance for feature extraction and important prior information from datasets. In this paper, we develop a semi-supervised medical image segmentation framework that e ectively integrates spatial regularization methods and volume priors. Speci cally, our approach integrates a strong explicit volume prior at the image scale and Threshold Dynamics spatial regularization, both derived from variational models, into the backbone segmentation network. The target region volumes for each unlabeled image are estimated by a regression network, which e ectively regularizes the backbone segmentation network through an image-scale Wasserstein distance constraint, ensuring that the class ratios in the segmentation results for each unlabeled image match those predicted by the regression network. Additionally, we design a dataset-scale Wasserstein distance loss function based on a weak implicit volume prior, which enforces that the volume distribution predicted for the unlabeled dataset is similar to that of labeled dataset. Experimental results on the 2017 ACDC dataset, PROMISE12 dataset, and thigh muscle MR image dataset show the superiority of the proposed method.

[101] Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Jingen Qu,Lijun Li,Bo Zhang,Yichen Yan,Jing Shao

Main category: cs.CV

TL;DR: This paper proposes an image-oriented self-adaptive dataset construction method for real-world multimodal safety scenarios and a standardized evaluation metric, showing promising results in scalability and effectiveness.

Details Motivation: The motivation is driven by the increasing complexity of safety challenges in multimodal large language models (MLLMs) and the limitations of current risk-oriented dataset construction methods in covering real-world scenarios. Method: The paper introduces a novel image-oriented self-adaptive dataset construction method for real-world multimodal safety scenarios (RMS) and a standardized safety dataset evaluation metric by fine-tuning a safety judge model. Result: Using the image-oriented method, the authors automatically generated an RMS dataset comprising 35k image-text pairs with guidance responses and demonstrated the effectiveness of their approach through extensive experiments on various tasks. Conclusion: The paper concludes that the proposed image-oriented pipeline is scalable and effective, offering a new perspective for the construction of real-world multimodal safety datasets. Abstract: Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges. However, current dataset construction methods, which are risk-oriented, fail to cover the growing complexity of real-world multimodal safety scenarios (RMS). And due to the lack of a unified evaluation metric, their overall effectiveness remains unproven. This paper introduces a novel image-oriented self-adaptive dataset construction method for RMS, which starts with images and end constructing paired text and guidance responses. Using the image-oriented method, we automatically generate an RMS dataset comprising 35k image-text pairs with guidance responses. Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety datasets.Extensive experiments on various tasks demonstrate the effectiveness of the proposed image-oriented pipeline. The results confirm the scalability and effectiveness of the image-oriented approach, offering a new perspective for the construction of real-world multimodal safety datasets.

[102] PAOLI: Pose-free Articulated Object Learning from Sparse-view Images

Jianning Deng,Kartic Subr,Hakan Bilen

Main category: cs.CV

TL;DR: 提出了一种新的自监督方法,能够在仅使用少量未标定视角图像的情况下,重建出准确且详细的可变形物体模型。

Details Motivation: 现有方法通常需要密集的多视角观测和真实相机姿态,而该方法旨在仅使用稀疏视角且无相机监督的情况下学习可变形物体的表示。 Method: 首先使用稀疏视角3D重建技术独立重建每个可变形部分,然后学习一个变形场以在不同姿态间建立密集对应关系,再通过渐进式解耦策略分离静态和动态部分,并通过自监督损失联合优化几何、外观和运动学。 Result: 实验表明,该方法在标准基准数据集和真实世界示例中均能生成高质量的可变形物体表示。 Conclusion: 该框架能够在比现有方法弱得多的输入假设下生成准确且详细的可变形物体表示。 Abstract: We present a novel self-supervised framework for learning articulated object representations from sparse-view, unposed images. Unlike prior methods that require dense multi-view observations and ground-truth camera poses, our approach operates with as few as four views per articulation and no camera supervision. To address the inherent challenges, we first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we jointly optimize geometry, appearance, and kinematics with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.

[103] The Telephone Game: Evaluating Semantic Drift in Unified Models

Sabbir Mollah,Rohit Gupta,Sirnam Swetha,Qingyang Liu,Ahnaf Munir,Mubarak Shah

Main category: cs.CV

TL;DR: This paper introduces a cyclic evaluation framework for unified visual language models, highlighting the importance of assessing cross-modal consistency over single-pass metrics.

Details Motivation: The motivation is to address the lack of evaluation methods that consider the consistency between visual understanding and generation in unified models, particularly when cycling between modalities. Method: The authors introduce the Unified Consistency Framework (UCF-UM), which uses a cyclic evaluation protocol alternating between image-to-text and text-to-image generations. They define three metrics: Mean Cumulative Drift (MCD), Semantic Drift Rate (SDR), and Multi-Generation GenEval (MGG). Result: The framework reveals significant differences in cross-modal stability among models, with some maintaining semantics over multiple alternations while others experience rapid drift despite strong single-pass performance. Conclusion: The study concludes that cyclic consistency is essential for evaluating cross-modal stability in unified visual language models, and proposes practical metrics for assessing these models' performance. Abstract: Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model's cross-modal stability and strength of their shared representations. Code: https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

[104] Noisy Label Refinement with Semantically Reliable Synthetic Images

Yingxuan Li,Jiafeng Mao,Yusuke Matsui

Main category: cs.CV

TL;DR: This paper proposes a novel method to address semantic noise in image classification datasets by using synthetic images as reliable reference points to identify and correct mislabeled data, achieving significant improvements in classification accuracy.

Details Motivation: Semantic noise in image classification datasets, caused by mislabeled visually similar categories, poses a significant challenge to conventional supervised learning approaches. This work aims to explore how synthetic images with reliable labels can be used to address this issue. Method: The method uses synthetic images generated by text-to-image models as reliable reference points to detect and correct mislabeled samples in noisy datasets. It focuses on overcoming domain gaps and diversity constraints associated with synthetic images. Result: Extensive experiments show that the approach significantly improves classification accuracy under various noise conditions. It achieves a 30% improvement on CIFAR-10, 11% on CIFAR-100 under 70% semantic noise, and 24% on ImageNet-100 under real-world noise conditions when combined with existing noise-robust training methods. Conclusion: The proposed method effectively addresses semantic noise in image classification datasets by leveraging synthetic images as reliable reference points to identify and correct mislabeled samples. It demonstrates significant improvements in classification accuracy, especially in scenarios with semantic label noise. Abstract: Semantic noise in image classification datasets, where visually similar categories are frequently mislabeled, poses a significant challenge to conventional supervised learning approaches. In this paper, we explore the potential of using synthetic images generated by advanced text-to-image models to address this issue. Although these high-quality synthetic images come with reliable labels, their direct application in training is limited by domain gaps and diversity constraints. Unlike conventional approaches, we propose a novel method that leverages synthetic images as reliable reference points to identify and correct mislabeled samples in noisy datasets. Extensive experiments across multiple benchmark datasets show that our approach significantly improves classification accuracy under various noise conditions, especially in challenging scenarios with semantic label noise. Additionally, since our method is orthogonal to existing noise-robust learning techniques, when combined with state-of-the-art noise-robust training methods, it achieves superior performance, improving accuracy by 30% on CIFAR-10 and by 11% on CIFAR-100 under 70% semantic noise, and by 24% on ImageNet-100 under real-world noise conditions.

[105] Efficient Odd-One-Out Anomaly Detection

Silvio Chito,Paolo Rabino,Tatiana Tommasi

Main category: cs.CV

TL;DR: 本文介绍了一种基于DINO的模型,用于解决odd-one-out异常检测任务,同时引入了多模态大型语言模型基线,证明了其在结构化视觉推理任务中的局限性。

Details Motivation: odd-one-out异常检测任务对现代深度学习模型提出了挑战,需要跨多个视图的空间推理和关系推理。这些挑战必须以高效的方式解决。 Method: 提出了一种基于DINO的模型,以解决odd-one-out异常检测任务中的挑战,并引入了多模态大型语言模型作为基线。 Result: 所提出的基于DINO的模型参数减少了三分之一,训练时间缩短了三倍。 Conclusion: 本文提出了一种基于DINO的模型,用于解决odd-one-out异常检测任务,在保持竞争力性能的同时,参数减少了三分之一,训练时间缩短了三倍,并引入了多模态大型语言模型基线。 Abstract: The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks. The project page can be found at https://silviochito.github.io/EfficientOddOneOut/

[106] GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization

Pengyue Jia,Yingyi Zhang,Xiangyu Zhao,Yixuan Li

Main category: cs.CV

TL;DR: GeoArena offers a novel, open, and human-centered platform for evaluating vision-language models in image geolocalization tasks, addressing data leakage and privacy issues in traditional methods.

Details Motivation: The motivation is to overcome the limitations of current image geolocalization evaluation methods, such as data leakage from pretraining on test datasets and reliance on exact geographic coordinates that neglect reasoning and pose privacy risks. Method: The authors designed GeoArena, which uses in-the-wild images and pairwise human judgments to evaluate LVLMs, addressing data leakage and privacy concerns from traditional methods. Result: GeoArena was deployed online for two months, collecting thousands of voting records, which were used to establish a leaderboard of LVLMs for the image geolocalization task. Conclusion: The paper concludes that GeoArena is an effective and innovative platform for evaluating LVLMs in image geolocalization tasks, providing a more accurate, privacy-preserving, and human-centered approach. Abstract: Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.

[107] From Editor to Dense Geometry Estimator

JiYuan Wang,Chunyu Lin,Lei Sun,Rongying Liu,Lang Nie,Mingxing Li,Kang Liao,Xiangxiang Chu,Yao Zhao

Main category: cs.CV

TL;DR: 本文提出FE2E框架,基于图像编辑模型进行密集几何估计,通过优化训练目标和量化方法,实现了在零样本条件下的高性能预测结果。

Details Motivation: 密集预测本质上是图像到图像的任务,因此图像编辑模型可能比文本到图像生成模型更适合微调。本文旨在系统分析编辑模型和生成模型在微调行为上的差异,并探索编辑模型在密集几何估计中的潜力。 Method: 提出FE2E框架,将编辑模型的流匹配损失重新设计为“一致速度”训练目标,并使用对数量化解决精度冲突,同时利用DiT的全局注意力机制实现深度和法线的联合估计。 Result: FE2E在多个数据集上的零样本单目深度和法线估计任务中表现出色,尤其在ETH3D数据集上性能提高了超过35%,并优于基于100倍数据训练的DepthAnything系列模型。 Conclusion: 编辑模型相比生成模型在密集几何估计任务上具有更好的性能和稳定性。FE2E框架通过采用基于Diffusion Transformer的编辑模型,改进了零样本单目深度和法线估计的结果。 Abstract: Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

[108] MICACL: Multi-Instance Category-Aware Contrastive Learning for Long-Tailed Dynamic Facial Expression Recognition

Feng-Qi Cui,Zhen Lin,Xinlong Rao,Anyang Tong,Shiyao Li,Fei Wang,Changlin Chen,Bin Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的多实例学习框架MICACL,用于解决动态面部表情识别中的长尾类别分布和时空特征建模问题,通过引入图增强实例交互模块、加权实例聚合网络和多尺度类别感知对比学习策略,有效提高了识别准确性和鲁棒性。

Details Motivation: 由于长尾类别分布和时空特征建模的复杂性,动态面部表情识别(DFER)面临重大挑战。现有的深度学习方法往往无法解决这些问题,导致严重的模型归纳偏差。 Method: 提出了一种新的多实例学习框架MICACL,包括图增强实例交互模块(GEIIM)用于捕捉复杂的时空关系,加权实例聚合网络(WIAN)用于动态分配实例重要性权重,以及多尺度类别感知对比学习(MCCL)策略用于平衡主要和次要类别的训练。 Result: 在真实世界数据集(即DFEW和FERV39k)上的大量实验表明,MICACL在性能上达到了最先进的水平,具有优越的鲁棒性和泛化能力。 Conclusion: MICACL通过结合时空依赖建模和长尾对比学习优化,有效解决了动态面部表情识别中的模型归纳偏差问题,提高了识别的准确性和鲁棒性。 Abstract: Dynamic facial expression recognition (DFER) faces significant challenges due to long-tailed category distributions and complexity of spatio-temporal feature modeling. While existing deep learning-based methods have improved DFER performance, they often fail to address these issues, resulting in severe model induction bias. To overcome these limitations, we propose a novel multi-instance learning framework called MICACL, which integrates spatio-temporal dependency modeling and long-tailed contrastive learning optimization. Specifically, we design the Graph-Enhanced Instance Interaction Module (GEIIM) to capture intricate spatio-temporal between adjacent instances relationships through adaptive adjacency matrices and multiscale convolutions. To enhance instance-level feature aggregation, we develop the Weighted Instance Aggregation Network (WIAN), which dynamically assigns weights based on instance importance. Furthermore, we introduce a Multiscale Category-aware Contrastive Learning (MCCL) strategy to balance training between major and minor categories. Extensive experiments on in-the-wild datasets (i.e., DFEW and FERV39k) demonstrate that MICACL achieves state-of-the-art performance with superior robustness and generalization.

[109] Stitching the Story: Creating Panoramic Incident Summaries from Body-Worn Footage

Dor Cohen,Inga Efrosman,Yehudit Aperstein,Alexander Apartsin

Main category: cs.CV

TL;DR: 这篇论文介绍了一种将身体摄像头视频转化为信息丰富的全景图像的计算机视觉管道。

Details Motivation: 急救人员广泛采用随身摄像头来记录事件现场并支持事后分析。然而,在时间紧迫的情况下,审查冗长的视频片段是不切实际的。有效的态势感知需要可以快速解读的简洁视觉摘要。 Method: 该论文使用单目同时定位与地图构建(SLAM)技术来估计摄像机轨迹并重建环境的空间布局。通过沿轨迹聚类摄像机姿态来确定关键视角,并从每个聚集中选择代表性的帧。这些帧通过多帧拼接技术融合成空间一致的全景图像。 Result: 生成的摘要能够快速理解复杂环境。 Conclusion: 该论文提出了一种计算机视觉流程,能够将随身摄像头拍摄的视频转化为全景图像,以实现对事件现场的快速理解,并促进高效决策和事件回顾。 Abstract: First responders widely adopt body-worn cameras to document incident scenes and support post-event analysis. However, reviewing lengthy video footage is impractical in time-critical situations. Effective situational awareness demands a concise visual summary that can be quickly interpreted. This work presents a computer vision pipeline that transforms body-camera footage into informative panoramic images summarizing the incident scene. Our method leverages monocular Simultaneous Localization and Mapping (SLAM) to estimate camera trajectories and reconstruct the spatial layout of the environment. Key viewpoints are identified by clustering camera poses along the trajectory, and representative frames from each cluster are selected. These frames are fused into spatially coherent panoramic images using multi-frame stitching techniques. The resulting summaries enable rapid understanding of complex environments and facilitate efficient decision-making and incident review.

Hao Ju,Hu Zhang,Zhedong Zheng

Main category: cs.CV

TL;DR: 本文提出了一种新的基于大型多模态模型的框架AnomalyLMM,用于解决基于文本的人员异常搜索任务,并在PAB数据集上取得了优于现有方法的性能。

Details Motivation: 随着公众安全需求的增长,基于文本的人员异常搜索成为一项重要任务,但该任务面临细粒度跨模态对齐和现实世界样本稀疏的挑战。尽管LMMs在多模态理解方面表现出色,但在细粒度异常检索中的潜力尚未被充分挖掘。 Method: 本文提出了一种新颖的由粗到细的流水线方法,结合LMMs,以及一种无需训练的适应策略(包括掩码跨模态提示、行为显著性预测和知识感知重排序),以实现零样本关注细微异常线索。 Result: 实验表明,该方法在PAB数据集上的Recall@1准确率比竞争基线提高了+0.96%,并展示了文本异常与视觉行为之间可解释的对齐关系。 Conclusion: 本文提出了AnomalyLMM框架,首次利用大型多模态模型(LMMs)进行基于文本的人员异常搜索,并在PAB数据集上进行了严格的评估,证明了其有效性。 Abstract: With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.

[111] Aesthetic Image Captioning with Saliency Enhanced MLLMs

Yilin Tao,Jiashui Huang,Huaze Xu,Ling Shao

Main category: cs.CV

TL;DR: 本文提出ASE-MLLM,首次将图像美学显著性集成到MLLM中用于AIC任务,取得了最先进的结果。

Details Motivation: 现有AIC方法在利用MLLM时未专门调整以关注目标美学内容,因此需要一种更有效的方法。 Method: 提出了ASE-MLLM框架,包含IASM模块提取美学显著性特征,并通过交叉注意力机制与原始图像特征融合。 Result: 实验表明,ASE-MLLM在主流AIC基准测试中显著优于传统方法和通用MLLM,达到了最先进的性能。 Conclusion: ASE-MLLM实现了针对AIC任务的最先进的性能,展示了将图像美学显著性整合到MLLM中的有效性。 Abstract: Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.

[112] SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer

Jimin Xu,Bosheng Qin,Tao Jin,Zhou Zhao,Zhenhui Ye,Jun Yu,Fei Wu

Main category: cs.CV

TL;DR: 本文提出了一种结合预训练2D扩散模型的新3D风格迁移方法,通过跨视图风格对齐和实例级风格迁移,提高了风格迁移的效果和结构清晰度。

Details Motivation: 现有的3D场景风格迁移方法在提取和传输高层风格语义方面效果不佳,且生成的结果缺乏结构清晰度和分离度,难以区分3D场景中的不同实例或物体。 Method: 该方法包括两个关键阶段:首先利用扩散先验生成关键视点的风格化渲染,然后将这些风格化的关键视图转移到3D表示上。其中创新设计包括跨视图风格对齐和实例级风格迁移。 Result: 广泛的定性和定量实验表明,该3D风格迁移管道在多种场景下显著优于现有最先进方法,包括前向和具有挑战性的360度环境。 Conclusion: 本文提出了一种新的3D风格迁移管道,通过结合预训练的2D扩散模型的先验知识,解决了现有方法在提取和传输高层风格语义以及结构清晰度方面的不足。 Abstract: Recent advancements in neural representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have increased interest in applying style transfer to 3D scenes. While existing methods can transfer style patterns onto 3D-consistent neural representations, they struggle to effectively extract and transfer high-level style semantics from the reference style image. Additionally, the stylized results often lack structural clarity and separation, making it difficult to distinguish between different instances or objects within the 3D scene. To address these limitations, we propose a novel 3D style transfer pipeline that effectively integrates prior knowledge from pretrained 2D diffusion models. Our pipeline consists of two key stages: First, we leverage diffusion priors to generate stylized renderings of key viewpoints. Then, we transfer the stylized key views onto the 3D representation. This process incorporates two innovative designs. The first is cross-view style alignment, which inserts cross-view attention into the last upsampling block of the UNet, allowing feature interactions across multiple key views. This ensures that the diffusion model generates stylized key views that maintain both style fidelity and instance-level consistency. The second is instance-level style transfer, which effectively leverages instance-level consistency across stylized key views and transfers it onto the 3D representation. This results in a more structured, visually coherent, and artistically enriched stylization. Extensive qualitative and quantitative experiments demonstrate that our 3D style transfer pipeline significantly outperforms state-of-the-art methods across a wide range of scenes, from forward-facing to challenging 360-degree environments. Visit our project page https://jm-xu.github.io/SSGaussian for immersive visualization.

[113] Learning neural representations for X-ray ptychography reconstruction with unknown probes

Tingyou Li,Zixin Xu,Zirui Gao,Hanfei Yan,Xiaojing Huang,Jizhou Li

Main category: cs.CV

TL;DR: This paper introduces PtyINR, a self-supervised neural framework for X-ray ptychography that improves image reconstruction under low-signal conditions without prior probe characterization.

Details Motivation: X-ray ptychography's potential is limited by challenges in accurately reconstructing images when the illuminating probe is unknown, especially under low-signal conditions. Method: PtyINR uses continuous neural representations to parameterize both object and probe, enabling end-to-end reconstruction from raw diffraction patterns. Result: PtyINR achieves superior reconstruction quality on simulated and experimental data, showing robustness under low-signal conditions compared to conventional methods. Conclusion: PtyINR offers a self-supervised framework for simultaneously recovering object and probe without requiring pre-characterization of the probe, making it applicable to various computational microscopy problems. Abstract: X-ray ptychography provides exceptional nanoscale resolution and is widely applied in materials science, biology, and nanotechnology. However, its full potential is constrained by the critical challenge of accurately reconstructing images when the illuminating probe is unknown. Conventional iterative methods and deep learning approaches are often suboptimal, particularly under the low-signal conditions inherent to low-dose and high-speed experiments. These limitations compromise reconstruction fidelity and restrict the broader adoption of the technique. In this work, we introduce the Ptychographic Implicit Neural Representation (PtyINR), a self-supervised framework that simultaneously addresses the object and probe recovery problem. By parameterizing both as continuous neural representations, PtyINR performs end-to-end reconstruction directly from raw diffraction patterns without requiring any pre-characterization of the probe. Extensive evaluations demonstrate that PtyINR achieves superior reconstruction quality on both simulated and experimental data, with remarkable robustness under challenging low-signal conditions. Furthermore, PtyINR offers a generalizable, physics-informed framework for addressing probe-dependent inverse problems, making it applicable to a wide range of computational microscopy problems.

[114] Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

Zanwei Zhou,Taoran Yi,Jiemin Fang,Chen Yang,Lingxi Xie,Xinggang Wang,Wei Shen,Qi Tian

Main category: cs.CV

TL;DR: This paper proposes MDT-dist, a framework for accelerating 3D generation by reducing sampling steps through Velocity Matching and Distillation.

Details Motivation: Existing distillation methods like Consistency Models are less effective for complex 3D generation tasks. Method: Two objectives, Velocity Matching (VM) and Velocity Distillation (VD), are proposed to optimize the model. Result: Sampling steps reduced from 25 to 1 or 2, achieving significant speedup without compromising quality. Conclusion: MDT-dist provides a more efficient method for 3D generation, significantly reducing sampling steps while maintaining high fidelity. Abstract: Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1 or 2, achieving 0.68s (1 step x 2) and 0.94s (2 steps x 2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Extensive experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.

[115] Durian: Dual Reference-guided Portrait Animation with Attribute Transfer

Hyunsoo Cha,Byungjun Kim,Hanbyul Joo

Main category: cs.CV

TL;DR: Durian是第一个可以从给定参考图像中生成具有面部属性转移的肖像动画视频的零样本方法。

Details Motivation: 生成具有面部属性转移的高质量和空间一致的肖像动画视频。 Method: 引入了双重参考网络,将肖像和属性图像的空间特征注入扩散模型的去噪过程中。 Result: Durian在肖像动画与属性传输方面达到了最先进的性能。 Conclusion: Durian能够通过单一生成过程实现多属性组合,无需额外训练。 Abstract: We present Durian, the first method for generating portrait animation videos with facial attribute transfer from a given reference image to a target portrait in a zero-shot manner. To enable high-fidelity and spatially consistent attribute transfer across frames, we introduce dual reference networks that inject spatial features from both the portrait and attribute images into the denoising process of a diffusion model. We train the model using a self-reconstruction formulation, where two frames are sampled from the same portrait video: one is treated as the attribute reference and the other as the target portrait, and the remaining frames are reconstructed conditioned on these inputs and their corresponding masks. To support the transfer of attributes with varying spatial extent, we propose a mask expansion strategy using keypoint-conditioned image generation for training. In addition, we further augment the attribute and portrait images with spatial and appearance-level transformations to improve robustness to positional misalignment between them. These strategies allow the model to effectively generalize across diverse attributes and in-the-wild reference combinations, despite being trained without explicit triplet supervision. Durian achieves state-of-the-art performance on portrait animation with attribute transfer, and notably, its dual reference design enables multi-attribute composition in a single generation pass without additional training.

[116] From Lines to Shapes: Geometric-Constrained Segmentation of X-Ray Collimators via Hough Transform

Benjamin El-Zein,Dominik Eckert,Andreas Fieselmann,Christopher Syben,Ludwig Ritschl,Steffen Kappler,Sebastian Stober

Main category: cs.CV

TL;DR: 该论文提出了一种基于深度学习的X射线图像准直阴影分割方法,通过引入可微霍夫变换网络来检测准直边界,并结合不同的任务信息生成精确的线约束分割掩码,实现了对准直区域的稳健重建。

Details Motivation: X射线成像中的准直限制了对感兴趣区域的暴露,并减少了对患者的辐射剂量。然而,由于散射X射线辐射可能导致边缘模糊,准直阴影的检测成为数字射线照相中的一个重要预处理步骤,提出了挑战。 Method: 该论文引入了一种基于深度学习的分割方法,该方法通过引入可微霍夫变换网络来检测准直边界,并利用不同的任务信息提取感兴趣区域中心的信息。在推理过程中,结合这些任务的信息生成精细的线约束分割掩码。 Result: 该方法在真实X射线图像的不同测试集上实现了中位豪斯多夫距离为4.3-5.0毫米的稳健准直区域重建。 Conclusion: 论文提出的方法在X射线图像的准直阴影分割中表现出了良好的性能,且不受限于特定的边缘数量。 Abstract: Collimation in X-ray imaging restricts exposure to the region-of-interest (ROI) and minimizes the radiation dose applied to the patient. The detection of collimator shadows is an essential image-based preprocessing step in digital radiography posing a challenge when edges get obscured by scattered X-ray radiation. Regardless, the prior knowledge that collimation forms polygonal-shaped shadows is evident. For this reason, we introduce a deep learning-based segmentation that is inherently constrained to its geometry. We achieve this by incorporating a differentiable Hough transform-based network to detect the collimation borders and enhance its capability to extract the information about the ROI center. During inference, we combine the information of both tasks to enable the generation of refined, line-constrained segmentation masks. We demonstrate robust reconstruction of collimated regions achieving median Hausdorff distances of 4.3-5.0mm on diverse test sets of real Xray images. While this application involves at most four shadow borders, our method is not fundamentally limited by a specific number of edges.

[117] One Flight Over the Gap: A Survey from Perspective to Panoramic Vision

Xin Lin,Xian Ge,Dizhe Zhang,Zhaoliang Wan,Xianshun Wang,Xiangtai Li,Wenjie Jiang,Bo Du,Dacheng Tao,Ming-Hsuan Yang,Lu Qi

Main category: cs.CV

TL;DR: 本文综述了全景视觉技术,特别是从透视到全景的适应方法,分析了其挑战、策略及分类,并探讨了未来研究方向。

Details Motivation: 由于对空间智能和整体场景感知的需求增加,提供完整360度视野的全向图像在多个应用中受到越来越多的关注,但其与透视图像的显著差异使得直接领域适应变得困难。 Method: 论文回顾了全景成像流程和投影方法,总结了领域适应的三个挑战,并对20多个代表性任务进行了分析,同时进行了跨任务比较并分类了全景视觉的主要类别。 Result: 论文提供了对全景视觉领域的深入分析,包括其挑战、方法和分类,并讨论了未来的研究方向和开放问题。 Conclusion: 该论文总结了全景视觉技术的最新进展,特别是从透视到全景的适应方法,并希望为全景视觉技术的发展提供新的见解和前瞻视角。 Abstract: Driven by the demand for spatial intelligence and holistic scene perception, omnidirectional images (ODIs), which provide a complete 360\textdegree{} field of view, are receiving growing attention across diverse applications such as virtual reality, autonomous driving, and embodied robotics. Despite their unique characteristics, ODIs exhibit remarkable differences from perspective images in geometric projection, spatial distribution, and boundary continuity, making it challenging for direct domain adaption from perspective methods. This survey reviews recent panoramic vision techniques with a particular emphasis on the perspective-to-panorama adaptation. We first revisit the panoramic imaging pipeline and projection methods to build the prior knowledge required for analyzing the structural disparities. Then, we summarize three challenges of domain adaptation: severe geometric distortions near the poles, non-uniform sampling in Equirectangular Projection (ERP), and periodic boundary continuity. Building on this, we cover 20+ representative tasks drawn from more than 300 research papers in two dimensions. On one hand, we present a cross-method analysis of representative strategies for addressing panoramic specific challenges across different tasks. On the other hand, we conduct a cross-task comparison and classify panoramic vision into four major categories: visual quality enhancement and assessment, visual understanding, multimodal understanding, and visual generation. In addition, we discuss open challenges and future directions in data, models, and applications that will drive the advancement of panoramic vision research. We hope that our work can provide new insight and forward looking perspectives to advance the development of panoramic vision technologies. Our project page is https://insta360-research-team.github.io/Survey-of-Panorama

[118] Plot'n Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models

Kiymet Akdemir,Jing Shi,Kushal Kafle,Brian Price,Pinar Yanardag

Main category: cs.CV

TL;DR: 本文提出 Plot'n Polish,一种零样本框架,解决文本到图像扩散模型在故事可视化中的一致性和控制性问题,实现灵活的多帧编辑和优化。

Details Motivation: 随着文本到图像扩散模型在现实世界创意领域的应用增加,提供增强的控制、优化和一致的后生成修改能力成为一个重要的挑战。现有方法在保持多帧间的视觉和叙事一致性方面缺乏灵活性,限制了创作者无缝制作和优化视觉故事的能力。 Method: 引入了一种名为 Plot'n Polish 的零样本框架,用于解决文本到图像扩散模型在故事可视化中的一致性和控制性问题。 Result: Plot'n Polish 能够在不牺牲一致性的情况下提供精细或粗略的编辑能力,为创作者提供更灵活的视觉故事创作和优化方式。 Conclusion: Plot'n Polish 提供了一种零样本框架,用于实现一致的故事生成,并在不同细节层面上对故事可视化进行细粒度控制。 Abstract: Text-to-image diffusion models have demonstrated significant capabilities to generate diverse and detailed visuals in various domains, and story visualization is emerging as a particularly promising application. However, as their use in real-world creative domains increases, the need for providing enhanced control, refinement, and the ability to modify images post-generation in a consistent manner becomes an important challenge. Existing methods often lack the flexibility to apply fine or coarse edits while maintaining visual and narrative consistency across multiple frames, preventing creators from seamlessly crafting and refining their visual stories. To address these challenges, we introduce Plot'n Polish, a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.

[119] TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection

Zehong Yan,Peng Qi,Wynne Hsu,Mong Li Lee

Main category: cs.CV

TL;DR: TRUST-VL是一个统一的、可解释的视觉-语言模型,用于检测各种类型的多模态错误信息,并在多种基准测试中表现出色。

Details Motivation: 现有的方法通常关注单一类型的失真,并且难以推广到未见过的场景。不同失真类型共享常见的推理能力,同时也需要特定任务的技能。 Method: 引入了一个统一的可解释视觉-语言模型TRUST-VL,并开发了一个问题感知的视觉放大器模块来提取特定任务的视觉特征。 Result: TRUST-VL在领域内和零样本基准测试中都取得了最先进的性能,并提供了强大的泛化和可解释性。 Conclusion: TRUST-VL通过跨失真类型联合训练,实现了先进的性能,并在多模态错误信息检测中展现出良好的泛化性和可解释性。 Abstract: Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

[120] Virtual Fitting Room: Generating Arbitrarily Long Videos of Virtual Try-On from a Single Image -- Technical Preview

Jun-Kun Chen,Aayush Bansal,Minh Phuoc Vo,Yu-Xiong Wang

Main category: cs.CV

TL;DR: The Virtual Fitting Room (VFR) is a novel video generative model that efficiently generates long virtual try-on videos with smoothness and temporal consistency.

Details Motivation: The motivation is to overcome resource-intensive generation and lengthy data requirements while enabling arbitrarily long video creation for virtual try-on. Method: VFR uses an auto-regressive, segment-by-segment generation process, employing a prefix video condition for smoothness and an anchor video for consistency. Result: VFR generates minute-scale virtual try-on videos that maintain both local smoothness and global temporal consistency across various motions. Conclusion: VFR successfully generates long virtual try-on videos with local smoothness and global temporal consistency, marking it as a pioneering work in this field. Abstract: We introduce the Virtual Fitting Room (VFR), a novel video generative model that produces arbitrarily long virtual try-on videos. Our VFR models long video generation tasks as an auto-regressive, segment-by-segment generation process, eliminating the need for resource-intensive generation and lengthy video data, while providing the flexibility to generate videos of arbitrary length. The key challenges of this task are twofold: ensuring local smoothness between adjacent segments and maintaining global temporal consistency across different segments. To address these challenges, we propose our VFR framework, which ensures smoothness through a prefix video condition and enforces consistency with the anchor video -- a 360-degree video that comprehensively captures the human's wholebody appearance. Our VFR generates minute-scale virtual try-on videos with both local smoothness and global temporal consistency under various motions, making it a pioneering work in long virtual try-on video generation.