cs.CL [Back]

[1] Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

Fatemeh Taherinezhad,Mohamad Javad Momeni Nezhad,Sepehr Karimi,Sina Rashidi,Ali Zolnour,Maryam Dadkhah,Yasaman Haghbin,Hossein AzadMaleki,Maryam Zolnoori

Main category: cs.CL

TL;DR: This study evaluated strategies for adapting large language models to detect dementia from speech, showing that certain adaptation techniques can make open-weight models as effective as commercial systems.

Details

Motivation: Over half of US adults with Alzheimer's disease and related dementias remain undiagnosed, and scalable detection methods, such as speech-based screening, are needed. Method: The researchers evaluated nine text-only models and three multimodal audio-text models using the DementiaBank speech corpus. They employed various adaptation strategies, including in-context learning, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Result: Class-centroid demonstrations achieved the highest in-context learning performance. Reasoning improved smaller models, and token-level fine-tuning generally produced the best scores. Adding a classification head significantly improved underperforming models. Fine-tuned audio-text multimodal models performed well but did not outperform the top text-only models. Conclusion: The study concluded that model adaptation strategies significantly impact the effectiveness of speech-based dementia detection, and appropriately adapted open-weight models can perform as well as or better than commercial systems. Abstract: Over half of US adults with Alzheimer disease and related dementias remain undiagnosed, and speech-based screening offers a scalable detection approach. We compared large language model adaptation strategies for dementia detection using the DementiaBank speech corpus, evaluating nine text-only models and three multimodal audio-text models on recordings from DementiaBank speech corpus. Adaptations included in-context learning with different demonstration selection policies, reasoning-augmented prompting, parameter-efficient fine-tuning, and multimodal integration. Results showed that class-centroid demonstrations achieved the highest in-context learning performance, reasoning improved smaller models, and token-level fine-tuning generally produced the best scores. Adding a classification head substantially improved underperforming models. Among multimodal models, fine-tuned audio-text systems performed well but did not surpass the top text-only models. These findings highlight that model adaptation strategies, including demonstration selection, reasoning design, and tuning method, critically influence speech-based dementia detection, and that properly adapted open-weight models can match or exceed commercial systems.

[2] Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Yansong Liu,Jiateng Li,Yuan Liu

Main category: cs.CL

TL;DR: This paper introduces Reinforced Behavior Alignment (RBA), a framework that improves the instruction-following abilities of speech-based LLMs using reinforcement learning and self-synthesis methods, achieving strong results across multiple tasks.

Details

Motivation: The motivation of the paper is to address the performance gap between speech-based LLMs (SpeechLMs) and text-based LLMs in instruction-following, particularly due to the dynamic and variable nature of user speech. Method: The paper proposes a framework called Reinforced Behavior Alignment (RBA), which uses a self-synthesis methodology to generate high-fidelity alignment data through a powerful teacher LLM. RBA aligns the behavior of SpeechLMs with the teacher model using a reinforcement learning-based approach. Result: Experimental results show that the RBA method enhances the instruction-following capabilities of SpeechLMs, outperforming conventional distillation baselines. It also achieves state-of-the-art performance on tasks like spoken question answering and speech-to-text translation using only self-generated data. Conclusion: The paper concludes that the proposed Reinforced Behavior Alignment (RBA) framework effectively improves the instruction-following capabilities of SpeechLMs, outperforming conventional distillation baselines and achieving state-of-the-art performance on open benchmarks with self-generated data. Abstract: The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

[3] Multilevel Analysis of Cryptocurrency News using RAG Approach with Fine-Tuned Mistral Large Language Model

Bohdan M. Pavlyshenko

Main category: cs.CL

TL;DR: This paper presents a multilevel multitask approach for analyzing cryptocurrency news using a fine-tuned Mistral 7B large language model with retrieval-augmented generation. The model generates graph and text summaries, sentiment scores, and JSON representations of summaries, which are consolidated into comprehensive reports through hierarchical stacking. The results demonstrate the effectiveness of the approach in conducting informative qualitative and quantitative analytics of cryptocurrency news.

Details

Motivation: The motivation behind this study is to address the challenges associated with large language model hallucinations when analyzing cryptocurrency news by representing the news as a knowledge graph. The authors aim to provide a more accurate and insightful analysis of cryptocurrency news through a multilevel multitask approach. Method: The paper uses a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG) for the analysis of cryptocurrency news. The model generates graph and text summaries, sentiment scores, and JSON representations of summaries. Hierarchical stacking is used to consolidate these summaries into comprehensive reports. The model is fine-tuned using 4-bit quantization with the PEFT/LoRA approach. Result: The results show that the fine-tuned Mistral 7B LLM can effectively conduct informative qualitative and quantitative analytics of cryptocurrency news. The combination of graph and text summaries provides complementary views of the news, and the hierarchical stacking approach allows for the consolidation of summaries into comprehensive reports. Conclusion: The study concludes that the use of a fine-tuned Mistral 7B LLM with RAG for multilevel cryptocurrency news analysis can effectively conduct both informative qualitative and quantitative analytics, providing significant insights into the cryptocurrency market. Abstract: In the paper, we consider multilevel multitask analysis of cryptocurrency news using a fine-tuned Mistral 7B large language model with retrieval-augmented generation (RAG). On the first level of analytics, the fine-tuned model generates graph and text summaries with sentiment scores as well as JSON representations of summaries. Higher levels perform hierarchical stacking that consolidates sets of graph-based and text-based summaries as well as summaries of summaries into comprehensive reports. The combination of graph and text summaries provides complementary views of cryptocurrency news. The model is fine-tuned with 4-bit quantization using the PEFT/LoRA approach. The representation of cryptocurrency news as knowledge graph can essentially eliminate problems with large language model hallucinations. The obtained results demonstrate that the use of fine-tuned Mistral 7B LLM models for multilevel cryptocurrency news analysis can conduct informative qualitative and quantitative analytics, providing important insights.

[4] The ProLiFIC dataset: Leveraging LLMs to Unveil the Italian Lawmaking Process

Matilde Contestabile,Chiara Ferrara,Alberto Giovannetti,Giovanni Parrillo,Andrea Vandin

Main category: cs.CL

TL;DR: 本研究提出了 ProLiFIC，一个基于意大利立法过程的新型事件日志，通过大语言模型处理非结构化数据，为法律领域的流程挖掘提供了基础。

Details

Motivation: 流程挖掘 (PM) 在法律领域的发展受限于数据的可获得性和质量，因此需要新的综合数据集来促进研究。 Method: 从 Normattiva 门户的非结构化数据创建 ProLiFIC，并使用大语言模型 (LLMs) 进行结构化处理。 Result: 创建了 ProLiFIC，一个涵盖意大利立法过程的综合事件日志，并展示了初步分析结果。 Conclusion: ProLiFIC 作为法律领域的流程挖掘基准，推动了 PM 和 LLMs 的整合发展。 Abstract: Process Mining (PM), initially developed for industrial and business contexts, has recently been applied to social systems, including legal ones. However, PM's efficacy in the legal domain is limited by the accessibility and quality of datasets. We introduce ProLiFIC (Procedural Lawmaking Flow in Italian Chambers), a comprehensive event log of the Italian lawmaking process from 1987 to 2022. Created from unstructured data from the Normattiva portal and structured using large language models (LLMs), ProLiFIC aligns with recent efforts in integrating PM with LLMs. We exemplify preliminary analyses and propose ProLiFIC as a benchmark for legal PM, fostering new developments.

[5] Multimodal Proposal for an AI-Based Tool to Increase Cross-Assessment of Messages

Alejandro Álvarez Castro,Joaquín Ordieres-Meré

Main category: cs.CL

TL;DR: 本文提出了一种基于多模态和分层话语结构的新型Transformer框架，用于更好地表示和分析财报电话会议内容。

Details

Motivation: 现有的财务情感分析系统大多依赖于平面文档级或句子级模型，无法捕捉财报电话会议这种交互式、分层话语结构的复杂性。 Method: 提出了一种两阶段的Transformer架构，第一阶段使用对比学习对多模态内容和话语元数据进行节点级编码，第二阶段为整个会议生成全局嵌入表示。 Result: 实验结果表明，所生成的嵌入表示能够稳定地反映情感基调、结构逻辑和主题一致性，并且该方法可以推广到其他高风险非脚本交流领域。 Conclusion: 本文提出了一种新的多模态框架，用于生成语义丰富且结构感知的财报电话会议嵌入表示，并展示了其在金融报告及其他高风险非脚本交流领域的实用性。 Abstract: Earnings calls represent a uniquely rich and semi-structured source of financial communication, blending scripted managerial commentary with unscripted analyst dialogue. Although recent advances in financial sentiment analysis have integrated multi-modal signals, such as textual content and vocal tone, most systems rely on flat document-level or sentence-level models, failing to capture the layered discourse structure of these interactions. This paper introduces a novel multi-modal framework designed to generate semantically rich and structurally aware embeddings of earnings calls, by encoding them as hierarchical discourse trees. Each node, comprising either a monologue or a question-answer pair, is enriched with emotional signals derived from text, audio, and video, as well as structured metadata including coherence scores, topic labels, and answer coverage assessments. A two-stage transformer architecture is proposed: the first encodes multi-modal content and discourse metadata at the node level using contrastive learning, while the second synthesizes a global embedding for the entire conference. Experimental results reveal that the resulting embeddings form stable, semantically meaningful representations that reflect affective tone, structural logic, and thematic alignment. Beyond financial reporting, the proposed system generalizes to other high-stakes unscripted communicative domains such as tele-medicine, education, and political discourse, offering a robust and explainable approach to multi-modal discourse representation. This approach offers practical utility for downstream tasks such as financial forecasting and discourse evaluation, while also providing a generalizable method applicable to other domains involving high-stakes communication.

Paul Blum,Enrico Liscio,Ruixuan Zhang,Caroline Figueroa,Pradeep K. Murukannaiah

Main category: cs.CL

TL;DR: This study proposes Early-SIB, a model that predicts adolescent suicidal ideation through social media activity before it is explicitly expressed.

Details

Motivation: Suicide is a leading cause of death among adolescents, and many cases go undetected due to a lack of contact with mental health services. Social media provides an opportunity to identify at-risk individuals early. Method: The study introduces Early-SIB, a transformer-based model that processes forum posts to predict future suicidal ideation and behavior without relying on explicit self-disclosure. Result: The Early-SIB model achieved a balanced accuracy of 0.73 in predicting future suicidal ideation and behavior on a Dutch youth forum. Conclusion: The study concludes that social media can be utilized to predict suicidal ideation and behavior among adolescents before they explicitly express it online. Abstract: Suicide is a leading cause of death among adolescents (12-18), yet predicting it remains a significant challenge. Many cases go undetected due to a lack of contact with mental health services. Social media, however, offers a unique opportunity, as young people often share their thoughts and struggles online in real time. In this work, we propose a novel task and method to approach it: predicting suicidal ideation and behavior (SIB) from forum posts before an adolescent explicitly expresses suicidal ideation on an online forum. This predictive framing, where no self-disclosure is used as input at any stage, remains largely unexplored in the suicide prediction literature. To this end, we introduce Early-SIB, a transformer-based model that sequentially processes the posts a user writes and engages with to predict whether they will write a SIB post. Our model achieves a balanced accuracy of 0.73 for predicting future SIB on a Dutch youth forum, demonstrating that such tools can offer a meaningful addition to traditional methods.

[7] Real-Time Detection of Hallucinated Entities in Long-Form Generation

Oscar Obeso,Andy Arditi,Javier Ferrando,Joshua Freeman,Cameron Holmes,Neel Nanda

Main category: cs.CL

TL;DR: 这篇论文提出了一种新的、廉价的、可扩展的大型语言模型幻觉检测方法，该方法针对实体级幻觉，并且在实际应用中表现良好。

Details

Motivation: 论文的动机是当前的幻觉检测方法在实际应用中不切实际，因为它们要么局限于短事实查询，要么需要昂贵的外部验证。 Method: 论文提出了一种针对实体级幻觉的方法，利用网络搜索来标记模型响应，并训练有效的幻觉分类器。 Result: 论文的结果显示，所提出的分类器在长格式回答中持续优于基线，包括比语义熵等更昂贵的方法。 Conclusion: 论文得出的结论是，这种方法在现实世界中的幻觉检测中表现出色，并且具有可扩展性。 Abstract: Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification. We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets \emph{entity-level hallucinations} -- e.g., fabricated names, dates, citations -- rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Moreover, despite being trained only with entity-level labels, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.

[8] Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck

Igor Halperin

Main category: cs.CL

TL;DR: This paper proposes UDIB, a new method for detecting confabulations in LLMs by creating a shared topic representation that is optimized for information-theoretic analysis rather than spatial proximity.

Details

Motivation: To bridge the gap in frameworks designed to detect intrinsic faithfulness hallucinations in LLMs, where topics are optimized for spatial proximity and not for the downstream information-theoretic analysis. Method: Transforming the Deterministic Information Bottleneck (DIB) method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound to generate a shared topic representation. Result: The development of UDIB, an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters, which can be used for joint clustering of LLM prompt and response embeddings. Conclusion: UDIB offers a more sensitive tool for detecting confabulations in LLMs by creating a shared topic representation that is structured to be maximally informative about the prompt-response relationship. Abstract: Large Language Models (LLMs) are prone to critical failure modes, including \textit{intrinsic faithfulness hallucinations} (also known as confabulations), where a response deviates semantically from the provided context. Frameworks designed to detect this, such as Semantic Divergence Metrics (SDM), rely on identifying latent topics shared between prompts and responses, typically by applying geometric clustering to their sentence embeddings. This creates a disconnect, as the topics are optimized for spatial proximity, not for the downstream information-theoretic analysis. In this paper, we bridge this gap by developing a principled topic identification method grounded in the Deterministic Information Bottleneck (DIB) for geometric clustering. Our key contribution is to transform the DIB method into a practical algorithm for high-dimensional data by substituting its intractable KL divergence term with a computationally efficient upper bound. The resulting method, which we dub UDIB, can be interpreted as an entropy-regularized and robustified version of K-means that inherently favors a parsimonious number of informative clusters. By applying UDIB to the joint clustering of LLM prompt and response embeddings, we generate a shared topic representation that is not merely spatially coherent but is fundamentally structured to be maximally informative about the prompt-response relationship. This provides a superior foundation for the SDM framework and offers a novel, more sensitive tool for detecting confabulations.

[9] QuesGenie: Intelligent Multimodal Question Generation

Ahmed Mubarak,Amna Ahmed,Amira Nasser,Aya Mohamed,Fares El-Sadek,Mohammed Ahmed,Ahmed Salah,Youssef Sobhy

Main category: cs.CL

TL;DR: 该论文介绍了一个多模态问题生成系统，能够根据各种教育资源自动生成多样化的问题，旨在解决当前学习资源中练习材料不足的问题。

Details

Motivation: 在当今信息丰富的时代，学习者虽然有大量教育资源可供使用，但缺乏与这些资源相匹配的练习材料，这是一个重大挑战。该项目旨在填补这一空白。 Method: 系统包含四个主要组件：多模态输入处理、问题生成、来自人类反馈的强化学习(RLHF)以及端到端的交互界面。 Result: 开发出一个能够从各种内容格式中自动生成不同类型问题的多模态问题生成系统。 Conclusion: 该论文提出了一种多模态问题生成系统，为自动化、可扩展和智能问题生成奠定了基础，同时平衡了资源效率、强大的功能和流畅的用户体验。 Abstract: In today's information-rich era, learners have access to abundant educational resources, but the lack of practice materials tailored to these resources presents a significant challenge. This project addresses that gap by developing a multi-modal question generation system that can automatically generate diverse question types from various content formats. The system features four major components: multi-modal input handling, question generation, reinforcement learning from human feedback (RLHF), and an end-to-end interactive interface. This project lays the foundation for automated, scalable, and intelligent question generation, carefully balancing resource efficiency, robust functionality and a smooth user experience.

[10] AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models

Cheng-Kai Yeh,Hsing-Wang Lee,Chung-Hung Kuo,Hen-Hsen Huang

Main category: cs.CL

TL;DR: This paper proposes AR$^2$, a novel adversarial reinforcement learning framework designed to enhance abstraction abilities in LLMs, resulting in improved performance on complex programming tasks by focusing on computational pattern recognition and generalization.

Details

Motivation: The motivation is to address the lack of explicit training for abstraction in existing LLMs for code generation, despite its importance as a foundational skill in computer science for problem-solving and generalization. Method: The study introduces AR$^2$, a framework that uses a teacher model to transform kernel problems into narrative-rich descriptions and trains a student coding model to extract underlying computational kernels using adversarial reinforcement learning. Result: Experimental results show that AR$^2$ significantly improves the accuracy of student models on challenging, previously unseen programming tasks. Conclusion: The study concludes that AR$^2$ effectively enhances the abstraction abilities of LLMs, highlighting abstraction as a crucial skill for improving model generalization in solving unseen programming tasks. Abstract: Abstraction--the ability to recognize and distill essential computational patterns from complex problem statements--is a foundational skill in computer science, critical both for human problem-solvers and coding-oriented large language models (LLMs). Despite recent advances in training LLMs for code generation using reinforcement learning (RL), most existing approaches focus primarily on superficial pattern recognition, overlooking explicit training for abstraction. In this study, we propose AR$^2$ (Adversarial Reinforcement Learning for Abstract Reasoning), a novel framework explicitly designed to enhance the abstraction abilities of LLMs. AR$^2$ employs a teacher model to transform kernel problems into narrative-rich, challenging descriptions without changing their fundamental logic. Simultaneously, a student coding model is trained to solve these complex narrative problems by extracting their underlying computational kernels. Experimental results demonstrate that AR$^2$ substantially improves the student model's accuracy on previously unseen, challenging programming tasks, underscoring abstraction as a key skill for enhancing LLM generalization.

[11] Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

Shanglin Wu,Lihui Liu,Jinho D. Choi,Kai Shu

Main category: cs.CL

TL;DR: A novel framework dynamically constructs knowledge graphs during inference to improve the factual accuracy and interpretability of Large Language Models.

Details

Motivation: Large Language Models struggle with factual consistency, and existing Retrieval-Augmented Generation methods are limited in supporting compositional reasoning and identifying factual inconsistencies. Method: The method involves extracting a seed knowledge graph from the question, expanding it using the model's latent knowledge, and refining it with external retrieval to enhance factual coverage and accuracy. Result: The approach demonstrated consistent improvements in factual accuracy, answer precision, and interpretability over baseline prompting and static knowledge graph-augmented methods on three factual QA benchmarks. Conclusion: The proposed framework of dynamically constructing and expanding knowledge graphs during inference enhances the factuality of Large Language Models in a structured, interpretable, and scalable manner. Abstract: Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) methods address this issue by incorporating external knowledge from trusted sources at inference time. However, such methods typically treat knowledge as unstructured text, which limits their ability to support compositional reasoning and identify factual inconsistencies. To overcome these limitations, we propose a novel framework that dynamically constructs and expands knowledge graphs (KGs) during inference, integrating both internal knowledge extracted from LLMs and external information retrieved from external sources. Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM's latent knowledge. The graph is then selectively refined through external retrieval, enhancing factual coverage and correcting inaccuracies. We evaluate our approach on three diverse factual QA benchmarks, demonstrating consistent improvements in factual accuracy, answer precision, and interpretability over baseline prompting and static KG-augmented methods. Our findings suggest that inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.

[12] ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference

Qi Chen,Jingxuan Wei,Zhuoya Yao,Haiguang Wang,Gaowei Wu,Bihui Yu,Siyuan Li,Cheng Tan

Main category: cs.CL

TL;DR: This paper introduces ResearchPulse, an agent-based framework for multi-document scientific inference, which reconstructs research development chains by aligning motivations, methodologies, and results across related papers. The framework outperforms baselines like GPT-4o in key evaluation metrics.

Details

Motivation: Understanding how scientific ideas evolve requires structured, cross-document reasoning over thematically related research. This work aims to formalize multi-document scientific inference by extracting and aligning motivation, methodology, and experimental results across related papers to reconstruct research development chains. Method: The authors introduce ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It includes three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent for constructing motivation-method mind maps, and a Lchart-Agent for synthesizing experimental line charts. They evaluate the framework using ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Result: The ResearchPulse framework achieves superior performance compared to strong baselines like GPT-4o in the areas of semantic alignment, structural consistency, and visual fidelity on the ResearchPulse-Bench benchmark. Conclusion: The proposed ResearchPulse framework outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity, despite using 7B-scale agents. Abstract: Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in https://huggingface.co/datasets/ResearchPulse/ResearchPulse-Bench.

[13] NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management

Josh Wisoff,Yao Tang,Zhengyu Fang,Jordan Guzman,YuTang Wang,Alex Yu

Main category: cs.CL

TL;DR: NoteBar is an AI-assisted note-taking tool that efficiently organizes notes using persona information and language models, supported by a new dataset for research and evaluation.

Details

Motivation: Existing AI-assisted note-taking tools struggle with efficiency, prompting the need for a more practical and effective solution. Method: NoteBar leverages persona information and efficient language models to organize notes into categories and support user workflows, supported by a novel persona-conditioned dataset. Result: NoteBar enables cost-effective, interactive note-taking without reliance on heavy infrastructure. Conclusion: NoteBar and its dataset offer a scalable and extensible foundation for advancing AI-assisted personal knowledge management. Abstract: Note-taking is a critical practice for capturing, organizing, and reflecting on information in both academic and professional settings. The recent success of large language models has accelerated the development of AI-assisted tools, yet existing solutions often struggle with efficiency. We present NoteBar, an AI-assisted note-taking tool that leverages persona information and efficient language models to automatically organize notes into multiple categories and better support user workflows. To support research and evaluation in this space, we further introduce a novel persona-conditioned dataset of 3,173 notes and 8,494 annotated concepts across 16 MBTI personas, offering both diversity and semantic richness for downstream tasks. Finally, we demonstrate that NoteBar can be deployed in a practical and cost-effective manner, enabling interactive use without reliance on heavy infrastructure. Together, NoteBar and its accompanying dataset provide a scalable and extensible foundation for advancing AI-assisted personal knowledge management.

[14] E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

Aryan Gupta,Anupam Purwar

Main category: cs.CL

TL;DR: 研究比较了传统OCR和LVLM在边缘部署中的性能，结果显示传统OCR系统在多语言、现实世界图像中仍然更优。

Details

Motivation: 在现实世界多语言、嘈杂和多样化图像中，OCR仍然面临重大挑战，而大型视觉语言模型（LVLM）的发展引发了对其超越固定OCR管道能力的兴趣。 Method: 提出了Sprinklr-Edge-OCR，并在包含54种语言的大规模数据集上与五种最先进的LVLM和两种传统OCR系统进行了比较评估。 Result: Qwen实现了最高精度（0.54），而Sprinklr-Edge-OCR提供了最佳F1分数（0.46），并在效率方面优于其他模型，每张图像平均处理时间为0.17秒，成本仅为0.006美元每千张图像。 Conclusion: 传统OCR系统在边缘部署方面仍具有优势，特别是在计算需求、延迟和成本方面，使其在多语言、噪声和多样化现实世界图像中更优。 Abstract: Optical Character Recognition (OCR) in multilingual, noisy, and diverse real-world images remains a significant challenge for optical character recognition systems. With the rise of Large Vision-Language Models (LVLMs), there is growing interest in their ability to generalize and reason beyond fixed OCR pipelines. In this work, we introduce Sprinklr-Edge-OCR, a novel OCR system built specifically optimized for edge deployment in resource-constrained environments. We present a large-scale comparative evaluation of five state-of-the-art LVLMs (InternVL, Qwen, GOT OCR, LLaMA, MiniCPM) and two traditional OCR systems (Sprinklr-Edge-OCR, SuryaOCR) on a proprietary, doubly hand annotated dataset of multilingual (54 languages) images. Our benchmark covers a broad range of metrics including accuracy, semantic consistency, language coverage, computational efficiency (latency, memory, GPU usage), and deployment cost. To better reflect real-world applicability, we also conducted edge case deployment analysis, evaluating model performance on CPU only environments. Among the results, Qwen achieved the highest precision (0.54), while Sprinklr-Edge-OCR delivered the best overall F1 score (0.46) and outperformed others in efficiency, processing images 35 faster (0.17 seconds per image on average) and at less than 0.01 of the cost (0.006 USD per 1,000 images) compared to LVLM. Our findings demonstrate that the most optimal OCR systems for edge deployment are the traditional ones even in the era of LLMs due to their low compute requirements, low latency, and very high affordability.

[15] Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Dani Roytburg,Matthew Bozoukov,Matthew Nguyen,Jou Barzdukas,Simon Fu,Narmeen Oozeer

Main category: cs.CL

TL;DR: This paper explores using lightweight steering vectors to reduce self-preference bias in large language models, showing significant reduction in unjustified bias but highlighting instability in handling legitimate self-preference.

Details

Motivation: Large language models (LLMs) are increasingly used as automated evaluators, but they exhibit self-preference bias, favoring their own outputs over those of other models. This undermines fairness and reliability in evaluation pipelines, especially in tasks like preference tuning and model routing. The study aims to address this issue by exploring lightweight steering vectors as a mitigation strategy. Method: The authors introduce a curated dataset to distinguish between justified and unjustified self-preference bias. They construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach, then evaluate their effectiveness in mitigating self-preference bias. Result: The results show that steering vectors can reduce unjustified self-preference bias by up to 97%, outperforming prompting and direct preference optimization baselines. However, they are unstable when dealing with legitimate self-preference and unbiased agreement, suggesting that self-preference spans multiple or nonlinear directions. Conclusion: Steering vectors can effectively reduce unjustified self-preference bias in large language models (LLMs) without retraining, but they are unstable in handling legitimate self-preference and unbiased agreement, indicating the need for more robust interventions. Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97\%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.

[16] Semantic Analysis of SNOMED CT Concept Co-occurrences in Clinical Documentation using MIMIC-IV

Ali Noori,Somya Mohanty,Prashanti Manda

Main category: cs.CL

TL;DR: This study explores the relationship between co-occurrence patterns and semantic similarity of SNOMED CT concepts using the MIMIC-IV database, showing that semantic embeddings can enhance clinical documentation and decision-making.

Details

Motivation: The motivation is to understand how clinical concepts relate through co-occurrence and semantic similarity, aiming to improve the interoperability and analysis of unstructured clinical notes. Method: The study uses the MIMIC-IV database and leverages techniques such as Normalized Pointwise Mutual Information (NPMI) along with pretrained embeddings like ClinicalBERT and BioBERT to analyze SNOMED CT concept co-occurrence patterns and semantic similarity. Result: Analyses revealed a weak correlation between co-occurrence and semantic similarity, but embeddings captured meaningful clinical associations. Embedding-based suggestions often matched later documented concepts, and clustering of concept embeddings mapped to coherent clinical themes and patient phenotypes. Conclusion: The study concludes that co-occurrence statistics and semantic embeddings have complementary value in improving clinical documentation completeness, uncovering latent clinical relationships, and informing decision support and phenotyping applications. Abstract: Clinical notes contain rich clinical narratives but their unstructured format poses challenges for large-scale analysis. Standardized terminologies such as SNOMED CT improve interoperability, yet understanding how concepts relate through co-occurrence and semantic similarity remains underexplored. In this study, we leverage the MIMIC-IV database to investigate the relationship between SNOMED CT concept co-occurrence patterns and embedding-based semantic similarity. Using Normalized Pointwise Mutual Information (NPMI) and pretrained embeddings (e.g., ClinicalBERT, BioBERT), we examine whether frequently co-occurring concepts are also semantically close, whether embeddings can suggest missing concepts, and how these relationships evolve temporally and across specialties. Our analyses reveal that while co-occurrence and semantic similarity are weakly correlated, embeddings capture clinically meaningful associations not always reflected in documentation frequency. Embedding-based suggestions frequently matched concepts later documented, supporting their utility for augmenting clinical annotations. Clustering of concept embeddings yielded coherent clinical themes (symptoms, labs, diagnoses, cardiovascular conditions) that map to patient phenotypes and care patterns. Finally, co-occurrence patterns linked to outcomes such as mortality and readmission demonstrate the practical utility of this approach. Collectively, our findings highlight the complementary value of co-occurrence statistics and semantic embeddings in improving documentation completeness, uncovering latent clinical relationships, and informing decision support and phenotyping applications.

[17] MLSD: A Novel Few-Shot Learning Approach to Enhance Cross-Target and Cross-Domain Stance Detection

Parush Gera,Tempestt Neal

Main category: cs.CL

TL;DR: MLSD improves cross-target and cross-domain stance detection by using metric learning with triplet loss to capture semantic similarities and differences between targets.

Details

Motivation: The motivation is to enhance domain adaptation in stance detection across different domains and targets. Method: MLSD uses metric learning with triplet loss to create a discriminative embedding space for capturing semantic similarities and differences between stance targets. Result: MLSD showed statistically significant improvement in stance detection performance across six widely used models in multiple cross-target and cross-domain scenarios. Conclusion: MLSD is an effective method for cross-target and cross-domain stance detection that significantly improves stance detection performance. Abstract: We present the novel approach for stance detection across domains and targets, Metric Learning-Based Few-Shot Learning for Cross-Target and Cross-Domain Stance Detection (MLSD). MLSD utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, enhancing domain adaptation. By constructing a discriminative embedding space, MLSD allows a cross-target or cross-domain stance detection model to acquire useful examples from new target domains. We evaluate MLSD in multiple cross-target and cross-domain scenarios across two datasets, showing statistically significant improvement in stance detection performance across six widely used stance detection models.

[18] SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation

Saki Imai,Mert İnan,Anthony Sicilia,Malihe Alikhani

Main category: cs.CL

TL;DR: 本研究提出 SiLVERScore，一种新的语义感知嵌入评估指标，用于改进手语生成的评估，避免传统回译方法的局限性并在多个数据集上表现出优越性能

Details

Motivation: 现有基于回译的两步评估流程存在模糊性，无法捕捉手语的多模态特性，并且难以确定评估错误来源于生成模型还是翻译系统 Method: 提出 SiLVERScore，一种基于语义感知嵌入空间的评估方法，用于在联合嵌入空间中评估手语生成 Result: SiLVERScore 展示了其对语义和语调变化的鲁棒性，并探索了跨数据集的泛化挑战 Conclusion: SiLVERScore 比传统评估指标更有效地评估手语生成，其在 PHOENIX-14T 和 CSL-Daily 数据集上表现出接近完美的正确与随机配对区分能力（ROC AUC = 0.99，重叠率 < 7%） Abstract: Evaluating sign language generation is often done through back-translation, where generated signs are first recognized back to text and then compared to a reference using text-based metrics. However, this two-step evaluation pipeline introduces ambiguity: it not only fails to capture the multimodal nature of sign language-such as facial expressions, spatial grammar, and prosody-but also makes it hard to pinpoint whether evaluation errors come from sign generation model or the translation system used to assess it. In this work, we propose SiLVERScore, a novel semantically-aware embedding-based evaluation metric that assesses sign language generation in a joint embedding space. Our contributions include: (1) identifying limitations of existing metrics, (2) introducing SiLVERScore for semantically-aware evaluation, (3) demonstrating its robustness to semantic and prosodic variations, and (4) exploring generalization challenges across datasets. On PHOENIX-14T and CSL-Daily datasets, SiLVERScore achieves near-perfect discrimination between correct and random pairs (ROC AUC = 0.99, overlap < 7%), substantially outperforming traditional metrics.

[19] Measuring How (Not Just Whether) VLMs Build Common Ground

Saki Imai,Mert İnan,Anthony Sicilia,Malihe Alikhani

Main category: cs.CL

TL;DR: 本文介绍了一个四指标套件，用于评估大型视觉语言模型（VLMs）在交互式基础情境中的表现，并通过与人类对话的比较发现现有模型与人类模式存在差异。

Details

Motivation: 当前基准测试仅在单次交互或问答场景中评估VLMs，但实际基础过程是通过持续交流逐步建立共享理解的互动过程。 Method: 本文设计了一个四维指标套件（基础效率、内容对齐、词汇适应和类人性），并将其应用于三个专有VLMs的150次自玩交互指称游戏会话中，与人类对话进行比较分析。 Result: 所有三个模型在至少三个指标上与人类模式存在差异，而GPT4o-mini总体上最接近人类表现；此外，任务成功得分并不意味着成功的基础，高图像-语句对齐也不一定预示任务成功。 Conclusion: 该研究提供了一个用于评估VLM基础表现的指标套件和相关发现，为未来相关研究提供了框架。 Abstract: Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

[20] Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

Jiaxin Guo,Daimeng Wei,Yuanchang Luo,Xiaoyu Chen,Zhanglin Wu,Huan Yang,Hengchao Shang,Zongyao Li,Zhiqiang Rao,Jinlong Yang,Hao Yang

Main category: cs.CL

TL;DR: This paper introduces Align-then-Slide, a new evaluation framework for document-level machine translation that aligns sentences and evaluates using multi-granularity scoring, showing strong correlation with human judgments and enabling better training methods.

Details

Motivation: The outputs of large language models in document-level machine translation challenge existing evaluation methods that assume sentence-by-sentence alignment, necessitating a new framework for accurate evaluation. Method: The Align-then-Slide framework consists of two stages: Align, where sentence-level source-target correspondences are inferred and target sentences are rebuilt to match the source sentence number; and n-Chunk Sliding Evaluate, where averaged metric scores are calculated across multiple chunk sizes for multi-granularity assessment. Result: Experiments on the WMT benchmark showed a Pearson correlation of 0.929 between the proposed method and expert MQM rankings. The method also aligned closely with human judgments on a new real-world test set, and preference data from the framework enabled effective training improvements. Conclusion: The proposed Align-then-Slide framework is validated as an accurate, robust, and actionable evaluation tool for doc-mt systems, showing strong alignment with human judgments and enabling effective training methods. Abstract: Large language models (LLMs) have ushered in a new era for document-level machine translation (\textit{doc}-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit{\textbf{Align-then-Slide}}, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.

[21] NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation

Huhong Xian,Rui Liu,Berrak Sisman,Haizhou Li

Main category: cs.CL

TL;DR: NE-PADD은 부분 오디오 딥페이크 탐지(PADD)를 위해 음성 명명 엔티티 인식(SpeechNER)과 PADD의 두 가지 병렬 분기를 활용하는 새로운 방법으로, 주의 집계 메커니즘과 보조 손실을 통해 명명 엔티티 의미 정보를 통합합니다.

Details

Motivation: 기존 문장 수준의 오디오 딥페이크 탐지(ADD)와 달리, 부분 오디오 딥페이크 탐지(PADD)는 가짜 음성을 프레임 수준에서 위치해야 하며, 특히 명명 엔티티와 같은 오디오의 의미 정보를 활용하는 것이 충분히 탐구되지 않았습니다. Method: NE-PADD는 SpeechNER 및 PADD라는 두 병렬 분기를 활용하여 명명 엔티티 지식을 활용하고, 주의 융합(AF)과 주의 전송(AT)이라는 두 가지 주의 집계 메커니즘을 포함하며, 보조 손실을 통해 PADD에 명명 엔티티 의미 정보를 통합합니다. Result: 실험 결과 NE-PADD는 기존 베이스라인보다 우수하며, 이는 PADD에 명명 엔티티 지식을 통합하는 효과를 입증합니다. Conclusion: NE-PADD는 명명 엔티티 의미 정보를 활용한 PADD에서 효과적인 성능을 보이며, 향후 연구에 중요한 영감을 줄 수 있습니다. Abstract: Different from traditional sentence-level audio deepfake detection (ADD), partial audio deepfake detection (PADD) requires frame-level positioning of the location of fake speech. While some progress has been made in this area, leveraging semantic information from audio, especially named entities, remains an underexplored aspect. To this end, we propose NE-PADD, a novel method for Partial Audio Deepfake Detection (PADD) that leverages named entity knowledge through two parallel branches: Speech Name Entity Recognition (SpeechNER) and PADD. The approach incorporates two attention aggregation mechanisms: Attention Fusion (AF) for combining attention weights and Attention Transfer (AT) for guiding PADD with named entity semantics using an auxiliary loss. Built on the PartialSpoof-NER dataset, experiments show our method outperforms existing baselines, proving the effectiveness of integrating named entity knowledge in PADD. The code is available at https://github.com/AI-S2-Lab/NE-PADD.

[22] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Yang Wang,Chenghao Xiao,Chia-Yi Hsiao,Zi Yan Chang,Chi-Li Chen,Tyler Loakman,Chenghua Lin

Main category: cs.CL

TL;DR: This paper introduces Drivelology, a linguistic phenomenon that presents challenges for large language models due to its complex semantics and subjective interpretation. The authors evaluate LLMs and find that they struggle to understand Drivelology, highlighting a gap in machine comprehension beyond surface-level coherence.

Details

Motivation: The paper aims to explore Drivelology, a linguistic phenomenon that is syntactically coherent but pragmatically complex, to determine if current LLMs can truly understand its layered semantics and implied meaning. Method: The authors created a benchmark dataset of over 1,200 Drivelology examples in multiple languages. They evaluated various LLMs on classification, generation, and reasoning tasks, with a focus on understanding the limitations of these models in grasping the deeper semantics of Drivelology. Result: The evaluation of LLMs showed that they often confuse Drivelology with shallow nonsense, produce incoherent justifications, and fail to grasp the rhetorical function of such texts. Conclusion: The paper concludes that despite their capabilities in natural language processing, LLMs struggle to understand the nuanced and subjective nature of Drivelology, revealing a gap in their pragmatic understanding. Abstract: We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

[23] A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models

Yanbo Wang,Yongcan Yu,Jian Liang,Ran He

Main category: cs.CL

TL;DR: This paper explores how CoT reasoning impacts the trustworthiness of language models, examining truthfulness, safety, robustness, fairness, and privacy, and highlights both the potential benefits and existing vulnerabilities in current models.

Details

Motivation: The paper aims to address the lack of comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models. Method: The paper surveys recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. Result: The paper provides an overview of recent studies on trustworthy reasoning, detailing their methodologies, findings, and limitations while identifying future research directions. Conclusion: Reasoning techniques have the potential to enhance model trustworthiness, but current models still face significant vulnerabilities in safety, robustness, and privacy. This paper serves as a resource for the AI safety community. Abstract: The development of Long-CoT reasoning has advanced LLM performance across various tasks, including language understanding, complex problem solving, and code generation. This paradigm enables models to generate intermediate reasoning steps, thereby improving both accuracy and interpretability. However, despite these advancements, a comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models remains underdeveloped. In this paper, we survey recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. For each aspect, we provide a clear and structured overview of recent studies in chronological order, along with detailed analyses of their methodologies, findings, and limitations. Future research directions are also appended at the end for reference and discussion. Overall, while reasoning techniques hold promise for enhancing model trustworthiness through hallucination mitigation, harmful content detection, and robustness improvement, cutting-edge reasoning models themselves often suffer from comparable or even greater vulnerabilities in safety, robustness, and privacy. By synthesizing these insights, we hope this work serves as a valuable and timely resource for the AI safety community to stay informed on the latest progress in reasoning trustworthiness. A full list of related papers can be found at \href{https://github.com/ybwang119/Awesome-reasoning-safety}{https://github.com/ybwang119/Awesome-reasoning-safety}.

[24] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Cheng Wang,Zeming Wei,Qin Liu,Muhao Chen

Main category: cs.CL

TL;DR: This paper shows that current probing methods for detecting harmful instructions in LLMs rely on superficial patterns like trigger words rather than understanding meaning, leading to a false sense of security. New approaches are needed for better safety detection.

Details

Motivation: The motivation stems from the observation that probing-based approaches for detecting harmful instructions in LLMs perform poorly on out-of-distribution data, suggesting they may not truly understand semantic harmfulness but instead rely on superficial patterns. Method: The authors conducted controlled experiments using semantically cleaned datasets and compared the performance of simple n-gram methods against more complex probing techniques. They analyzed the patterns learned by probes, focusing on instructional patterns and trigger words, and assessed the out-of-distribution performance of these methods. Result: The experiments confirmed that probes learn superficial patterns such as instructional patterns and trigger words rather than semantic harmfulness. Simple n-gram methods showed comparable performance, indicating that current probing methods do not deeply understand the content but rely on shallow cues. Conclusion: The paper concludes that current probing-based safety detection methods in Large Language Models (LLMs) provide a false sense of security, as they tend to learn superficial patterns rather than understanding semantic harmfulness. This highlights the need to redesign both models and evaluation protocols for more effective safety measures. Abstract: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

[25] MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation

Gowen Loo,Chang Liu,Qinghong Yin,Xiang Chen,Jiawei Chen,Jingyuan Zhang,Yu Tian

Main category: cs.CL

TL;DR: MobileRAG improves the performance of mobile agents by using Retrieval-Augmented Generation, effectively handling complex, real-world mobile tasks with fewer steps than current methods.

Details

Motivation: Current LLM-based mobile agents face challenges such as reliance on LLM comprehension, lack of environment interaction, and absence of memory capabilities. Method: MobileRAG uses Retrieval-Augmented Generation (RAG) to enhance the performance of mobile agents, incorporating InterRAG, LocalRAG, and MemRAG. Result: MobileRAG achieves a 10.3% improvement over existing methods with fewer operational steps, as demonstrated by extensive experiments on the MobileRAG-Eval benchmark. Conclusion: MobileRAG is a more effective framework for handling real-world mobile tasks, achieving better results compared to state-of-the-art methods. Abstract: Smartphones have become indispensable in people's daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3\% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: https://github.com/liuxiaojieOutOfWorld/MobileRAG_arxiv

[26] MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering

Fengxiao Tang,Yufeng Li,Zongzong Wu,Ming Zhao

Main category: cs.CL

TL;DR: This paper proposes the Matrix of Thought (MoT) framework and a fact-correction mechanism to improve the reasoning capabilities of large language models (LLMs) in complex question answering tasks, achieving better performance and efficiency than existing methods.

Details

Motivation: The motivation stems from the limitations of existing methods like Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Retrieval-Augmented Generation (RAG), which face challenges such as redundancy, single-path limitations, and difficulty in handling multi-entity, multi-hop reasoning tasks. Method: The study introduces the Matrix of Thought (MoT) structure, which explores problems in both horizontal and vertical dimensions through a 'column-cell communication' mechanism. Additionally, a fact-correction mechanism is developed by constructing knowledge units from knowledge graph triples and raw text. The resulting framework (MTQA) is evaluated on four widely-used datasets. Result: The MTQA framework outperformed state-of-the-art methods on four datasets in terms of F1 and EM scores, while its reasoning time was only 14.4% of baseline methods, demonstrating both higher accuracy and efficiency. Conclusion: The proposed Matrix of Thought (MoT) framework, along with its fact-correction mechanism, significantly enhances the reasoning capabilities of large language models (LLMs), leading to improved performance in complex question answering tasks. Abstract: Complex Question Answering (QA) is a fundamental and challenging task in NLP. While large language models (LLMs) exhibit impressive performance in QA, they suffer from significant performance degradation when facing complex and abstract QA tasks due to insufficient reasoning capabilities. Works such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance LLMs' reasoning abilities, but they face issues such as in-layer redundancy in tree structures and single paths in chain structures. Although some studies utilize Retrieval-Augmented Generation (RAG) methods to assist LLMs in reasoning, the challenge of effectively utilizing large amounts of information involving multiple entities and hops remains critical. To address this, we propose the Matrix of Thought (MoT), a novel and efficient LLM thought structure. MoT explores the problem in both horizontal and vertical dimensions through the "column-cell communication" mechanism, enabling LLMs to actively engage in multi-strategy and deep-level thinking, reducing redundancy within the column cells and enhancing reasoning capabilities. Furthermore, we develop a fact-correction mechanism by constructing knowledge units from retrieved knowledge graph triples and raw text to enhance the initial knowledge for LLM reasoning and correct erroneous answers. This leads to the development of an efficient and accurate QA framework (MTQA). Experimental results show that our framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4\% of the baseline methods, demonstrating both its efficiency and accuracy. The code for this framework is available at https://github.com/lyfiter/mtqa.

[27] Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling

Iro Lim,Haein Ji,Byungjun Kim

Main category: cs.CL

TL;DR: KPoEM is a new dataset for computational emotion analysis in Korean poetry, which significantly improves emotion classification performance and bridges computational methods with literary analysis.

Details

Motivation: The motivation is to address the lack of exploration in Korean poetry emotion analysis due to its figurative language and cultural specificity, despite progress in text-based emotion classification. Method: A state-of-the-art Korean language model was fine-tuned on the KPoEM dataset through sequential fine-tuning, first on general corpora and then on the dataset. Result: The KPoEM model achieved an F1-micro score of 0.60, significantly outperforming previous models trained on general corpora, which scored 0.34. Conclusion: This study concludes that the KPoEM dataset enhances the ability to analyze emotions in modern Korean poetry, bridging computational methods and literary analysis. Abstract: This study introduces KPoEM (Korean Poetry Emotion Mapping) , a novel dataset for computational emotion analysis in modern Korean poetry. Despite remarkable progress in text-based emotion classification using large language models, poetry-particularly Korean poetry-remains underexplored due to its figurative language and cultural specificity. We built a multi-label emotion dataset of 7,662 entries, including 7,007 line-level entries from 483 poems and 615 work-level entries, annotated with 44 fine-grained emotion categories from five influential Korean poets. A state-of-the-art Korean language model fine-tuned on this dataset significantly outperformed previous models, achieving 0.60 F1-micro compared to 0.34 from models trained on general corpora. The KPoEM model, trained through sequential fine-tuning-first on general corpora and then on the KPoEM dataset-demonstrates not only an enhanced ability to identify temporally and culturally specific emotional expressions, but also a strong capacity to preserve the core sentiments of modern Korean poetry. This study bridges computational methods and literary analysis, presenting new possibilities for the quantitative exploration of poetic emotions through structured data that faithfully retains the emotional and cultural nuances of Korean literature.

[28] SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment

Yuqing Huang,Rongyang Zhang,Qimeng Wang,Chengqiang Lu,Yan Gao,Yi Wu,Yao Hu,Xuyang Zhi,Guiquan Liu,Xin Li,Hao Wang,Enhong Chen

Main category: cs.CL

TL;DR: 提出了一种名为SelfAug的自我分布对齐方法，以缓解在检索增强生成（RAG）场景中微调大型语言模型时出现的灾难性遗忘问题。

Details

Motivation: 现有的监督微调方法，尤其是在检索增强生成（RAG）场景中，虽然能够有效提升任务特定性能，但往往会导致模型失去先前获取的知识和通用能力。为克服这一限制，提出了SelfAug。 Method: SelfAug是一种自我分布对齐方法，通过对输入序列logits进行对齐以保持模型的语义分布，从而减轻灾难性遗忘并提升下游性能。 Result: 实验表明，SelfAug在下游学习和保留模型通用能力之间实现了更优的平衡。实证分析显示了分布变化与RAG场景中灾难性遗忘严重程度之间的直接关联。 Conclusion: SelfAug不仅在RAG背景下推进了对灾难性遗忘的理解，还提供了一个适用于各种微调场景的实用解决方案。 Abstract: Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, effectively enhances task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model's original distribution. To overcome these limitations, we propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model's semantic distribution, thereby mitigating catastrophic forgetting and improving downstream performance. Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning. Our findings not only advance the understanding of catastrophic forgetting in RAG contexts but also provide a practical solution applicable across diverse fine-tuning scenarios. Our code is publicly available at https://github.com/USTC-StarTeam/SelfAug.

[29] SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

Yuhao Zhang,Shaoming Duan,Jinhang Su,Chuanyi Liu,Peiyi Han

Main category: cs.CL

TL;DR: SPFT-SQL是一种针对Text-to-SQL任务的新自我对弈微调方法，通过合成高质量的微调数据和错误驱动损失方法，提高了生成正确SQL的能力。

Details

Motivation: SPIN在Text-to-SQL任务中面临挑战，无法生成新信息，对手模型生成大量正确SQL查询降低了主模型生成准确SQL的能力。 Method: SPFT-SQL在自我对弈之前引入了一种基于验证的迭代微调方法，并在微调阶段采用错误驱动损失方法。 Result: 在六个开源LLM和五个广泛使用的基准上的实验和分析表明，SPFT-SQL优于现有的最先进方法。 Conclusion: SPFT-SQL面对Text-to-SQL任务的挑战，提出了一种新的自我对弈微调方法，通过验证迭代微调和错误驱动损失方法，提高了生成正确SQL的能力。 Abstract: Despite the significant advancements of self-play fine-tuning (SPIN), which can transform a weak large language model (LLM) into a strong one through competitive interactions between models of varying capabilities, it still faces challenges in the Text-to-SQL task. SPIN does not generate new information, and the large number of correct SQL queries produced by the opponent model during self-play reduces the main model's ability to generate accurate SQL queries. To address this challenge, we propose a new self-play fine-tuning method tailored for the Text-to-SQL task, called SPFT-SQL. Prior to self-play, we introduce a verification-based iterative fine-tuning approach, which synthesizes high-quality fine-tuning data iteratively based on the database schema and validation feedback to enhance model performance, while building a model base with varying capabilities. During the self-play fine-tuning phase, we propose an error-driven loss method that incentivizes incorrect outputs from the opponent model, enabling the main model to distinguish between correct SQL and erroneous SQL generated by the opponent model, thereby improving its ability to generate correct SQL. Extensive experiments and in-depth analyses on six open-source LLMs and five widely used benchmarks demonstrate that our approach outperforms existing state-of-the-art (SOTA) methods.

[30] VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

Weihao Wu,Liang Cao,Xinyu Wu,Zhiwei Lin,Rui Niu,Jingbei Li,Zhiyong Wu

Main category: cs.CL

TL;DR: This paper addresses the lack of standardized benchmarks and paralinguistic features in speech-based role-playing conversational agents by introducing VoxRole, a comprehensive evaluation dataset derived from movies.

Details

Motivation: Advancements in large language models have driven the development of role-playing conversational agents, but current research overlooks important aspects like intonation and lacks standardized benchmarks for evaluation. Method: The authors introduced VoxRole, a benchmark for evaluating speech-based RPCAs, which includes a large dataset of dialogues and a two-stage automated pipeline for character profiling. Result: VoxRole includes 13,335 multi-turn dialogues across 261 movies, totaling 65.6 hours of speech from 1,228 unique characters, enabling multi-dimensional evaluation of spoken dialogue models. Conclusion: The study concludes that current speech-based role-playing conversational agents face limitations in evaluation benchmarks and paralinguistic features. VoxRole provides a solution to these issues. Abstract: Recent significant advancements in Large Language Models (LLMs) have greatly propelled the development of Role-Playing Conversational Agents (RPCAs). These systems aim to create immersive user experiences through consistent persona adoption. However, current RPCA research faces dual limitations. First, existing work predominantly focuses on the textual modality, entirely overlooking critical paralinguistic features including intonation, prosody, and rhythm in speech, which are essential for conveying character emotions and shaping vivid identities. Second, the speech-based role-playing domain suffers from a long-standing lack of standardized evaluation benchmarks. Most current spoken dialogue datasets target only fundamental capability assessments, featuring thinly sketched or ill-defined character profiles. Consequently, they fail to effectively quantify model performance on core competencies like long-term persona consistency. To address this critical gap, we introduce VoxRole, the first comprehensive benchmark specifically designed for the evaluation of speech-based RPCAs. The benchmark comprises 13335 multi-turn dialogues, totaling 65.6 hours of speech from 1228 unique characters across 261 movies. To construct this resource, we propose a novel two-stage automated pipeline that first aligns movie audio with scripts and subsequently employs an LLM to systematically build multi-dimensional profiles for each character. Leveraging VoxRole, we conduct a multi-dimensional evaluation of contemporary spoken dialogue models, revealing crucial insights into their respective strengths and limitations in maintaining persona consistency.

[31] CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking

Ruiling Guo,Xinwei Yang,Chen Huang,Tong Zhang,Yong Hu

Main category: cs.CL

TL;DR: 本文介绍CANDY基准测试和数据集，用以评估中文虚假信息事实核查中LLMs的能力和局限性，并发现其主要问题和潜在用途。

Details

Motivation: 尽管LLMs被广泛使用，但其在核查虚假信息方面的有效性仍不确定。 Method: 开发了一个名为CANDY的基准测试，并整理了一个约20,000个实例的精心标注数据集，用于系统评估LLMs在事实核查中的能力与局限性。 Result: 分析表明，即使使用了链式推理和少量样本提示，当前的LLMs在生成准确的事实核查结论方面仍存在局限。通过开发一种分类法，发现事实捏造是最常见的失败模式。 Conclusion: 尽管大型语言模型（LLMs）在事实核查中存在局限性，但它们在作为辅助工具使用时具有显著潜力提升人类的表现。 Abstract: The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of ~20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code can be accessed at https://github.com/SCUNLP/CANDY

[32] Exploring NLP Benchmarks in an Extremely Low-Resource Setting

Ulin Nuha,Adam Jatowt

Main category: cs.CL

TL;DR: This paper develops synthetic NLP datasets for the endangered Ladin language, improving translation quality and providing foundational resources for underrepresented language research.

Details

Motivation: The motivation is to address the lack of NLP datasets for low-resource languages like Ladin, which hinders the development of robust language technologies. Method: The authors used parallel Ladin-Italian sentence pairs to generate synthetic datasets for sentiment analysis and MCQA, applying filtering and back-translation procedures. Result: The result is the creation of the first publicly available sentiment analysis and MCQA datasets for Ladin, which enhance Italian-Ladin translation performance. Conclusion: The paper concludes that the synthetic datasets created for Ladin improve machine translation training, leading to better results than current baselines. Abstract: The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.

[33] Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study

Junghwan Lim,Gangwon Jo,Sungmin Lee,Jiyoung Park,Dongseok Kim,Jihwan Kim,Junhyeok Lee,Wai Ting Cheung,Dahye Choi,Kibong Choi,Jaeyeon Huh,Beomgyu Kim,Jangwoong Kim,Taehyun Kim,Haesol Lee,Jeesoo Lee,Dongpin Oh,Changseok Song,Daewon Suh

Main category: cs.CL

TL;DR: Llama-3-Motif 是一个基于 Llama 3 架构的 1020 亿参数语言模型，专注于提升韩语能力同时保持英语性能，其表现可与 GPT-4 相媲美。

Details

Motivation: 开发一个能够同时在韩语和英语任务中表现出色的大型语言模型，以填补现有模型的不足。 Method: 基于 Llama 3 架构，使用 LlamaPro 和 Masked Structure Growth 等先进技术进行模型扩展，并利用 MoAI 平台在超大规模 GPU 集群上进行高效训练。 Result: Llama-3-Motif 在韩语相关基准测试中表现出色，超越了现有模型，并达到了与 GPT-4 相当的水平。 Conclusion: Llama-3-Motif 是一个专注于提升韩语能力同时保持英语性能的大型语言模型，其表现可与 GPT-4 相媲美。 Abstract: We introduce Llama-3-Motif, a language model consisting of 102 billion parameters, specifically designed to enhance Korean capabilities while retaining strong performance in English. Developed on the Llama 3 architecture, Llama-3-Motif employs advanced training techniques, including LlamaPro and Masked Structure Growth, to effectively scale the model without altering its core Transformer architecture. Using the MoAI platform for efficient training across hyperscale GPU clusters, we optimized Llama-3-Motif using a carefully curated dataset that maintains a balanced ratio of Korean and English data. Llama-3-Motif shows decent performance on Korean-specific benchmarks, outperforming existing models and achieving results comparable to GPT-4.

[34] RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models

Zhaoyan Gong,Juan Li,Zhiqiang Liu,Lei Liang,Huajun Chen,Wen Zhang

Main category: cs.CL

TL;DR: 本文提出RTQA框架，通过递归分解和多路径聚合提升TKGQA性能，无需训练即可有效处理复杂时间查询。

Details

Motivation: 现有TKGQA方法处理复杂时间查询能力有限，且在分解框架中存在推理能力弱和错误传播问题。 Method: RTQA框架通过三个核心组件（时间问题分解器，递归求解器和答案聚合器）递归地分解问题并进行多路径答案聚合。 Result: 在MultiTQ和TimelineKGQA基准测试中，“Multiple”和“Complex”类别中的Hits@1显著提高，超越了最先进的方法。 Conclusion: RTQA是一个无需训练的TKGQA新框架，通过递归思考解决了现有方法的局限性。 Abstract: Current temporal knowledge graph question answering (TKGQA) methods primarily focus on implicit temporal constraints, lacking the capability of handling more complex temporal queries, and struggle with limited reasoning abilities and error propagation in decomposition frameworks. We propose RTQA, a novel framework to address these challenges by enhancing reasoning over TKGs without requiring training. Following recursive thinking, RTQA recursively decomposes questions into sub-problems, solves them bottom-up using LLMs and TKG knowledge, and employs multi-path answer aggregation to improve fault tolerance. RTQA consists of three core components: the Temporal Question Decomposer, the Recursive Solver, and the Answer Aggregator. Experiments on MultiTQ and TimelineKGQA benchmarks demonstrate significant Hits@1 improvements in "Multiple" and "Complex" categories, outperforming state-of-the-art methods. Our code and data are available at https://github.com/zjukg/RTQA.

[35] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Riccardo Lunardi,Vincenzo Della Mea,Stefano Mizzaro,Kevin Roitero

Main category: cs.CL

TL;DR: 该研究系统评估了LLMs对改写基准问题的鲁棒性，并探讨了基于基准的评估是否能可靠地衡量模型能力。

Details

Motivation: 现实世界的应用涉及语言的变化，需要模型在不同重述的问题或查询中保持其有效性。 Method: 系统生成六个常见基准测试中所有问题的各种改写，并测量34个最先进的LLMs的有效性变化。 Result: 研究发现，虽然LLMs在改写输入中的排名相对稳定，但绝对有效性得分显著下降。 Conclusion: 研究强调了当前基准测试的局限性，并指出需要开发更具鲁棒性的基准测试方法，以更好地反映实际应用环境。 Abstract: Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query. In this study, we systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities. We systematically generate various paraphrases of all the questions across six different common benchmarks, and measure the resulting variations in effectiveness of 34 state-of-the-art LLMs, of different size and effectiveness. Our findings reveal that while LLM rankings remain relatively stable across paraphrased inputs, absolute effectiveness scores change, and decline significantly. This suggests that LLMs struggle with linguistic variability, raising concerns about their generalization abilities and evaluation methodologies. Furthermore, the observed performance drop challenges the reliability of benchmark-based evaluations, indicating that high benchmark scores may not fully capture a model's robustness to real-world input variations. We discuss the implications of these findings for LLM evaluation methodologies, emphasizing the need for robustness-aware benchmarks that better reflect practical deployment scenarios.

[36] What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

Debangan Mishra,Arihant Rastogi,Agyeya Negi,Shashwat Goel,Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: 研究发现，随着模型规模和能力的提升，其在不同语言间的输出一致性增强，并且kappa_p指标在评估多语言可靠性方面表现出实用价值。

Details

Motivation: 研究模型在不同语言间的输出相似性，评估多语言模型的一致性和可靠性。 Method: 使用新提出的模型相似性度量 kappa_p，对GlobalMMLU中的20种语言和47个科目进行分析。 Result: 模型的响应在其规模和能力增长时，在不同语言间变得更加一致；模型在不同语言中的自洽性高于与其他模型在相同语言中的匹配度。 Conclusion: 模型在不同语言间的输出一致性随着其规模和能力的增加而提高，并且模型在不同语言间的自洽性高于与其他模型在相同语言下的匹配度。kappa_p被证明是评估多语言可靠性的有效工具，并对构建更一致的多语言系统具有指导意义。 Abstract: How similar are model outputs across languages? In this work, we study this question using a recently proposed model similarity metric $\kappa_p$ applied to 20 languages and 47 subjects in GlobalMMLU. Our analysis reveals that a model's responses become increasingly consistent across languages as its size and capability grow. Interestingly, models exhibit greater cross-lingual consistency within themselves than agreement with other models prompted in the same language. These results highlight not only the value of $\kappa_p$ as a practical tool for evaluating multilingual reliability, but also its potential to guide the development of more consistent multilingual systems.

[37] A RoBERTa-Based Functional Syntax Annotation Model for Chinese Texts

Han Xiaohui,Zhang Yunlong,Guo Yuxi

Main category: cs.CL

TL;DR: 该研究开发了一种基于RoBERTa的中文功能句法注释模型，通过使用从人民日报语料库选取的数据进行训练，在命名实体识别任务中表现出色，为中文功能句法分析提供了新方法。

Details

Motivation: 基于系统功能语法和加的夫语法的自动注释系统在中文文本中的缺失限制了相关理论的应用和推广，因此进行此研究。 Method: 基于RoBERTa的模型通过微调以实现命名实体识别任务，并使用从人民日报2014语料库中随机选择的4,100个句子进行训练和标注。 Result: 在测试集上，模型的F1得分为0.852，并且在识别主体(S)、主要动词(M)和补语(C)等核心句法元素方面表现出色。 Conclusion: 该研究成功地将功能句法与基于注意力的自然语言处理模型相结合，为中文功能句法分析提供了新的方法，并为后续研究奠定了基础。 Abstract: Systemic Functional Grammar and its branch, Cardiff Grammar, have been widely applied to discourse analysis, semantic function research, and other tasks across various languages and texts. However, an automatic annotation system based on this theory for Chinese texts has not yet been developed, which significantly constrains the application and promotion of relevant theories. To fill this gap, this research introduces a functional syntax annotation model for Chinese based on RoBERTa (Robustly Optimized BERT Pretraining Approach). The study randomly selected 4,100 sentences from the People's Daily 2014 corpus and annotated them according to functional syntax theory to establish a dataset for training. The study then fine-tuned the RoBERTa-Chinese wwm-ext model based on the dataset to implement the named entity recognition task, achieving an F1 score of 0.852 on the test set that significantly outperforms other comparative models. The model demonstrated excellent performance in identifying core syntactic elements such as Subject (S), Main Verb (M), and Complement (C). Nevertheless, there remains room for improvement in recognizing entities with imbalanced label samples. As the first integration of functional syntax with attention-based NLP models, this research provides a new method for automated Chinese functional syntax analysis and lays a solid foundation for subsequent studies.

[38] Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning

Zhilin Wang,Zhe Yang,Yun Luo,Yafu Li,Haoran Zhang,Runzhe Zhan,Derek F. Wong,Jizhe Zhou,Yu Cheng

Main category: cs.CL

TL;DR: The paper proposes a framework for synthesizing sheet music problems grounded in music theory to serve as evaluation benchmarks and training data for reinforcement learning with verifiable rewards. It demonstrates the effectiveness of this approach in enhancing model reasoning for sheet music understanding and in enabling AI-assisted music creation.

Details

Motivation: The motivation is to enhance the ability of Large Language Models and Multimodal Large Language Models to interpret sheet music, which is crucial for building AI musicians, due to the lack of evaluation benchmarks and training data for sheet music reasoning. Method: The paper introduces a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, resulting in the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Result: Evaluation results on SSMR-Bench highlight the importance of models' reasoning abilities in interpreting sheet music. Qwen3-8B-Base and Qwen2.5-VL-Instruct achieve improvements on the SSMR-Bench by leveraging synthetic data for reinforcement learning with verifiable rewards. Qwen3-8B-Base surpasses GPT-4 in overall performance on MusicTheoryBench and also improves performance on math problems relative to the original Qwen3-8B-Base. Conclusion: The paper concludes that synthesizing sheet music problems based on music theory rules is effective in advancing model reasoning for sheet music understanding and opens new possibilities for AI-assisted music creation. Abstract: Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. To address this, we propose the idea of synthesizing sheet music problems grounded in music theory, which can serve both as evaluation benchmarks and as training data for reinforcement learning with verifiable rewards (RLVR). We introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench show the importance of models' reasoning abilities in interpreting sheet music. At the same time, the poor performance of Gemini 2.5-Pro highlights the challenges that MLLMs still face in interpreting sheet music in a visual format. By leveraging synthetic data for RLVR, Qwen3-8B-Base and Qwen2.5-VL-Instruct achieve improvements on the SSMR-Bench. Besides, the trained Qwen3-8B-Base surpasses GPT-4 in overall performance on MusicTheoryBench and achieves reasoning performance comparable to GPT-4 with the strategies of Role play and Chain-of-Thought. Notably, its performance on math problems also improves relative to the original Qwen3-8B-Base. Furthermore, our results show that the enhanced reasoning ability can also facilitate music composition. In conclusion, we are the first to propose the idea of synthesizing sheet music problems based on music theory rules, and demonstrate its effectiveness not only in advancing model reasoning for sheet music understanding but also in unlocking new possibilities for AI-assisted music creation.

[39] Arabic Chatbot Technologies in Education: An Overview

Hicham Bourhil,Yacine El Younoussi

Main category: cs.CL

TL;DR: This paper surveys Arabic chatbots in education, highlighting their characteristics and identifying research gaps, particularly the limited use of modern AI techniques compared to chatbots in other languages.

Details

Motivation: Motivated by the increasing adoption of AI and NLP technologies, especially chatbots, in various domains including education, and the growing interest in these technologies since the COVID-19 pandemic, this study focuses on Arabic chatbots due to the limited use of modern techniques in this language. Method: The study conducts a survey on existing Arabic chatbots in education, analyzing their approaches, language varieties, and performance metrics. Result: The survey identifies characteristics of Arabic chatbots in education and uncovers research gaps, particularly the limited use of modern AI techniques in comparison to chatbots in other languages like English. Conclusion: The study concludes that while chatbots have been successful in other languages, there is a lack of adoption of modern techniques in educational Arabic chatbots, highlighting opportunities for future research. Abstract: The recent advancements in Artificial Intelligence (AI) in general, and in Natural Language Processing (NLP) in particular, and some of its applications such as chatbots, have led to their implementation in different domains like education, healthcare, tourism, and customer service. Since the COVID-19 pandemic, there has been an increasing interest in these digital technologies to allow and enhance remote access. In education, e-learning systems have been massively adopted worldwide. The emergence of Large Language Models (LLM) such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformers) made chatbots even more popular. In this study, we present a survey on existing Arabic chatbots in education and their different characteristics such as the adopted approaches, language variety, and metrics used to measure their performance. We were able to identified some research gaps when we discovered that, despite the success of chatbots in other languages such as English, only a few educational Arabic chatbots used modern techniques. Finally, we discuss future directions of research in this field.

[40] Improving Narrative Classification and Explanation via Fine Tuned Language Models

Rishit Tyagi,Rahul Bouri,Mohit Gupta

Main category: cs.CL

TL;DR: 本文研究了如何通过改进BERT模型和GPT-4o流水线提高新闻文章中叙事检测和解释的准确性与可靠性。

Details

Motivation: 理解隐性叙述和隐含信息对分析偏见和情感至关重要，但传统NLP方法难以检测细微措辞和隐藏议程。 Method: 论文使用了微调BERT模型与召回导向方法进行叙事检测，并通过GPT-4o流水线进行预测优化。对于叙事解释，提出了基于语义检索的ReACT框架与少量样本提示。 Result: 论文结果显示，通过结合辅助知识与提示优化，可以增强事实准确性并减少幻觉现象。 Conclusion: 论文得出结论：在提示中整合辅助知识可以提高分类准确性和解释可靠性，适用于媒体分析、教育和情报收集等领域。 Abstract: Understanding covert narratives and implicit messaging is essential for analyzing bias and sentiment. Traditional NLP methods struggle with detecting subtle phrasing and hidden agendas. This study tackles two key challenges: (1) multi-label classification of narratives and sub-narratives in news articles, and (2) generating concise, evidence-based explanations for dominant narratives. We fine-tune a BERT model with a recall-oriented approach for comprehensive narrative detection, refining predictions using a GPT-4o pipeline for consistency. For narrative explanation, we propose a ReACT (Reasoning + Acting) framework with semantic retrieval-based few-shot prompting, ensuring grounded and relevant justifications. To enhance factual accuracy and reduce hallucinations, we incorporate a structured taxonomy table as an auxiliary knowledge base. Our results show that integrating auxiliary knowledge in prompts improves classification accuracy and justification reliability, with applications in media analysis, education, and intelligence gathering.

[41] Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

Keara Schaaij,Roel Boumans,Tibor Bosse,Iris Hendrickx

Main category: cs.CL

TL;DR: 本研究探索了如何利用少量语音数据构建个性化词汇轮廓，为实现人机对话中的词汇对齐提供了实用见解。

Details

Motivation: 词汇对齐在成功交流中起着重要作用，但在对话代理中的应用仍不够充分，尤其是在大语言模型（LLMs）的发展背景下。 Method: 通过改变用于构建词汇轮廓的转录语音数据量以及每个词性类别中包含的词汇数量，利用回忆率、覆盖率和余弦相似度指标评估了轮廓性能。 Result: 结果显示，使用10分钟的转录语音数据构建的小而紧凑的词汇轮廓在性能和数据效率之间达到了最佳平衡。 Conclusion: 研究得出，使用少量转录语音数据构建个性化词汇轮廓是可行的，这为对话代理中的词汇对齐策略奠定了基础。 Abstract: Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.

[42] MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

Dan Saattrup Smart

Main category: cs.CL

TL;DR: 本文介绍了MultiWikiQA，这是一个广泛覆盖多语言的阅读理解数据集，通过LLM生成问题并进行人类评估和模型评估，证明了数据集的难度和语言间的性能差异。

Details

Motivation: 介绍一个覆盖多种语言的新阅读理解数据集，用于评估语言模型在不同语言上的表现。 Method: 使用维基百科文章生成上下文数据，并通过LLM生成问题。在30种语言中进行了众包人类流畅度评估，并评估了6种不同语言模型的表现。 Result: 生成的问题质量良好，基准足够困难，且不同语言之间存在较大的表现差异。 Conclusion: MultiWikiQA是一个覆盖306种语言的新阅读理解数据集，该数据集来源于维基百科文章，问题由LLM生成，并在维基百科文章中以逐字形式出现。该数据集和调查评估是免费提供的。 Abstract: We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.

[43] Joint Modeling of Entities and Discourse Relations for Coherence Assessment

Wei Liu,Michael Strube

Main category: cs.CL

TL;DR: 该研究通过联合建模实体和话语关系，提高了文本连贯性评估的效果。

Details

Motivation: 现有的连贯性建模工作大多只关注实体特征或话语关系特征，很少有研究尝试将两者结合。 Method: 探索了两种联合建模实体和话语关系的方法，并在三个基准数据集上进行了实验。 Result: 实验表明，整合实体特征和话语关系特征可以显著提高连贯性模型的性能。 Conclusion: 建模实体和话语关系对连贯性评估都有好处，整合这两类特征可以显著提高连贯性模型的性能。 Abstract: In linguistics, coherence can be achieved by different means, such as by maintaining reference to the same set of entities across sentences and by establishing discourse relations between them. However, most existing work on coherence modeling focuses exclusively on either entity features or discourse relation features, with little attention given to combining the two. In this study, we explore two methods for jointly modeling entities and discourse relations for coherence assessment. Experiments on three benchmark datasets show that integrating both types of features significantly enhances the performance of coherence models, highlighting the benefits of modeling both simultaneously for coherence evaluation.

[44] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

Aishik Mandal,Tanmoy Chakraborty,Iryna Gurevych

Main category: cs.CL

TL;DR: MAGneT is a novel multi-agent framework for synthetic psychological counseling session generation that outperforms existing methods in quality, diversity, and therapeutic alignment, making it a promising solution for the scarcity of high-quality, privacy-compliant data in psychological counseling.

Details

Motivation: The scarcity of high-quality, privacy-compliant data for fine-tuning open-source Large Language Models (LLMs) in psychological counseling motivated the development of MAGneT, aiming to better meet the growing demand for scalable psychological counseling. Method: MAGneT uses a multi-agent framework where specialized LLM agents handle coordinated sub-tasks, each modeling a key psychological technique. A unified evaluation framework is also proposed, integrating automatic and expert metrics across nine aspects of counseling. Result: MAGneT outperformed existing methods, improving general counseling skills by 3.2% and CBT-specific skills by 4.3% on average on the Cognitive Therapy Rating Scale (CTRS). Experts preferred MAGneT-generated sessions in 77.2% of cases. Fine-tuning open-source models on MAGneT-generated sessions improved performance by 6.3% on general counseling skills and 7.3% on CBT-specific skills on average on CTRS. Conclusion: MAGneT is a significant advancement in generating synthetic psychological counseling sessions, outperforming existing methods in quality, diversity, and therapeutic alignment. It also demonstrates improved performance when used to fine-tune open-source models compared to baseline methods. Abstract: The growing demand for scalable psychological counseling highlights the need for fine-tuning open-source Large Language Models (LLMs) with high-quality, privacy-compliant data, yet such data remains scarce. Here we introduce MAGneT, a novel multi-agent framework for synthetic psychological counseling session generation that decomposes counselor response generation into coordinated sub-tasks handled by specialized LLM agents, each modeling a key psychological technique. Unlike prior single-agent approaches, MAGneT better captures the structure and nuance of real counseling. In addition, we address inconsistencies in prior evaluation protocols by proposing a unified evaluation framework integrating diverse automatic and expert metrics. Furthermore, we expand the expert evaluations from four aspects of counseling in previous works to nine aspects, enabling a more thorough and robust assessment of data quality. Empirical results show that MAGneT significantly outperforms existing methods in quality, diversity, and therapeutic alignment of the generated counseling sessions, improving general counseling skills by 3.2% and CBT-specific skills by 4.3% on average on cognitive therapy rating scale (CTRS). Crucially, experts prefer MAGneT-generated sessions in 77.2% of cases on average across all aspects. Moreover, fine-tuning an open-source model on MAGneT-generated sessions shows better performance, with improvements of 6.3% on general counseling skills and 7.3% on CBT-specific skills on average on CTRS over those fine-tuned with sessions generated by baseline methods. We also make our code and data public.

Congbo Ma,Yuxia Wang,Jia Wu,Jian Yang,Jing Du,Zitai Qiu,Qing Li,Hu Wang,Preslav Nakov

Main category: cs.CL

TL;DR: SED-Aug is a plug-and-play dual augmentation framework for social event detection that improves model performance by enhancing data diversity through explicit text-based and implicit feature-space augmentation.

Details

Motivation: Social event detection relies on costly and labor-intensive labeled data, which motivated the development of a more efficient framework. Method: SED-Aug uses explicit text-based and implicit feature-space augmentation methods to enhance data diversity and model robustness. Result: SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score. Conclusion: SED-Aug is an effective framework for social event detection that significantly improves performance compared to baseline models. Abstract: Social event detection involves identifying and categorizing important events from social media, which relies on labeled data, but annotation is costly and labor-intensive. To address this problem, we propose Augmentation framework for Social Event Detection (SED-Aug), a plug-and-play dual augmentation framework, which combines explicit text-based and implicit feature-space augmentation to enhance data diversity and model robustness. The explicit augmentation utilizes large language models to enhance textual information through five diverse generation strategies. For implicit augmentation, we design five novel perturbation techniques that operate in the feature space on structural fused embeddings. These perturbations are crafted to keep the semantic and relational properties of the embeddings and make them more diverse. Specifically, SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score. The code is available at GitHub: https://github.com/congboma/SED-Aug.

[46] Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

Qinyan Zhang,Xinping Lei,Ruijie Miao,Yu Fu,Haojie Fan,Le Chang,Jiafan Hou,Dingling Zhang,Zhongfei Hou,Ziqiang Yang,Changxin Pu,Fei Hu,Jingkai Liu,Mengyun Liu,Yang Liu,Xiang Gao,Jiaheng Liu,Tong Yang,Zaiyuan Wang,Ge Zhang,Wenhao Huang

Main category: cs.CL

TL;DR: 本研究提出了Inverse IFEval基准测试，用于评估大型语言模型（LLMs）在面对冲突指令时的适应能力，强调未来模型开发应注重减轻认知惰性并提高在多样实际场景中的可靠性。

Details

Motivation: 大型语言模型（LLMs）在多样任务上表现出色，但往往表现出认知惰性，在监督微调（SFT）过程中学习到的标准模式与冲突指令之间难以遵循。为了评估这一限制，作者提出Inverse IFEval基准测试，用于衡量模型的反直觉能力，即克服训练诱导的偏见并遵守对抗性指令的能力。 Method: 提出一种新的基准测试方法Inverse IFEval，包含8种类型的挑战，如问题纠正、故意文本缺陷、无注释代码和反事实回答。使用人类参与的数据集构建流程，构建了涵盖23个领域的1012个高质量中英文问题数据集，并在一个优化的LLM作为评判框架下进行评估。 Result: 实验结果显示，现有领先的LLM在Inverse IFEval基准测试中表现不佳，表明该基准测试对于评估和改进模型在非常规情境下的指令跟随能力具有必要性。 Conclusion: 研究强调未来的对齐工作不仅应追求流畅性和事实正确性，还应考虑在非常规情境下的适应能力，以减轻LLM的认知惰性，减少对狭窄模式的过拟合，并最终增强LLM在多样且不可预测的实际场景中的指令跟随可靠性。 Abstract: Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

[47] Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models

Juraj Vladika,Mahdi Dhaini,Florian Matthes

Main category: cs.CL

TL;DR: 本研究发现大型语言模型在医疗领域存在依赖过时知识的问题，通过构建MedRevQA和MedChangeQA两个数据集进行分析，并提出了缓解问题的未来方向。

Details

Motivation: 大型语言模型在医疗领域的应用潜力巨大，但其依赖静态训练数据可能导致过时的医疗建议，这可能对患者健康造成危害。 Method: 研究引入了两个新的问答数据集MedRevQA和MedChangeQA，用于评估八个知名LLMs在依赖过时医学知识方面的表现，并分析了预训练数据及训练策略的影响。 Result: 研究发现所有被评估的LLMs都存在依赖过时医学知识的问题，并分析了过时预训练数据和训练策略对此现象的影响。 Conclusion: 研究发现，大型语言模型（LLMs）在医疗领域存在依赖过时知识的问题，这可能对医疗AI系统的可靠性和时效性提出挑战。研究提出了缓解这一问题的未来方向。 Abstract: The growing capabilities of Large Language Models (LLMs) show significant potential to enhance healthcare by assisting medical researchers and physicians. However, their reliance on static training data is a major risk when medical recommendations evolve with new research and developments. When LLMs memorize outdated medical knowledge, they can provide harmful advice or fail at clinical reasoning tasks. To investigate this problem, we introduce two novel question-answering (QA) datasets derived from systematic reviews: MedRevQA (16,501 QA pairs covering general biomedical knowledge) and MedChangeQA (a subset of 512 QA pairs where medical consensus has changed over time). Our evaluation of eight prominent LLMs on the datasets reveals consistent reliance on outdated knowledge across all models. We additionally analyze the influence of obsolete pre-training data and training strategies to explain this phenomenon and propose future directions for mitigation, laying the groundwork for developing more current and reliable medical AI systems.

[48] PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

Jiajun He,Naoki Sawada,Koichi Miyazaki,Tomoki Toda

Main category: cs.CL

TL;DR: PARCO improves contextual ASR by integrating phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering.

Details

Motivation: ASR systems struggle with domain-specific named entities, especially homophones. Contextual ASR often fails to capture fine-grained phoneme variations and treats entities as independent tokens, leading to incomplete biasing. Method: PARCO introduces phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering to enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Result: PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. It also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech. Conclusion: PARCO effectively improves contextual ASR performance on domain-specific named entities, particularly homophones, by addressing phoneme variation and multi-token biasing challenges. Abstract: Automatic speech recognition (ASR) systems struggle with domain-specific named entities, especially homophones. Contextual ASR improves recognition but often fails to capture fine-grained phoneme variations due to limited entity diversity. Moreover, prior methods treat entities as independent tokens, leading to incomplete multi-token biasing. To address these issues, we propose Phoneme-Augmented Robust Contextual ASR via COntrastive entity disambiguation (PARCO), which integrates phoneme-aware encoding, contrastive entity disambiguation, entity-level supervision, and hierarchical entity filtering. These components enhance phonetic discrimination, ensure complete entity retrieval, and reduce false positives under uncertainty. Experiments show that PARCO achieves CER of 4.22% on Chinese AISHELL-1 and WER of 11.14% on English DATA2 under 1,000 distractors, significantly outperforming baselines. PARCO also demonstrates robust gains on out-of-domain datasets like THCHS-30 and LibriSpeech.

[49] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

Bufan Gao,Elisa Kreiss

Main category: cs.CL

TL;DR: 本文研究了提示对测量LLM性别偏差的影响，发现提示的变化会显著影响偏差结果，同时提出了关于测试设计如何影响模型性能的新问题。

Details

Motivation: 随着LLM越来越多地应用于社会影响显著的领域，性别偏差问题引起了关注。现有的偏差评估方法通常依赖于与自然语言分布不同的任务提示，因此需要研究提示如何影响偏差测量结果。 Method: 论文通过在提示条件中测试模型，具体包括使测试上下文显著和使性别相关内容显著，然后使用基于概率和离散选择的度量方法评估提示敏感性。 Result: 研究发现，即使是微小的提示变化也能显著改变偏差结果，有时甚至完全反转结果方向。离散选择度量通常会相对于概率度量放大偏差。 Conclusion: 论文指出LLM性别偏差评估的脆弱性，并提出了一个新的难题，即受控测试设计在多大程度上能触发LLM的“测试模式”性能，以及这对未来基准的生态有效性意味着什么。 Abstract: As LLMs are increasingly applied in socially impactful settings, concerns about gender bias have prompted growing efforts both to measure and mitigate such bias. These efforts often rely on evaluation tasks that differ from natural language distributions, as they typically involve carefully constructed task prompts that overtly or covertly signal the presence of gender bias-related content. In this paper, we examine how signaling the evaluative purpose of a task impacts measured gender bias in LLMs. Concretely, we test models under prompt conditions that (1) make the testing context salient, and (2) make gender-focused content salient. We then assess prompt sensitivity across four task formats with both token-probability and discrete-choice metrics. We find that even minor prompt changes can substantially alter bias outcomes, sometimes reversing their direction entirely. Discrete-choice metrics further tend to amplify bias relative to probabilistic measures. These findings do not only highlight the brittleness of LLM gender bias evaluations but open a new puzzle for the NLP benchmarking and development community: To what extent can well-controlled testing designs trigger LLM ``testing mode'' performance, and what does this mean for the ecological validity of future benchmarks.

[50] Can Language Models Handle a Non-Gregorian Calendar?

Mutsumi Sasaki,Go Kamoda,Ryosuke Takahashi,Kosuke Sato,Kentaro Inui,Keisuke Sakaguchi,Benjamin Heinzerling

Main category: cs.CL

TL;DR: 研究发现，现有的语言模型在处理日本日历系统时存在挑战，尤其是在日历算术和一致性方面，需要进一步开发文化特定日历理解能力更强的语言模型。

Details

Motivation: 非公历的日历系统（如日本、伊斯兰和希伯来日历）在许多文化中广泛使用，但当前的语言模型在处理这些日历时的能力尚未得到评估。 Method: 创建了四个任务的数据集，以评估英语和日语语言模型在日历转换、日历算术和跨日历一致性方面的能力。 Result: 一些模型能够执行日历转换，但即使是针对日语的模型，在日历算术和跨日历一致性方面也存在困难。 Conclusion: 当前的语言模型在处理非公历的日历系统时存在困难，强调需要开发更好地理解文化特定日历的语言模型。 Abstract: Temporal reasoning and knowledge are essential capabilities for language models (LMs). While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar. However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time. If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far. Here, we present a systematic evaluation of how well open-source LMs handle one such non-Gregorian system: the Japanese calendar. For our evaluation, we create datasets for four tasks that require both temporal knowledge and temporal reasoning. Evaluating a range of English-centric and Japanese-centric LMs, we find that some models can perform calendar conversions, but even Japanese-centric models struggle with Japanese-calendar arithmetic and with maintaining consistency across calendars. Our results highlight the importance of developing LMs that are better equipped for culture-specific calendar understanding.

cs.CV [Back]

[51] Towards Efficient General Feature Prediction in Masked Skeleton Modeling

Shengkai Sun,Zefan Zhang,Jianfeng Dong,Zhiyong Cheng,Xiaojun Chang,Meng Wang

Main category: cs.CV

TL;DR: This paper proposes the General Feature Prediction (GFP) framework for skeleton-based action recognition, offering faster training and better performance by focusing on high-level feature prediction instead of traditional low-level reconstruction.

Details

Motivation: Existing masked autoencoder approaches for skeleton-based action recognition are limited by computational redundancy and simple reconstruction targets, necessitating a more efficient and semantically rich method. Method: The study introduces a collaborative learning framework called General Feature Prediction (GFP), where a lightweight network dynamically generates supervision signals across spatial-temporal hierarchies to replace conventional low-level reconstruction. Result: The GFP framework achieves 6.2× faster training speed and state-of-the-art performance on datasets like NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD. Conclusion: The proposed GFP framework significantly improves computational efficiency and representation quality in skeleton-based action recognition compared to traditional methods. Abstract: Recent advances in the masked autoencoder (MAE) paradigm have significantly propelled self-supervised skeleton-based action recognition. However, most existing approaches limit reconstruction targets to raw joint coordinates or their simple variants, resulting in computational redundancy and limited semantic representation. To address this, we propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling. Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations. Specifically, we introduce a collaborative learning framework where a lightweight target generation network dynamically produces diversified supervision signals across spatial-temporal hierarchies, avoiding reliance on pre-computed offline features. The framework incorporates constrained optimization to ensure feature diversity while preventing model collapse. Experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD demonstrate the benefits of our approach: Computational efficiency (with 6.2$\times$ faster training than standard masked skeleton modeling methods) and superior representation quality, achieving state-of-the-art performance in various downstream tasks.

[52] Teacher-Student Model for Detecting and Classifying Mitosis in the MIDOG 2025 Challenge

Seungho Choe,Xiaoli Qin,Abubakr Shafique,Amanda Dy,Dimitri Androutsos,Susan Done,April Khademi

Main category: cs.CV

TL;DR: 本文提出了一种基于分割的教师-学生模型，用于解决有丝分裂检测中的领域转移问题和数据不平衡问题，并在Track 1和Track 2上分别实现了0.7660的F1分数和0.8414的平衡准确率。

Details

Motivation: 有丝分裂计数是一项耗时且容易产生观察者间差异的工作，而人工智能在解决这一问题的同时还面临领域转移和数据不平衡的挑战。 Method: 作者将有丝分裂检测表述为像素级分割任务，并提出了一种结合领域泛化模块（对比表示学习和领域对抗训练）的UNet分割主干网络。此外，还引入了一种多尺度CNN分类器，以多任务学习范式利用分割模型的特征图。 Result: 该算法在初步测试集上实现了Track 1（有丝分裂检测）的F1分数为0.7660，Track 2（非典型有丝分裂分类）的平衡准确率为0.8414。 Conclusion: 本文的方法在解决有丝分裂检测中的领域转移问题和数据不平衡问题方面表现出色，为鲁棒性有丝分裂分析提供了一个有效的统一框架。 Abstract: Counting mitotic figures is time-intensive for pathologists and leads to inter-observer variability. Artificial intelligence (AI) promises a solution by automatically detecting mitotic figures while maintaining decision consistency. However, AI tools are susceptible to domain shift, where a significant drop in performance can occur due to differences in the training and testing sets, including morphological diversity between organs, species, and variations in staining protocols. Furthermore, the number of mitoses is much less than the count of normal nuclei, which introduces severely imbalanced data for the detection task. In this work, we formulate mitosis detection as a pixel-level segmentation and propose a teacher-student model that simultaneously addresses mitosis detection (Track 1) and atypical mitosis classification (Track 2). Our method is based on a UNet segmentation backbone that integrates domain generalization modules, namely contrastive representation learning and domain-adversarial training. A teacher-student strategy is employed to generate pixel-level pseudo-masks not only for annotated mitoses and hard negatives but also for normal nuclei, thereby enhancing feature discrimination and improving robustness against domain shift. For the classification task, we introduce a multi-scale CNN classifier that leverages feature maps from the segmentation model within a multi-task learning paradigm. On the preliminary test set, the algorithm achieved an F1 score of 0.7660 in Track 1 and balanced accuracy of 0.8414 in Track 2, demonstrating the effectiveness of integrating segmentation-based detection and classification into a unified framework for robust mitosis analysis.

[53] Multi Attribute Bias Mitigation via Representation Learning

Rajeev Ranjan Dwivedi,Ankur Kumar,Vinod K Kurmi

Main category: cs.CV

TL;DR: GMBM是一种新的多偏差缓解框架，通过两个阶段的训练和新的偏差度量方法SBA，有效提高了视觉识别的性能和公平性。

Details

Motivation: 现实世界中的图像存在多种重叠偏差，这些偏差影响了现代视觉模型的性能和公平性。单独解决这些偏差是不够的，因此需要一个综合的解决方案。 Method: GMBM框架包括两个阶段：ABIL和GSFT。SBA用于测试时的偏差度量。 Result: 在多个数据集上验证了GMBM的有效性，提升了最坏情况下的准确性，减少了多属性偏差放大，并在SBA指标上取得了新的低值。 Conclusion: GMBM是一个有效的多偏差缓解框架，能够在测试时减少偏差并提高视觉识别的性能，同时提出了一种新的偏差度量方法SBA。 Abstract: Real world images frequently exhibit multiple overlapping biases, including textures, watermarks, gendered makeup, scene object pairings, etc. These biases collectively impair the performance of modern vision models, undermining both their robustness and fairness. Addressing these biases individually proves inadequate, as mitigating one bias often permits or intensifies others. We tackle this multi bias problem with Generalized Multi Bias Mitigation (GMBM), a lean two stage framework that needs group labels only while training and minimizes bias at test time. First, Adaptive Bias Integrated Learning (ABIL) deliberately identifies the influence of known shortcuts by training encoders for each attribute and integrating them with the main backbone, compelling the classifier to explicitly recognize these biases. Then Gradient Suppression Fine Tuning prunes those very bias directions from the backbone's gradients, leaving a single compact network that ignores all the shortcuts it just learned to recognize. Moreover we find that existing bias metrics break under subgroup imbalance and train test distribution shifts, so we introduce Scaled Bias Amplification (SBA): a test time measure that disentangles model induced bias amplification from distributional differences. We validate GMBM on FB CMNIST, CelebA, and COCO, where we boost worst group accuracy, halve multi attribute bias amplification, and set a new low in SBA even as bias complexity and distribution shifts intensify, making GMBM the first practical, end to end multibias solution for visual recognition. Project page: http://visdomlab.github.io/GMBM/

[54] Lightweight image segmentation for echocardiography

Anders Kjelsrud,Lasse Løvstakken,Erik Smistad,Håvard Dalen,Gilles Van De Vyver

Main category: cs.CV

TL;DR: 通过消融研究，开发了一种轻量级U-Net模型，其性能与nnU-Net相当，但参数更少、速度更快，适用于实时心脏分割。

Details

Motivation: nnU-Net模型虽然在心脏分割方面表现出色，但体积大且速度慢，限制了其在实时应用中的使用。 Method: 通过消融研究，逐步评估数据增强方案、架构修改、损失函数和后处理技术，确定nnU-Net中最有效的组件，并基于这些发现开发了一个轻量级U-Net模型。 Result: 轻量级U-Net（2M参数）在CAMUS数据集上实现了与nnU-Net（33M参数）统计上相当的性能，Dice分数分别为0.93/0.85/0.89（LV/MYO/LA），同时体积小16倍、速度快4倍。在内部数据集上的跨数据集评估也证实了其可比较的泛化能力。 Conclusion: 轻量级U-Net模型在保证性能的同时显著减少了参数数量和推理时间，适合实时心脏分割应用。 Abstract: Accurate segmentation of the left ventricle in echocardiography can enable fully automatic extraction of clinical measurements such as volumes and ejection fraction. While models configured by nnU-Net perform well, they are large and slow, thus limiting real-time use. We identified the most effective components of nnU-Net for cardiac segmentation through an ablation study, incrementally evaluating data augmentation schemes, architectural modifications, loss functions, and post-processing techniques. Our analysis revealed that simple affine augmentations and deep supervision drive performance, while complex augmentations and large model capacity offer diminishing returns. Based on these insights, we developed a lightweight U-Net (2M vs 33M parameters) that achieves statistically equivalent performance to nnU-Net on CAMUS (N=500) with Dice scores of 0.93/0.85/0.89 vs 0.93/0.86/0.89 for LV/MYO/LA ($p>0.05$), while being 16 times smaller and 4 times faster (1.35ms vs 5.40ms per frame) than the default nnU-Net configuration. Cross-dataset evaluation on an internal dataset (N=311) confirms comparable generalization.

[55] treeX: Unsupervised Tree Instance Segmentation in Dense Forest Point Clouds

Josafat-Mattias Burmeister,Andreas Tockner,Stefan Reder,Markus Engel,Rico Richter,Jan-Peter Mund,Jürgen Döllner

Main category: cs.CV

TL;DR: The revised treeX algorithm improves tree instance segmentation in 3D point cloud data by offering faster processing and better accuracy than the original method, serving as a resource-efficient alternative to deep learning approaches.

Details

Motivation: Deep learning methods for tree instance segmentation require large annotated datasets and significant computational resources. The revised treeX algorithm offers a resource-efficient alternative that does not rely on annotated data and can be applied in scenarios with limited resources. Method: The revised treeX algorithm uses clustering-based stem detection combined with region growing for crown delineation. It introduces two parameter presets for ground-based and UAV-borne laser scanning data. Result: The revised treeX algorithm showed improved performance compared to the original, with F₁-score gains of +0.11 to +0.49 for ground-based data. It achieved an F₁-score of 0.58 for UAV-borne data, where the original algorithm failed completely. Accuracy was comparable to deep learning methods for ground-based data. Conclusion: The revised treeX algorithm serves as a resource-efficient alternative to deep learning methods for tree instance segmentation in 3D point cloud data, offering improved accuracy and reduced runtime. It is suitable for scenarios with sufficient stem visibility and point density and can be used for semi-automatic label generation for deep learning models. Abstract: Close-range laser scanning provides detailed 3D captures of forest stands but requires efficient software for processing 3D point cloud data and extracting individual trees. Although recent studies have introduced deep learning methods for tree instance segmentation, these approaches require large annotated datasets and substantial computational resources. As a resource-efficient alternative, we present a revised version of the treeX algorithm, an unsupervised method that combines clustering-based stem detection with region growing for crown delineation. While the original treeX algorithm was developed for personal laser scanning (PLS) data, we provide two parameter presets, one for ground-based laser scanning (stationary terrestrial - TLS and PLS), and one for UAV-borne laser scanning (ULS). We evaluated the method on six public datasets (FOR-instance, ForestSemantic, LAUTx, NIBIO MLS, TreeLearn, Wytham Woods) and compared it to six open-source methods (original treeX, treeiso, RayCloudTools, ForAINet, SegmentAnyTree, TreeLearn). Compared to the original treeX algorithm, our revision reduces runtime and improves accuracy, with instance detection F$_1$-score gains of +0.11 to +0.49 for ground-based data. For ULS data, our preset achieves an F$_1$-score of 0.58, whereas the original algorithm fails to segment any correct instances. For TLS and PLS data, our algorithm achieves accuracy similar to recent open-source methods, including deep learning. Given its algorithmic design, we see two main applications for our method: (1) as a resource-efficient alternative to deep learning approaches in scenarios where the data characteristics align with the method design (sufficient stem visibility and point density), and (2) for the semi-automatic generation of labels for deep learning models. To enable broader adoption, we provide an open-source Python implementation in the pointtree package.

[56] Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding

Hongpei Zheng,Lintao Xiang,Qijun Yang,Qian Lin,Hujun Yin

Main category: cs.CV

TL;DR: Reg3D框架通过引入几何感知监督，解决了当前在3D场景理解方面的局限性，并在多个实验中表现出色。

Details

Motivation: 现有的方法主要依赖于纯文本监督，无法提供学习强大3D空间表示所需的几何约束。 Method: Reg3D框架采用了一种双监督范式，将3D几何信息作为输入和明确的学习目标，设计了互补的物体级和帧级重建任务，以强制几何一致性。 Result: 在ScanQA、Scan2Cap、ScanRefer和SQA3D上的实验表明，Reg3D提供了显著的性能提升。 Conclusion: Reg3D建立了一种新的空间感知多模态模型训练范式。 Abstract: The rapid development of Large Multimodal Models (LMMs) has led to remarkable progress in 2D visual understanding; however, extending these capabilities to 3D scene understanding remains a significant challenge. Existing approaches predominantly rely on text-only supervision, which fails to provide the geometric constraints required for learning robust 3D spatial representations. In this paper, we introduce Reg3D, a novel Reconstructive Geometry Instruction Tuning framework that addresses this limitation by incorporating geometry-aware supervision directly into the training process. Our key insight is that effective 3D understanding necessitates reconstructing underlying geometric structures rather than merely describing them. Unlike existing methods that inject 3D information solely at the input level, Reg3D adopts a dual-supervision paradigm that leverages 3D geometric information both as input and as explicit learning targets. Specifically, we design complementary object-level and frame-level reconstruction tasks within a dual-encoder architecture, enforcing geometric consistency to encourage the development of spatial reasoning capabilities. Extensive experiments on ScanQA, Scan2Cap, ScanRefer, and SQA3D demonstrate that Reg3D delivers substantial performance improvements, establishing a new training paradigm for spatially aware multimodal models.

[57] QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception

Seth Z. Zhao,Huizhi Zhang,Zhaowei Li,Juntong Peng,Anthony Chui,Zewei Zhou,Zonglin Meng,Hao Xiang,Zhiyu Huang,Fujia Wang,Ran Tian,Chenfeng Xu,Bolei Zhou,Jiaqi Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为QuantV2X的新型多智能体系统，该系统通过统一的端到端量化策略，在保证准确性的同时提高了效率和部署能力。

Details

Motivation: 过去的研究主要关注提高准确性指标，而没有解决效率、延迟和实际部署等系统级考虑因素。 Method: 提出了一种统一的端到端量化策略，涵盖神经网络模型和传输消息表示，同时减少计算负载和传输带宽。 Result: QuantV2X在低比特约束下实现了与全精度系统相当的准确性，系统级延迟减少了3.2倍，并在mAP30上比全精度基线提高了+9.5。 Conclusion: QuantV2X是一种完全量化的多智能体系统，旨在高效扩展多模态、多智能体V2X协作感知的部署。 Abstract: Cooperative perception through Vehicle-to-Everything (V2X) communication offers significant potential for enhancing vehicle perception by mitigating occlusions and expanding the field of view. However, past research has predominantly focused on improving accuracy metrics without addressing the crucial system-level considerations of efficiency, latency, and real-world deployability. Noticeably, most existing systems rely on full-precision models, which incur high computational and transmission costs, making them impractical for real-time operation in resource-constrained environments. In this paper, we introduce \textbf{QuantV2X}, the first fully quantized multi-agent system designed specifically for efficient and scalable deployment of multi-modal, multi-agent V2X cooperative perception. QuantV2X introduces a unified end-to-end quantization strategy across both neural network models and transmitted message representations that simultaneously reduces computational load and transmission bandwidth. Remarkably, despite operating under low-bit constraints, QuantV2X achieves accuracy comparable to full-precision systems. More importantly, when evaluated under deployment-oriented metrics, QuantV2X reduces system-level latency by 3.2$\times$ and achieves a +9.5 improvement in mAP30 over full-precision baselines. Furthermore, QuantV2X scales more effectively, enabling larger and more capable models to fit within strict memory budgets. These results highlight the viability of a fully quantized multi-agent intermediate fusion system for real-world deployment. The system will be publicly released to promote research in this field: https://github.com/ucla-mobility/QuantV2X.

[58] Transfer Learning-Based CNN Models for Plant Species Identification Using Leaf Venation Patterns

Bandita Bharadwaj,Ankur Mishra,Saurav Bharadwaj

Main category: cs.CV

TL;DR: 本研究评估了三种深度学习模型（ResNet50、MobileNetV2和EfficientNetB0）在基于叶脉模式的植物种类分类中的表现，结果显示EfficientNetB0具有最佳性能，准确率超过94.6%，适用于开发高效的自动化植物分类工具。

Details

Motivation: 研究动机是基于叶脉模式的分类任务具有重要的分类学意义，同时评估不同深度学习模型在该任务中的效能，以推动自动化植物种类分类技术的发展。 Method: 研究使用瑞典叶数据集，包括15个不同物种的图像（每个物种75张图像，总计1125张图像），并采用ResNet50、MobileNetV2和EfficientNetB0三种深度学习架构进行实验，通过训练和测试阶段的标准性能指标评估模型效果。 Result: ResNet50在训练中达到了94.11%的准确率，但测试准确率下降到88.45%，F1得分为87.82%；MobileNetV2表现更好的泛化能力，测试准确率为93.34%，F1得分为93.23%；EfficientNetB0表现最佳，测试准确率为94.67%，精确率、召回率和F1得分均超过94.6%。 Conclusion: 该研究得出结论，EfficientNetB0在基于叶脉模式的植物种类分类任务中表现最佳，强调了深度学习在开发基于脉络特征的可扩展且准确的自动化植物分类工具中的潜力。 Abstract: This study evaluates the efficacy of three deep learning architectures: ResNet50, MobileNetV2, and EfficientNetB0 for automated plant species classification based on leaf venation patterns, a critical morphological feature with high taxonomic relevance. Using the Swedish Leaf Dataset comprising images from 15 distinct species (75 images per species, totalling 1,125 images), the models were demonstrated using standard performance metrics during training and testing phases. ResNet50 achieved a training accuracy of 94.11% but exhibited overfitting, reflected by a reduced testing accuracy of 88.45% and an F1 score of 87.82%. MobileNetV2 demonstrated better generalization capabilities, attaining a testing accuracy of 93.34% and an F1 score of 93.23%, indicating its suitability for lightweight, real-time applications. EfficientNetB0 outperformed both models, achieving a testing accuracy of 94.67% with precision, recall, and F1 scores exceeding 94.6%, highlighting its robustness in venation-based classification. The findings underscore the potential of deep learning, particularly EfficientNetB0, in developing scalable and accurate tools for automated plant taxonomy using venation traits.

[59] LayoutGKN: Graph Similarity Learning of Floor Plans

Casper van Engelenburg,Jan van Gemert,Seyran Khademi

Main category: cs.CV

TL;DR: This paper proposes LayoutGKN, an efficient method for comparing building layout graphs by postponing cross-graph interactions, achieving faster and effective similarity computation.

Details

Motivation: The motivation is to address the inefficiency and slow inference time of existing graph matching networks due to costly intermediate cross-graph node-level interactions. Method: The paper introduces LayoutGKN, which uses a differentiable graph kernel as a distance function on the final learned node-level embeddings, thereby delaying cross-graph interactions until the final stage. Result: LayoutGKN computes graph similarity comparably or better than graph matching networks while significantly increasing inference speed. Conclusion: LayoutGKN is a more efficient approach for comparing floor plan graphs, as it postpones cross-graph interactions to the end of the embedding process, achieving comparable or better similarity computation while significantly improving speed. Abstract: Floor plans depict building layouts and are often represented as graphs to capture the underlying spatial relationships. Comparison of these graphs is critical for applications like search, clustering, and data visualization. The most successful methods to compare graphs \ie, graph matching networks, rely on costly intermediate cross-graph node-level interactions, therefore being slow in inference time. We introduce \textbf{LayoutGKN}, a more efficient approach that postpones the cross-graph node-level interactions to the end of the joint embedding architecture. We do so by using a differentiable graph kernel as a distance function on the final learned node-level embeddings. We show that LayoutGKN computes similarity comparably or better than graph matching networks while significantly increasing the speed. \href{https://github.com/caspervanengelenburg/LayoutGKN}{Code and data} are open.

[60] Singular Value Few-shot Adaptation of Vision-Language Models

Taha Koleilat,Hassan Rivaz,Yiming Xiao

Main category: cs.CV

TL;DR: 本文提出了一种名为CLIP-SVD的多模态和参数高效的适应技术，通过改变CLIP的内部参数空间，实现了在少量样本设置下的增强适应性能和更好泛化能力。

Details

Motivation: 适应视觉-语言模型到新的细粒度领域仍然困难，因为依赖于提示工程和全模型微调的高成本。 Method: 使用SVD调整CLIP参数矩阵的奇异值，以重新缩放域适应的基向量，同时保持预训练模型不变。 Result: CLIP-SVD在11个自然和10个生物医学数据集上达到了最先进的分类结果，并在少量样本设置下在准确性和泛化性上超过了以前的方法。 Conclusion: CLIP-SVD是一种新的多模态和参数高效适应技术，通过利用奇异值分解(SVD)来修改CLIP的内部参数空间，实现了增强的适应性能和更好的泛化能力。 Abstract: Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present \textbf{CLIP-SVD}, a novel \textit{multi-modal} and \textit{parameter-efficient} adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only \textbf{0.04\%} of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

[61] STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification

Zongsen Qiu

Main category: cs.CV

TL;DR: 为了解决边缘设备上高精度植物病害诊断模型的部署难题，研究提出了一种高效的网络架构搜索方法和一种新的注意力模块。

Details

Motivation: 应对全球粮食安全需求上升，精准农业和基于深度学习的植物病害诊断变得至关重要。然而，在边缘设备上部署高精度模型具有挑战性。 Method: 提出了一种两方面的解决方案：首先使用训练自由的神经架构搜索方法（DeepMAD）创建高效的网络骨干，其次引入了形状-纹理注意力模块（STAM） Result: 在公共CCMT植物病害数据集上，STA-Net模型（具有401K参数和51.1M FLOPs）达到了89.00%的准确率和88.96%的F1分数。 Conclusion: 通过解耦注意力机制整合领域知识，为边缘部署的精准农业AI提供了有前景的路径。 Abstract: Responding to rising global food security needs, precision agriculture and deep learning-based plant disease diagnosis have become crucial. Yet, deploying high-precision models on edge devices is challenging. Most lightweight networks use attention mechanisms designed for generic object recognition, which poorly capture subtle pathological features like irregular lesion shapes and complex textures. To overcome this, we propose a twofold solution: first, using a training-free neural architecture search method (DeepMAD) to create an efficient network backbone for edge devices; second, introducing the Shape-Texture Attention Module (STAM). STAM splits attention into two branches -- one using deformable convolutions (DCNv4) for shape awareness and the other using a Gabor filter bank for texture awareness. On the public CCMT plant disease dataset, our STA-Net model (with 401K parameters and 51.1M FLOPs) reached 89.00% accuracy and an F1 score of 88.96%. Ablation studies confirm STAM significantly improves performance over baseline and standard attention models. Integrating domain knowledge via decoupled attention thus presents a promising path for edge-deployed precision agriculture AI. The source code is available at https://github.com/RzMY/STA-Net.

[62] SLENet: A Guidance-Enhanced Network for Underwater Camouflaged Object Detection

Xinxin Wang,Han Sun,Ningzhong Liu,Huiyu Zhou,Yinan Yao

Main category: cs.CV

TL;DR: 本文提出了一种新的水下伪装物体检测（UCOD）任务和一个相应的基准数据集DeepCamo，并设计了用于UCOD的SLENet框架，其性能优于现有最先进的方法。

Details

Motivation: 水下伪装物体检测（UCOD）对于海洋生态至关重要，但由于光学失真、水浑浊度和海洋生物的复杂特性，该任务仍未得到充分探索且识别困难。 Method: 提出了用于UCOD任务的SLENet框架，包括Gamma-非对称增强模块、定位引导分支和多尺度监督解码器。 Result: SLENet在DeepCamo数据集和三个COD基准数据集上均表现出优于SOTA方法的性能。 Conclusion: SLENet在UCOD任务和更广泛的COD任务中表现出色，具有高度的泛化能力。 Abstract: Underwater Camouflaged Object Detection (UCOD) aims to identify objects that blend seamlessly into underwater environments. This task is critically important to marine ecology. However, it remains largely underexplored and accurate identification is severely hindered by optical distortions, water turbidity, and the complex traits of marine organisms. To address these challenges, we introduce the UCOD task and present DeepCamo, a benchmark dataset designed for this domain. We also propose Semantic Localization and Enhancement Network (SLENet), a novel framework for UCOD. We first benchmark state-of-the-art COD models on DeepCamo to reveal key issues, upon which SLENet is built. In particular, we incorporate Gamma-Asymmetric Enhancement (GAE) module and a Localization Guidance Branch (LGB) to enhance multi-scale feature representation while generating a location map enriched with global semantic information. This map guides the Multi-Scale Supervised Decoder (MSSD) to produce more accurate predictions. Experiments on our DeepCamo dataset and three benchmark COD datasets confirm SLENet's superior performance over SOTA methods, and underscore its high generality for the broader COD task.

[63] Fitting Image Diffusion Models on Video Datasets

Juhun Lee,Simon S. Woo

Main category: cs.CV

TL;DR: This paper introduces a video-based training strategy for diffusion models that significantly improves convergence speed, lowers FID, and enhances generative diversity by leveraging temporal coherence without architectural changes.

Details

Motivation: Training diffusion models on static images leads to information deficiency, slower convergence, and limited generalization. Video data provides richer temporal information that can improve training efficiency and performance. Method: A training strategy that utilizes temporal coherence in video frames without modifying the model architecture, integrated into standard diffusion pipelines. Result: The method achieves over 2x faster convergence, lower FID scores on training and validation sets, and enhanced generative diversity by capturing meaningful temporal variations. Conclusion: The proposed method improves diffusion training by leveraging temporal inductive bias from continuous video frames, resulting in faster convergence, lower FID, and better generative diversity. Abstract: Image diffusion models are trained on independently sampled static images. While this is the bedrock task protocol in generative modeling, capturing the temporal world through the lens of static snapshots is information-deficient by design. This limitation leads to slower convergence, limited distributional coverage, and reduced generalization. In this work, we propose a simple and effective training strategy that leverages the temporal inductive bias present in continuous video frames to improve diffusion training. Notably, the proposed method requires no architectural modification and can be seamlessly integrated into standard diffusion training pipelines. We evaluate our method on the HandCo dataset, where hand-object interactions exhibit dense temporal coherence and subtle variations in finger articulation often result in semantically distinct motions. Empirically, our method accelerates convergence by over 2$\text{x}$ faster and achieves lower FID on both training and validation distributions. It also improves generative diversity by encouraging the model to capture meaningful temporal variations. We further provide an optimization analysis showing that our regularization reduces the gradient variance, which contributes to faster convergence.

[64] MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

Yuheng Li,Yenho Chen,Yuxiang Lai,Jike Zhong,Vanessa Wildman,Xiaofeng Yang

Main category: cs.CV

TL;DR: MedVista3D是一个用于3D CT分析的多尺度语义增强的视觉-语言预训练框架，旨在解决放射学诊断错误的问题。

Details

Motivation: 放射学诊断错误在临床实践中仍然普遍存在，现有的3D视觉-语言模型无法同时满足精确的局部检测、全局体积级推理和语义一致的自然语言报告的需求。 Method: MedVista3D通过局部和全局图像-文本对齐来进行细粒度表征学习，并应用语言模型重写和引入放射学语义匹配银行来解决报告的变异性问题。 Result: MedVista3D在零样本疾病分类、报告检索和医学视觉问答任务上达到了最先进的性能，并且能够很好地转移到器官分割和预后预测上。 Conclusion: MedVista3D解决了现有模型在局部-全局理解上的不足，并提高了放射学报告的语义一致性，展现出在多种医学影像任务上的广泛适用性。 Abstract: Radiologic diagnostic errors-under-reading errors, inattentional blindness, and communication failures-remain prevalent in clinical practice. These issues often stem from missed localized abnormalities, limited global context, and variability in report language. These challenges are amplified in 3D imaging, where clinicians must examine hundreds of slices per scan. Addressing them requires systems with precise localized detection, global volume-level reasoning, and semantically consistent natural language reporting. However, existing 3D vision-language models are unable to meet all three needs jointly, lacking local-global understanding for spatial reasoning and struggling with the variability and noise of uncurated radiology reports. We present MedVista3D, a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis. To enable joint disease detection and holistic interpretation, MedVista3D performs local and global image-text alignment for fine-grained representation learning within full-volume context. To address report variability, we apply language model rewrites and introduce a Radiology Semantic Matching Bank for semantics-aware alignment. MedVista3D achieves state-of-the-art performance on zero-shot disease classification, report retrieval, and medical visual question answering, while transferring well to organ segmentation and prognosis prediction. Code and datasets will be released.

[65] Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Mengyu Gao,Qiulei Dong

Main category: cs.CV

TL;DR: 本文提出CaPL，通过视觉粒化与因果推断提升CLIP在细粒度识别任务中的提示学习能力。

Details

Motivation: 现有CLIP-based提示学习方法在处理细粒度数据集时能力有限，因此提出CaPL以提升其性能。 Method: 提出了一种基于视觉粒化的因果引导文本提示学习方法（CaPL），包含属性解耦模块和粒学习模块，分别利用布朗桥扩散模型和因果推断策略进行属性分解和视觉粒构建。 Result: CaPL方法在细粒度识别任务上显著优于现有提示学习方法，并能够学习到更具判别性的文本提示。 Conclusion: CaPL方法在15个数据集上的实验结果表明，其显著优于现有最先进的提示学习方法，尤其是在细粒度数据集上。 Abstract: Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

[66] EGTM: Event-guided Efficient Turbulence Mitigation

Huanan Li,Rui Fan,Juntao Guan,Weidong Hao,Lai Rui,Tong Wu,Yikai Wang,Lin Gu

Main category: cs.CV

TL;DR: The paper proposes a novel framework (EGTM) for turbulence mitigation using event cameras, significantly improving efficiency and restoration quality compared to existing methods.

Details

Motivation: The motivation is to overcome the limitations of existing deep-learning methods for turbulence mitigation, which require high computational and storage capacity due to coarse-grained turbulence dynamics between synchronous frames. Method: The paper introduces the 'event-lucky insight' and proposes the EGTM framework, which extracts reliable turbulence-free guidance from event streams for temporal lucky fusion. Result: The proposed EGTM approach outperforms existing methods by 710 times in model size, 214 times in inference latency, and 224 times in model complexity while achieving better restoration quality on a real-world dataset. Conclusion: The paper concludes that using event cameras and the proposed EGTM framework significantly improves turbulence mitigation in terms of efficiency and restoration quality. Abstract: Turbulence mitigation (TM) aims to remove the stochastic distortions and blurs introduced by atmospheric turbulence into frame cameras. Existing state-of-the-art deep-learning TM methods extract turbulence cues from multiple degraded frames to find the so-called "lucky'', not distorted patch, for "lucky fusion''. However, it requires high-capacity network to learn from coarse-grained turbulence dynamics between synchronous frames with limited frame-rate, thus fall short in computational and storage efficiency. Event cameras, with microsecond-level temporal resolution, have the potential to fundamentally address this bottleneck with efficient sparse and asynchronous imaging mechanism. In light of this, we (i) present the fundamental \textbf{``event-lucky insight''} to reveal the correlation between turbulence distortions and inverse spatiotemporal distribution of event streams. Then, build upon this insight, we (ii) propose a novel EGTM framework that extracts pixel-level reliable turbulence-free guidance from the explicit but noisy turbulent events for temporal lucky fusion. Moreover, we (iii) build the first turbulence data acquisition system to contribute the first real-world event-driven TM dataset. Extensive experimental results demonstrate that our approach significantly surpass the existing SOTA TM method by 710 times, 214 times and 224 times in model size, inference latency and model complexity respectively, while achieving the state-of-the-art in restoration quality (+0.94 PSNR and +0.08 SSIM) on our real-world EGTM dataset. This demonstrating the great efficiency merit of introducing event modality into TM task. Demo code and data have been uploaded in supplementary material and will be released once accepted.

[67] Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection

Nan Yang,Yang Wang,Zhanwen Liu,Yuchao Dai,Yang Liu,Xiangmo Zhao

Main category: cs.CV

TL;DR: 本文提出FocusMamba，利用事件相机感知的场景变化，实现多模态特征的自适应稀疏化和高效融合，从而提升RGB-Event检测的性能与效率。

Details

Motivation: 现有RGB-Event检测方法在低信息区域处理上存在计算冗余，且固定阈值的稀疏化策略无法适应不同复杂度的样本，影响效率和性能。 Method: 提出FocusMamba方法，包括事件引导的多模态稀疏化（EGMS）策略和跨模态聚焦融合（CMFF）模块。 Result: 在DSEC-Det和PKU-DAVIS-SOD数据集上的实验表明，FocusMamba在准确性和效率方面均优于现有方法。 Conclusion: FocusMamba通过自适应的多模态特征稀疏化和高效的跨模态融合，实现了在准确性和效率上的平衡，优于现有方法。 Abstract: Existing RGB-Event detection methods process the low-information regions of both modalities (background in images and non-event regions in event data) uniformly during feature extraction and fusion, resulting in high computational costs and suboptimal performance. To mitigate the computational redundancy during feature extraction, researchers have respectively proposed token sparsification methods for the image and event modalities. However, these methods employ a fixed number or threshold for token selection, hindering the retention of informative tokens for samples with varying complexity. To achieve a better balance between accuracy and efficiency, we propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features and efficiently integrates complementary information. Specifically, an Event-Guided Multimodal Sparsification (EGMS) strategy is designed to identify and adaptively discard low-information regions within each modality by leveraging scene content changes perceived by the event camera. Based on the sparsification results, a Cross-Modality Focus Fusion (CMFF) module is proposed to effectively capture and integrate complementary features from both modalities. Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency compared to existing methods. The code will be available at https://github.com/Zizzzzzzz/FocusMamba.

[68] SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition

Jiajun Song,Xiaoou Liu

Main category: cs.CV

TL;DR: 本文提出了SalientFusion方法，用于解决组合零样本食品识别中的背景冗余、角色混淆和语义偏差问题，并在多个基准测试中表现出色。

Details

Motivation: 食品识别需要能够识别未见过的食物类别，因此提出了Zero-Shot Food Learning的需求。 Method: 提出了SalientFusion方法，包括SalientFormer和DebiasAT两个组件，分别用于去除背景冗余、解决语义偏差。 Result: 在CZSFood-90和CZSFood-164基准测试中，SalientFusion表现出最先进的结果。 Conclusion: SalientFusion方法在Compositional Zero-Shot Food Recognition任务中达到了最先进的性能，解决了背景冗余、角色混淆和语义偏差的问题。 Abstract: Food recognition has gained significant attention, but the rapid emergence of new dishes requires methods for recognizing unseen food categories, motivating Zero-Shot Food Learning (ZSFL). We propose the task of Compositional Zero-Shot Food Recognition (CZSFR), where cuisines and ingredients naturally align with attributes and objects in Compositional Zero-Shot learning (CZSL). However, CZSFR faces three challenges: (1) Redundant background information distracts models from learning meaningful food features, (2) Role confusion between staple and side dishes leads to misclassification, and (3) Semantic bias in a single attribute can lead to confusion of understanding. Therefore, we propose SalientFusion, a context-aware CZSFR method with two components: SalientFormer, which removes background redundancy and uses depth features to resolve role confusion; DebiasAT, which reduces the semantic bias by aligning prompts with visual features. Using our proposed benchmarks, CZSFood-90 and CZSFood-164, we show that SalientFusion achieves state-of-the-art results on these benchmarks and the most popular general datasets for the general CZSL. The code is avaliable at https://github.com/Jiajun-RUC/SalientFusion.

[69] Human Motion Video Generation: A Survey

Haiwei Xue,Xiangyang Luo,Zhanghao Hu,Xin Zhang,Xunzhi Xiang,Yuqin Dai,Jianzhuang Liu,Zhensong Zhang,Minglei Li,Jian Yang,Fei Ma,Zhiyong Wu,Changpeng Yang,Zonghong Dai,Fei Richard Yu

Main category: cs.CV

TL;DR: 这篇论文提供了一个全面的人类运动视频生成领域的调查，涵盖了十个子任务和五个关键生成阶段，探讨了大语言模型的潜在作用。

Details

Motivation: 现有研究缺乏对整个生成过程的全面概述，这篇论文旨在填补这一空白，并探索大语言模型在增强人类运动视频生成中的潜力。 Method: 论文调查了超过两百篇论文，涵盖了十个子任务，并详细描述了生成过程的五个关键阶段：输入、运动规划、运动视频生成、优化和输出。 Result: 论文提供了对视觉、文本和音频三种主要模态下的人类运动视频生成的最新发展和科技趋势的全面回顾，并列出了里程碑式的工作。 Conclusion: 这篇论文旨在全面概述人类运动视频生成领域，强调了潜在的应用和大语言模型的作用，并为未来数字人类的综合应用提供了有价值的资源。 Abstract: Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.

[70] OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction

Bu Jin,Songen Gu,Xiaotao Hu,Yupeng Zheng,Xiaoyang Guo,Qian Zhang,Xiaoxiao Long,Wei Yin

Main category: cs.CV

TL;DR: OccTENS 是一种新型的生成式占用世界模型，能够在保持计算效率的同时实现高保真、可控的长期 3D 场景预测。

Details

Motivation: 现有的基于自回归的占用世界模型在长期生成中存在效率低、时间退化和缺乏可控性的问题，需要一种更高效、可控且高质量的生成方法。 Method: OccTENS 将占用世界模型重新定义为时间下一尺度预测 (TENS) 任务，通过 TensFormer 网络实现空间和时间关系的灵活建模，并提出了一种整体姿态聚合策略来增强姿态可控性。 Result: OccTENS 在占用生成质量和推理速度方面均优于现有最先进方法。 Conclusion: OccTENS 提出了一种可控的、高保真的长期占用生成模型，解决了现有方法在效率、长期生成中的时间退化以及可控性方面的不足。 Abstract: In this paper, we propose OccTENS, a generative occupancy world model that enables controllable, high-fidelity long-term occupancy generation while maintaining computational efficiency. Different from visual generation, the occupancy world model must capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from \textbf{inefficiency}, \textbf{temporal degradation} in long-term generation and \textbf{lack of controllability}. To holistically address these issues, we reformulate the occupancy world model as a temporal next-scale prediction (TENS) task, which decomposes the temporal sequence modeling problem into the modeling of spatial scale-by-scale generation and temporal scene-by-scene prediction. With a \textbf{TensFormer}, OccTENS can effectively manage the temporal causality and spatial relationships of occupancy sequences in a flexible and scalable way. To enhance the pose controllability, we further propose a holistic pose aggregation strategy, which features a unified sequence modeling for occupancy and ego-motion. Experiments show that OccTENS outperforms the state-of-the-art method with both higher occupancy quality and faster inference time.

[71] Weakly-Supervised Learning of Dense Functional Correspondences

Stefan Stojanov,Linan Zhao,Yunzhi Zhang,Daniel L. K. Yamins,Jiajun Wu

Main category: cs.CV

TL;DR: The paper introduces a weakly-supervised method to establish dense functional correspondence across object categories by leveraging vision-language models and dense contrastive learning, outperforming existing baseline solutions.

Details

Motivation: The motivation is that object parts enabling specific functions often share similarities in shape and appearance, which can guide the establishment of correspondences across different object categories. Method: The paper proposes a weakly-supervised learning paradigm that leverages vision-language models to pseudo-label multi-view images and integrates this with dense contrastive learning to distill functional and spatial knowledge. Result: The results show that the proposed approach outperforms baseline solutions using self-supervised image representations and grounded vision language models. Conclusion: The paper concludes that leveraging vision-language models and dense contrastive learning can effectively establish dense functional correspondence across different object categories. Abstract: Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.

[72] Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

Phuoc-Nguyen Bui,Khanh-Binh Nguyen,Hyunseung Choo

Main category: cs.CV

TL;DR: 本研究提出了一种名为 Attn-Adapter 的新在线小样本学习框架，通过双注意力机制提升 CLIP 模型在小样本场景下的适应性，解决了传统方法中存在的计算量大和过拟合问题。

Details

Motivation: 对比视觉-语言模型在零样本图像识别中表现出色，但在小样本场景中面临挑战，因为使用提示学习进行离线微调计算量大，且存在过拟合风险。 Method: Attn-Adapter 结合了两个组件：Memory Attn-Adapter 通过支持样例优化类别嵌入，Local-Global Attn-Adapter 通过整合局部和全局特征丰富图像嵌入。 Result: Attn-Adapter 在跨类别和跨数据集的泛化能力上优于最先进的方法，并且在 CLIP 基础模型上保持了高效的推理和扩展性。 Conclusion: Attn-Adapter 是一种增强 CLIP 适应性的新在线小样本学习框架，它通过双注意力机制，在没有重新训练基础模型的情况下实现从少量标记样本的动态适应。 Abstract: Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

[73] SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

Xiaofu Chen,Israfel Salazar,Yova Kementchedjhieva

Main category: cs.CV

TL;DR: 本文提出了一种名为SPECS的新的图像字幕评估方法，该方法通过修改CLIP来强调具体性，从而在与人类判断的相关性方面达到与基于LLM的度量相当的性能，同时更加高效。

Details

Motivation: 随着对生成长而详细的图像字幕的兴趣增加，标准评估指标变得越来越不可靠。尽管基于大语言模型（LLM）的度量与人类判断有很强的相关性，但在模型开发过程中由于成本太高而不适合迭代使用。 Method: SPECS通过一种新的目标修改了CLIP，这种方法强调具体性：奖励正确的细节并惩罚错误的细节。 Result: SPECS在与人类判断的相关性方面达到了与基于LLM的度量相同的性能水平，同时效率更高。 Conclusion: SPECS是一种实用的图像字幕模型开发迭代检查点评估的替代方法，因为它在与人类判断的相关性方面表现出与基于LLM的度量相当的性能，同时效率更高。 Abstract: As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.

[74] A Generative Foundation Model for Chest Radiography

Yuanfeng Ji,Dan Lin,Xiyue Wang,Lu Zhang,Wenhui Zhou,Chongjian Ge,Ruihang Chu,Xiaoli Yang,Junhan Zhao,Junsong Chen,Xiangde Luo,Sen Yang,Jin Fang,Ping Luo,Ruijiang Li

Main category: cs.CV

TL;DR: ChexGen is a generative vision-language foundation model for synthesizing chest radiographs, which helps in creating more accurate, data-efficient, and equitable medical AI systems.

Details

Motivation: The scarcity of well-annotated diverse medical images hinders the development of reliable AI models in healthcare. Advances in generative models for natural images inspired the development of ChexGen. Method: ChexGen, a generative vision-language foundation model was developed, which introduces a unified framework for text-, mask-, and bounding box-guided synthesis of chest radiographs. The model was pretrained on a large dataset of chest X-rays. Result: ChexGen achieved accurate synthesis of radiographs as evaluated by experts and quantitative metrics. It proved useful for data augmentation and supervised pretraining, improving performance across various tasks with minimal data. Additionally, it helped in creating diverse patient cohorts to enhance model fairness. Conclusion: generative foundation models can play a transformative role in building more accurate, data-efficient, and equitable medical AI systems. Abstract: The scarcity of well-annotated diverse medical images is a major hurdle for developing reliable AI models in healthcare. Substantial technical advances have been made in generative foundation models for natural images. Here we develop `ChexGen', a generative vision-language foundation model that introduces a unified framework for text-, mask-, and bounding box-guided synthesis of chest radiographs. Built upon the latent diffusion transformer architecture, ChexGen was pretrained on the largest curated chest X-ray dataset to date, consisting of 960,000 radiograph-report pairs. ChexGen achieves accurate synthesis of radiographs through expert evaluations and quantitative metrics. We demonstrate the utility of ChexGen for training data augmentation and supervised pretraining, which led to performance improvements across disease classification, detection, and segmentation tasks using a small fraction of training data. Further, our model enables the creation of diverse patient cohorts that enhance model fairness by detecting and mitigating demographic biases. Our study supports the transformative role of generative foundation models in building more accurate, data-efficient, and equitable medical AI systems.

[75] LMVC: An End-to-End Learned Multiview Video Coding Framework

Xihua Sheng,Yingwen Zhang,Long Xu,Shiqi Wang

Main category: cs.CV

TL;DR: 本文提出了一种端到端的多视角视频编码框架（LMVC），通过利用独立视图的运动和内容信息来提升压缩效率，同时确保随机访问和向后兼容性。

Details

Motivation: 多视角视频在存储和传输方面面临巨大挑战，现有的深度学习视频编码方法主要关注单视角或立体视频，而多视角场景的研究较少。 Method: 提出了一种基于特征的视图间运动矢量预测方法和视图间上下文预测模块，结合视图间运动熵模型和上下文熵模型，以提高压缩效率。 Result: 实验结果表明，所提出的LMVC框架在压缩效率上显著优于传统的MV-HEVC标准参考软件。 Conclusion: 本文为多视角视频编码提供了一个高效的端到端解决方案，并为未来研究奠定了基础。 Abstract: Multiview video is a key data source for volumetric video, enabling immersive 3D scene reconstruction but posing significant challenges in storage and transmission due to its massive data volume. Recently, deep learning-based end-to-end video coding has achieved great success, yet most focus on single-view or stereo videos, leaving general multiview scenarios underexplored. This paper proposes an end-to-end learned multiview video coding (LMVC) framework that ensures random access and backward compatibility while enhancing compression efficiency. Our key innovation lies in effectively leveraging independent-view motion and content information to enhance dependent-view compression. Specifically, to exploit the inter-view motion correlation, we propose a feature-based inter-view motion vector prediction method that conditions dependent-view motion encoding on decoded independent-view motion features, along with an inter-view motion entropy model that learns inter-view motion priors. To exploit the inter-view content correlation, we propose a disparity-free inter-view context prediction module that predicts inter-view contexts from decoded independent-view content features, combined with an inter-view contextual entropy model that captures inter-view context priors. Experimental results show that our proposed LMVC framework outperforms the reference software of the traditional MV-HEVC standard by a large margin, establishing a strong baseline for future research in this field.

[76] TopoSculpt: Betti-Steered Topological Sculpting of 3D Fine-grained Tubular Shapes

Minghui Zhang,Yaoyu Liu,Junyang Wu,Xin You,Hanxiao Zhang,Junjun He,Yun Gu

Main category: cs.CV

TL;DR: 本文提出TopoSculpt，用于改进3D管状结构的拓扑和几何重建，显著减少了拓扑错误并提升了重建质量。

Details

Motivation: 现有方法依赖体素级重叠度量，无法保证拓扑正确性和完整性，而现有的拓扑感知损失也无法全局保持拓扑结构。 Method: 提出了一种名为TopoSculpt的新框架，包括整体区域建模策略、拓扑完整性Betti（TIB）约束以及基于持续同调的课程优化方案。 Result: 在肺气道和Willis环数据集上，β0错误分别从69.00降至3.40，和从1.65降至0.30，树长度和分支检测率提高了近10%。 Conclusion: TopoSculpt有效地纠正了复杂3D管状结构的关键拓扑错误，提高了几何和拓扑的重建精度。 Abstract: Medical tubular anatomical structures are inherently three-dimensional conduits with lumens, enclosing walls, and complex branching topologies. Accurate reconstruction of their geometry and topology is crucial for applications such as bronchoscopic navigation and cerebral arterial connectivity assessment. Existing methods often rely on voxel-wise overlap measures, which fail to capture topological correctness and completeness. Although topology-aware losses and persistent homology constraints have shown promise, they are usually applied patch-wise and cannot guarantee global preservation or correct geometric errors at inference. To address these limitations, we propose a novel TopoSculpt, a framework for topological refinement of 3D fine-grained tubular structures. TopoSculpt (i) adopts a holistic whole-region modeling strategy to capture full spatial context, (ii) first introduces a Topological Integrity Betti (TIB) constraint that jointly enforces Betti number priors and global integrity, and (iii) employs a curriculum refinement scheme with persistent homology to progressively correct errors from coarse to fine scales. Extensive experiments on challenging pulmonary airway and Circle of Willis datasets demonstrate substantial improvements in both geometry and topology. For instance, $\beta_{0}$ errors are reduced from 69.00 to 3.40 on the airway dataset and from 1.65 to 0.30 on the CoW dataset, with Tree length detected and branch detected rates improving by nearly 10\%. These results highlight the effectiveness of TopoSculpt in correcting critical topological errors and advancing the high-fidelity modeling of complex 3D tubular anatomy. The project homepage is available at: https://github.com/Puzzled-Hui/TopoSculpt.

[77] Chest X-ray Pneumothorax Segmentation Using EfficientNet-B4 Transfer Learning in a U-Net Architecture

Alvaro Aranibar Roque,Helga Sebastian

Main category: cs.CV

TL;DR: 本研究开发了一个深度学习模型，用于从胸部X光图像中精确分割气胸区域，具有较高的IoU和Dice分数，可辅助医生进行气胸诊断。

Details

Motivation: 气胸是一种可能危及生命的疾病，胸部X光是首选的诊断工具，但一些小的气胸病例可能表现不明显，需要更精确的检测方法。 Method: 我们提出了一种基于U-Net架构（EfficientNet-B4编码器）的深度学习自动化流程，使用数据增强和结合二元交叉熵与Dice损失的模型进行训练。 Result: 在独立的PTX-498数据集上，模型获得了0.7008的IoU和0.8241的Dice分数。 Conclusion: 该模型能够准确定位气胸区域，辅助放射科医生的工作。 Abstract: Pneumothorax, the abnormal accumulation of air in the pleural space, can be life-threatening if undetected. Chest X-rays are the first-line diagnostic tool, but small cases may be subtle. We propose an automated deep-learning pipeline using a U-Net with an EfficientNet-B4 encoder to segment pneumothorax regions. Trained on the SIIM-ACR dataset with data augmentation and a combined binary cross-entropy plus Dice loss, the model achieved an IoU of 0.7008 and Dice score of 0.8241 on the independent PTX-498 dataset. These results demonstrate that the model can accurately localize pneumothoraces and support radiologists.

[78] ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection

Zhu Wenjie,Zhang Yabin,Xin Jin,Wenjun Zeng,Lei Zhang

Main category: cs.CV

TL;DR: ANTS leverages MLLMs to generate adaptive negative textual spaces, significantly improving OOD detection performance in both far-OOD and near-OOD settings without requiring training.

Details

Motivation: Existing OOD detection methods struggle with constructing accurate negative spaces and suffer from false negatives, especially in near-OOD settings. This work aims to improve OOD detection by leveraging MLLMs to generate precise negative labels for both far-OOD and near-OOD cases. Method: ANTS identifies likely OOD samples as negative images and uses MLLMs to generate descriptive negative sentences. For near-OOD, it finds ID classes visually similar to negative images and generates tailored negative labels. An adaptive weighted score balances far- and near-OOD detection without task-specific knowledge. Result: On the ImageNet benchmark, ANTS reduces FPR95 by 4.2%, achieves state-of-the-art performance, and operates in a training-free, zero-shot manner, ensuring high scalability. Conclusion: The proposed ANTS method effectively improves both near-OOD and far-OOD detection performance by generating expressive negative labels using MLLMs and achieves state-of-the-art results on ImageNet without requiring training. Abstract: The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

[79] Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection

Yijun Zhou,Yikui Zhai,Zilu Ying,Tingfeng Xian,Wenlve Zhou,Zhiheng Zhou,Xiaolin Tian,Xudong Jia,Hongsheng Zhang,C. L. Philip Chen

Main category: cs.CV

TL;DR: MMChange is a multimodal method for remote sensing change detection that improves accuracy and robustness by combining image and text modalities.

Details

Motivation: Most existing deep learning methods for remote sensing change detection rely solely on image modality, limiting their performance under illumination and noise disturbances. Method: MMChange uses an Image Feature Refinement module, a vision language model, a Textual Difference Enhancement module, and an Image Text Feature Fusion module to integrate multimodal data. Result: MMChange outperforms state-of-the-art methods on LEVIRCD, WHUCD, and SYSUCD datasets across multiple metrics. Conclusion: MMChange is an effective multimodal method for remote sensing change detection that combines image and text modalities to improve accuracy and robustness. Abstract: Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.

[80] SAC-MIL: Spatial-Aware Correlated Multiple Instance Learning for Histopathology Whole Slide Image Classification

Yu Bai,Zitong Yu,Haowen Tian,Xijing Wang,Shuo Yan,Lin Wang,Honglin Li,Xitong Ling,Bo Zhang,Zheng Zhang,Wufan Wang,Hui Gao,Xiangyang Gong,Wendong Wang

Main category: cs.CV

TL;DR: SAC-MIL 是一种用于 WSI 分类的新方法，它使用位置编码模块和 SAC 块来处理实例之间的空间关系和相关性，实现了高性能和高效的部署。

Details

Motivation: 传统的 Transformer 方法需要自定义 CUDA 内核，部署复杂。此外，序列长度的变化导致模型难以处理长度外推问题。 Method: SAC-MIL 由一个位置编码模块和一个 SAC 块组成。位置编码模块利用滑动中的实例坐标来编码空间关系，SAC 块是一个基于 MLP 的模块，可在序列长度上以线性时间复杂度执行全实例相关性分析。 Result: SAC-MIL 在 CAMELYON-16、TCGA-LUNG 和 TCGA-BRAC 数据集上实现了最先进的性能。 Conclusion: SAC-MIL 是一种新的 WSI 分类方法，具有简单、高效的特性，并实现了最先进的性能。 Abstract: We propose Spatial-Aware Correlated Multiple Instance Learning (SAC-MIL) for performing WSI classification. SAC-MIL consists of a positional encoding module to encode position information and a SAC block to perform full instance correlations. The positional encoding module utilizes the instance coordinates within the slide to encode the spatial relationships instead of the instance index in the input WSI sequence. The positional encoding module can also handle the length extrapolation issue where the training and testing sequences have different lengths. The SAC block is an MLP-based method that performs full instance correlation in linear time complexity with respect to the sequence length. Due to the simple structure of MLP, it is easy to deploy since it does not require custom CUDA kernels, compared to Transformer-based methods for WSI classification. SAC-MIL has achieved state-of-the-art performance on the CAMELYON-16, TCGA-LUNG, and TCGA-BRAC datasets. The code will be released upon acceptance.

[81] Improving Vessel Segmentation with Multi-Task Learning and Auxiliary Data Available Only During Model Training

Daniel Sobotka,Alexander Herold,Matthias Perkonigg,Lucian Beer,Nina Bastati,Alina Sablatnig,Ahmed Ba-Ssalamah,Georg Langs

Main category: cs.CV

TL;DR: This paper introduces a multi-task learning framework for liver vessel segmentation in non-contrast MRI, leveraging contrast-enhanced data during training to improve accuracy with limited annotations.

Details

Motivation: Liver vessel segmentation is crucial for analyzing vascular remodeling in liver diseases, but current methods rely on contrast-enhanced imaging which is not always available. Non-contrast images are more frequently acquired but challenging for vessel segmentation due to lack of annotations. Method: A multi-task learning framework was proposed, using paired native and contrast-enhanced MRI data during training to improve vessel segmentation without contrast, leveraging auxiliary data to enhance feature representation. Result: Using auxiliary contrast-enhanced data during training improved vessel segmentation accuracy on non-contrast MRI, particularly when few annotated examples were available. The method also showed benefits in brain tumor segmentation, indicating broad applicability. Conclusion: The proposed multi-task learning framework effectively improves liver vessel segmentation in non-contrast MRI data, especially when limited annotated data is available, and demonstrates cross-domain applicability. Abstract: Liver vessel segmentation in magnetic resonance imaging data is important for the computational analysis of vascular remodelling, associated with a wide spectrum of diffuse liver diseases. Existing approaches rely on contrast enhanced imaging data, but the necessary dedicated imaging sequences are not uniformly acquired. Images without contrast enhancement are acquired more frequently, but vessel segmentation is challenging, and requires large-scale annotated data. We propose a multi-task learning framework to segment vessels in liver MRI without contrast. It exploits auxiliary contrast enhanced MRI data available only during training to reduce the need for annotated training examples. Our approach draws on paired native and contrast enhanced data with and without vessel annotations for model training. Results show that auxiliary data improves the accuracy of vessel segmentation, even if they are not available during inference. The advantage is most pronounced if only few annotations are available for training, since the feature representation benefits from the shared task structure. A validation of this approach to augment a model for brain tumor segmentation confirms its benefits across different domains. An auxiliary informative imaging modality can augment expert annotations even if it is only available during training.

[82] Promptception: How Sensitive Are Large Multimodal Models to Prompts?

Mohamed Insaf Ismithdeen,Muhammad Uzair Khattak,Salman Khan

Main category: cs.CV

TL;DR: Promptception框架揭示大模态模型在多选题回答中的提示敏感性，提出定制提示原则以实现更稳健和公正的评估。

Details

Motivation: 提示设计在大模态模型的多选题回答任务中的重要性尚未被充分理解，且模型性能受提示措辞和结构的影响较大，影响评估的透明性和公平性。 Method: 引入Promptception框架，包含61种提示类型，涵盖15个类别和6个超类别，用于评估10种大模态模型在3个基准测试中的表现。 Result: 专有模型对提示措辞更敏感，而开源模型更稳定但难以处理复杂措辞。 Conclusion: 通过分析结果，为专有和开源大模态模型提出了定制化的提示原则，以提高评估的稳健性和公平性。 Abstract: Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple-Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open-source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU-Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open-source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.

[83] SliceSemOcc: Vertical Slice Based Multimodal 3D Semantic Occupancy Representation

Han Huang,Han Sun,Ningzhong Liu,Huiyu Zhou,Jiaquan Shen

Main category: cs.CV

TL;DR: 本文提出了一种新的3D语义占用预测框架SliceSemOcc，通过引入垂直切片和SEAttention3D模块，显著提升了模型对高度方向特征的感知能力，并在多个数据集上取得了优异性能。

Details

Motivation: 现有方法在处理体素特征时忽略了高度轴信息，且传统SENet风格的通道注意力无法对不同高度层分配差异化权重，限制了模型的表达能力。 Method: 提出了一种基于垂直切片的多模态框架SliceSemOcc，并设计了SEAttention3D模块，以保留高度方向上的分辨率并动态分配通道注意力权重。 Result: 在nuScenes-SurroundOcc和nuScenes-OpenOccupancy数据集上进行了广泛实验，结果显示该方法显著提升了平均IoU，尤其在多数小物体类别上表现优异。 Conclusion: SliceSemOcc有效提升了3D语义占用预测的性能，特别是在小物体类别上表现突出，通过垂直切片的方法和SEAttention3D模块，成功增强了特征的高度感知能力。 Abstract: Driven by autonomous driving's demands for precise 3D perception, 3D semantic occupancy prediction has become a pivotal research topic. Unlike bird's-eye-view (BEV) methods, which restrict scene representation to a 2D plane, occupancy prediction leverages a complete 3D voxel grid to model spatial structures in all dimensions, thereby capturing semantic variations along the vertical axis. However, most existing approaches overlook height-axis information when processing voxel features. And conventional SENet-style channel attention assigns uniform weight across all height layers, limiting their ability to emphasize features at different heights. To address these limitations, we propose SliceSemOcc, a novel vertical slice based multimodal framework for 3D semantic occupancy representation. Specifically, we extract voxel features along the height-axis using both global and local vertical slices. Then, a global local fusion module adaptively reconciles fine-grained spatial details with holistic contextual information. Furthermore, we propose the SEAttention3D module, which preserves height-wise resolution through average pooling and assigns dynamic channel attention weights to each height layer. Extensive experiments on nuScenes-SurroundOcc and nuScenes-OpenOccupancy datasets verify that our method significantly enhances mean IoU, achieving especially pronounced gains on most small-object categories. Detailed ablation studies further validate the effectiveness of the proposed SliceSemOcc framework.

[84] Detecting Regional Spurious Correlations in Vision Transformers via Token Discarding

Solha Kang,Esla Timothy Anzaku,Wesley De Neve,Arnout Van Messem,Joris Vankerschaver,Francois Rameau,Utku Ozbulak

Main category: cs.CV

TL;DR: This paper introduces a new method for detecting spurious correlations in vision transformers, revealing how training methods and dataset characteristics can affect model reliability, with implications for real-world applications like medical image classification.

Details

Motivation: The motivation stems from the need to build trustworthy machine learning models by identifying and mitigating spurious correlations, which occur when models rely on unintended but statistically relevant patterns in the data. Method: The researchers used a novel detection method applied to both supervised and self-supervised vision transformer models, conducting large-scale experiments on the ImageNet dataset and providing a case study on invasive breast mass classification. Result: The proposed method successfully detected spurious correlations in vision transformers, with results showing that training methodology significantly impacts model reliance on these correlations and that certain ImageNet classes contain easily detectable spurious signals. Conclusion: The study concludes that spurious correlations are a significant concern in vision transformers, affecting model reliability and generalizability, and calls for caution in the use of specific ImageNet images due to identified spurious signals. Abstract: Due to their powerful feature association capabilities, neural network-based computer vision models have the ability to detect and exploit unintended patterns within the data, potentially leading to correct predictions based on incorrect or unintended but statistically relevant signals. These clues may vary from simple color aberrations to small texts within the image. In situations where these unintended signals align with the predictive task, models can mistakenly link these features with the task and rely on them for making predictions. This phenomenon is referred to as spurious correlations, where patterns appear to be associated with the task but are actually coincidental. As a result, detection and mitigation of spurious correlations have become crucial tasks for building trustworthy, reliable, and generalizable machine learning models. In this work, we present a novel method to detect spurious correlations in vision transformers, a type of neural network architecture that gained significant popularity in recent years. Using both supervised and self-supervised trained models, we present large-scale experiments on the ImageNet dataset demonstrating the ability of the proposed method to identify spurious correlations. We also find that, even if the same architecture is used, the training methodology has a significant impact on the model's reliance on spurious correlations. Furthermore, we show that certain classes in the ImageNet dataset contain spurious signals that are easily detected by the models and discuss the underlying reasons for those spurious signals. In light of our findings, we provide an exhaustive list of the aforementioned images and call for caution in their use in future research efforts. Lastly, we present a case study investigating spurious signals in invasive breast mass classification, grounding our work in real-world scenarios.

[85] Learning from Majority Label: A Novel Problem in Multi-class Multiple-Instance Learning

Shiku Kaito,Shinnosuke Matsuo,Daiki Suehiro,Ryoma Bise

Main category: cs.CV

TL;DR: This paper introduces Learning from Majority Label (LML), a new multi-class Multiple-Instance Learning approach, which uses a Counting Network and a Majority Proportion Enhancement Module to improve classification performance.

Details

Motivation: The paper addresses a novel multi-class Multiple-Instance Learning (MIL) problem where bag-level labels are assigned based on the majority class of instances. This approach has practical applications in fields like pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. Method: A Counting Network is proposed to estimate bag-level majority labels by counting the number of instances per class. Additionally, a Majority Proportion Enhancement Module (MPEM) is introduced to improve the proportion of the majority class by removing minority instances. Result: The experiments showed that the proposed method outperforms existing MIL approaches on four datasets. Furthermore, ablation studies confirmed the effectiveness of both the Counting Network and the MPEM module. Conclusion: The proposed method, Learning from Majority Label (LML), demonstrates superiority over conventional MIL methods and effectively enhances the proportion of the majority class in bags using the MPEM module. Abstract: The paper proposes a novel multi-class Multiple-Instance Learning (MIL) problem called Learning from Majority Label (LML). In LML, the majority class of instances in a bag is assigned as the bag-level label. The goal of LML is to train a classification model that estimates the class of each instance using the majority label. This problem is valuable in a variety of applications, including pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. To solve LML, we propose a Counting Network trained to produce bag-level majority labels, estimated by counting the number of instances in each class. Furthermore, analysis experiments on the characteristics of LML revealed that bags with a high proportion of the majority class facilitate learning. Based on this result, we developed a Majority Proportion Enhancement Module (MPEM) that increases the proportion of the majority class by removing minority class instances within the bags. Experiments demonstrate the superiority of the proposed method on four datasets compared to conventional MIL methods. Moreover, ablation studies confirmed the effectiveness of each module. The code is available at \href{https://github.com/Shiku-Kaito/Learning-from-Majority-Label-A-Novel-Problem-in-Multi-class-Multiple-Instance-Learning}{here}.

[86] Millisecond-Response Tracking and Gazing System for UAVs: A Domestic Solution Based on "Phytium + Cambricon"

Yuchen Zhu,Longxiang Yin,Kai Zhao

Main category: cs.CV

TL;DR: 本文为解决传统摄像系统在动态场景中存在的响应延迟问题，提出了一种基于飞腾处理器和寒武纪加速卡的异构计算架构，并构建了具有毫秒级响应能力的无人机跟踪和凝视系统。

Details

Motivation: 由于传统摄像系统在动态场景中存在超过200毫秒的响应延迟，无法满足复杂场景中的实时性要求，因此需要提出新的解决方案。 Method: 在硬件层面，系统采用了飞腾FT-2000/4处理器和MLU220加速卡的协同计算架构；在软件层面，创新性地融合了轻量级YOLOv5s检测网络和DeepSORT级联跟踪算法。 Result: 实验结果表明，该系统在1920*1080分辨率视频流处理中实现了稳定的单帧综合处理延迟50-100毫秒，并且具有超过98.5%的多尺度目标识别准确率。 Conclusion: 本文提出了一种基于飞腾处理器和寒武纪加速卡的异构计算架构，为无人机监控和国产芯片的应用提供了创新解决方案。 Abstract: In the frontier research and application of current video surveillance technology, traditional camera systems exhibit significant limitations of response delay exceeding 200 ms in dynamic scenarios due to the insufficient deep feature extraction capability of automatic recognition algorithms and the efficiency bottleneck of computing architectures, failing to meet the real-time requirements in complex scenes. To address this issue, this study proposes a heterogeneous computing architecture based on Phytium processors and Cambricon accelerator cards, constructing a UAV tracking and gazing system with millisecond-level response capability. At the hardware level, the system adopts a collaborative computing architecture of Phytium FT-2000/4 processors and MLU220 accelerator cards, enhancing computing power through multi-card parallelism. At the software level, it innovatively integrates a lightweight YOLOv5s detection network with a DeepSORT cascaded tracking algorithm, forming a closed-loop control chain of "detection-tracking-feedback". Experimental results demonstrate that the system achieves a stable single-frame comprehensive processing delay of 50-100 ms in 1920*1080 resolution video stream processing, with a multi-scale target recognition accuracy of over 98.5%, featuring both low latency and high precision. This study provides an innovative solution for UAV monitoring and the application of domestic chips.

[87] A Re-ranking Method using K-nearest Weighted Fusion for Person Re-identification

Quang-Huy Che,Le-Chuong Nguyen,Gia-Nghia Tran,Dinh-Duy Phan,Vinh-Tiep Nguyen

Main category: cs.CV

TL;DR: 本文提出了一种无需模型微调或额外标注的高效行人重识别重排序方法，通过K近邻加权融合生成多视角特征，在多个数据集上显著提升了识别性能。

Details

Motivation: 现有的行人重识别研究主要关注单视角特征，容易受到姿态变化、视角变化和遮挡等问题的影响，而多视角特征可以有效减少视角偏差，提高识别准确性。 Method: 提出了一种基于K近邻加权融合（KWF）的重排序方法，通过无监督选择K个邻域特征生成多视角特征，并探索了特征聚合过程中的权重选择策略。 Result: 在Market1501、MSMT17和Occluded-DukeMTMC数据集上评估表明，该方法在初始排序结果的前M个候选中显著提升了Rank@1和mAP。特别是在MSMT17和Occluded-DukeMTMC数据集上，Rank@1分别提高了9.8%和22.0%。此外，该方法在计算效率上也优于其他重排序方法。 Conclusion: 本文提出了一种高效的重排序方法，通过使用K近邻加权融合方法生成多视角特征，有效减少了视角偏差，并在多个数据集上验证了该方法在提升Rank@1和mAP方面的显著效果，同时具有较高的计算效率。 Abstract: In person re-identification, re-ranking is a crucial step to enhance the overall accuracy by refining the initial ranking of retrieved results. Previous studies have mainly focused on features from single-view images, which can cause view bias and issues like pose variation, viewpoint changes, and occlusions. Using multi-view features to present a person can help reduce view bias. In this work, we present an efficient re-ranking method that generates multi-view features by aggregating neighbors' features using K-nearest Weighted Fusion (KWF) method. Specifically, we hypothesize that features extracted from re-identification models are highly similar when representing the same identity. Thus, we select K neighboring features in an unsupervised manner to generate multi-view features. Additionally, this study explores the weight selection strategies during feature aggregation, allowing us to identify an effective strategy. Our re-ranking approach does not require model fine-tuning or extra annotations, making it applicable to large-scale datasets. We evaluate our method on the person re-identification datasets Market1501, MSMT17, and Occluded-DukeMTMC. The results show that our method significantly improves Rank@1 and mAP when re-ranking the top M candidates from the initial ranking results. Specifically, compared to the initial results, our re-ranking method achieves improvements of 9.8%/22.0% in Rank@1 on the challenging datasets: MSMT17 and Occluded-DukeMTMC, respectively. Furthermore, our approach demonstrates substantial enhancements in computational efficiency compared to other re-ranking methods.

[88] TEn-CATS: Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph

Yaru Chen,Faegheh Sardari,Peiliang Zhang,Ruohao Guo,Yang Xiang,Zhenbo Li,Wenwu Wang

Main category: cs.CV

TL;DR: A new method for Audio-Visual Video Parsing (AVVP) is proposed, combining BiT and CATS modules to address error amplification in training, achieving state-of-the-art results on benchmark datasets.

Details

Motivation: Existing AVVP methods either treat noisy pseudo-labels as reliable supervision or spread attention indiscriminately across frames, leading to error amplification during training. Method: The method integrates Bi-Directional Text Fusion (BiT) and Category-Aware Temporal Graph (CATS) modules to enhance semantic cues and enable precise semantic information dissemination over time. Result: The proposed approach achieves state-of-the-art (SOTA) performance on multiple key indicators across two benchmark datasets, LLP and UnAV-100. Conclusion: The proposed method combining BiT and CATS modules outperforms existing methods in AVVP tasks by effectively addressing the issue of error amplification during training. Abstract: Audio-Visual Video Parsing (AVVP) task aims to identify event categories and their occurrence times in a given video with weakly supervised labels. Existing methods typically fall into two categories: (i) designing enhanced architectures based on attention mechanism for better temporal modeling, and (ii) generating richer pseudo-labels to compensate for the absence of frame-level annotations. However, the first type methods treat noisy segment-level pseudo labels as reliable supervision and the second type methods let indiscriminate attention spread them across all frames, the initial errors are repeatedly amplified during training. To address this issue, we propose a method that combines the Bi-Directional Text Fusion (BiT) module and Category-Aware Temporal Graph (CATS) module. Specifically, we integrate the strengths and complementarity of the two previous research directions. We first perform semantic injection and dynamic calibration on audio and visual modality features through the BiT module, to locate and purify cleaner and richer semantic cues. Then, we leverage the CATS module for semantic propagation and connection to enable precise semantic information dissemination across time. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators on two benchmark datasets, LLP and UnAV-100.

[89] TriLiteNet: Lightweight Model for Multi-Task Visual Perception

Quang-Huy Che,Duc-Khai Lam

Main category: cs.CV

TL;DR: The study presents TriLiteNet, an efficient multi-task model for Advanced Driver Assistance Systems that balances performance, computational efficiency, and scalability, achieving competitive results on the BDD100k dataset while maintaining low computational costs.

Details

Motivation: Efficient perception models are essential for Advanced Driver Assistance Systems (ADAS) to ensure rapid processing and response for safety and effectiveness in real-world environments. Method: This study introduces TriLiteNet, a model designed to optimize performance while maintaining low computational costs, with two configurations: a base version and a tiny version. Result: Experimental results on the BDD100k dataset demonstrate that TriLiteNet achieves competitive performance across three key tasks: vehicle detection, drivable area segmentation, and lane line segmentation. The base configuration achieved a recall of 85.6% for vehicle detection, a mean Intersection over Union (mIoU) of 92.4% for drivable area segmentation, and an accuracy of 82.3% for lane line segmentation with only 2.35M parameters and a computational cost of 7.72 GFLOPs. The tiny configuration includes just 0.14M parameters, providing a multi-task solution with minimal computational demand. TriLiteNet shows low latency and reasonable power consumption during inference on embedded devices. Conclusion: TriLiteNet offers a practical and deployable solution for real-world autonomous driving applications by balancing performance, computational efficiency, and scalability. Abstract: Efficient perception models are essential for Advanced Driver Assistance Systems (ADAS), as these applications require rapid processing and response to ensure safety and effectiveness in real-world environments. To address the real-time execution needs of such perception models, this study introduces the TriLiteNet model. This model can simultaneously manage multiple tasks related to panoramic driving perception. TriLiteNet is designed to optimize performance while maintaining low computational costs. Experimental results on the BDD100k dataset demonstrate that the model achieves competitive performance across three key tasks: vehicle detection, drivable area segmentation, and lane line segmentation. Specifically, the TriLiteNet_{base} demonstrated a recall of 85.6% for vehicle detection, a mean Intersection over Union (mIoU) of 92.4% for drivable area segmentation, and an Acc of 82.3% for lane line segmentation with only 2.35M parameters and a computational cost of 7.72 GFLOPs. Our proposed model includes a tiny configuration with just 0.14M parameters, which provides a multi-task solution with minimal computational demand. Evaluated for latency and power consumption on embedded devices, TriLiteNet in both configurations shows low latency and reasonable power during inference. By balancing performance, computational efficiency, and scalability, TriLiteNet offers a practical and deployable solution for real-world autonomous driving applications. Code is available at https://github.com/chequanghuy/TriLiteNet.

[90] DVS-PedX: Synthetic-and-Real Event-Based Pedestrian Dataset

Mustafa Sakhai,Kaung Sithu,Min Khant Soe Oke,Maciej Wielgosz

Main category: cs.CV

TL;DR: DVS-PedX 是一个用于基于事件的行人检测和过街意图分析的神经形态数据集，包含合成和真实世界数据，旨在推动行人安全和神经形态感知的研究。

Details

Motivation: 事件相机（如动态视觉传感器 DVS）提供低延迟、高动态范围和运动鲁棒性，DVS-PedX 的设计旨在探索在正常和恶劣天气条件下行人物体检测和过街意图分析的应用。 Method: DVS-PedX 包含两个互补的数据源：在 CARLA 模拟器中生成的合成事件流和使用 v2e 工具将真实世界 JAAD 行车记录视频转换为事件流。数据集包括配对的 RGB 帧、DVS 事件帧以及帧级标签，并提供原始 AEDAT 和 AVI DVS 视频文件和元数据。 Result: DVS-PedX 数据集揭示了模拟到真实场景之间的差距，激励了领域适应和多模态融合的研究。 Conclusion: DVS-PedX 是一个面向行人物体检测和过街意图分析的神经形态数据集，旨在加速基于事件的行人安全、意图预测和神经形态感知的研究。 Abstract: Event cameras like Dynamic Vision Sensors (DVS) report micro-timed brightness changes instead of full frames, offering low latency, high dynamic range, and motion robustness. DVS-PedX (Dynamic Vision Sensor Pedestrian eXploration) is a neuromorphic dataset designed for pedestrian detection and crossing-intention analysis in normal and adverse weather conditions across two complementary sources: (1) synthetic event streams generated in the CARLA simulator for controlled "approach-cross" scenes under varied weather and lighting; and (2) real-world JAAD dash-cam videos converted to event streams using the v2e tool, preserving natural behaviors and backgrounds. Each sequence includes paired RGB frames, per-frame DVS "event frames" (33 ms accumulations), and frame-level labels (crossing vs. not crossing). We also provide raw AEDAT 2.0/AEDAT 4.0 event files and AVI DVS video files and metadata for flexible re-processing. Baseline spiking neural networks (SNNs) using SpikingJelly illustrate dataset usability and reveal a sim-to-real gap, motivating domain adaptation and multimodal fusion. DVS-PedX aims to accelerate research in event-based pedestrian safety, intention prediction, and neuromorphic perception.

[91] TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering

Ayan Banerjee,Josep Lladós,Umapada Pal,Anjan Dutta

Main category: cs.CV

TL;DR: TaleDiffusion是一个用于生成多角色故事的框架，通过迭代过程保持角色一致性，并通过后期处理实现准确的对话分配，从而在一致性、降噪和对话呈现方面优于现有方法。

Details

Motivation: 文本到故事可视化具有挑战性，因为需要在多个帧中保持多个角色的一致交互。现有方法在角色一致性方面存在不足，导致伪影生成和不准确的对话呈现，从而造成故事断裂。 Method: 给定一个故事，使用预训练的LLM通过上下文学习生成每帧描述、角色细节和对话，然后使用基于有界注意力的每框掩码技术来控制角色交互并最小化伪影。接着应用身份一致的自注意力机制以确保跨帧的角色一致性，并使用区域感知的交叉注意力进行精确的对象放置。对话也通过CLIPSeg呈现为气泡并分配给角色。 Result: 实验结果表明，TaleDiffusion在一致性、降噪和对话呈现方面优于现有方法。 Conclusion: TaleDiffusion通过迭代生成和角色一致性控制，有效解决了文本到故事可视化中的角色一致性问题和对话呈现问题。 Abstract: Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.

[92] MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation

Yuan Zhao,Liu Lin

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到图像生成框架MEPG，通过结合位置和风格感知的大型语言模型与多专家扩散模块，有效提高了生成图像的质量和风格多样性。

Details

Motivation: 文本到图像扩散模型在处理复杂、多元素提示和实现多样化风格方面仍存在不足，因此提出MEPG框架以解决这些问题。 Method: 提出了多专家规划与生成框架（MEPG），包括位置-风格感知（PSA）模块和多专家扩散（MED）模块，利用监督微调的LLM分解输入提示，并通过注意力门控机制激活特定区域的专家模型。 Result: 实验表明，MEPG在图像质量和风格多样性方面显著优于相同主干的基线模型。 Conclusion: MEPG框架通过整合位置和风格感知的LLM与空间语义专家模块，有效提升了文本到图像扩散模型的图像质量和风格多样性。 Abstract: Text-to-image diffusion models have achieved remarkable image quality, but they still struggle with complex, multiele ment prompts, and limited stylistic diversity. To address these limitations, we propose a Multi-Expert Planning and Gen eration Framework (MEPG) that synergistically integrates position- and style-aware large language models (LLMs) with spatial-semantic expert modules. The framework comprises two core components: (1) a Position-Style-Aware (PSA) module that utilizes a supervised fine-tuned LLM to decom pose input prompts into precise spatial coordinates and style encoded semantic instructions; and (2) a Multi-Expert Dif fusion (MED) module that implements cross-region genera tion through dynamic expert routing across both local regions and global areas. During the generation process for each lo cal region, specialized models (e.g., realism experts, styliza tion specialists) are selectively activated for each spatial par tition via attention-based gating mechanisms. The architec ture supports lightweight integration and replacement of ex pert models, providing strong extensibility. Additionally, an interactive interface enables real-time spatial layout editing and per-region style selection from a portfolio of experts. Ex periments show that MEPG significantly outperforms base line models with the same backbone in both image quality and style diversity.

[93] Revisiting Simple Baselines for In-The-Wild Deepfake Detection

Orlando Castaneda,Kevin So-Tang,Kshitij Gurung

Main category: cs.CV

TL;DR: 本文研究了在“in-the-wild”基准Deepfake-Eval-2024上评估深度伪造检测器的性能，并展示了通过调整超参数，Ojha等人提出的简单方法可以实现81%的准确率，与商业深度伪造检测器相媲美。

Details

Motivation: 本文的动机是现有的深度伪造检测器在高度受控的数据集上进行评估，而本文关注的是最近发布的“in-the-wild”基准Deepfake-Eval-2024，并尝试提高现有基线方法的性能。 Method: 本文采用的方法是重新审视Ojha等人提出的基线方法，该方法将标准的预训练视觉主干适应于生成可泛化的深度伪造检测器，并通过调整超参数来提高性能。 Result: 通过更好地调整超参数，Ojha等人提出的简单方法在Deepfake-Eval-2024上实现了81%的准确率，比之前报道的基线方法准确率提高了18%，并能与商业深度伪造检测器竞争。 Conclusion: 本文的结论是，通过更好地调整超参数，Ojha等人提出的简单方法实际上可以产生与商业深度伪造检测器相媲美的性能，并在Deepfake-Eval-2024上实现了81%的准确率。 Abstract: The widespread adoption of synthetic media demands accessible deepfake detectors and realistic benchmarks. While most existing research evaluates deepfake detectors on highly controlled datasets, we focus on the recently released "in-the-wild" benchmark, Deepfake-Eval-2024. Initial reporting on Deepfake-Eval-2024 showed that three finetuned open-source models achieve accuracies between 61% and 69%, significantly lagging behind the leading commercial deepfake detector with 82% accuracy. Our work revisits one of these baseline approaches, originally introduced by Ojha et al., which adapts standard pretrained vision backbones to produce generalizable deepfake detectors. We demonstrate that with better-tuned hyperparameters, this simple approach actually yields much higher performance -- 81% accuracy on Deepfake-Eval-2024 -- surpassing the previously reported accuracy of this baseline approach by 18% and competing with commercial deepfake detectors. We discuss tradeoffs in accuracy, computational costs, and interpretability, focusing on how practical these deepfake detectors might be when deployed in real-world settings. Our code can be found at https://github.com/Deepfake-Detection-KKO/deepfake-detection.

[94] YOLO Ensemble for UAV-based Multispectral Defect Detection in Wind Turbine Components

Serhii Svystun,Pavlo Radiuk,Oleksandr Melnychenko,Oleg Savenko,Anatoliy Sachenko

Main category: cs.CV

TL;DR: This research proposes an ensemble approach integrating a general-purpose YOLOv8 model with a specialized thermal model for improved defect detection in wind power plants using multispectral imagery.

Details

Motivation: Reliable defect detection in wind power plants requires high-resolution data and efficient methods to process multispectral imagery. Method: Development of an ensemble of YOLO-based deep learning models that integrate both visible and thermal channels, using a sophisticated bounding box fusion algorithm to combine their predictions. Result: The proposed approach achieves a mean Average Precision (mAP@.5) of 0.93 and an F1-score of 0.90, outperforming a standalone YOLOv8 model, which scored an mAP@.5 of 0.91. Conclusion: Combining multiple YOLO architectures with fused multispectral data provides a more reliable solution for improving the detection of both visual and thermal defects in wind power plants. Abstract: Unmanned aerial vehicles (UAVs) equipped with advanced sensors have opened up new opportunities for monitoring wind power plants, including blades, towers, and other critical components. However, reliable defect detection requires high-resolution data and efficient methods to process multispectral imagery. In this research, we aim to enhance defect detection accuracy through the development of an ensemble of YOLO-based deep learning models that integrate both visible and thermal channels. We propose an ensemble approach that integrates a general-purpose YOLOv8 model with a specialized thermal model, using a sophisticated bounding box fusion algorithm to combine their predictions. Our experiments show this approach achieves a mean Average Precision (mAP@.5) of 0.93 and an F1-score of 0.90, outperforming a standalone YOLOv8 model, which scored an mAP@.5 of 0.91. These findings demonstrate that combining multiple YOLO architectures with fused multispectral data provides a more reliable solution, improving the detection of both visual and thermal defects.

[95] VisioFirm: Cross-Platform AI-assisted Annotation Tool for Computer Vision

Safouane El Ghazouali,Umberto Michelucci

Main category: cs.CV

TL;DR: VisioFirm是一个AI辅助的图像标注工具，利用先进模型减少人工工作量，提升标注效率，同时支持多种导出格式和离线使用。

Details

Motivation: 图像标注通常是一项劳动密集型任务，传统工具需要大量手动输入，限制了大规模数据集的可扩展性，因此需要一种更高效、自动化的解决方案。 Method: VisioFirm采用了一种混合方法，结合了CLIP、预训练检测器（如Ultralytics模型）、零样本模型（如Grounding DINO）以及Segment Anything模型，用于生成初始注释并通过交互工具进行优化。 Result: VisioFirm在不同数据集上的基准测试显示，人工努力减少了90%，同时保持了高标注准确率，并支持多种标注任务和格式。 Conclusion: VisioFirm是一个开源的网页应用，通过集成先进的AI模型，显著减少了图像标注所需的人工努力，同时保持了高准确度，并支持多种导出格式和离线操作。 Abstract: AI models rely on annotated data to learn pattern and perform prediction. Annotation is usually a labor-intensive step that require associating labels ranging from a simple classification label to more complex tasks such as object detection, oriented bounding box estimation, and instance segmentation. Traditional tools often require extensive manual input, limiting scalability for large datasets. To address this, we introduce VisioFirm, an open-source web application designed to streamline image labeling through AI-assisted automation. VisioFirm integrates state-of-the-art foundation models into an interface with a filtering pipeline to reduce human-in-the-loop efforts. This hybrid approach employs CLIP combined with pre-trained detectors like Ultralytics models for common classes and zero-shot models such as Grounding DINO for custom labels, generating initial annotations with low-confidence thresholding to maximize recall. Through this framework, when tested on COCO-type of classes, initial prediction have been proven to be mostly correct though the users can refine these via interactive tools supporting bounding boxes, oriented bounding boxes, and polygons. Additionally, VisioFirm has on-the-fly segmentation powered by Segment Anything accelerated through WebGPU for browser-side efficiency. The tool supports multiple export formats (YOLO, COCO, Pascal VOC, CSV) and operates offline after model caching, enhancing accessibility. VisioFirm demonstrates up to 90\% reduction in manual effort through benchmarks on diverse datasets, while maintaining high annotation accuracy via clustering of connected CLIP-based disambiguate components and IoU-graph for redundant detection suppression. VisioFirm can be accessed from \href{https://github.com/OschAI/VisioFirm}{https://github.com/OschAI/VisioFirm}.

[96] DUDE: Diffusion-Based Unsupervised Cross-Domain Image Retrieval

Ruohong Yang,Peng Hu,Yunfan Li,Xi Peng

Main category: cs.CV

TL;DR: DUDE是一种基于特征解缠的新颖无监督跨域图像检索方法，通过文本到图像生成模型分离对象特征和特定域样式，并以渐进方式实现跨域对齐，实验表明其在多个数据集上具有出色的性能。

Details

Motivation: 现有的UCIR方法通常难以应对域差距，因为关键的对象特征经常与特定域的样式纠缠在一起。 Method: DUDE利用文本到图像生成模型来分离对象特征和特定于域的样式，并以渐进的方式对齐域内和域间的相互邻居，以实现可靠对齐。 Result: 实验表明，DUDE在三个基准数据集上的13个域中实现了最先进的性能。 Conclusion: DUDE是一种新的UCIR方法，实现了最先进的性能，并且代码将被发布。 Abstract: Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images of the same category across diverse domains without relying on annotations. Existing UCIR methods, which align cross-domain features for the entire image, often struggle with the domain gap, as the object features critical for retrieval are frequently entangled with domain-specific styles. To address this challenge, we propose DUDE, a novel UCIR method building upon feature disentanglement. In brief, DUDE leverages a text-to-image generative model to disentangle object features from domain-specific styles, thus facilitating semantical image retrieval. To further achieve reliable alignment of the disentangled object features, DUDE aligns mutual neighbors from within domains to across domains in a progressive manner. Extensive experiments demonstrate that DUDE achieves state-of-the-art performance across three benchmark datasets over 13 domains. The code will be released.

[97] Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Wanfu Wang,Qipeng Huang,Guangquan Xue,Xiaobo Liang,Juntao Li

Main category: cs.CV

TL;DR: LASER是一种新的自演化框架，能够有效提升视觉语言模型在GUI定位任务中的多步骤感知能力和精确度。

Details

Motivation: 现有VLMs在高分辨率输入和复杂多元素视觉交互场景中仍难以有效进行GUI定位，需要更精确的感知和推理能力。 Method: LASER框架利用蒙特卡洛质量估计与IoU区域质量评估联合优化，以提升VLMs在多步骤感知任务中的准确性和多样性。 Result: 在ScreenSpot Pro和ScreenSpot-v2基准测试中，LASER均取得了显著的性能提升，并在GTA1-7B微调后达到55.7分，成为7B规模模型的新SoTA。 Conclusion: LASER通过结合蒙特卡洛质量估计和IoU区域质量评估，显著提升了VLMs在GUI定位任务中的表现，尤其是在高分辨率输入和复杂多元素视觉交互场景中。 Abstract: Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning. Recently, OpenAI o3 model introduced a zoom-in search strategy that effectively elicits active perception capabilities in VLMs, improving downstream task performance. However, enabling VLMs to reason effectively over appropriate image regions remains a core challenge in GUI grounding, particularly under high-resolution inputs and complex multi-element visual interactions. In this work, we propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities, enabling precise coordinate prediction. Specifically, our approach integrate Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity. Comprehensive experiments on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate consistent performance gains, validating the effectiveness of our method. Furthermore, when fine-tuned on GTA1-7B, LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, establishing a new state-of-the-art (SoTA) among 7B-scale models.

[98] Differential Morphological Profile Neural Networks for Semantic Segmentation

David Huangal,J. Alex Hurt

Main category: cs.CV

TL;DR: This paper explores the use of Differential Morphological Profile (DMP) in semantic segmentation of overhead remote sensing imagery, extending prior work beyond classification and object detection, and shows that hybrid DMP can outperform non-DMP models on certain evaluation metrics.

Details

Motivation: The motivation is to address the challenges of remote sensing such as extreme scale variation, foreground-background imbalance, and large image sizes which are not directly addressed by state-of-the-art segmentation networks developed and tuned on ground-perspective photographs. Method: The authors integrate Differential Morphological Profile (DMP) features into modern segmentation networks through direct input and hybrid architectures, and evaluate them on the iSAID benchmark dataset. Result: The result shows that while non-DMP models generally outperform the direct-input variants, hybrid DMP consistently outperforms direct-input and is capable of surpassing a non-DMP model on mIoU, F1, and Recall. Conclusion: In this paper, the author extends prior DMPNet work beyond classification and object detection by integrating DMP features into three state-of-the-art convolutional and transformer semantic segmentation architectures, which proves that hybrid DMP can surpass non-DMP model in some evaluation indexes. Abstract: Semantic segmentation of overhead remote sensing imagery enables applications in mapping, urban planning, and disaster response. State-of-the-art segmentation networks are typically developed and tuned on ground-perspective photographs and do not directly address remote sensing challenges such as extreme scale variation, foreground-background imbalance, and large image sizes. We explore the incorporation of the differential morphological profile (DMP), a multi-scale shape extraction method based on grayscale morphology, into modern segmentation networks. Prior studies have shown that the DMP can provide critical shape information to Deep Neural Networks to enable superior detection and classification performance in overhead imagery. In this work, we extend prior DMPNet work beyond classification and object detection by integrating DMP features into three state-of-the-art convolutional and transformer semantic segmentation architectures. We utilize both direct input, which adapts the input stem of feature extraction architectures to accept DMP channels, and hybrid architectures, a dual-stream design that fuses RGB and DMP encoders. Using the iSAID benchmark dataset, we evaluate a variety of DMP differentials and structuring element shapes to more effectively provide shape information to the model. Our results show that while non-DMP models generally outperform the direct-input variants, hybrid DMP consistently outperforms direct-input and is capable of surpassing a non-DMP model on mIoU, F1, and Recall.

[99] TauGenNet: Plasma-Driven Tau PET Image Synthesis via Text-Guided 3D Diffusion Models

Yuxin Gong,Se-in Jang,Wei Shao,Yi Su,Kuang Gong

Main category: cs.CV

TL;DR: 提出了一种基于文本引导的3D扩散模型，用于合成3D Tau PET图像，通过结合结构MRI和血浆测量数据，实现对阿尔茨海默病中Tau病理的非侵入性、低成本可视化和疾病进展模拟。

Details

Motivation: 由于Tau PET扫描的成本高昂且供应有限，因此需要一种更广泛可用的非侵入性替代方法来量化Tau病理学，以用于阿尔茨海默病的诊断和监测。 Method: 研究提出了一种文本引导的3D扩散模型，利用来自结构MRI的解剖约束和来自血浆p-tau217测量的文本提示，进行3D Tau PET图像合成。 Result: 实验结果表明，该框架能够生成真实且具有临床意义的3D Tau PET图像，并能在不同疾病阶段进行数据增强，提供非侵入性和成本效益高的可视化和疾病进展模拟。 Conclusion: 所提出的框架可以作为一种非侵入性、成本效益高的方法，用于可视化Tau病理及模拟不同血浆生物标志物水平和认知状态下的疾病进展。 Abstract: Accurate quantification of tau pathology via tau positron emission tomography (PET) scan is crucial for diagnosing and monitoring Alzheimer's disease (AD). However, the high cost and limited availability of tau PET restrict its widespread use. In contrast, structural magnetic resonance imaging (MRI) and plasma-based biomarkers provide non-invasive and widely available complementary information related to brain anatomy and disease progression. In this work, we propose a text-guided 3D diffusion model for 3D tau PET image synthesis, leveraging multimodal conditions from both structural MRI and plasma measurement. Specifically, the textual prompt is from the plasma p-tau217 measurement, which is a key indicator of AD progression, while MRI provides anatomical structure constraints. The proposed framework is trained and evaluated using clinical AV1451 tau PET data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Experimental results demonstrate that our approach can generate realistic, clinically meaningful 3D tau PET across a range of disease stages. The proposed framework can help perform tau PET data augmentation under different settings, provide a non-invasive, cost-effective alternative for visualizing tau pathology, and support the simulation of disease progression under varying plasma biomarker levels and cognitive conditions.

[100] Dual-Scale Volume Priors with Wasserstein-Based Consistency for Semi-Supervised Medical Image Segmentation

Junying Meng,Gangxuan Zhou,Jun Liu,Weihong Guo

Main category: cs.CV

TL;DR: A new semi-supervised medical image segmentation framework integrates spatial regularization and volume priors to improve segmentation accuracy, showing superior performance on multiple datasets.

Details

Motivation: Most existing semi-supervised segmentation networks lack effective methodological guidance for feature extraction and fail to utilize important prior information from datasets, prompting the need for a more robust framework. Method: The approach combines a strong explicit volume prior and Threshold Dynamics spatial regularization from variational models into a segmentation network. A regression network estimates target region volumes for unlabeled images, and an image-scale Wasserstein distance constraint ensures class ratios align with predictions. Additionally, a dataset-scale Wasserstein distance loss function is designed based on a weak implicit volume prior. Result: Experimental results on the 2017 ACDC dataset, PROMISE12 dataset, and thigh muscle MR image dataset demonstrate the effectiveness and superiority of the proposed method. Conclusion: The proposed semi-supervised medical image segmentation framework demonstrates superiority over existing methods by integrating spatial regularization methods and volume priors, as evidenced by experimental results on multiple datasets. Abstract: Despite signi cant progress in semi-supervised medical image segmentation, most existing segmentation networks overlook e ective methodological guidance for feature extraction and important prior information from datasets. In this paper, we develop a semi-supervised medical image segmentation framework that e ectively integrates spatial regularization methods and volume priors. Speci cally, our approach integrates a strong explicit volume prior at the image scale and Threshold Dynamics spatial regularization, both derived from variational models, into the backbone segmentation network. The target region volumes for each unlabeled image are estimated by a regression network, which e ectively regularizes the backbone segmentation network through an image-scale Wasserstein distance constraint, ensuring that the class ratios in the segmentation results for each unlabeled image match those predicted by the regression network. Additionally, we design a dataset-scale Wasserstein distance loss function based on a weak implicit volume prior, which enforces that the volume distribution predicted for the unlabeled dataset is similar to that of labeled dataset. Experimental results on the 2017 ACDC dataset, PROMISE12 dataset, and thigh muscle MR image dataset show the superiority of the proposed method.

[101] Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Jingen Qu,Lijun Li,Bo Zhang,Yichen Yan,Jing Shao

Main category: cs.CV

TL;DR: This paper proposes an image-oriented dataset construction method and evaluation metric for real-world multimodal safety scenarios, resulting in a dataset of 35k pairs and demonstrating the effectiveness of the approach.

Details

Motivation: Current dataset construction methods for multimodal large language models (MLLMs) are risk-oriented and lack the ability to cover the growing complexity of real-world multimodal safety scenarios (RMS). Additionally, there is a lack of a unified evaluation metric to prove their overall effectiveness. Method: The paper introduces a novel image-oriented self-adaptive dataset construction method and proposes a standardized safety dataset evaluation metric by fine-tuning a safety judge model. Result: Using the proposed method, the authors automatically generated an RMS dataset comprising 35k image-text pairs with guidance responses. Extensive experiments demonstrated the effectiveness and scalability of the image-oriented pipeline. Conclusion: The paper concludes that the proposed image-oriented approach is scalable and effective, providing a new perspective for constructing real-world multimodal safety datasets. Abstract: Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges. However, current dataset construction methods, which are risk-oriented, fail to cover the growing complexity of real-world multimodal safety scenarios (RMS). And due to the lack of a unified evaluation metric, their overall effectiveness remains unproven. This paper introduces a novel image-oriented self-adaptive dataset construction method for RMS, which starts with images and end constructing paired text and guidance responses. Using the image-oriented method, we automatically generate an RMS dataset comprising 35k image-text pairs with guidance responses. Additionally, we introduce a standardized safety dataset evaluation metric: fine-tuning a safety judge model and evaluating its capabilities on other safety datasets.Extensive experiments on various tasks demonstrate the effectiveness of the proposed image-oriented pipeline. The results confirm the scalability and effectiveness of the image-oriented approach, offering a new perspective for the construction of real-world multimodal safety datasets.

[102] PAOLI: Pose-free Articulated Object Learning from Sparse-view Images

Jianning Deng,Kartic Subr,Hakan Bilen

Main category: cs.CV

TL;DR: 该论文提出了一种新的自我监督学习框架，能够在稀疏视角和无相机监督的条件下生成准确且详细的可变形物体表示。

Details

Motivation: 现有的方法需要密集的多视角观测和真实相机姿态，而该研究旨在使用更少的视角（每个关节至少四视角）和无相机监督来学习可变形物体表示。 Method: 提出了一种新的自我监督框架，通过稀疏视角的未定位图像学习可变形物体表示。该方法通过独立重建每个关节，学习变形场建立密集对应关系，并采用渐进式解耦策略分离静态和运动部件，最后联合优化几何、外观和运动学。 Result: 在标准基准测试和现实世界示例中的实验表明，该方法在更弱的输入条件下仍能生成准确且详细的可变形物体表示。 Conclusion: 该方法能够在比现有方法弱得多的输入假设下生成准确且详细的可变形物体表示。 Abstract: We present a novel self-supervised framework for learning articulated object representations from sparse-view, unposed images. Unlike prior methods that require dense multi-view observations and ground-truth camera poses, our approach operates with as few as four views per articulation and no camera supervision. To address the inherent challenges, we first reconstruct each articulation independently using recent advances in sparse-view 3D reconstruction, then learn a deformation field that establishes dense correspondences across poses. A progressive disentanglement strategy further separates static from moving parts, enabling robust separation of camera and object motion. Finally, we jointly optimize geometry, appearance, and kinematics with a self-supervised loss that enforces cross-view and cross-pose consistency. Experiments on the standard benchmark and real-world examples demonstrate that our method produces accurate and detailed articulated object representations under significantly weaker input assumptions than existing approaches.

[103] The Telephone Game: Evaluating Semantic Drift in Unified Models

Sabbir Mollah,Rohit Gupta,Sirnam Swetha,Qingyang Liu,Ahnaf Munir,Mubarak Shah

Main category: cs.CV

TL;DR: This paper introduces a new evaluation framework to measure semantic drift in unified visual language models when cycling between understanding and generation tasks, showing that consistency over multiple cycles is crucial and varies significantly across models.

Details

Motivation: The motivation is to evaluate the consistency between visual understanding (image-to-text) and visual generation (text-to-image) in unified models, as current single-pass metrics do not capture the semantic drift when cycling between modalities. Method: The authors introduced the Unified Consistency Framework (UCF-UM), which uses a cyclic evaluation protocol alternating between image-to-text and text-to-image generations. They proposed three metrics: Mean Cumulative Drift (MCD), Semantic Drift Rate (SDR), and Multi-Generation GenEval (MGG) to quantify semantic drift. Result: The evaluation on seven recent models using the UCF-UM revealed significant variations in cross-modal stability, showing that some models maintain semantics over many alternations while others experience rapid drift despite strong single-pass scores. Conclusion: The study highlights the importance of cyclic consistency as a necessary complement to standard evaluations for unified models, providing practical metrics to assess cross-modal stability and the strength of shared representations. Abstract: Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model's cross-modal stability and strength of their shared representations. Code: https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

Yingxuan Li,Jiafeng Mao,Yusuke Matsui

Main category: cs.CV

TL;DR: This paper proposes a novel method that leverages synthetic images to identify and correct mislabeled samples in noisy image classification datasets, significantly improving accuracy and complementing existing noise-robust learning techniques.

Details

Motivation: Semantic noise in image classification datasets, where visually similar categories are frequently mislabeled, poses a significant challenge to conventional supervised learning approaches. This motivates the exploration of alternative strategies, such as leveraging synthetic images with reliable labels, to improve classification accuracy in the presence of noisy data. Method: The method uses synthetic images generated by advanced text-to-image models as reliable reference points to detect and correct mislabeled samples in noisy datasets. It addresses domain gaps and diversity constraints associated with direct use of synthetic images in training. Result: Extensive experiments show that the approach significantly improves classification accuracy under various noise conditions. When combined with state-of-the-art noise-robust training methods, it achieves a 30% accuracy improvement on CIFAR-10, 11% on CIFAR-100 under 70% semantic noise, and 24% on ImageNet-100 under real-world noise conditions. Conclusion: The proposed method effectively utilizes synthetic images to identify and correct mislabeled samples in noisy datasets, significantly improving classification accuracy, especially in scenarios with semantic label noise. The method complements existing noise-robust learning techniques, achieving superior performance when combined with them. Abstract: Semantic noise in image classification datasets, where visually similar categories are frequently mislabeled, poses a significant challenge to conventional supervised learning approaches. In this paper, we explore the potential of using synthetic images generated by advanced text-to-image models to address this issue. Although these high-quality synthetic images come with reliable labels, their direct application in training is limited by domain gaps and diversity constraints. Unlike conventional approaches, we propose a novel method that leverages synthetic images as reliable reference points to identify and correct mislabeled samples in noisy datasets. Extensive experiments across multiple benchmark datasets show that our approach significantly improves classification accuracy under various noise conditions, especially in challenging scenarios with semantic label noise. Additionally, since our method is orthogonal to existing noise-robust learning techniques, when combined with state-of-the-art noise-robust training methods, it achieves superior performance, improving accuracy by 30% on CIFAR-10 and by 11% on CIFAR-100 under 70% semantic noise, and by 24% on ImageNet-100 under real-world noise conditions.

[105] Efficient Odd-One-Out Anomaly Detection

Silvio Chito,Paolo Rabino,Tatiana Tommasi

Main category: cs.CV

TL;DR: 本文研究了奇偶检测任务的高效解决方案，提出了一种基于DINO的模型，减少了参数数量和训练时间，同时保持竞争力性能，并引入了多模态大语言模型基线。

Details

Motivation: 奇偶检测任务对现代深度学习模型提出了挑战，需要高效的解决方案。 Method: 提出了一种基于DINO的模型，并引入了多模态大语言模型基线。 Result: 与当前最先进的方法相比，所提出的模型将参数数量减少三分之一，训练时间缩短三倍。 Conclusion: 本文提出了一种基于DINO的模型，用于解决奇偶检测任务中的效率问题，在保持竞争力性能的同时减少了参数数量和训练时间。 Abstract: The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks. The project page can be found at https://silviochito.github.io/EfficientOddOneOut/

[106] GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization

Pengyue Jia,Yingyi Zhang,Xiangyu Zhao,Yixuan Li

Main category: cs.CV

TL;DR: The paper proposes GeoArena, an open platform for evaluating image geolocalization models using in-the-wild images and human-centered judgments.

Details

Motivation: Current evaluation methodologies for image geolocalization suffer from data leakage and rely on exact geographic coordinates, which neglects the reasoning process and raises privacy concerns. Method: The paper introduces GeoArena, which allows users to upload in-the-wild images and leverages pairwise human judgments to evaluate model outputs based on human expectations. Result: The platform was deployed online for two months, collecting thousands of voting records, which were used to establish a leaderboard of different LVLMs for image geolocalization. Conclusion: GeoArena is a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. Abstract: Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.

[107] From Editor to Dense Geometry Estimator

JiYuan Wang,Chunyu Lin,Lei Sun,Rongying Liu,Lang Nie,Mingxing Li,Kang Liao,Xiangxiang Chu,Yao Zhao

Main category: cs.CV

TL;DR: 本文提出FE2E，通过适配图像编辑模型进行密集几何预测，取得了优异的性能表现。

Details

Motivation: 密集预测本质上是图像到图像的任务，图像编辑模型可能比文本到图像生成模型更适合此类任务的微调。 Method: 引入了基于Diffusion Transformer架构的编辑模型，重新设计了训练目标和量化方法，并利用全局注意力机制进行深度和法线的联合估计。 Result: FE2E在多个数据集上的单目深度和法线估计任务中实现了显著性能提升，特别是在ETH3D数据集上性能提升了35%以上。 Conclusion: FE2E利用图像编辑模型的结构先验，在密集几何估计任务中表现出色，无需扩大训练数据即可实现性能提升。 Abstract: Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

[108] MICACL: Multi-Instance Category-Aware Contrastive Learning for Long-Tailed Dynamic Facial Expression Recognition

Feng-Qi Cui,Zhen Lin,Xinlong Rao,Anyang Tong,Shiyao Li,Fei Wang,Changlin Chen,Bin Liu

Main category: cs.CV

TL;DR: MICACL通过创新性地结合时空依赖建模与长尾对比学习优化，在动态面部表情识别领域取得了显著进展，有效解决了模型归纳偏差问题。

Details

Motivation: DFER面临显著挑战，主要是由于长尾类别分布和时空特征建模的复杂性，而现有的深度学习方法往往未能解决这些问题，导致严重的模型归纳偏差。 Method: 提出了一种名为MICACL的多实例学习框架，包括Graph-Enhanced Instance Interaction Module (GEIIM) 和 Weighted Instance Aggregation Network (WIAN)，以及Multiscale Category-aware Contrastive Learning (MCCL) 策略。 Result: 在真实世界数据集（即DFEW和FERV39k）上的大量实验表明，MICACL在鲁棒性和泛化能力方面达到了最先进的性能。 Conclusion: MICACL 是一种新颖的多实例学习框架，它通过整合时空依赖建模和长尾对比学习优化，成功解决了动态面部表情识别（DFER）中的模型归纳偏差问题。 Abstract: Dynamic facial expression recognition (DFER) faces significant challenges due to long-tailed category distributions and complexity of spatio-temporal feature modeling. While existing deep learning-based methods have improved DFER performance, they often fail to address these issues, resulting in severe model induction bias. To overcome these limitations, we propose a novel multi-instance learning framework called MICACL, which integrates spatio-temporal dependency modeling and long-tailed contrastive learning optimization. Specifically, we design the Graph-Enhanced Instance Interaction Module (GEIIM) to capture intricate spatio-temporal between adjacent instances relationships through adaptive adjacency matrices and multiscale convolutions. To enhance instance-level feature aggregation, we develop the Weighted Instance Aggregation Network (WIAN), which dynamically assigns weights based on instance importance. Furthermore, we introduce a Multiscale Category-aware Contrastive Learning (MCCL) strategy to balance training between major and minor categories. Extensive experiments on in-the-wild datasets (i.e., DFEW and FERV39k) demonstrate that MICACL achieves state-of-the-art performance with superior robustness and generalization.

[109] Stitching the Story: Creating Panoramic Incident Summaries from Body-Worn Footage

Dor Cohen,Inga Efrosman,Yehudit Aperstein,Alexander Apartsin

Main category: cs.CV

TL;DR: 本文提出了一种将随身摄像头视频转化为全景图像的方法，以帮助快速理解事件现场。

Details

Motivation: 紧急救援人员广泛使用随身摄像头记录事件现场并支持事后分析，但在时间紧迫的情况下审查长时间的视频素材是不现实的。有效的态势感知需要简洁的视觉摘要以便快速解读。 Method: 该方法利用单目同时定位与地图构建（SLAM）技术来估计摄像头轨迹并重建环境的空间布局。通过在轨迹上聚类摄像头姿态来确定关键视点，并从每个聚类中选择代表性的帧。然后使用多帧拼接技术将这些帧融合成空间一致的全景图像。 Result: 生成的摘要能够帮助快速理解复杂环境，并促进高效决策和事件回顾。 Conclusion: 该研究提出了一种计算机视觉流水线，能够将随身摄像头拍摄的视频转化为全景图像，以帮助快速理解复杂环境并促进高效决策和事件回顾。 Abstract: First responders widely adopt body-worn cameras to document incident scenes and support post-event analysis. However, reviewing lengthy video footage is impractical in time-critical situations. Effective situational awareness demands a concise visual summary that can be quickly interpreted. This work presents a computer vision pipeline that transforms body-camera footage into informative panoramic images summarizing the incident scene. Our method leverages monocular Simultaneous Localization and Mapping (SLAM) to estimate camera trajectories and reconstruct the spatial layout of the environment. Key viewpoints are identified by clustering camera poses along the trajectory, and representative frames from each cluster are selected. These frames are fused into spatially coherent panoramic images using multi-frame stitching techniques. The resulting summaries enable rapid understanding of complex environments and facilitate efficient decision-making and incident review.

[110] AnomalyLMM: Bridging Generative Knowledge and Discriminative Retrieval for Text-Based Person Anomaly Search

Hao Ju,Hu Zhang,Zhedong Zheng

Main category: cs.CV

TL;DR: 本文提出了AnomalyLMM框架，利用大型多模态模型进行基于文本的人员异常搜索，通过从粗到细的管道方法和训练自由适应策略，有效解决了细粒度跨模态对齐和现实世界样本稀疏的问题，并在PAB数据集上展示了优越的性能和可解释性。

Details

Motivation: 随着公共安全需求的增长，基于文本的人员异常搜索成为一项重要任务，但存在细粒度跨模态对齐和现实世界样本稀疏的挑战。 Method: 提出了一种从粗到细的管道方法，结合了掩码跨模态提示、行为显著性预测和知识感知重排序的训练自由适应策略。 Result: 在PAB数据集上的实验表明，所提出的方法在Recall@1准确率上超过竞争基线+0.96%，并展示了文本异常与视觉行为之间的可解释对齐。 Conclusion: 本文提出了一种名为AnomalyLMM的新框架，利用大型多模态模型进行基于文本的人员异常搜索，有效解决了该领域中的多个挑战，并在PAB数据集上展示了优越的性能。 Abstract: With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.

[111] Aesthetic Image Captioning with Saliency Enhanced MLLMs

Yilin Tao,Jiashui Huang,Huaze Xu,Ling Shao

Main category: cs.CV

TL;DR: 本文提出了ASE-MLLM框架，将图像美学显著性集成到多模态大语言模型中，有效提升了美学图像描述生成任务的性能。

Details

Motivation: 当前关于图像美学的研究主要集中在预测美学评分，很少应用在AIC上，而现有的基于MLLM的AIC工作主要依赖微调方法，没有专门调整MLLM以关注目标美学内容。 Method: 提出了美学显著性增强的多模态大语言模型（ASE-MLLM），引入了图像美学显著性模块（IASM）和IAS-ViT作为MLLM的图像编码器，通过交叉注意力机制融合美学显著性特征和原始图像特征。 Result: ASE-MLLM在主流AIC基准测试中表现优异，显著优于传统方法和通用MLLM，实现了最先进的性能。 Conclusion: ASE-MLLM 是首个专门针对AIC任务将图像美学显著性集成到MLLM中的框架，并在主流AIC基准测试中显著优于传统方法和通用MLLM，实现了最先进的性能。 Abstract: Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.

[112] SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer

Jimin Xu,Bosheng Qin,Tao Jin,Zhou Zhao,Zhenhui Ye,Jun Yu,Fei Wu

Main category: cs.CV

TL;DR: This paper proposes a novel 3D style transfer pipeline that effectively integrates pretrained 2D diffusion models to generate more structured, coherent, and visually enriched stylizations in 3D scenes.

Details

Motivation: Current 3D style transfer methods struggle to extract and transfer high-level style semantics from reference images and often produce results lacking structural clarity and separation. This work aims to overcome these limitations by integrating prior knowledge from pretrained 2D diffusion models. Method: The method involves a two-stage pipeline: (1) leveraging diffusion priors to generate stylized renderings of key viewpoints, and (2) transferring these stylized views onto a 3D representation. Innovations include cross-view style alignment using cross-view attention in the UNet and instance-level style transfer to maintain consistency across views. Result: Extensive experiments show that the proposed pipeline outperforms state-of-the-art methods across a wide range of 3D scenes, including both forward-facing and 360-degree environments, producing more structured and visually coherent stylizations. Conclusion: The proposed 3D style transfer pipeline significantly outperforms existing methods in generating structured, visually coherent, and artistically enriched stylizations across various 3D scenes. Abstract: Recent advancements in neural representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have increased interest in applying style transfer to 3D scenes. While existing methods can transfer style patterns onto 3D-consistent neural representations, they struggle to effectively extract and transfer high-level style semantics from the reference style image. Additionally, the stylized results often lack structural clarity and separation, making it difficult to distinguish between different instances or objects within the 3D scene. To address these limitations, we propose a novel 3D style transfer pipeline that effectively integrates prior knowledge from pretrained 2D diffusion models. Our pipeline consists of two key stages: First, we leverage diffusion priors to generate stylized renderings of key viewpoints. Then, we transfer the stylized key views onto the 3D representation. This process incorporates two innovative designs. The first is cross-view style alignment, which inserts cross-view attention into the last upsampling block of the UNet, allowing feature interactions across multiple key views. This ensures that the diffusion model generates stylized key views that maintain both style fidelity and instance-level consistency. The second is instance-level style transfer, which effectively leverages instance-level consistency across stylized key views and transfers it onto the 3D representation. This results in a more structured, visually coherent, and artistically enriched stylization. Extensive qualitative and quantitative experiments demonstrate that our 3D style transfer pipeline significantly outperforms state-of-the-art methods across a wide range of scenes, from forward-facing to challenging 360-degree environments. Visit our project page https://jm-xu.github.io/SSGaussian for immersive visualization.

[113] Learning neural representations for X-ray ptychography reconstruction with unknown probes

Tingyou Li,Zixin Xu,Zirui Gao,Hanfei Yan,Xiaojing Huang,Jizhou Li

Main category: cs.CV

TL;DR: The paper introduces PtyINR, a self-supervised framework for X-ray ptychography that improves image reconstruction when the illuminating probe is unknown, especially in low-signal conditions.

Details

Motivation: X-ray ptychography is constrained by the challenge of accurately reconstructing images when the illuminating probe is unknown, particularly under low-signal conditions inherent to low-dose and high-speed experiments. Conventional iterative methods and deep learning approaches are often suboptimal, compromising reconstruction fidelity and restricting broader adoption of the technique. Method: PtyINR is a self-supervised framework that parameterizes both the object and probe as continuous neural representations, performing end-to-end reconstruction directly from raw diffraction patterns without requiring any pre-characterization of the probe. Result: PtyINR achieves superior reconstruction quality on both simulated and experimental data, with remarkable robustness under challenging low-signal conditions. Conclusion: PtyINR offers a generalizable, physics-informed framework for addressing probe-dependent inverse problems in X-ray ptychography, making it applicable to a wide range of computational microscopy problems. Abstract: X-ray ptychography provides exceptional nanoscale resolution and is widely applied in materials science, biology, and nanotechnology. However, its full potential is constrained by the critical challenge of accurately reconstructing images when the illuminating probe is unknown. Conventional iterative methods and deep learning approaches are often suboptimal, particularly under the low-signal conditions inherent to low-dose and high-speed experiments. These limitations compromise reconstruction fidelity and restrict the broader adoption of the technique. In this work, we introduce the Ptychographic Implicit Neural Representation (PtyINR), a self-supervised framework that simultaneously addresses the object and probe recovery problem. By parameterizing both as continuous neural representations, PtyINR performs end-to-end reconstruction directly from raw diffraction patterns without requiring any pre-characterization of the probe. Extensive evaluations demonstrate that PtyINR achieves superior reconstruction quality on both simulated and experimental data, with remarkable robustness under challenging low-signal conditions. Furthermore, PtyINR offers a generalizable, physics-informed framework for addressing probe-dependent inverse problems, making it applicable to a wide range of computational microscopy problems.

[114] Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

Zanwei Zhou,Taoran Yi,Jiemin Fang,Chen Yang,Lingxi Xie,Xinggang Wang,Wei Shen,Qi Tian

Main category: cs.CV

TL;DR: This paper introduces MDT-dist, a new framework for accelerating 3D flow-based generation models by reducing sampling steps to 1 or 2, achieving significant speedups without compromising visual and geometric quality.

Details

Motivation: Flow-based 3D generation models usually require many sampling steps, which limits their efficiency. Although methods like Consistency Models have accelerated 2D diffusion models, similar advancements for 3D generation are lacking. This study aims to address this gap by developing a more efficient framework for 3D flow distillation. Method: The authors propose a novel framework called MDT-dist for few-step 3D flow distillation. They introduce two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to convert the intractable integration of velocity fields into manageable optimization problems at the velocity and distribution levels. Result: When applied to the 3D generation framework TRELLIS, the method reduces sampling steps from 25 to just 1 or 2, achieving a latency of 0.68s (1 step x 2) and 0.94s (2 steps x 2) on A800 hardware, with speedups of 9.0x and 6.5x respectively, while preserving high-quality output. Conclusion: The study concludes that the proposed MDT-dist framework significantly accelerates 3D flow distillation with minimal sampling steps while maintaining high visual and geometric fidelity, outperforming existing methods like Consistency Models. Abstract: Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1 or 2, achieving 0.68s (1 step x 2) and 0.94s (2 steps x 2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Extensive experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.

[115] Durian: Dual Reference-guided Portrait Animation with Attribute Transfer

Hyunsoo Cha,Byungjun Kim,Hanbyul Joo

Main category: cs.CV

TL;DR: Durian is a zero-shot method for generating portrait animation videos with facial attribute transfer, using dual reference networks and self-reconstruction to achieve high-fidelity results and multi-attribute composition without additional training.

Details

Motivation: The motivation is to develop a method that can transfer facial attributes from a reference image to a target portrait video in a zero-shot manner, ensuring high-fidelity and spatially consistent results without explicit triplet supervision. Method: The method introduces dual reference networks to inject spatial features from both the portrait and attribute images into a diffusion model's denoising process. The model is trained using self-reconstruction with sampled frames from portrait videos, mask expansion strategy using keypoint-conditioned image generation, and augmented with spatial and appearance-level transformations. Result: Durian achieves state-of-the-art performance on portrait animation with attribute transfer and enables multi-attribute composition in a single generation pass without additional training. Conclusion: Durian is a zero-shot method for generating portrait animation videos with facial attribute transfer, achieving state-of-the-art performance and enabling multi-attribute composition in a single generation pass without additional training. Abstract: We present Durian, the first method for generating portrait animation videos with facial attribute transfer from a given reference image to a target portrait in a zero-shot manner. To enable high-fidelity and spatially consistent attribute transfer across frames, we introduce dual reference networks that inject spatial features from both the portrait and attribute images into the denoising process of a diffusion model. We train the model using a self-reconstruction formulation, where two frames are sampled from the same portrait video: one is treated as the attribute reference and the other as the target portrait, and the remaining frames are reconstructed conditioned on these inputs and their corresponding masks. To support the transfer of attributes with varying spatial extent, we propose a mask expansion strategy using keypoint-conditioned image generation for training. In addition, we further augment the attribute and portrait images with spatial and appearance-level transformations to improve robustness to positional misalignment between them. These strategies allow the model to effectively generalize across diverse attributes and in-the-wild reference combinations, despite being trained without explicit triplet supervision. Durian achieves state-of-the-art performance on portrait animation with attribute transfer, and notably, its dual reference design enables multi-attribute composition in a single generation pass without additional training.

[116] From Lines to Shapes: Geometric-Constrained Segmentation of X-Ray Collimators via Hough Transform

Benjamin El-Zein,Dominik Eckert,Andreas Fieselmann,Christopher Syben,Ludwig Ritschl,Steffen Kappler,Sebastian Stober

Main category: cs.CV

TL;DR: This paper introduces a deep learning-based geometrically constrained segmentation approach for detecting collimation borders in X-ray imaging, achieving accurate ROI reconstruction with minimal error on real image datasets.

Details

Motivation: The motivation is to improve the detection of collimator shadows in X-ray imaging, which is challenging due to obscured edges from scattered radiation, and accurate detection is crucial for minimizing radiation dose and identifying the region-of-interest (ROI). Method: The paper proposes a deep learning-based segmentation method constrained by geometry, using a differentiable Hough transform-based network to detect collimation borders and extract ROI center information, which is then combined during inference to generate refined segmentation masks. Result: The method achieves robust reconstruction of collimated regions with median Hausdorff distances of 4.3-5.0mm on diverse test sets of real X-ray images, demonstrating its effectiveness and adaptability to varying edge counts. Conclusion: The proposed method effectively reconstructs collimated regions in X-ray images with high accuracy, and its performance is demonstrated by achieving median Hausdorff distances of 4.3-5.0mm on real X-ray image datasets. Abstract: Collimation in X-ray imaging restricts exposure to the region-of-interest (ROI) and minimizes the radiation dose applied to the patient. The detection of collimator shadows is an essential image-based preprocessing step in digital radiography posing a challenge when edges get obscured by scattered X-ray radiation. Regardless, the prior knowledge that collimation forms polygonal-shaped shadows is evident. For this reason, we introduce a deep learning-based segmentation that is inherently constrained to its geometry. We achieve this by incorporating a differentiable Hough transform-based network to detect the collimation borders and enhance its capability to extract the information about the ROI center. During inference, we combine the information of both tasks to enable the generation of refined, line-constrained segmentation masks. We demonstrate robust reconstruction of collimated regions achieving median Hausdorff distances of 4.3-5.0mm on diverse test sets of real Xray images. While this application involves at most four shadow borders, our method is not fundamentally limited by a specific number of edges.

[117] One Flight Over the Gap: A Survey from Perspective to Panoramic Vision

Xin Lin,Xian Ge,Dizhe Zhang,Zhaoliang Wan,Xianshun Wang,Xiangtai Li,Wenjie Jiang,Bo Du,Dacheng Tao,Ming-Hsuan Yang,Lu Qi

Main category: cs.CV

TL;DR: 本文综述了全景视觉技术的发展，重点分析了从透视图像到全景图像的适应性挑战，提出了系统的研究框架，并展望了未来的研究方向。

Details

Motivation: 由于对空间智能和整体场景感知的需求增加，提供完整360度视野的全向图像（ODIs）在虚拟现实、自动驾驶和具身机器人等领域受到越来越多的关注。然而，ODIs与透视图像在几何投影、空间分布和边界连续性方面存在显著差异，使得直接从透视方法进行领域适应变得困难。 Method: 文章首先回顾了全景成像流程和投影方法，总结了领域适应的三大挑战，并基于这些挑战对来自300多篇研究论文的20多个任务进行了分析，从方法和任务两个维度进行了讨论。 Result: 文章提出了全景视觉技术的系统分析框架，涵盖了视觉质量增强与评估、视觉理解、多模态理解和视觉生成四大类别，并讨论了数据、模型和应用方面的未来研究方向。 Conclusion: 本文总结了全景视觉技术的最新进展，重点分析了从透视到全景的适应性问题，并讨论了未来的研究方向和挑战，以推动全景视觉技术的发展。 Abstract: Driven by the demand for spatial intelligence and holistic scene perception, omnidirectional images (ODIs), which provide a complete 360\textdegree{} field of view, are receiving growing attention across diverse applications such as virtual reality, autonomous driving, and embodied robotics. Despite their unique characteristics, ODIs exhibit remarkable differences from perspective images in geometric projection, spatial distribution, and boundary continuity, making it challenging for direct domain adaption from perspective methods. This survey reviews recent panoramic vision techniques with a particular emphasis on the perspective-to-panorama adaptation. We first revisit the panoramic imaging pipeline and projection methods to build the prior knowledge required for analyzing the structural disparities. Then, we summarize three challenges of domain adaptation: severe geometric distortions near the poles, non-uniform sampling in Equirectangular Projection (ERP), and periodic boundary continuity. Building on this, we cover 20+ representative tasks drawn from more than 300 research papers in two dimensions. On one hand, we present a cross-method analysis of representative strategies for addressing panoramic specific challenges across different tasks. On the other hand, we conduct a cross-task comparison and classify panoramic vision into four major categories: visual quality enhancement and assessment, visual understanding, multimodal understanding, and visual generation. In addition, we discuss open challenges and future directions in data, models, and applications that will drive the advancement of panoramic vision research. We hope that our work can provide new insight and forward looking perspectives to advance the development of panoramic vision technologies. Our project page is https://insta360-research-team.github.io/Survey-of-Panorama

[118] Plot'n Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models

Kiymet Akdemir,Jing Shi,Kushal Kafle,Brian Price,Pinar Yanardag

Main category: cs.CV

TL;DR: 本文提出Plot'n Polish，一种零样本框架，解决了文本到图像扩散模型在故事可视化中缺乏编辑灵活性和一致性的问题。

Details

Motivation: 现有的文本到图像扩散模型在多个领域展现了强大的生成能力，但其在故事可视化应用中缺乏对精细或粗略编辑的灵活性，同时难以在多个帧之间保持视觉和叙述一致性。 Method: 引入了一种名为Plot'n Polish的零样本框架，以增强对故事可视化生成的控制力，并确保多帧之间的一致性。 Result: 开发出了一种新的框架，能够在不牺牲视觉质量的前提下，实现对生成后图像的多层次修改和故事内容的细化调整。 Conclusion: Plot'n Polish提供了一种零样本的框架，用于实现一致的故事生成，并在不同细节层次上对故事可视化进行细粒度控制。 Abstract: Text-to-image diffusion models have demonstrated significant capabilities to generate diverse and detailed visuals in various domains, and story visualization is emerging as a particularly promising application. However, as their use in real-world creative domains increases, the need for providing enhanced control, refinement, and the ability to modify images post-generation in a consistent manner becomes an important challenge. Existing methods often lack the flexibility to apply fine or coarse edits while maintaining visual and narrative consistency across multiple frames, preventing creators from seamlessly crafting and refining their visual stories. To address these challenges, we introduce Plot'n Polish, a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.

[119] TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection

Zehong Yan,Peng Qi,Wynne Hsu,Mong Li Lee

Main category: cs.CV

TL;DR: 本文介绍了一种新的统一视觉-语言模型TRUST-VL和大型指令数据集TRUST-Instruct，以检测多模态错误信息。

Details

Motivation: 多模态错误信息对社会构成了日益增长的威胁，现有方法通常只关注单一类型的扭曲，并且难以推广到未见过的场景中。 Method: 提出了TRUST-VL，这是一种统一且可解释的视觉-语言模型，用于一般的多模态错误信息检测。还包括一个大型指令数据集TRUST-Instruct。 Result: TRUST-VL在领域内和零样本基准上都取得了最先进的性能。 Conclusion: TRUST-VL实现了最先进的性能，同时提供了强大的泛化性和可解释性。 Abstract: Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

[120] Virtual Fitting Room: Generating Arbitrarily Long Videos of Virtual Try-On from a Single Image -- Technical Preview

Jun-Kun Chen,Aayush Bansal,Minh Phuoc Vo,Yu-Xiong Wang

Main category: cs.CV

TL;DR: VFR 是一种创新的视频生成模型，可以生成任意长度的虚拟试穿视频，通过分段生成策略解决长视频生成的资源消耗问题。

Details

Motivation: 传统方法在生成长视频时面临资源消耗大和数据需求长的问题，同时需要保证生成视频的局部平滑性和全局时间一致性。 Method: 将长视频生成任务建模为自回归、逐段生成的过程，并通过前缀视频条件保证局部平滑性，锚视频确保全局一致性。 Result: VFR 能够在多种动作下生成分钟级的虚拟试穿视频，具有局部平滑性和全局时间一致性。 Conclusion: VFR 在长虚拟试穿视频生成领域是一项开创性的工作，解决了生成长视频时的资源消耗和一致性问题。 Abstract: We introduce the Virtual Fitting Room (VFR), a novel video generative model that produces arbitrarily long virtual try-on videos. Our VFR models long video generation tasks as an auto-regressive, segment-by-segment generation process, eliminating the need for resource-intensive generation and lengthy video data, while providing the flexibility to generate videos of arbitrary length. The key challenges of this task are twofold: ensuring local smoothness between adjacent segments and maintaining global temporal consistency across different segments. To address these challenges, we propose our VFR framework, which ensures smoothness through a prefix video condition and enforces consistency with the anchor video -- a 360-degree video that comprehensively captures the human's wholebody appearance. Our VFR generates minute-scale virtual try-on videos with both local smoothness and global temporal consistency under various motions, making it a pioneering work in long virtual try-on video generation.

Table of Contents

cs.CL [Back]

[1] Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

[2] Enhancing Speech Large Language Models through Reinforced Behavior Alignment

[3] Multilevel Analysis of Cryptocurrency News using RAG Approach with Fine-Tuned Mistral Large Language Model

[4] The ProLiFIC dataset: Leveraging LLMs to Unveil the Italian Lawmaking Process

[5] Multimodal Proposal for an AI-Based Tool to Increase Cross-Assessment of Messages

[6] Reading Between the Signs: Predicting Future Suicidal Ideation from Adolescent Social Media Texts

[7] Real-Time Detection of Hallucinated Entities in Long-Form Generation

[8] Topic Identification in LLM Input-Output Pairs through the Lens of Information Bottleneck

[9] QuesGenie: Intelligent Multimodal Question Generation

[10] AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models

[11] Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

[12] ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference

[13] NoteBar: An AI-Assisted Note-Taking System for Personal Knowledge Management

[14] E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition

[15] Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

[16] Semantic Analysis of SNOMED CT Concept Co-occurrences in Clinical Documentation using MIMIC-IV

[17] MLSD: A Novel Few-Shot Learning Approach to Enhance Cross-Target and Cross-Domain Stance Detection

[18] SiLVERScore: Semantically-Aware Embeddings for Sign Language Generation Evaluation

[19] Measuring How (Not Just Whether) VLMs Build Common Ground

[20] Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

[21] NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation

[22] Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

[23] A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models

[24] False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

[25] MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation

[26] MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering

[27] Decoding the Poetic Language of Emotion in Korean Modern Poetry: Insights from a Human-Labeled Dataset and AI Modeling

[28] SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment

[29] SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning

[30] VoxRole: A Comprehensive Benchmark for Evaluating Speech-Based Role-Playing Agents

[31] CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking

[32] Exploring NLP Benchmarks in an Extremely Low-Resource Setting

[33] Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study

[34] RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models

[35] On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

[36] What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

[37] A RoBERTa-Based Functional Syntax Annotation Model for Chinese Texts

[38] Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning

[39] Arabic Chatbot Technologies in Education: An Overview

[40] Improving Narrative Classification and Explanation via Fine Tuned Language Models

[41] Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

[42] MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages

[43] Joint Modeling of Entities and Discourse Relations for Coherence Assessment

[44] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

[45] Explicit and Implicit Data Augmentation for Social Event Detection

[46] Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

[47] Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models

[48] PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation

[49] Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases

[50] Can Language Models Handle a Non-Gregorian Calendar?

cs.CV [Back]

[51] Towards Efficient General Feature Prediction in Masked Skeleton Modeling

[52] Teacher-Student Model for Detecting and Classifying Mitosis in the MIDOG 2025 Challenge

[53] Multi Attribute Bias Mitigation via Representation Learning

[54] Lightweight image segmentation for echocardiography

[55] treeX: Unsupervised Tree Instance Segmentation in Dense Forest Point Clouds

[56] Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding

[57] QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception

[58] Transfer Learning-Based CNN Models for Plant Species Identification Using Leaf Venation Patterns

[59] LayoutGKN: Graph Similarity Learning of Floor Plans

[60] Singular Value Few-shot Adaptation of Vision-Language Models

[61] STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification

[62] SLENet: A Guidance-Enhanced Network for Underwater Camouflaged Object Detection

[63] Fitting Image Diffusion Models on Video Datasets

[64] MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

[65] Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

[66] EGTM: Event-guided Efficient Turbulence Mitigation

[67] Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection

[68] SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition

[69] Human Motion Video Generation: A Survey

[70] OccTENS: 3D Occupancy World Model via Temporal Next-Scale Prediction

[71] Weakly-Supervised Learning of Dense Functional Correspondences

[72] Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

[73] SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

[74] A Generative Foundation Model for Chest Radiography

[75] LMVC: An End-to-End Learned Multiview Video Coding Framework

[76] TopoSculpt: Betti-Steered Topological Sculpting of 3D Fine-grained Tubular Shapes

[77] Chest X-ray Pneumothorax Segmentation Using EfficientNet-B4 Transfer Learning in a U-Net Architecture