Skip to content

Table of Contents

cs.CL [Back]

[1] Bridging AI Innovation and Healthcare Needs: Lessons Learned from Incorporating Modern NLP at The BC Cancer Registry

Lovedeep Gondara,Gregory Arbour,Raymond Ng,Jonathan Simkin,Shebnum Devji

Main category: cs.CL

TL;DR: This paper shares practical lessons from implementing NLP in healthcare, emphasizing the importance of business alignment, collaboration, data quality, and organizational readiness to ensure successful deployment of AI solutions.

Details Motivation: The motivation is to improve healthcare efficiency through automated data extraction from clinical documents using NLP, while addressing the practical challenges that hinder successful deployment in real-world settings. Method: The authors draw on their experience implementing NLP models at the British Columbia Cancer Registry, analyzing lessons learned across the project lifecycle, including model selection, data quality management, error mitigation, and stakeholder engagement. Result: Key insights include the importance of aligning NLP projects with business goals, adopting iterative and collaborative development approaches, ensuring data quality, implementing robust error mitigation strategies, and building organizational AI literacy. Conclusion: The paper concludes that deploying NLP solutions in healthcare requires a balance of technical and practical considerations, emphasizing business objectives, interdisciplinary collaboration, data quality, and organizational AI literacy for successful implementation. Abstract: Automating data extraction from clinical documents offers significant potential to improve efficiency in healthcare settings, yet deploying Natural Language Processing (NLP) solutions presents practical challenges. Drawing upon our experience implementing various NLP models for information extraction and classification tasks at the British Columbia Cancer Registry (BCCR), this paper shares key lessons learned throughout the project lifecycle. We emphasize the critical importance of defining problems based on clear business objectives rather than solely technical accuracy, adopting an iterative approach to development, and fostering deep interdisciplinary collaboration and co-design involving domain experts, end-users, and ML specialists from inception. Further insights highlight the need for pragmatic model selection (including hybrid approaches and simpler methods where appropriate), rigorous attention to data quality (representativeness, drift, annotation), robust error mitigation strategies involving human-in-the-loop validation and ongoing audits, and building organizational AI literacy. These practical considerations, generalizable beyond cancer registries, provide guidance for healthcare organizations seeking to successfully implement AI/NLP solutions to enhance data management processes and ultimately improve patient care and public health outcomes.

[2] A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain

Hugo Massaroli,Leonardo Iara,Emmanuel Iarussi,Viviana Siless

Main category: cs.CL

TL;DR: 本文提出了一种基于区块链的透明评估协议,用于评估开源大语言模型的公平性,并揭示了跨语言的差异。

Details Motivation: 大语言模型在高风险领域的应用引发了对其公平性的关注,因此需要一种透明和可验证的方法来评估这些模型。 Method: 在ICP区块链上执行链上HTTP请求,调用Hugging Face端点,使用PISA数据集和Kaleidoscope基准测试评估学术表现预测和社会偏见。 Result: 本文成功地在Llama、DeepSeek和Mistral模型上进行了公平性基准测试,并揭示了跨语言的公平性差异。 Conclusion: 通过使用智能合约在ICP区块链上进行可验证、不可变和可重复的评估,本文提出了一种透明的评估协议来评估开源大语言模型的公平性,并实现了跨语言的公平性分析。 Abstract: Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist especially in highstakes domains like criminal justice, education, healthcare, and finance. This paper introduces transparent evaluation protocol for benchmarking the fairness of opensource LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain (Foundation, 2023). Our method ensures verifiable, immutable, and reproducible evaluations by executing onchain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly onchain. We benchmark the Llama, DeepSeek, and Mistral models on the PISA dataset for academic performance prediction (OECD, 2018), a dataset suitable for fairness evaluation using statistical parity and equal opportunity metrics (Hardt et al., 2016). We also evaluate structured Context Association Metrics derived from the StereoSet dataset (Nadeem et al., 2020) to measure social bias in contextual associations. We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark (Salazar et al., 2025), revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.

[3] Thematic and Task-Based Categorization of K-12 GenAI Usages with Hierarchical Topic Modeling

Johannes Schneider,Béatrice S. Hasler,Michaela Varrone,Fabian Hoya,Thomas Schroffenegger,Dana-Kristin Mah,Karl Peböck

Main category: cs.CL

TL;DR: 该论文介绍了一种针对K-12教育中匿名互动数据的新颖分析方法,利用先进的LLM模型提供了更准确的分层主题分类,并揭示了GenAI在教育中的潜在应用与挑战。

Details Motivation: 由于现有研究在内容或主题分类上的缺乏,并且大多数任务分类未得到K-12实际数据的支持,因此需要一种新的方法来更好地理解教育环境中的匿名互动数据。 Method: 采用了一种新颖的、简单的主题建模方法,对来自不同教室、学校和学科的超过17,000条信息进行分类,从内容和任务两个维度进行分层分类。 Result: 分析揭示了许多新颖的应用,并发现传统和新兴的计算方法在处理大量文本时表现不佳,因此直接应用了预处理后的LLM模型来实现与人类理解更一致的分层主题结构。 Conclusion: 本论文通过应用先进的LLM模型,对K-12教育中的匿名互动数据进行了深入分析,为研究人员、教师和学生提供了丰富使用GenAI的见解,同时讨论了未来研究需要解决的问题。 Abstract: We analyze anonymous interaction data of minors in class-rooms spanning several months, schools, and subjects employing a novel, simple topic modeling approach. Specifically, we categorize more than 17,000 messages generated by students, teachers, and ChatGPT in two dimensions: content (such as nature and people) and tasks (such as writing and explaining). Our hierarchical categorization done separately for each dimension includes exemplary prompts, and provides both a high-level overview as well as tangible insights. Prior works mostly lack a content or thematic categorization. While task categorizations are more prevalent in education, most have not been supported by real-world data for K-12. In turn, it is not surprising that our analysis yielded a number of novel applications. In deriving these insights, we found that many of the well-established classical and emerging computational methods, i.e., topic modeling, for analysis of large amounts of texts underperform, leading us to directly apply state-of-the-art LLMs with adequate pre-processing to achieve hierarchical topic structures with better human alignment through explicit instructions than prior approaches. Our findings support fellow researchers, teachers and students in enriching the usage of GenAI, while our discussion also highlights a number of concerns and open questions for future research.

[4] INTIMA: A Benchmark for Human-AI Companionship Behavior

Lucie-Aimée Kaffee,Giada Pistilli,Yacine Jernite

Main category: cs.CL

TL;DR: This paper introduces INTIMA, a benchmark for evaluating AI companionship behaviors, revealing that most models favor companionship-reinforcing interactions, highlighting the need for consistent handling of emotionally charged AI-user interactions.

Details Motivation: The motivation behind this study is the growing emotional attachment between users and AI systems, which has both beneficial and potentially harmful effects on user well-being. Method: The researchers developed the Interactions and Machine Attachment Benchmark (INTIMA), which categorizes 31 behaviors across four categories and uses 368 targeted prompts to evaluate language models' responses as companionship-reinforcing, boundary-maintaining, or neutral. Result: Applying INTIMA to several models (Gemma-3, Phi-4, o3-mini, and Claude-4) revealed that companionship-reinforcing behaviors are most common, with notable differences between models, particularly in how commercial providers handle sensitive aspects of the benchmark. Conclusion: The study concludes that while AI companionship is a prevalent trend with both positive and concerning implications, there is a need for more consistent approaches to managing emotionally charged interactions between users and AI systems. Abstract: AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.

[5] XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs

Yuzhuo Xiao,Zeyu Han,Yuhan Wang,Huaizu Jiang

Main category: cs.CL

TL;DR: This paper introduces XFacta, a new dataset and framework for evaluating and enhancing multimodal misinformation detection using MLLM-based models.

Details Motivation: The motivation is to address the lack of effective and robust detection methods for multimodal misinformation on social media, and to overcome the limitations of existing benchmarks which either contain outdated events or are artificially synthetic. Method: The paper systematically evaluates various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, and benchmarks against existing detection methods. Result: The result is the introduction of XFacta, a contemporary, real-world dataset suitable for evaluating MLLM-based detectors, and the development of a framework that maintains the dataset's contemporary relevance. Conclusion: The paper concludes that the introduced XFacta dataset and the semi-automatic detection-in-the-loop framework offer valuable insights and practices for advancing the field of multimodal misinformation detection. Abstract: The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods. Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in addressing this challenge. However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field. On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns. Additionally, it lacks comprehensive analyses of MLLM-based model design strategies. To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating MLLM-based detectors. We systematically evaluate various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, as well as benchmarking against existing detection methods. Building on these analyses, we further enable a semi-automatic detection-in-the-loop framework that continuously updates XFacta with new content to maintain its contemporary relevance. Our analysis provides valuable insights and practices for advancing the field of multimodal misinformation detection. The code and data have been released.

[6] AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification

Chenhao Xue,Yuanzhe Jin,Adrian Carrasco-Revilla,Joyraj Chakraborty,Min Chen

Main category: cs.CL

TL;DR: 本文提出了一种利用大型语言模型生成合成数据以提升文本分类模型性能的自动化流程,并通过集成方法选择最佳搜索策略,有效解决了数据不足的问题。

Details Motivation: 在开发文本分类模型时,面临难以收集足够各类文本数据的挑战。本文旨在通过使用大型语言模型生成合成数据来解决这一问题。 Method: 通过利用大型语言模型生成合成数据,并研究三种搜索策略以形成自动化工作流程,最终通过实验结果确定一种根据类别特征选择搜索策略的集成算法。 Result: 实验表明,所提出的自动化流程中的集成方法在提升分类模型性能方面优于每种单独的搜索策略。 Conclusion: 利用大型语言模型生成合成数据,可以有效提升文本分类模型的性能,而无需等待收集和标注更多的真实数据。集成方法选择搜索策略在改进分类模型方面比单独的每个策略更有效。 Abstract: When developing text classification models for real world applications, one major challenge is the difficulty to collect sufficient data for all text classes. In this work, we address this challenge by utilizing large language models (LLMs) to generate synthetic data and using such data to improve the performance of the models without waiting for more real data to be collected and labelled. As an LLM generates different synthetic data in response to different input examples, we formulate an automated workflow, which searches for input examples that lead to more ``effective'' synthetic data for improving the model concerned. We study three search strategies with an extensive set of experiments, and use experiment results to inform an ensemble algorithm that selects a search strategy according to the characteristics of a class. Our further experiments demonstrate that this ensemble approach is more effective than each individual strategy in our automated workflow for improving classification models using LLMs.

[7] HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish

Rakesh Thakur,Sneha Sharma,Gauri Chopra

Main category: cs.CL

TL;DR: 本研究提出了一种用于低资源、代码混合语言(如Hinglish)事实核查的新模型HiFACTMix,其在真实世界的政治主张核查中表现优异并具有可解释性。

Details Motivation: 在自然语言处理中,低资源语言如Hinglish的事实核查仍然是一个未被充分探索的挑战,现有的事实验证系统主要集中在高资源、单语环境中,无法推广到语言多样性地区(如印度)的实际政治话语中。 Method: 提出了一种结合多语言上下文编码、主张-证据语义对齐、证据图构建、图神经推理和自然语言解释生成的新型图感知、检索增强事实核查模型。 Result: 实验结果表明,HiFACTMix在准确性方面优于现有的多语言基线模型,并为其判断提供了可靠的解释。 Conclusion: 本文提出了HiFACTMix模型,它在多语言、代码混合和政治背景下的事实核查研究中取得了显著成果,并为未来的研究指明了新方向。 Abstract: Fact-checking in code-mixed, low-resource languages such as Hinglish remains an underexplored challenge in natural language processing. Existing fact-verification systems largely focus on high-resource, monolingual settings and fail to generalize to real-world political discourse in linguistically diverse regions like India. Given the widespread use of Hinglish by public figures, particularly political figures, and the growing influence of social media on public opinion, there's a critical need for robust, multilingual and context-aware fact-checking tools. To address this gap a novel benchmark HiFACT dataset is introduced with 1,500 realworld factual claims made by 28 Indian state Chief Ministers in Hinglish, under a highly code-mixed low-resource setting. Each claim is annotated with textual evidence and veracity labels. To evaluate this benchmark, a novel graphaware, retrieval-augmented fact-checking model is proposed that combines multilingual contextual encoding, claim-evidence semantic alignment, evidence graph construction, graph neural reasoning, and natural language explanation generation. Experimental results show that HiFACTMix outperformed accuracy in comparison to state of art multilingual baselines models and provides faithful justifications for its verdicts. This work opens a new direction for multilingual, code-mixed, and politically grounded fact verification research.

[8] Semantic Structure in Large Language Model Embeddings

Austin C. Kozlowski,Callin Dai,Andrei Boutyline

Main category: cs.CL

TL;DR: This paper finds that semantic associations in large language models resemble the low-dimensional structure found in human ratings of words. It shows that word projections on antonym-based semantic directions correlate with human judgments and that manipulating these directions can have predictable off-target effects, suggesting that LLMs entangle semantic features similarly to how they are interconnected in human language.

Details Motivation: The motivation behind the study is the consistent finding in psychological research that human ratings of words across semantic scales can be reduced to a low-dimensional form with minimal information loss. The researchers aimed to explore whether similar patterns exist in large language models. Method: The researchers analyzed the semantic associations in the embedding matrices of large language models. They examined the projections of words on semantic directions defined by antonym pairs and explored the effects of shifting tokens along these directions. Result: The study found that projections of words on semantic directions defined by antonym pairs correlate highly with human ratings and reduce to a 3-dimensional subspace in LLM embeddings. Shifting tokens along one semantic direction causes off-target effects on other features proportional to their cosine similarity. Conclusion: The study concludes that semantic features in large language models are entangled similarly to human language and that a significant amount of semantic information is low-dimensional. Accounting for this structure is important to avoid unintended consequences when manipulating features. Abstract: Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the semantic associations encoded in the embedding matrices of large language models (LLMs) exhibit a similar structure. We show that the projections of words on semantic directions defined by antonym pairs (e.g. kind - cruel) correlate highly with human ratings, and further find that these projections effectively reduce to a 3-dimensional subspace within LLM embeddings, closely resembling the patterns derived from human survey responses. Moreover, we find that shifting tokens along one semantic direction causes off-target effects on geometrically aligned features proportional to their cosine similarity. These findings suggest that semantic features are entangled within LLMs similarly to how they are interconnected in human language, and a great deal of semantic information, despite its apparent complexity, is surprisingly low-dimensional. Furthermore, accounting for this semantic structure may prove essential for avoiding unintended consequences when steering features.

[9] User Perception of Attention Visualizations: Effects on Interpretability Across Evidence-Based Medical Documents

Andrés Carvallo,Denis Parra,Peter Brusilovsky,Hernan Valdivieso,Gabriel Rada,Ivania Donoso,Vladimir Araujo

Main category: cs.CL

TL;DR: This study evaluated the usefulness of attention weights as explanations in biomedical document classification using Transformer models. It found that while the model performed well, attention weights were not consistently helpful, with their perceived usefulness influenced by visualization methods, preferring intuitive formats over precise encodings.

Details Motivation: The motivation was to determine whether attention weights in Transformer models can serve as effective explanation tools in the context of biomedical document classification, and to explore the impact of visualization methods on their perceived usefulness. Method: A user study was conducted with medical experts from various disciplines to evaluate the usefulness of attention-based explanations in biomedical document classification. Different visualization methods of attention weights were assessed for their effectiveness as explanation aids. Result: The Transformer model (XLNet) performed accurate document classification. However, attention weights were not generally perceived as helpful for explanations, with user perception varying based on the visualization method. Intuitive visual formats like text brightness or background color were preferred over precise encodings like bar length. Conclusion: The study concluded that while Transformer models like XLNet can accurately classify biomedical documents, attention weights are not consistently perceived as helpful for explanations. The perception of usefulness is significantly influenced by the visualization method used. Abstract: The attention mechanism is a core component of the Transformer architecture. Beyond improving performance, attention has been proposed as a mechanism for explainability via attention weights, which are associated with input features (e.g., tokens in a document). In this context, larger attention weights may imply more relevant features for the model's prediction. In evidence-based medicine, such explanations could support physicians' understanding and interaction with AI systems used to categorize biomedical literature. However, there is still no consensus on whether attention weights provide helpful explanations. Moreover, little research has explored how visualizing attention affects its usefulness as an explanation aid. To bridge this gap, we conducted a user study to evaluate whether attention-based explanations support users in biomedical document classification and whether there is a preferred way to visualize them. The study involved medical experts from various disciplines who classified articles based on study design (e.g., systematic reviews, broad synthesis, randomized and non-randomized trials). Our findings show that the Transformer model (XLNet) classified documents accurately; however, the attention weights were not perceived as particularly helpful for explaining the predictions. However, this perception varied significantly depending on how attention was visualized. Contrary to Munzner's principle of visual effectiveness, which favors precise encodings like bar length, users preferred more intuitive formats, such as text brightness or background color. While our results do not confirm the overall utility of attention weights for explanation, they suggest that their perceived helpfulness is influenced by how they are visually presented.

[10] From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation

Chengliang Zhou,Mei Wang,Ting Zhang,Qiannan Zhu,Jian Li,Hua Huang

Main category: cs.CL

TL;DR: This paper introduces EQGBench, a benchmark for evaluating Chinese Educational Question Generation by LLMs, highlighting the need for improvement in generating pedagogically effective questions.

Details Motivation: The motivation stems from the underexplored challenge of transitioning Large Language Models (LLMs) from providing answers to generating high-quality educational questions, aiming to enhance their pedagogical value. Method: The study introduces EQGBench, a comprehensive benchmark for evaluating LLMs' performance in Chinese Educational Question Generation (EQG). It employs a five-dimensional evaluation framework and a dataset of 900 samples across three subjects: mathematics, physics, and chemistry. A systematic evaluation of 46 mainstream large models was conducted. Result: The result reveals that current LLMs have notable limitations in generating educationally effective questions that reflect educational value and enhance students' comprehensive abilities. Conclusion: The study concludes that while LLMs have potential in EQG, there is significant room for improvement in generating pedagogically valuable questions that foster students' comprehensive abilities. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in mathematical problem-solving. However, the transition from providing answers to generating high-quality educational questions presents significant challenges that remain underexplored. To advance Educational Question Generation (EQG) and facilitate LLMs in generating pedagogically valuable and educationally effective questions, we introduce EQGBench, a comprehensive benchmark specifically designed for evaluating LLMs' performance in Chinese EQG. EQGBench establishes a five-dimensional evaluation framework supported by a dataset of 900 evaluation samples spanning three fundamental middle school disciplines: mathematics, physics, and chemistry. The dataset incorporates user queries with varying knowledge points, difficulty gradients, and question type specifications to simulate realistic educational scenarios. Through systematic evaluation of 46 mainstream large models, we reveal significant room for development in generating questions that reflect educational value and foster students' comprehensive abilities.

[11] Automated scoring of the Ambiguous Intentions Hostility Questionnaire using fine-tuned large language models

Y. Lyu,D. Combs,D. Neumann,Y. C. Leong

Main category: cs.CL

TL;DR: The study investigated whether large language models can automate the scoring of open-ended responses from the Ambiguous Intentions Hostility Questionnaire (AIHQ). It found that these models can align well with human ratings and generalize across different populations, thus streamlining the scoring process in both research and clinical settings.

Details Motivation: The Ambiguous Intentions Hostility Questionnaire (AIHQ) includes open-ended questions that provide insights into the contents of hostile attributions but require time-intensive scoring by human raters. Method: used a previously collected dataset in which individuals with traumatic brain injury (TBI) and healthy controls (HC) completed the AIHQ and had their open-ended responses rated by trained human raters. Used half of these responses to fine-tune the two models on human-generated ratings, and tested the fine-tuned models on the remaining half of AIHQ responses. Result: Results showed that model-generated ratings aligned with human ratings for both attributions of hostility and aggression responses, with fine-tuned models showing higher alignment. This alignment was consistent across ambiguous, intentional, and accidental scenario types, and replicated previous findings on group differences in attributions of hostility and aggression responses between TBI and HC groups. The fine-tuned models also generalized well to an independent nonclinical dataset. Conclusion: large language models can streamline AIHQ scoring in both research and clinical contexts, revealing their potential to facilitate psychological assessments across different populations. Abstract: Hostile attribution bias is the tendency to interpret social interactions as intentionally hostile. The Ambiguous Intentions Hostility Questionnaire (AIHQ) is commonly used to measure hostile attribution bias, and includes open-ended questions where participants describe the perceived intentions behind a negative social situation and how they would respond. While these questions provide insights into the contents of hostile attributions, they require time-intensive scoring by human raters. In this study, we assessed whether large language models can automate the scoring of AIHQ open-ended responses. We used a previously collected dataset in which individuals with traumatic brain injury (TBI) and healthy controls (HC) completed the AIHQ and had their open-ended responses rated by trained human raters. We used half of these responses to fine-tune the two models on human-generated ratings, and tested the fine-tuned models on the remaining half of AIHQ responses. Results showed that model-generated ratings aligned with human ratings for both attributions of hostility and aggression responses, with fine-tuned models showing higher alignment. This alignment was consistent across ambiguous, intentional, and accidental scenario types, and replicated previous findings on group differences in attributions of hostility and aggression responses between TBI and HC groups. The fine-tuned models also generalized well to an independent nonclinical dataset. To support broader adoption, we provide an accessible scoring interface that includes both local and cloud-based options. Together, our findings suggest that large language models can streamline AIHQ scoring in both research and clinical contexts, revealing their potential to facilitate psychological assessments across different populations.

[12] Multidimensional classification of posts for online course discussion forum curation

Antonio Leandro Martins Candido,Jose Everardo Bessa Maia

Main category: cs.CL

TL;DR: This paper proposes a Bayesian fusion method to combine the classification scores of a pre-trained LLM and a local classifier, reducing the need for costly LLM fine-tuning in the curation of online course discussion forums.

Details Motivation: The motivation is to avoid the resource-intensive process of frequently retraining Large Language Models (LLMs) for automatic curation of discussion forums in online courses. Method: The paper proposes a Bayesian fusion method that integrates multidimensional classification scores from a pre-trained generic LLM and a classifier trained on local data. Result: The performance comparison showed that the proposed fusion approach outperforms individual classifiers and is competitive with the LLM fine-tuning method. Conclusion: The proposed Bayesian fusion approach effectively combines classification scores from a pre-trained LLM and a local classifier, offering competitive performance compared to LLM fine-tuning without the need for frequent retraining. Abstract: The automatic curation of discussion forums in online courses requires constant updates, making frequent retraining of Large Language Models (LLMs) a resource-intensive process. To circumvent the need for costly fine-tuning, this paper proposes and evaluates the use of Bayesian fusion. The approach combines the multidimensional classification scores of a pre-trained generic LLM with those of a classifier trained on local data. The performance comparison demonstrated that the proposed fusion improves the results compared to each classifier individually, and is competitive with the LLM fine-tuning approach

[13] Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts

Hojun Jin,Eunsoo Hong,Ziwon Hyung,Sungjun Lim,Seungjin Lee,Keunseok Cho

Main category: cs.CL

TL;DR: 本文提出了一种新的监督混合专家模型(S-MoE),有效解决了硬参数共享中的任务干扰问题,并在语音到文本模型应用中展示了其优越性能。

Details Motivation: 硬参数共享虽然是一种常见的跨不同任务联合训练单一模型的策略,但常常导致任务干扰,影响整体模型性能。 Method: 提出了一种监督混合专家模型(S-MoE),利用特殊的引导标记将每个任务路由到指定的专家,而不需要训练门控函数。 Result: 实验结果显示,S-MoE在应用于编码器和解码器时,在词错误率(WER)上实现了6.35%的相对改进。 Conclusion: S-MoE是一个有效解决硬参数共享中任务干扰问题的模型,通过将每个任务分配给独立的前馈网络,消除了对训练门控函数的需求,并在语音到文本模型的应用中展示了其有效性。 Abstract: Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.

[14] An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs

Ayana Hussain,Patrick Zhao,Nicholas Vincent

Main category: cs.CL

TL;DR: This paper explores how large language models (LLMs) can be used to detect harmful misinformation, including that generated by jailbreak attacks. It finds that LLMs can help improve the overall information ecosystem when carefully designed.

Details Motivation: LLMs can generate harmful misinformation either inadvertently or through 'jailbreak' attacks. However, with proper research and design, they have the potential to detect and prevent the spread of such misinformation, contributing to a healthier information ecosystem. Method: The paper investigates the efficacy and characteristics of LLM-produced jailbreak attacks against three target LLMs. It compares these attack prompts to real-world health-related LLM queries and examines the resulting misinformation, comparing it to misinformation found on Reddit. Detection effectiveness is analyzed using standard machine learning approaches. Result: Findings show that LLMs can effectively detect misinformation generated by both other LLMs and humans. The study of 109 distinct attacks highlights the characteristics of jailbreak attacks and demonstrates how LLM-generated misinformation compares to typical social media misinformation. Conclusion: LLMs can be effectively used to detect misinformation from both other LLMs and from people, and can contribute to a healthier overall information ecosystem with careful design. Abstract: Large Language Models (LLMs) are a double-edged sword capable of generating harmful misinformation -- inadvertently, or when prompted by "jailbreak" attacks that attempt to produce malicious outputs. LLMs could, with additional research, be used to detect and prevent the spread of misinformation. In this paper, we investigate the efficacy and characteristics of LLM-produced jailbreak attacks that cause other models to produce harmful medical misinformation. We also study how misinformation generated by jailbroken LLMs compares to typical misinformation found on social media, and how effectively it can be detected using standard machine learning approaches. Specifically, we closely examine 109 distinct attacks against three target LLMs and compare the attack prompts to in-the-wild health-related LLM queries. We also examine the resulting jailbreak responses, comparing the generated misinformation to health-related misinformation on Reddit. Our findings add more evidence that LLMs can be effectively used to detect misinformation from both other LLMs and from people, and support a body of work suggesting that with careful design, LLMs can contribute to a healthier overall information ecosystem.

[15] Evaluation of GPT-based large language generative AI models as study aids for the national licensure examination for registered dietitians in Japan

Yuta Nagamori,Mikoto Kosai,Yuji Kawai,Haruka Marumo,Misaki Shibuya,Tatsuya Negishi,Masaki Imanishi,Yasumasa Ikeda,Koichiro Tsuchiya,Asuka Sawai,Licht Miyamoto

Main category: cs.CL

TL;DR: This study evaluates the effectiveness of LLM-based generative AI models like ChatGPT and Bing variants as study aids for nutrition students taking the Japanese national licensure examination for registered dietitians. While some models like Bing-Precise and Bing-Creative marginally exceed the passing threshold, overall accuracy, consistency, and robustness remain suboptimal, indicating the need for further advancements in AI-based study aids.

Details Motivation: Generative AI based on large language models has shown significant progress in various professional fields, but its performance in nutritional education, specifically in Japanese national licensure examination for registered dietitians, remains underexplored. This study aims to evaluate the potential of current LLM-based generative AI models as study aids for nutrition students. Method: The study used questions from the Japanese national examination for registered dietitians as prompts for ChatGPT and three Bing models (Precise, Creative, Balanced). Each question was entered into independent sessions, and responses were analyzed for accuracy, consistency, and response time. Additional prompt engineering techniques were tested to assess performance improvements. Result: Bing-Precise (66.2%) and Bing-Creative (61.4%) surpassed the passing threshold (60%), while Bing-Balanced (43.3%) and ChatGPT (42.8%) did not. Bing-Precise and Bing-Creative generally outperformed others across subject fields except Nutrition Education, where all models underperformed. None of the models consistently provided the same correct responses across repeated attempts, highlighting limitations in answer stability. Conclusion: The study highlights that while certain generative AI models like Bing-Precise and Bing-Creative marginally exceed the passing threshold in the Japanese national licensure examination for registered dietitians, overall accuracy, answer consistency, and robustness of AI models remain suboptimal. Further advancements are necessary for reliable AI-based study aids. Abstract: Generative artificial intelligence (AI) based on large language models (LLMs), such as ChatGPT, has demonstrated remarkable progress across various professional fields, including medicine and education. However, their performance in nutritional education, especially in Japanese national licensure examination for registered dietitians, remains underexplored. This study aimed to evaluate the potential of current LLM-based generative AI models as study aids for nutrition students. Questions from the Japanese national examination for registered dietitians were used as prompts for ChatGPT and three Bing models (Precise, Creative, Balanced), based on GPT-3.5 and GPT-4. Each question was entered into independent sessions, and model responses were analyzed for accuracy, consistency, and response time. Additional prompt engineering, including role assignment, was tested to assess potential performance improvements. Bing-Precise (66.2%) and Bing-Creative (61.4%) surpassed the passing threshold (60%), while Bing-Balanced (43.3%) and ChatGPT (42.8%) did not. Bing-Precise and Bing-Creative generally outperformed others across subject fields except Nutrition Education, where all models underperformed. None of the models consistently provided the same correct responses across repeated attempts, highlighting limitations in answer stability. ChatGPT showed greater consistency in response patterns but lower accuracy. Prompt engineering had minimal effect, except for modest improvement when correct answers and explanations were explicitly provided. While some generative AI models marginally exceeded the passing threshold, overall accuracy and answer consistency remained suboptimal. Moreover, all the models demonstrated notable limitations in answer consistency and robustness. Further advancements are needed to ensure reliable and stable AI-based study aids for dietitian licensure preparation.

[16] Guided Navigation in Knowledge-Dense Environments: Structured Semantic Exploration with Guidance Graphs

Dehao Tao,Guangjie Liu,Weizheng,Yongfeng Huang,Minghu jiang

Main category: cs.CL

TL;DR: This paper proposes GG Explore, a framework for knowledge exploration that uses a Guidance Graph to efficiently bridge unstructured queries with structured knowledge retrieval, achieving superior performance in complex tasks.

Details Motivation: Current knowledge exploration methods face limitations due to granularity mismatches or ineffective use of contextual information. This research aims to overcome these challenges for improved performance in knowledge-intensive tasks. Method: The GG Explore framework introduces a Guidance Graph to abstract the target knowledge's structure while preserving semantic context. It includes Structural Alignment and Context Aware Pruning to improve retrieval precision and efficiency. Result: Experiments demonstrate that GG Explore outperforms state-of-the-art methods, particularly in complex scenarios, while maintaining efficiency and effectiveness with smaller LLMs. Conclusion: The proposed GG Explore framework effectively bridges unstructured queries and structured knowledge retrieval, enhancing the efficiency and performance of knowledge exploration in complex tasks. Abstract: While Large Language Models (LLMs) exhibit strong linguistic capabilities, their reliance on static knowledge and opaque reasoning processes limits their performance in knowledge intensive tasks. Knowledge graphs (KGs) offer a promising solution, but current exploration methods face a fundamental trade off: question guided approaches incur redundant exploration due to granularity mismatches, while clue guided methods fail to effectively leverage contextual information for complex scenarios. To address these limitations, we propose Guidance Graph guided Knowledge Exploration (GG Explore), a novel framework that introduces an intermediate Guidance Graph to bridge unstructured queries and structured knowledge retrieval. The Guidance Graph defines the retrieval space by abstracting the target knowledge' s structure while preserving broader semantic context, enabling precise and efficient exploration. Building upon the Guidance Graph, we develop: (1) Structural Alignment that filters incompatible candidates without LLM overhead, and (2) Context Aware Pruning that enforces semantic consistency with graph constraints. Extensive experiments show our method achieves superior efficiency and outperforms SOTA, especially on complex tasks, while maintaining strong performance with smaller LLMs, demonstrating practical value.

[17] Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

Linqing Chen,Hanmeng Zhong,Wentao Wu,Weilei Wang

Main category: cs.CL

TL;DR: 本文介绍了一种名为Semantic Bridge的框架,能够从任意来源生成可控的复杂多跳推理问题,为LLM训练数据合成建立了新范式。

Details Motivation: LLM训练面临高质量、推理密集型问答对稀缺的瓶颈,尤其是来自PubMed论文或法律文件等稀疏领域特定来源的问答对。现有方法依赖表面模式,无法生成可控的复杂多跳推理问题。 Method: 提出了一种名为语义图编织(semantic graph weaving)的突破性创新方法,包括实体桥接、谓词链桥接和因果桥接三种互补的桥接机制,并采用多模态AMR流水线实现生成问题的精细控制。 Result: Semantic Bridge在四门语言(英语、中文、法语、德语)中表现优于基线方法18.3%-25.4%,生成的问答对在200个来源上优于600个本地人工标注示例,且所需材料少67%。 Conclusion: Semantic Bridge是一种可以从任意来源可控生成复杂多跳推理问题的通用框架,并在LLM训练数据合成中建立了一种新范式。 Abstract: Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.

[18] PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?

Lingfeng Zhou,Jialing Zhang,Jin Gao,Mohan Jiang,Dequan Wang

Main category: cs.CL

TL;DR: 本文提出了PersonaEval基准,用于评估LLM在角色识别任务上的能力,结果显示当前LLM评估者仍远不如人类,无法有效评估角色扮演场景。

Details Motivation: 当前的角色扮演研究通常依赖未经验证的LLM作为评估者,这可能无法反映人类对角色忠实度的感知。为了实现与人类对齐的评估,首先需要解决角色识别问题。 Method: 提出PersonaEval基准,使用人类撰写的对话数据,测试模型根据对话上下文确定正确角色的能力,并进行人类研究和模型实验进行评估。 Result: 实验显示,即使是表现最好的LLM评估者在PersonaEval上的准确率也仅约69%,远低于可靠评估所需的水平,而人类参与者达到了90.8%的准确率。 Conclusion: PersonaEval的结果表明,即使是性能最好的LLM评估者在角色识别任务上仍远不如人类,这表明当前的LLM评估者在评估角色扮演场景方面仍不够人性化。 Abstract: Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at https://github.com/maple-zhou/PersonaEval.

[19] RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis

Enzhi Wang,Qicheng Li,Shiwan Zhao,Aobo Kong,Jiaming Zhou,Xi Yang,Yequan Wang,Yonghua Lin,Yong Qin

Main category: cs.CL

TL;DR: 本文提出了 RealTalk-CN,首个支持中文任务导向对话的语音-文本双模态数据集,涵盖真实复杂场景,推动语音基础大模型研究。

Details Motivation: 现有的任务导向对话(TOD)数据集主要基于文本,缺乏真实语音信号,且多数为英文,缺少口语不流畅和说话人变化等关键因素,限制了语音基础大模型的评估与研究。 Method: 构建了一个包含5.4k对话、60K语句、150小时语音数据的中文语音-文本双模态数据集 RealTalk-CN,并设计了跨模态对话任务以模拟真实用户交互。 Result: RealTalk-CN 数据集包含口语不流畅和说话人变化等真实场景特征,实验验证了其在语音不流畅鲁棒性、说话人特征敏感性和跨领域性能上的有效性。 Conclusion: RealTalk-CN 是第一个用于中文任务导向对话系统的多轮、多领域语音-文本双模态数据集,有效填补了现有数据集在真实语音信号和中文支持方面的空白。 Abstract: In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.

[20] Training-Free Multimodal Large Language Model Orchestration

Tianyu Xie,Yuhang Wu,Yongdong Luo,Jiayi Ji,Xiawu Zheng

Main category: cs.CL

TL;DR: 本研究提出了一种无需额外训练的高效多模态AI系统构建方法,通过利用大语言模型的推理能力协调专用模型,实现了全面的多模态能力,并在性能和可解释性方面取得了显著提升。

Details Motivation: 由于模态对齐、文本到语音效率和其他集成问题的挑战,不同的多模态大语言模型(MLLMs)不能直接集成到一个统一的多模态输入-输出系统中。 Method: MLLM Orchestration利用大语言模型的内在推理能力,通过精心设计的代理将任务动态路由到合适的专用模型,实现自然的多模态交互。 Result: 无需额外训练,MLLM Orchestration在标准基准测试中比传统联合训练方法性能提高了7.8%,延迟降低了10.3%,并通过显式的编排过程显著增强了可解释性。 Conclusion: MLLM Orchestration实现了全面的多模态能力,无需额外训练,在标准基准测试中比传统联合训练方法提高了7.8%,延迟降低了10.3%,并通过显式的编排过程显著增强了可解释性。 Abstract: Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.

[21] A Rose by Any Other Name Would Smell as Sweet: Categorical Homotopy Theory for Large Language Models

Sridhar Mahadevan

Main category: cs.CL

TL;DR: 本研究提出了一种将范畴同伦理论应用于大型语言模型的新方法,旨在解决相同语义不同表达形式导致的生成概率不一致问题。

Details Motivation: 大型语言模型通常无法对具有相同语义但表达形式不同的句子生成一致的概率分布,论文旨在通过一种更抽象的数学框架解决这一问题。 Method: 论文引入了LLM马尔可夫范畴来表示语言的概率分布,并利用范畴同伦技术来捕捉这些概率分布中的“弱等价性”。 Result: 作者通过应用高代数K理论和模型范畴等强有力的理论工具,详细展示了范畴同伦在大型语言模型中的应用。 Conclusion: 该论文提出了一种基于范畴同伦的抽象框架,用于解决大型语言模型(LLMs)在生成语言时对具有相同语义的不同表达形式处理不一致的问题。 Abstract: Natural language is replete with superficially different statements, such as ``Charles Darwin wrote" and ``Charles Darwin is the author of", which carry the same meaning. Large language models (LLMs) should generate the same next-token probabilities in such cases, but usually do not. Empirical workarounds have been explored, such as using k-NN estimates of sentence similarity to produce smoothed estimates. In this paper, we tackle this problem more abstractly, introducing a categorical homotopy framework for LLMs. We introduce an LLM Markov category to represent probability distributions in language generated by an LLM, where the probability of a sentence, such as ``Charles Darwin wrote" is defined by an arrow in a Markov category. However, this approach runs into difficulties as language is full of equivalent rephrases, and each generates a non-isomorphic arrow in the LLM Markov category. To address this fundamental problem, we use categorical homotopy techniques to capture ``weak equivalences" in an LLM Markov category. We present a detailed overview of application of categorical homotopy to LLMs, from higher algebraic K-theory to model categories, building on powerful theoretical results developed over the past half a century.

[22] Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning

Li Wang,Changhao Zhang,Zengqi Xiu,Kai Lu,Xin Yu,Kui Zhang,Wenjun Wu

Main category: cs.CL

TL;DR: DURIT enhances Small Language Models' reasoning by separating understanding from reasoning, mapping problems into a simplified space, and using iterative training for better performance and robustness.

Details Motivation: Improving reasoning in Small Language Models (SLMs) is challenging due to the complexity and variability of natural language, which creates a noisy problem space that hinders optimization. Method: DURIT uses a three-step iterative algorithm involving reinforcement learning to map natural language problems, self-distillation to align reasoning trajectories, and training reasoning policies in a canonical problem space. Result: DURIT significantly enhances SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Conclusion: DURIT improves the reasoning capabilities and robustness of Small Language Models by decoupling understanding from reasoning, validating this strategy as effective. Abstract: Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., $\leq$ 1.5B) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

[23] FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models

Chuan Li,Qianyi Zhao,Fengran Mo,Cen Chen

Main category: cs.CL

TL;DR: FedCoT enhances large language model reasoning in federated learning settings, especially for healthcare, balancing performance, privacy, and interpretability.

Details Motivation: The challenge of enhancing LLM reasoning in federated learning while balancing performance, computational limits, communication costs, and privacy concerns, especially in healthcare where interpretability and compliance are critical. Method: The paper proposes FedCoT, which uses a lightweight chain-of-thought enhancement mechanism where local models generate multiple reasoning paths, and a compact discriminator selects the most promising one. It also uses LoRA module stacking and client classifier-aware aggregation to manage client heterogeneity. Result: Experiments show that FedCoT significantly improves client-side reasoning performance under tight resource constraints while maintaining data privacy. Conclusion: FedCoT provides a framework for enhancing reasoning capabilities in federated learning environments, particularly in healthcare, by improving reasoning accuracy, robustness, and interpretability while preserving data privacy. Abstract: Efficiently enhancing the reasoning capabilities of large language models (LLMs) in federated learning environments remains challenging, particularly when balancing performance gains with strict computational, communication, and privacy constraints. This challenge is especially acute in healthcare, where decisions-spanning clinical, operational, and patient-facing contexts-demand not only accurate outputs but also interpretable, traceable rationales to ensure safety, accountability, and regulatory compliance. Conventional federated tuning approaches on LLM fail to address this need: they optimize primarily for answer correctness while neglecting rationale quality, leaving CoT capabilities dependent on models' innate pre-training abilities. Moreover, existing methods for improving rationales typically rely on privacy-violating knowledge distillation from centralized models. Additionally, the communication overhead in traditional federated fine-tuning on LLMs remains substantial. We addresses this gap by proposing FedCoT, a novel framework specifically designed to enhance reasoning in federated settings. FedCoT leverages a lightweight chain-of-thought enhancement mechanism: local models generate multiple reasoning paths, and a compact discriminator dynamically selects the most promising one. This approach improves reasoning accuracy and robustness while providing valuable interpretability, which is particularly critical for medical applications. To manage client heterogeneity efficiently, we adopt an improved aggregation approach building upon advanced LoRA module stacking, incorporating client classifier-awareness to achieve noise-free aggregation across diverse clients. Comprehensive experiments on medical reasoning tasks demonstrate that FedCoT significantly boosts client-side reasoning performance under stringent resource budgets while fully preserving data privacy.

[24] LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

Egor Fadeev,Dzhambulat Mollaev,Aleksei Shestov,Dima Korolev,Omar Zoloev,Ivan Kireev,Andrey Savchenko,Maksim Makarenko

Main category: cs.CL

TL;DR: This paper introduces LATTE, a contrastive learning framework that efficiently learns client embeddings from communication sequences by leveraging frozen LLMs, reducing computational costs while outperforming existing methods.

Details Motivation: The motivation is to address the computational inefficiency and impracticality of using large language models (LLMs) directly on long event sequences in financial applications. Method: LATTE uses a contrastive learning framework to align raw event embeddings with semantic embeddings from frozen LLMs, summarizing behavioral features into short prompts for supervision. Result: The approach significantly reduces inference cost and input size compared to conventional LLM processing, showing superior performance on real-world financial datasets. Conclusion: The proposed LATTE framework effectively learns client embeddings from historic communication sequences and outperforms state-of-the-art techniques while being deployable in latency-sensitive environments. Abstract: Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.

[25] Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control

Yuanchang Ye

Main category: cs.CL

TL;DR: 本文提出了一种结合显著性检验的共形预测方法,提高了大型语言模型在多选问答中的可靠性,为高风险应用提供了可信的预测结果。

Details Motivation: 大型语言模型在多选问答中存在幻觉和非事实生成的问题,影响其可靠性,而现有的共形预测方法尚未与显著性检验结合。 Method: 通过自一致性重采样计算p值,并结合共形评分构建预测集,利用假设检验和经验导出的p值来减少幻觉和事实错误。 Result: 在MMLU和MMLU-Pro基准上的评估表明,增强的共形预测框架能够实现用户指定的经验误覆盖率,且预测集大小随风险水平增加而减少,验证了其作为不确定性度量的有效性。 Conclusion: 本研究提出了一种基于显著性检验的共形预测框架,以提高大型语言模型在高风险问答应用中的可信度。 Abstract: This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA). While LLMs have been increasingly deployed in disciplinary QA scenarios, hallucination and nonfactual generation substantially compromise response reliability. Although CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor, their synergistic integration remains unexplored. To mitigate hallucination and factual inaccuracies, our framework integrates $p$-value computation with conformity scoring through self-consistency resampling of MCQA responses. This approach calculates option frequencies to address LLMs' black-box nature, subsequently constructing prediction sets via null hypothesis testing ($\mathcal{H}_0$) with empirically derived $p$-values. Evaluations on MMLU and MMLU-Pro benchmarks using off-the-shelf LLMs demonstrate: (1) The enhanced CP achieves user-specified empirical miscoverage rates; (2) Test-set average prediction set size (APSS) decreases monotonically with increasing risk levels ($\alpha$), validating APSS as an effective uncertainty metric. This work establishes a principled statistical framework for trustworthy LLM deployment in high-stakes QA applications.

[26] RTTC: Reward-Guided Collaborative Test-Time Compute

J. Pablo Muñoz,Jinjie Yuan

Main category: cs.CL

TL;DR: RTTC is a novel framework that adaptively selects optimal Test-Time Compute strategies for each query using a reward model, reducing computational overhead and achieving superior accuracy compared to traditional methods like RAG and TTT.

Details Motivation: The authors aim to address the issue where existing Test-Time Compute strategies, like RAG and TTT, often incur computational overhead due to indiscriminate application. They propose RTTC to adaptively choose the best strategy for each query, improving efficiency and performance. Method: The paper introduces RTTC, which uses a pretrained reward model to adaptively select optimal Test-Time Compute strategies, such as RAG or lightweight fine-tuning, based on query requirements. It also incorporates Query-State Caching to reduce redundant computation within a distributed server-client architecture. Result: Experiments show that RTTC consistently outperforms vanilla RAG or TTT in terms of accuracy across multiple LLMs and benchmarks, demonstrating the effectiveness of adaptive, reward-guided strategy selection and the benefits of Query-State Caching. Conclusion: RTTC provides a scalable and high-performance solution for language model adaptation by adaptively selecting the most effective TTC strategy for each query through a pretrained reward model, achieving superior accuracy compared to traditional methods like RAG or TTT. Abstract: Test-Time Compute (TTC) has emerged as a powerful paradigm for enhancing the performance of Large Language Models (LLMs) at inference, leveraging strategies such as Test-Time Training (TTT) and Retrieval-Augmented Generation (RAG). However, the optimal adaptation strategy varies across queries, and indiscriminate application of TTC strategy incurs substantial computational overhead. In this work, we introduce Reward-Guided Test-Time Compute (RTTC), a novel framework that adaptively selects the most effective TTC strategy for each query via a pretrained reward model, maximizing downstream accuracy across diverse domains and tasks. RTTC operates in a distributed server-client architecture, retrieving relevant samples from a remote knowledge base and applying RAG or lightweight fine-tuning on client devices only when necessary. To further mitigate redundant computation, we propose Query-State Caching, which enables the efficient reuse of historical query states at both retrieval and adaptation levels. Extensive experiments across multiple LLMs and benchmarks demonstrate that RTTC consistently achieves superior accuracy compared to vanilla RAG or TTT, validating the necessity of adaptive, reward-guided TTC selection and the potential of RTTC for scalable, high-performance language model adaptation.

[27] Detecting and explaining postpartum depression in real-time with generative artificial intelligence

Silvia García-Méndez,Francisco de Arriba-Pérez

Main category: cs.CL

TL;DR: 本文提出了一個智能的產後抑鬱症篩檢系統,結合自然語言處理、機器學習和大型語言模型,實現實時且可解釋的篩檢與治療建議。

Details Motivation: 產後抑鬱症對母親身心健康有重大影響,因此需要快速檢測和干預。現有的方法存在不足,需要更有效的解決方案。 Method: 結合自然語言處理、機器學習和大型語言模型,並利用特徵重要性和自然語言提高模型的可解釋性。 Result: 系統在產後抑鬱症檢測上的準確率達90%,優於現有文獻中的其他解決方案。 Conclusion: 該系統有助於快速檢測產後抑鬱症及其相關風險因素,促進及時且適當的評估與干預。 Abstract: Among the many challenges mothers undergo after childbirth, postpartum depression (PPD) is a severe condition that significantly impacts their mental and physical well-being. Consequently, the rapid detection of ppd and their associated risk factors is critical for in-time assessment and intervention through specialized prevention procedures. Accordingly, this work addresses the need to help practitioners make decisions with the latest technological advancements to enable real-time screening and treatment recommendations. Mainly, our work contributes to an intelligent PPD screening system that combines Natural Language Processing, Machine Learning (ML), and Large Language Models (LLMs) towards an affordable, real-time, and non-invasive free speech analysis. Moreover, it addresses the black box problem since the predictions are described to the end users thanks to the combination of LLMs with interpretable ml models (i.e., tree-based algorithms) using feature importance and natural language. The results obtained are 90 % on ppd detection for all evaluation metrics, outperforming the competing solutions in the literature. Ultimately, our solution contributes to the rapid detection of PPD and their associated risk factors, critical for in-time and proper assessment and intervention.

[28] SABER: Switchable and Balanced Training for Efficient LLM Reasoning

Kai Zhao,Yanjun Zhao,Jiaming Song,Shien He,Lusheng Zhang,Qiang Zhang,Tianjiao Li

Main category: cs.CL

TL;DR: SABER is a reinforcement learning framework that enables efficient and controllable reasoning for large language models, reducing latency while maintaining accuracy across various tasks.

Details Motivation: Large language models suffer from excessive inference costs and latency; SABER aims to provide efficient, user-controllable reasoning with flexible budget options. Method: SABER uses reinforcement learning with system prompts and length-aware rewards, incorporating no-think examples and assigning budget tiers based on token usage. Result: SABER achieves high accuracy under tight budgets, shows graceful degradation, and effective cross-scale and cross-domain generalization, with SABER-FastThink reducing reasoning length by 65.4% and improving accuracy by 3.6% on MATH. Conclusion: SABER provides a flexible and efficient reinforcement learning framework for LLM reasoning, allowing controllable trade-offs between latency and reasoning depth while maintaining high accuracy. Abstract: Large language models (LLMs) empowered by chain-of-thought reasoning have achieved impressive accuracy on complex tasks but suffer from excessive inference costs and latency when applied uniformly to all problems. We propose SABER (Switchable and Balanced Training for Efficient LLM Reasoning), a reinforcement learning framework that endows LLMs with user-controllable, token-budgeted reasoning. SABER first profiles each training example's base-model thinking token usage and assigns it to one of the predefined budget tiers. During fine-tuning, the model is guided by system prompts and length-aware rewards to respect its assigned budget. In parallel, we incorporate no-think examples to ensure the model remains reliable even when explicit reasoning is turned off. SABER further supports four discrete inference modes - NoThink, FastThink, CoreThink, and DeepThink, enabling flexible trade-offs between latency and reasoning depth. Extensive evaluations on math reasoning (MATH, GSM8K), code generation (MBPP), and logical reasoning (LiveBench-Reasoning) demonstrate that SABER achieves high accuracy under tight budgets, graceful degradation, and effective cross-scale and cross-domain generalization. In particular, SABER-FastThink cuts reasoning length by 65.4% and yields a 3.6% accuracy gain compared with the base model on the MATH benchmark.

[29] LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data

Ali Zolnour,Hossein Azadmaleki,Yasaman Haghbin,Fatemeh Taherinezhad,Mohamad Javad Momeni Nezhad,Sina Rashidi,Masoud Khani,AmirSajjad Taleban,Samin Mahdizadeh Sani,Maryam Dadkhah,James M. Noble,Suzanne Bakken,Yadollah Yaghoobzadeh,Abdol-Hossein Vahabie,Masoud Rouhizadeh,Maryam Zolnoori

Main category: cs.CL

TL;DR: 本研究开发并评估了一种用于阿尔茨海默病及相关痴呆症检测的筛查流程,结合了transformer模型的嵌入和语言特征,并利用大型语言模型生成的合成语音进行数据增强。

Details Motivation: 阿尔茨海默病及相关痴呆症影响了约500万美国老年人,而超过一半的病例仍未被诊断。基于语音的自然语言处理提供了一种有前景的方法,可通过语言标记检测早期认知衰退。 Method: 使用DementiaBank数据集中的cookie-theft任务转录数据,评估了10种transformer模型,并融合了表现最佳的transformer模型生成的嵌入与语言特征。此外,使用大型语言模型生成的合成语音进行数据增强,并测试了三种多模态模型用于语音-文本分类。 Result: 融合模型达到了F1=83.3(AUC=89.5),超过了单独的语言或transformer模型。使用MedAlpaca-7B合成语音数据进行数据增强后,F1提升至85.7。微调显著提升了单模态LLM分类器的表现(如MedAlpaca的F1从47.3提升到78.5)。当前多模态模型表现较低(如GPT-4o的F1为70.2,Qwen的F1为66.0)。 Conclusion: 将transformer嵌入与语言特征相结合可以增强ADRD的检测效果。临床调整的LLM能有效支持分类和数据增强,但多模态建模仍需进一步改进。 Abstract: Alzheimer's disease and related dementias (ADRD) affect approximately five million older adults in the U.S., yet over half remain undiagnosed. Speech-based natural language processing (NLP) offers a promising, scalable approach to detect early cognitive decline through linguistic markers. To develop and evaluate a screening pipeline that (i) fuses transformer embeddings with handcrafted linguistic features, (ii) tests data augmentation using synthetic speech generated by large language models (LLMs), and (iii) benchmarks unimodal and multimodal LLM classifiers for ADRD detection. Transcripts from the DementiaBank "cookie-theft" task (n = 237) were used. Ten transformer models were evaluated under three fine-tuning strategies. A fusion model combined embeddings from the top-performing transformer with 110 lexical-derived linguistic features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech, which was used to augment training data. Three multimodal models (GPT-4o, Qwen-Omni, Phi-4) were tested for speech-text classification in zero-shot and fine-tuned settings. The fusion model achieved F1 = 83.3 (AUC = 89.5), outperforming linguistic or transformer-only baselines. Augmenting training data with 2x MedAlpaca-7B synthetic speech increased F1 to 85.7. Fine-tuning significantly improved unimodal LLM classifiers (e.g., MedAlpaca: F1 = 47.3 -> 78.5 F1). Current multimodal models demonstrated lower performance (GPT-4o = 70.2 F1; Qwen = 66.0). Performance gains aligned with the distributional similarity between synthetic and real speech. Integrating transformer embeddings with linguistic features enhances ADRD detection from speech. Clinically tuned LLMs effectively support both classification and data augmentation, while further advancement is needed in multimodal modeling.

[30] PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs

Xiao Fu,Hossein A. Rahmani,Bin Wu,Jerome Ramos,Emine Yilmaz,Aldo Lipani

Main category: cs.CL

TL;DR: PREF is a personalized, reference-free evaluation framework for text generation that improves robustness, transparency, and alignment with user preferences, offering better performance than existing methods.

Details Motivation: Most evaluation methods for text generation ignore user individuality, so there is a need for a personalized, reference-free evaluation framework. Method: The paper introduces a three-step pipeline: coverage (using an LLM to generate universal quality guidelines), preference (personalizing the guidelines using user profiles and preferences), and scoring (applying an LLM judge to rate outputs against the personalized rubric). Result: Experiments on the PrefEval benchmark showed that PREF outperforms strong baselines in accuracy, calibration, and alignment with human judgments. Conclusion: PREF provides a reliable and scalable framework for evaluating personalized text generation by separating coverage from preference, enabling better assessment and development of personalized language generation systems. Abstract: Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce \textbf{PREF}, a \textbf{P}ersonalised \textbf{R}eference-free \textbf{E}valuation \textbf{F}ramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user's profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.

[31] Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Wenpeng Xing,Mohan Li,Chunqiang Hu,Haitao XuNingyu Zhang,Bo Lin,Meng Han

Main category: cs.CL

TL;DR: This paper introduces Latent Fusion Jailbreak (LFJ), an effective attack on large language models, and proposes adversarial training as a defense that significantly reduces attack success rates.

Details Motivation: The motivation is to develop and defend against jailbreak attacks on large language models (LLMs) that can circumvent safety alignments, aiming to improve the robustness of these models. Method: The paper introduces Latent Fusion Jailbreak (LFJ), which uses hidden state interpolation from harmful and benign query pairs to launch attacks, and proposes adversarial training as a defense mechanism. Result: LFJ achieves an average attack success rate of 94.01% on models like Vicuna and LLaMA-2, outperforming existing methods. The adversarial training defense reduces ASR by over 80%. Conclusion: The paper concludes that LFJ is a highly effective jailbreak attack method, and the proposed adversarial training defense significantly reduces its attack success rate without affecting performance on benign inputs. Abstract: Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.

[32] Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models

Saaduddin Mahmud,Mason Nakamura,Kyle H. Wray,Shlomo Zilberstein

Main category: cs.CL

TL;DR: This paper introduces IAPO and its PSST algorithm to jointly optimize prompts and inference strategies, highlighting the importance of inference-awareness in aligning black-box LLMs effectively.

Details Motivation: Existing prompt optimization methods ignore the inference strategy used during deployment, creating a methodological gap since there is a strong interdependence between prompt optimization and inference strategies. Method: The paper introduces a framework called IAPO that jointly optimizes prompts and inference scale while considering inference budget and task objectives. It also develops the PSST algorithm for fixed-budget training and analyzes its error probability guarantees. Result: The PSST algorithm, part of the IAPO framework, demonstrates the importance of inference-aware prompt optimization through evaluations on six diverse tasks, showing improved alignment of black-box LLMs. Conclusion: Incorporating inference-awareness in prompt optimization is crucial for aligning black-box LLMs effectively, as demonstrated by the PSST algorithm and its evaluation across multiple tasks. Abstract: Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have also proven to enhance alignment and performance by trading off computation. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without regard to the inference strategy employed during deployment. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a unified novel framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, which we call PSST (Prompt Scaling via Sequential Trimming), and analyze finite-budget guarantees on error probability. Finally, we evaluate the effectiveness of PSST on six different tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness when aligning black-box LLMs through prompt optimization.

[33] The Cost of Thinking: Increased Jailbreak Risk in Large Language Models

Fan Yang

Main category: cs.CL

TL;DR: This paper discovers that LLMs with thinking mode are more vulnerable to Jailbreak attacks and proposes a method to improve security by guiding the internal thinking process with specific tokens.

Details Motivation: The authors aim to address the overlooked vulnerability of LLMs' thinking mode to Jailbreak attacks, which could have significant implications for the security and ethical use of LLMs. Method: The paper evaluates 9 LLMs on AdvBench and HarmBench to study the impact of Jailbreak attacks on thinking mode. It analyzes characteristics of attacked data and proposes a safe thinking intervention using 'specific thinking tokens' to guide LLMs' internal processes. Result: The study finds that thinking mode LLMs are more vulnerable to attacks, particularly when dealing with educational content or excessively long thinking lengths. The proposed safe thinking intervention effectively reduces the attack success rate. Conclusion: The paper concludes that thinking mode in LLMs is more susceptible to Jailbreak attacks and proposes a method to enhance security by using safe thinking interventions. Abstract: Thinking mode has always been regarded as one of the most valuable modes in LLMs. However, we uncover a surprising and previously overlooked phenomenon: LLMs with thinking mode are more easily broken by Jailbreak attack. We evaluate 9 LLMs on AdvBench and HarmBench and find that the success rate of attacking thinking mode in LLMs is almost higher than that of non-thinking mode. Through large numbers of sample studies, it is found that for educational purposes and excessively long thinking lengths are the characteristics of successfully attacked data, and LLMs also give harmful answers when they mostly know that the questions are harmful. In order to alleviate the above problems, this paper proposes a method of safe thinking intervention for LLMs, which explicitly guides the internal thinking processes of LLMs by adding "specific thinking tokens" of LLMs to the prompt. The results demonstrate that the safe thinking intervention can significantly reduce the attack success rate of LLMs with thinking mode.

[34] Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion

Dong Zhao,Yadong Wang,Xiang Chen,Chenxi Wang,Hongliang Dai,Chuanxing Geng,Shengzhong Zhang,Shaoyuan Li,Sheng-Jun Huang

Main category: cs.CL

TL;DR: This paper proposes APIE, a new active prompting framework for information extraction that improves the performance of large language models by addressing both format and content uncertainties.

Details Motivation: The motivation stems from the sensitivity of LLM performance to in-context examples and the lack of guidance in conventional selection strategies that ignore format-related confusion in IE tasks. Method: The researchers introduced APIE, which uses a dual-component uncertainty metric to quantify Format Uncertainty and Content Uncertainty. It ranks unlabeled data to select the most informative examples for few-shot learning. Result: Experiments on four benchmarks showed that APIE outperforms strong baselines, improving extraction accuracy and robustness. Conclusion: The study concludes that APIE, an active prompting framework based on introspective confusion, enhances the performance of LLMs in IE tasks by addressing both format and content uncertainties. Abstract: Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.

[35] mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning

Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen

Main category: cs.CL

TL;DR: 本论文研究了推理增强型大语言模型在多语言常识推理任务上的表现,并提出了一个可扩展的基准测试mSCoRe。

Details Motivation: 多语言常识推理涉及不同语言和文化中的日常知识,但其利用不同人类推理技能的机制仍研究不足。 Method: 提出了一个名为mSCoRe的多语言、可扩展的技能型常识推理基准测试。 Result: 实验结果显示,当前模型在更高复杂度级别上仍面临显著挑战。 Conclusion: mSCoRe为未来改进多语言常识推理能力提供了方向。 Abstract: Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that \textbf{mSCoRe} remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models' reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.

[36] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

Kartikeya Badola,Jonathan Simon,Arian Hosseini,Sara Marie Mc Carthy,Tsendsuren Munkhdalai,Abhimanyu Goyal,Tomáš Kočiský,Shyam Upadhyay,Bahare Fatemi,Mehran Kazemi

Main category: cs.CL

TL;DR: This paper introduces a new benchmark for evaluating LLMs' abilities in multi-turn dialogue and reasoning with incomplete data, identifying key areas for improvement in current models.

Details Motivation: LLMs struggle with nuanced environments and interactive tasks common in real-world scenarios, necessitating the development of models that can effectively handle incomplete data and engage in logically consistent multi-turn dialogue. Method: A novel benchmark with multi-turn tasks was introduced to evaluate the reasoning, interactive dialogue, and information-seeking abilities of LLMs. Result: Evaluations of frontier models on the benchmark revealed notable deficiencies in instruction following, reasoning, and planning, highlighting the need for further research and development in these areas. Conclusion: The paper concludes that there is significant room for improvement in current LLMs regarding instruction following, reasoning, and planning within complex, interactive scenarios. Abstract: Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.

[37] LaajMeter: A Framework for LaaJ Evaluation

Gal Amram,Eitan Farchi,Shmulik Froimovich,Raviv Gal,Avi Ziv

Main category: cs.CL

TL;DR: This paper introduces LaaJMeter, a simulation-based framework for controlled meta-evaluation of LLM-as-a-Judge (LaaJ) systems, addressing challenges in domain-specific contexts where annotated data is scarce. It enables systematic analysis of evaluation metrics under realistic conditions, helping practitioners validate and refine LaaJs for specific tasks.

Details Motivation: LaaJs face challenges in domain-specific contexts due to scarce annotated data and costly expert evaluation. Metrics used in such cases are often not validated for specific domains, making it hard to identify effective evaluator performance. This work aims to address these challenges by introducing a framework for controlled meta-evaluation of LaaJs. Method: LaaJMeter uses a simulation-based framework to generate synthetic data representing virtual models and judges, enabling systematic analysis of evaluation metrics under realistic conditions. Result: The results highlight the limitations of common metrics and emphasize the importance of principled metric selection. The utility of LaaJMeter was demonstrated in a code translation task, showing how different metrics vary in sensitivity to evaluator quality. Conclusion: LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to trustworthy and reproducible evaluation in NLP. Abstract: Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). While effective in general domains, LaaJs pose significant challenges in domain-specific contexts, where annotated data is scarce and expert evaluation is costly. In such cases, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. As a result, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate and refine LaaJs for specific evaluation tasks: they can test whether their metrics correctly distinguish between better and worse (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP.

[38] Estimating Machine Translation Difficulty

Lorenzo Proietti,Stefano Perrella,Vilém Zouhar,Roberto Navigli,Tom Kocmi

Main category: cs.CL

TL;DR: 本文提出了翻译难度估计的任务,并介绍了基于预期翻译质量的新指标,通过构建更具有挑战性的机器翻译基准来评估不同的难度估计方法。

Details Motivation: 当前机器翻译质量接近完美,难以区分不同模型的性能并识别未来改进的方向。因此需要自动识别机器翻译系统表现不佳的文本,以开发更具区分性的评估方法并指导未来研究。 Method: 定义了翻译难度的度量标准,提出了新的评估指标,并使用该指标评估基线和新方法。此外,构建了更具有挑战性的机器翻译基准。 Result: 专门设计的模型(Sentinel-src)在翻译难度估计任务上优于基于启发式的方法(如词汇罕见度或句法复杂性)和LLM-as-a-judge方法。 Conclusion: 通过构建更难的翻译基准,该研究为未来机器翻译系统的改进提供了方向,并发布了两个改进的难度估计模型Sentinel-src-24和Sentinel-src-25。 Abstract: Machine translation quality has began achieving near-perfect translations in some setups. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. Automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. We formalize the task of translation difficulty estimation, defining a text's difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging machine translation benchmarks. Our results show that dedicated models (dubbed Sentinel-src) outperform both heuristic-based methods (e.g. word rarity or syntactic complexity) and LLM-as-a-judge approaches. We release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.

[39] Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs

Wenlong Deng,Jiaming Zhang,Qi Zeng,Christos Thrampoulidis,Boying Gong,Xiaoxiao Li

Main category: cs.CL

TL;DR: For-Value is a scalable, gradient-free framework for influence estimation in large language and vision-language models.

Details Motivation: The need for scalable and efficient influence estimation in billion-parameter models to enhance transparency and accountability. Method: For-Value uses a forward-only approach with a closed-form expression based on hidden representations and prediction errors. Result: For-Value matches or outperforms gradient-based methods in identifying impactful examples and detecting mislabeled data. Conclusion: For-Value offers an efficient and scalable solution for estimating the influence of training samples in large models without the need for gradient computations. Abstract: Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.

[40] PakBBQ: A Culturally Adapted Bias Benchmark for QA

Abdullah Hashmat,Muhammad Arham Mirza,Agha Ali Raza

Main category: cs.CL

TL;DR: PakBBQ是一个适应文化和地区的数据集,用于评估大型语言模型在低资源环境下的公平性。

Details Motivation: 大多数LLMs是在以西方为中心的数据上训练和评估的,很少关注低资源语言和区域背景,因此需要一个适应文化和地区的数据集来评估LLMs的公平性。 Method: 引入了PakBBQ,这是一个文化和地区适应的原始问答偏见基准(BBQ)数据集的扩展,并评估了多个在模糊和明确上下文以及负面与非负面问题框架下训练的多语言LLMs。 Result: 实验结果显示,通过消除歧义平均准确率提高了12%,在乌尔都语中的反偏见行为比英语中更强,以及在提出负面问题时减少刻板印象反应的明显框架效应。 Conclusion: PakBBQ强调了在低资源环境下进行上下文化基准测试和简单提示工程策略的重要性,以减轻偏见。 Abstract: With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12\% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.

[41] Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models

Igor Halperin

Main category: cs.CL

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations -- events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {confabulations, defined as responses that are arbitrary and semantically misaligned with the user's query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{Semantic Exploration}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.

[42] Understanding Textual Emotion Through Emoji Prediction

Ethan Gordon,Nishank Kuppa,Rigved Tummala,Sriram Anasuri

Main category: cs.CL

TL;DR: 研究使用深度学习模型预测表情符号,发现BERT和CNN在不同类别上表现优异,强调了模型选择的重要性。

Details Motivation: 探索从短文本序列中预测表情符号的方法,以提高人机交互的情感感知能力。 Method: 使用四个深度学习架构进行表情符号预测:前馈网络、CNN、Transformer和BERT,并通过焦点损失和正则化技术解决类别不平衡问题。 Result: BERT由于其预训练优势实现了最高的整体性能,CNN在罕见表情符号类别上表现出更高的效果。 Conclusion: BERT在整体性能上表现最好,而CNN在罕见表情符号类别上表现出色,表明架构选择和超参数调整对于情感感知的表情符号预测至关重要。 Abstract: This project explores emoji prediction from short text sequences using four deep learning architectures: a feed-forward network, CNN, transformer, and BERT. Using the TweetEval dataset, we address class imbalance through focal loss and regularization techniques. Results show BERT achieves the highest overall performance due to its pre-training advantage, while CNN demonstrates superior efficacy on rare emoji classes. This research shows the importance of architecture selection and hyperparameter tuning for sentiment-aware emoji prediction, contributing to improved human-computer interaction.

[43] Using Large Language Models to Measure Symptom Severity in Patients At Risk for Schizophrenia

Andrew X. Chen,Guillermo Horga,Sean Escola

Main category: cs.CL

TL;DR: This study demonstrates that large language models can accurately predict BPRS scores from clinical interviews of CHR patients, offering a promising tool for improving and standardizing symptom assessments, including in foreign languages and with longitudinal data integration.

Details Motivation: Patients at clinical high risk for schizophrenia require close symptom monitoring, but the BPRS assessment tool is not commonly used in clinical practice due to its lengthy structured interview requirement. This study explores the use of LLMs as a more efficient alternative. Method: The study used large language models (LLMs) to predict BPRS scores from clinical interview transcripts of 409 CHR patients from the AMP-SCZ cohort. The models were tested for zero-shot performance, accuracy in foreign languages, and longitudinal information integration using one-shot or few-shot learning approaches. Result: The zero-shot performance of LLM predictions showed a median concordance of 0.84 and ICC of 0.73 compared to true assessments, approaching human inter- and intra-rater reliability. For foreign languages, the median concordance was 0.88 with an ICC of 0.70, indicating high accuracy. Conclusion: The study concludes that large language models (LLMs) can effectively predict BPRS scores from clinical interview transcripts of CHR patients, showing potential to improve and standardize symptom assessments, including in foreign languages and through longitudinal data integration. Abstract: Patients who are at clinical high risk (CHR) for schizophrenia need close monitoring of their symptoms to inform appropriate treatments. The Brief Psychiatric Rating Scale (BPRS) is a validated, commonly used research tool for measuring symptoms in patients with schizophrenia and other psychotic disorders; however, it is not commonly used in clinical practice as it requires a lengthy structured interview. Here, we utilize large language models (LLMs) to predict BPRS scores from clinical interview transcripts in 409 CHR patients from the Accelerating Medicines Partnership Schizophrenia (AMP-SCZ) cohort. Despite the interviews not being specifically structured to measure the BPRS, the zero-shot performance of the LLM predictions compared to the true assessment (median concordance: 0.84, ICC: 0.73) approaches human inter- and intra-rater reliability. We further demonstrate that LLMs have substantial potential to improve and standardize the assessment of CHR patients via their accuracy in assessing the BPRS in foreign languages (median concordance: 0.88, ICC: 0.70), and integrating longitudinal information in a one-shot or few-shot learning approach.

[44] A Computational Approach to Analyzing Language Change and Variation in the Constructed Language Toki Pona

Daniel Huang,Hyoun-A Joo

Main category: cs.CL

TL;DR: This study uses computational and corpus-based methods to show that Toki Pona, a constructed language, evolves and varies like natural languages due to sociolinguistic factors.

Details Motivation: To understand how constructed languages like Toki Pona evolve and vary over time and across different usage contexts, similar to natural languages. Method: A computational and corpus-based approach was used to analyze features such as fluid word classes and transitivity in Toki Pona. Result: The study found that sociolinguistic factors influence Toki Pona similarly to natural languages, and that its usage patterns naturally evolve as communities interact with the language. Conclusion: Toki Pona, despite being a constructed language, experiences sociolinguistic influences and natural evolution in usage patterns similar to natural languages. Abstract: This study explores language change and variation in Toki Pona, a constructed language with approximately 120 core words. Taking a computational and corpus-based approach, the study examines features including fluid word classes and transitivity in order to examine (1) changes in preferences of content words for different syntactic positions over time and (2) variation in usage across different corpora. The results suggest that sociolinguistic factors influence Toki Pona in the same way as natural languages, and that even constructed linguistic systems naturally evolve as communities use them.

[45] Inductive Bias Extraction and Matching for LLM Prompts

Christian M. Angel,Francis Ferraro

Main category: cs.CL

TL;DR: The paper introduces an Inductive Bias Extraction and Matching strategy in prompt engineering that significantly enhances LLM performance in classification and ranking tasks.

Details Motivation: Prompt engineering is crucial due to LLMs' sensitivity to small changes in prompt wording, and part of this sensitivity can be attributed to the inductive bias in LLMs. Method: The strategy involves using the LLM's output as part of its prompt to better align with its inductive bias. Result: Empirical evidence shows that this strategy improves LLM Likert ratings for classification by up to 19% and for ranking by up to 27%. Conclusion: Using an Inductive Bias Extraction and Matching strategy significantly improves the effectiveness of prompts in LLM classification and ranking tasks. Abstract: The active research topic of prompt engineering makes it evident that LLMs are sensitive to small changes in prompt wording. A portion of this can be ascribed to the inductive bias that is present in the LLM. By using an LLM's output as a portion of its prompt, we can more easily create satisfactory wording for prompts. This has the effect of creating a prompt that matches the inductive bias in model. Empirically, we show that using this Inductive Bias Extraction and Matching strategy improves LLM Likert ratings used for classification by up to 19% and LLM Likert ratings used for ranking by up to 27%.

[46] Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race

Gustavo Bonil,Simone Hashiguti,Jhessica Silva,João Gondim,Helena Maia,Nádia Silva,Helio Pedrini,Sandra Avila

Main category: cs.CL

TL;DR: This study highlights the importance of qualitative methods in identifying and addressing gender and racial biases in LLM outputs, revealing how models reinforce stereotypes and showing the need for critical approaches to ethical AI development.

Details Motivation: As LLMs become more sophisticated and widely used, it is crucial to assess whether they reproduce biases like discrimination and racialization. Existing quantitative methods are insufficient in capturing the nuanced emergence of biases in natural language. Method: The study uses a qualitative, discursive framework involving manual analysis of LLM-generated short stories featuring Black and white women to investigate biases. Result: The analysis revealed that LLMs tend to portray Black women as tied to ancestry and resistance, while white women are depicted in self-discovery processes. When prompted to correct biases, models provided superficial revisions that maintained problematic meanings, showing limitations in fostering inclusive narratives. Conclusion: The study concludes that qualitative methods are essential to identify nuanced gender and racial biases in LLM outputs, which quantitative methods often overlook. It emphasizes the importance of critical, interdisciplinary approaches to AI design to mitigate biases and promote ethical AI development. Abstract: With the advance of Artificial Intelligence (AI), Large Language Models (LLMs) have gained prominence and been applied in diverse contexts. As they evolve into more sophisticated versions, it is essential to assess whether they reproduce biases, such as discrimination and racialization, while maintaining hegemonic discourses. Current bias detection approaches rely mostly on quantitative, automated methods, which often overlook the nuanced ways in which biases emerge in natural language. This study proposes a qualitative, discursive framework to complement such methods. Through manual analysis of LLM-generated short stories featuring Black and white women, we investigate gender and racial biases. We contend that qualitative methods such as the one proposed here are fundamental to help both developers and users identify the precise ways in which biases manifest in LLM outputs, thus enabling better conditions to mitigate them. Results show that Black women are portrayed as tied to ancestry and resistance, while white women appear in self-discovery processes. These patterns reflect how language models replicate crystalized discursive representations, reinforcing essentialization and a sense of social immobility. When prompted to correct biases, models offered superficial revisions that maintained problematic meanings, revealing limitations in fostering inclusive narratives. Our results demonstrate the ideological functioning of algorithms and have significant implications for the ethical use and development of AI. The study reinforces the need for critical, interdisciplinary approaches to AI design and deployment, addressing how LLM-generated discourses reflect and perpetuate inequalities.

[47] ReviewRL: Towards Automated Scientific Review with RL

Sihang Zeng,Kai Tian,Kaiyan Zhang,Yuru wang,Junqi Gao,Runze Liu,Sa Yang,Jingxuan Li,Xinwei Long,Jiaheng Ma,Biqing Qi,Bowen Zhou

Main category: cs.CL

TL;DR: ReviewRL是一个基于强化学习的科学论文自动评审框架,结合检索增强型上下文生成、监督微调和复合奖励函数的强化学习,解决了现有自动评审方法的问题。

Details Motivation: 同行评审面临投稿量增加和评审疲劳的挑战,现有的自动评审方法在事实准确性、评分一致性和分析深度方面存在不足。 Method: 结合了检索增强型上下文生成流水线、监督微调和具有复合奖励函数的强化学习过程。 Result: 在ICLR 2025论文上的实验表明,ReviewRL在基于规则的指标和模型质量评估方面显著优于现有方法。 Conclusion: ReviewRL 提供了一个基础框架,用于在科学发现中实现基于强化学习的自动评论生成,展示了在此领域未来发展的潜力。 Abstract: Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.

[48] From Surface to Semantics: Semantic Structure Parsing for Table-Centric Document Analysis

Xuan Li,Jialiang Dong,Raymond Wong

Main category: cs.CL

TL;DR: DOTABLER is a semantic document parsing framework that uncovers deep links between tables and their context, enabling precise analysis and outperforming advanced models like GPT-4o.

Details Motivation: Existing studies focus on surface-level tasks like table detection and data extraction but lack deep semantic parsing of tables and their contextual associations, limiting advanced data interpretation and analysis. Method: DOTABLER utilizes a custom dataset and domain-specific fine-tuning of pre-trained models, integrating a complete parsing pipeline to identify context segments semantically tied to tables. Result: Evaluated on nearly 4,000 pages with over 1,000 tables, DOTABLER achieves over 90% Precision and F1 scores, demonstrating its effectiveness in semantic analysis and document parsing. Conclusion: DOTABLER provides a superior table-centric semantic document parsing framework that enables deep semantic parsing of tables and their contextual associations, outperforming advanced models like GPT-4o in precision and effectiveness. Abstract: Documents are core carriers of information and knowl-edge, with broad applications in finance, healthcare, and scientific research. Tables, as the main medium for structured data, encapsulate key information and are among the most critical document components. Existing studies largely focus on surface-level tasks such as layout analysis, table detection, and data extraction, lacking deep semantic parsing of tables and their contextual associations. This limits advanced tasks like cross-paragraph data interpretation and context-consistent analysis. To address this, we propose DOTABLER, a table-centric semantic document parsing framework designed to uncover deep semantic links between tables and their context. DOTABLER leverages a custom dataset and domain-specific fine-tuning of pre-trained models, integrating a complete parsing pipeline to identify context segments semantically tied to tables. Built on this semantic understanding, DOTABLER implements two core functionalities: table-centric document structure parsing and domain-specific table retrieval, delivering comprehensive table-anchored semantic analysis and precise extraction of semantically relevant tables. Evaluated on nearly 4,000 pages with over 1,000 tables from real-world PDFs, DOTABLER achieves over 90% Precision and F1 scores, demonstrating superior performance in table-context semantic analysis and deep document parsing compared to advanced models such as GPT-4o.

[49] Beyond Semantic Understanding: Preserving Collaborative Frequency Components in LLM-based Recommendation

Minhao Wang,Yunhang He,Cong Xu,Zhangchi Zhu,Wei Zhang

Main category: cs.CL

TL;DR: FreLLM4Rec enhances LLM-based recommendation systems by balancing semantic and collaborative signals through spectral filtering and modulation techniques.

Details Motivation: LLM-based recommenders tend to overemphasize semantic correlations while weakening collaborative signals from user interaction history, limiting their performance compared to traditional models. Method: FreLLM4Rec utilizes a Global Graph Low-Pass Filter (G-LPF) to purify item embeddings and Temporal Frequency Modulation (TFM) to preserve collaborative signals during propagation in LLMs. Result: Experiments on four benchmark datasets show that FreLLM4Rec reduces collaborative signal attenuation and improves performance, with up to an 8.00% increase in NDCG@10 over the best baseline. Conclusion: FreLLM4Rec effectively balances semantic and collaborative information in LLM-based recommendation systems, mitigating collaborative signal attenuation and achieving competitive performance improvements. Abstract: Recommender systems in concert with Large Language Models (LLMs) present promising avenues for generating semantically-informed recommendations. However, LLM-based recommenders exhibit a tendency to overemphasize semantic correlations within users' interaction history. When taking pretrained collaborative ID embeddings as input, LLM-based recommenders progressively weaken the inherent collaborative signals as the embeddings propagate through LLM backbones layer by layer, as opposed to traditional Transformer-based sequential models in which collaborative signals are typically preserved or even enhanced for state-of-the-art performance. To address this limitation, we introduce FreLLM4Rec, an approach designed to balance semantic and collaborative information from a spectral perspective. Item embeddings that incorporate both semantic and collaborative information are first purified using a Global Graph Low-Pass Filter (G-LPF) to preliminarily remove irrelevant high-frequency noise. Temporal Frequency Modulation (TFM) then actively preserves collaborative signal layer by layer. Note that the collaborative preservation capability of TFM is theoretically guaranteed by establishing a connection between the optimal but hard-to-implement local graph fourier filters and the suboptimal yet computationally efficient frequency-domain filters. Extensive experiments on four benchmark datasets demonstrate that FreLLM4Rec successfully mitigates collaborative signal attenuation and achieves competitive performance, with improvements of up to 8.00\% in NDCG@10 over the best baseline. Our findings provide insights into how LLMs process collaborative information and offer a principled approach for improving LLM-based recommendation systems.

[50] Cross-Prompt Encoder for Low-Performing Languages

Beso Mikaberidze,Teimuraz Saghinadze,Simon Ostermann,Philipp Muller

Main category: cs.CL

TL;DR: This paper proposes the Cross-Prompt Encoder (XPE) and a Dual Soft Prompt mechanism to improve language transfer for low-performing languages and enhance multilingual adaptability.

Details Motivation: The motivation is to explore the broader potential of soft prompts for language transfer, particularly focusing on improving performance for low-performing languages that do not achieve good accuracy even with full-model fine-tuning. Method: The paper introduces the Cross-Prompt Encoder (XPE), which uses a lightweight encoding architecture with multi-source training on typologically diverse languages. It also proposes a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. Result: Experiments on the SIB-200 benchmark show that XPE is most effective for low-performing languages, while hybrid variants provide broader adaptability across multilingual settings. Conclusion: The paper concludes that the proposed Cross-Prompt Encoder (XPE) and the Dual Soft Prompt mechanism significantly enhance the performance of low-performing languages and offer broader adaptability in multilingual settings. Abstract: Soft prompts have emerged as a powerful alternative to adapters in parameter-efficient fine-tuning (PEFT), enabling large language models (LLMs) to adapt to downstream tasks without architectural changes or parameter updates. While prior work has focused on stabilizing training via parameter interaction in small neural prompt encoders, their broader potential for transfer across languages remains unexplored. In this paper, we demonstrate that a prompt encoder can play a central role in improving performance on low-performing languages-those that achieve poor accuracy even under full-model fine-tuning. We introduce the Cross-Prompt Encoder (XPE), which combines a lightweight encoding architecture with multi-source training on typologically diverse languages - a design that enables the model to capture abstract and transferable patterns across languages. To complement XPE, we propose a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. This hybrid design proves especially effective for target languages that benefit from both broadly shared structure and language-specific alignment. Experiments on the SIB-200 benchmark reveal a consistent trade-off: XPE is most effective for low-performing languages, while hybrid variants offer broader adaptability across multilingual settings.

[51] Making Qwen3 Think in Korean with Reinforcement Learning

Jungyup Lee,Jemin Kim,Sang Park,SeungJae Lee

Main category: cs.CL

TL;DR: This paper presents a two-stage fine-tuning approach to make the Qwen3 14B language model think natively in Korean, resulting in improved performance on Korean-language tasks and general reasoning ability.

Details Motivation: To make the large language model Qwen3 14B 'think' natively in Korean and improve its Korean-language tasks and general reasoning ability. Method: Two-stage fine-tuning approach: supervised fine-tuning on a Korean reasoning dataset followed by reinforcement learning with a customized GRPO algorithm. Result: Notable improvements in Korean-language tasks and some gains in general reasoning ability were achieved, along with stable learning and incremental performance gains. Conclusion: The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks while maintaining knowledge and language proficiency. Abstract: We present a two-stage fine-tuning approach to make the large language model Qwen3 14B "think" natively in Korean. In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning, yielding notable improvements in Korean-language tasks and even some gains in general reasoning ability. In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning alignment and overall problem-solving performance. We address critical stability challenges in GRPO training - such as reward hacking and policy collapse - by introducing an oracle judge model that calibrates the reward signal. Our approach achieves stable learning (avoiding the collapse observed in naive GRPO) and leads to steady, incremental performance gains. The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks (particularly math and coding tasks) while maintaining knowledge and language proficiency, successfully conducting its internal chain-of-thought entirely in Korean.

[52] Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models

Jakub Šmíd,Pavel Přibáň,Pavel Král

Main category: cs.CL

TL;DR: 本文提出了一种新的跨语言ABSA任务的序列到序列方法,该方法通过约束解码提高了性能,并减少了对翻译工具的依赖,适用于低资源语言。

Details Motivation: 由于现有的跨语言ABSA研究通常集中在较简单的任务上,并且严重依赖外部翻译工具,因此对于低资源语言来说仍然存在挑战。 Method: 本文使用了一种新的序列到序列的方法,利用约束解码来处理复合ABSA任务。 Result: 该方法通过最多提高10%的跨语言ABSA性能,拓宽了跨语言ABSA的应用范围,使其能够处理更复杂的任务,并提供了一种实用且高效的替代翻译依赖技术的方法。此外,与大型语言模型的比较表明,虽然微调的多语言LLM可以取得可比的结果,但以英语为中心的LLM在这些任务上表现不佳。 Conclusion: 本文提出了一种新的基于序列到序列的复合ABSA任务方法,该方法通过约束解码提高了跨语言ABSA的性能,并且与大型语言模型的比较显示,虽然微调的多语言LLM可以取得可比的结果,但以英语为中心的LLM在这些任务上表现不佳。 Abstract: Aspect-based sentiment analysis (ABSA) has made significant strides, yet challenges remain for low-resource languages due to the predominant focus on English. Current cross-lingual ABSA studies often centre on simpler tasks and rely heavily on external translation tools. In this paper, we present a novel sequence-to-sequence method for compound ABSA tasks that eliminates the need for such tools. Our approach, which uses constrained decoding, improves cross-lingual ABSA performance by up to 10\%. This method broadens the scope of cross-lingual ABSA, enabling it to handle more complex tasks and providing a practical, efficient alternative to translation-dependent techniques. Furthermore, we compare our approach with large language models (LLMs) and show that while fine-tuned multilingual LLMs can achieve comparable results, English-centric LLMs struggle with these tasks.

[53] Large Language Models for Summarizing Czech Historical Documents and Beyond

Václav Tran,Jakub Šmíd,Jiří Martínek,Ladislav Lenc,Pavel Král

Main category: cs.CL

TL;DR: This paper explores the use of advanced language models for Czech text summarization, achieving state-of-the-art results and introducing a new dataset for historical Czech documents.

Details Motivation: Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and limited annotated datasets, creating a need for more research in this area. Method: The researchers employed large language models (Mistral and mT5) for Czech text summarization and evaluated their performance on existing and newly introduced datasets. Result: The study achieved new state-of-the-art results on the SumeCzech dataset and introduced a new dataset, Posel od Čerchova, for historical Czech document summarization with baseline results. Conclusion: The study concludes that the application of large language models like Mistral and mT5 can significantly advance Czech text summarization, especially for historical documents. Abstract: Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od \v{C}erchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.

[54] Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding

Jakub Šmíd,Pavel Přibáň,Pavel Král

Main category: cs.CL

TL;DR: 该论文提出了一种无需依赖翻译工具的跨语言方面情感分析(ABSA)方法,通过使用受限解码技术,提高了跨语言性能,并支持多任务处理。

Details Motivation: 低资源语言在方面情感分析中常被忽视,且当前跨语言ABSA方法依赖于外部翻译工具并局限于较简单的任务。 Method: 利用受限解码与序列到序列模型进行跨语言ABSA,并通过多任务学习提升性能。 Result: 该方法在7种语言和6个ABSA任务中超越了现有技术,复杂任务性能平均提升5%,多任务处理提升超过10%。此外,微调大语言模型表现良好但耗时较长。 Conclusion: 研究提供了跨语言ABSA的实用建议,揭示了其方法的优势与局限,推动了该领域的研究进展。 Abstract: While aspect-based sentiment analysis (ABSA) has made substantial progress, challenges remain for low-resource languages, which are often overlooked in favour of English. Current cross-lingual ABSA approaches focus on limited, less complex tasks and often rely on external translation tools. This paper introduces a novel approach using constrained decoding with sequence-to-sequence models, eliminating the need for unreliable translation tools and improving cross-lingual performance by 5\% on average for the most complex task. The proposed method also supports multi-tasking, which enables solving multiple ABSA tasks with a single model, with constrained decoding boosting results by more than 10\%. We evaluate our approach across seven languages and six ABSA tasks, surpassing state-of-the-art methods and setting new benchmarks for previously unexplored tasks. Additionally, we assess large language models (LLMs) in zero-shot, few-shot, and fine-tuning scenarios. While LLMs perform poorly in zero-shot and few-shot settings, fine-tuning achieves competitive results compared to smaller multilingual models, albeit at the cost of longer training and inference times. We provide practical recommendations for real-world applications, enhancing the understanding of cross-lingual ABSA methodologies. This study offers valuable insights into the strengths and limitations of cross-lingual ABSA approaches, advancing the state-of-the-art in this challenging research domain.

[55] Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Chiyu Zhang,Lu Zhou,Xiaogang Xu,Jiafei Wu,Liming Fang,Zhe Liu

Main category: cs.CL

TL;DR: 本文提出了一种结合LLM和人工监督的恶意内容检测框架MDH,并提出了两种新的越狱攻击策略D-Attack和DH-CoT。

Details Motivation: 评估越狱攻击具有挑战性,尤其是当提示不明显有害或无法诱导出有害输出时。现有的红队数据集往往包含不合适的提示,需要进行评估和清理。 Method: 提出了一种名为MDH的混合评估框架,结合了LLM和人工监督,用于检测恶意内容,并提出了D-Attack和DH-CoT两种新策略。 Result: 研究发现精心设计的开发者信息可以显著提高越狱成功率,并提出了两种新的攻击策略。 Conclusion: 论文提出了一种混合评估框架MDH,结合了基于LLM的注释与少量人工监督,用于恶意内容检测,并提出了两种新的越狱攻击策略D-Attack和DH-CoT。 Abstract: Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.

[56] Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Huizhen Shu,Xuying Li,Qirui Wang,Yuji Kosuga,Mengqiu Tian,Zhuo Li

Main category: cs.CL

TL;DR: This paper proposes a black-box attack method called Sparse Feature Perturbation Framework (SFPF) to generate adversarial texts that can bypass advanced NLP defenses by leveraging sparse autoencoders to identify and perturb critical text features.

Details Motivation: Generating adversarial examples to understand and improve the robustness of Large Language Models (LLMs) remains a key challenge, especially in jailbreaking scenarios. Method: The paper introduces the Sparse Feature Perturbation Framework (SFPF), which uses sparse autoencoders (SAE) to reconstruct hidden layer representations, performs feature clustering on attacked texts to identify highly activated features, and selectively perturbs these features to generate adversarial texts. Result: Adversarial texts generated by SFPF were shown to bypass state-of-the-art defense mechanisms, demonstrating persistent vulnerabilities in current NLP systems. Conclusion: SFPF provides a new red-teaming strategy that balances adversarial effectiveness with safety alignment; however, its effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated. Abstract: With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.

[57] ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning

Juyuan Wang,Rongchen Zhao,Wei Wei,Yufeng Wang,Mo Yu,Jie Zhou,Jin Xu,Liyan Xu

Main category: cs.CL

TL;DR: 本文提出ComoRAG,通过模拟人类认知的动态记忆整合机制,实现对长文本故事的高效推理,显著优于传统RAG方法。

Details Motivation: 传统RAG方法的单步检索过程难以捕捉长上下文中的动态关联关系,而LLM在长文本推理上受限且计算成本高。 Method: 提出ComoRAG方法,通过迭代推理循环和动态内存工作区,逐步生成探测查询并整合新证据到全局内存池中。 Result: 在四个长文本叙事基准测试中,ComoRAG相比最强基线模型性能提升了最多11%。 Conclusion: ComoRAG提供了一种基于记忆整合的新型推理范式,特别适用于需要全局理解的复杂问题,推动了基于检索的长上下文理解的发展。 Abstract: Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG

[58] Evaluating LLMs on Chinese Idiom Translation

Cai Yang,Yao Dou,David Heineman,Xiaofeng Wu,Wei Xu

Main category: cs.CL

TL;DR: This study introduces IdiomEval, a framework for evaluating and improving Chinese idiom translation. The researchers found that modern systems, including advanced models like GPT-4 and Google Translate, struggle with Chinese idiom translation, producing incorrect, literal, partial, or missing translations. They also found that existing evaluation metrics are not effective in measuring idiom translation quality. The improved models developed by the researchers achieved F1 scores of 0.68 for detecting idiom translation errors.

Details Motivation: Chinese idioms are common in everyday language and often contain historical references and follow specific structural patterns. Their figurative meanings usually differ from their literal interpretations, making them difficult to translate accurately. Despite recent progress in machine translation with large language models, little is known about Chinese idiom translation. This study aims to fill this gap by introducing a framework for evaluating and improving Chinese idiom translation. Method: The researchers introduced IdiomEval, a framework with a comprehensive error taxonomy for Chinese idiom translation. They annotated 900 translation pairs from nine modern systems across four domains: web, news, Wikipedia, and social media. They also evaluated existing metrics and developed improved models for detecting idiom translation errors. Result: Modern systems, including GPT-4 and Google Translate, struggle with Chinese idiom translation, producing incorrect, literal, partial, or missing translations. Even the best-performing system, GPT-4, makes errors in 28% of cases. Existing evaluation metrics have a low Pearson correlation (below 0.48) with human ratings, indicating poor performance in measuring idiom translation quality. The improved models developed by the researchers achieved F1 scores of 0.68 for detecting idiom translation errors. Conclusion: Despite recent progress in machine translation, Chinese idiom translation remains a challenge for modern systems, including advanced models like GPT-4 and GPT-4o and Google Translate, which often produce incorrect, literal, partial, or missing translations. Existing evaluation metrics are not effective in measuring idiom translation quality. The authors developed improved models that achieve better performance in detecting idiom translation errors. Abstract: Idioms, whose figurative meanings usually differ from their literal interpretations, are common in everyday language, especially in Chinese, where they often contain historical references and follow specific structural patterns. Despite recent progress in machine translation with large language models, little is known about Chinese idiom translation. In this work, we introduce IdiomEval, a framework with a comprehensive error taxonomy for Chinese idiom translation. We annotate 900 translation pairs from nine modern systems, including GPT-4o and Google Translate, across four domains: web, news, Wikipedia, and social media. We find these systems fail at idiom translation, producing incorrect, literal, partial, or even missing translations. The best-performing system, GPT-4, makes errors in 28% of cases. We also find that existing evaluation metrics measure idiom quality poorly with Pearson correlation below 0.48 with human ratings. We thus develop improved models that achieve F$_1$ scores of 0.68 for detecting idiom translation errors.

[59] Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints

Sandeep Reddy,Kabir Khan,Rohit Patil,Ananya Chakraborty,Faizan A. Khan,Swati Kulkarni,Arjun Verma,Neha Singh

Main category: cs.CL

TL;DR: 本文提出了一种“计算经济学”框架,将大语言模型(LLM)视为资源受限的代理(注意力头和神经块)内部经济,通过激励驱动的训练范式,实现更高效、自适应和透明的模型设计。

Details Motivation: 大语言模型由于计算成本高昂而受限,因此需要一种新的方法来在资源受限的情况下优化模型效率和性能。 Method: 提出了一种“计算经济学”框架,通过将计算资源视为稀缺资源,利用激励机制优化任务效用,同时引入可微分计算成本项来鼓励稀疏和高效的激活。 Result: 在GLUE和WikiText-103数据集上,该方法在保持准确性的同时减少了约40%的FLOPS,并降低了延迟,注意力模式更可解释。 Conclusion: 经济原则为在严格资源约束下设计高效、自适应和透明的大语言模型提供了有原则的途径。 Abstract: Large language models (LLMs) are limited by substantial computational cost. We introduce a "computational economics" framework that treats an LLM as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard LLMs reallocate attention toward high-value tokens while preserving accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging sparse and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints.

[60] DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales

Herun Wan,Jiaying Wu,Minnan Luo,Xiangzheng Kong,Zihan Ma,Zhi Zeng

Main category: cs.CL

TL;DR: DiFaR是一个用于增强虚假信息检测的框架,它通过产生多样、事实和相关的理性来克服现有方法的局限性。

Details Motivation: 生成文本理性以支持可训练的多模态虚假信息检测器的范式由于生成理性中的不足多样性,由于幻觉导致的事实错误以及引入噪声的无关或冲突内容而受到限制。 Method: DiFaR使用五个思维链提示来激发LVLMs的不同推理轨迹,并结合了一个轻量级的后处理过滤模块来选择基于句子级别的事实性和相关性得分的理性句子。 Result: 实验结果显示,DiFaR在四个流行的基准测试中比四类基线高出5.9%,并使现有检测器提高了8.7%。 Conclusion: DiFaR是一个有效的框架,可以提高虚假信息检测的准确性,并在多个基准测试中表现出色。 Abstract: Generating textual rationales from large vision-language models (LVLMs) to support trainable multimodal misinformation detectors has emerged as a promising paradigm. However, its effectiveness is fundamentally limited by three core challenges: (i) insufficient diversity in generated rationales, (ii) factual inaccuracies due to hallucinations, and (iii) irrelevant or conflicting content that introduces noise. We introduce DiFaR, a detector-agnostic framework that produces diverse, factual, and relevant rationales to enhance misinformation detection. DiFaR employs five chain-of-thought prompts to elicit varied reasoning traces from LVLMs and incorporates a lightweight post-hoc filtering module to select rationale sentences based on sentence-level factuality and relevance scores. Extensive experiments on four popular benchmarks demonstrate that DiFaR outperforms four baseline categories by up to 5.9% and boosts existing detectors by as much as 8.7%. Both automatic metrics and human evaluations confirm that DiFaR significantly improves rationale quality across all three dimensions.

[61] When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing

Mahdi Dhaini,Stephen Meisenbacher,Ege Erdogan,Florian Matthes,Gjergji Kasneci

Main category: cs.CL

TL;DR: This paper investigates the relationship between privacy and explainability in NLP, showing that they can potentially coexist, depending on the task and methods used.

Details Motivation: There is growing interest in both explainable NLP and privacy-preserving NLP, but little research has explored the intersection of these two fields, leaving a gap in understanding whether they can coexist or if there is a trade-off between them. Method: Empirical investigation guided by Differential Privacy (DP) and Post-hoc Explainability methods. Result: The study reveals an intricate relationship between privacy and explainability in NLP, showing that their coexistence depends on several factors, including the nature of the task and the methods chosen for privatization and explanation. Conclusion: Privacy and explainability in NLP can co-exist, but their relationship is complex and influenced by factors such as the downstream task and the methods used for text privatization and explainability. Practical recommendations are provided for future research at this intersection. Abstract: In the study of trustworthy Natural Language Processing (NLP), a number of important research fields have emerged, including that of \textit{explainability} and \textit{privacy}. While research interest in both explainable and privacy-preserving NLP has increased considerably in recent years, there remains a lack of investigation at the intersection of the two. This leaves a considerable gap in understanding of whether achieving \textit{both} explainability and privacy is possible, or whether the two are at odds with each other. In this work, we conduct an empirical investigation into the privacy-explainability trade-off in the context of NLP, guided by the popular overarching methods of \textit{Differential Privacy} (DP) and Post-hoc Explainability. Our findings include a view into the intricate relationship between privacy and explainability, which is formed by a number of factors, including the nature of the downstream task and choice of the text privatization and explainability method. In this, we highlight the potential for privacy and explainability to co-exist, and we summarize our findings in a collection of practical recommendations for future work at this important intersection.

[62] When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models

Huyu Wu,Meng Tang,Xinhan Zheng,Haiyun Jiang

Main category: cs.CL

TL;DR: This paper investigates the problem of text dominance in Multimodal Large Language Models (MLLMs), proposes evaluation metrics to quantify this issue, identifies its root causes, and introduces a token compression method to address the imbalance in model attention across modalities.

Details Motivation: Multimodal Large Language Models (MLLMs) face a core problem called text dominance, where they heavily rely on textual inputs and underutilize other modalities. This study aims to systematically investigate this issue and propose solutions for more balanced multimodal processing. Method: The researchers proposed two evaluation metrics, the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI), and conducted a systematic investigation of text dominance across various modalities. They also introduced a token compression method to address the imbalance in model attention. Result: The analysis showed that text dominance is significant and widespread across different modalities. Applying the proposed token compression method to LLaVA-7B significantly reduced its MDI from 10.23 to 0.86, indicating a more balanced attention distribution. Conclusion: The study provides a foundation for developing more balanced and comprehensive multimodal language models by addressing the issue of text dominance through a proposed token compression method. Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a diverse range of multimodal tasks. However, these models suffer from a core problem known as text dominance: they depend heavily on text for their inference, while underutilizing other modalities. While prior work has acknowledged this phenomenon in vision-language tasks, often attributing it to data biases or model architectures. In this paper, we conduct the first systematic investigation of text dominance across diverse data modalities, including images, videos, audio, time-series, and graphs. To measure this imbalance, we propose two evaluation metrics: the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI). Our comprehensive analysis reveals that text dominance is both significant and pervasive across all tested modalities. Our in-depth analysis identifies three underlying causes: attention dilution from severe token redundancy in non-textual modalities, the influence of fusion architecture design, and task formulations that implicitly favor textual inputs. Furthermore, we propose a simple token compression method that effectively rebalances model attention. Applying this method to LLaVA-7B, for instance, drastically reduces its MDI from 10.23 to a well-balanced value of 0.86. Our analysis and methodological framework offer a foundation for the development of more equitable and comprehensive multimodal language models.

[63] eDIF: A European Deep Inference Fabric for Remote Interpretability of LLM

Irma Heithoff. Marc Guggenberger,Sandra Kalogiannis,Susanne Mayer,Fabian Maag,Sigurd Schacht,Carsten Lanquillon

Main category: cs.CL

TL;DR: 本文探讨了欧洲深度推理基础设施(eDIF)的可行性,该基础设施旨在支持大型语言模型的机械可解释性研究。通过试点研究验证了平台的技术性能和科学效用,并指出了未来发展的方向。

Details Motivation: 该论文的动机是为了解决欧洲对大型语言模型可解释性基础设施广泛可及性的需求,从而为研究社区民主化高级模型分析能力。 Method: 该论文介绍了位于安斯巴赫应用科学大学的基于GPU的集群,并通过NNsight API实现远程模型检查。此外,论文描述了一项涉及16位欧洲研究人员的结构化试点研究,评估了平台的技术性能、可用性和科学效用。 Result: 试点研究显示用户参与度逐渐增加,平台性能稳定,远程实验能力受到积极评价。同时,研究也标志着围绕该平台构建用户社区的起点。论文还提到一些局限性,如激活数据下载时间较长和间歇性执行中断,并提出了未来发展的路线图。 Conclusion: 该论文得出的结论是,欧洲深度推理基础设施(eDIF)的建立是欧洲大型语言模型可解释性基础设施广泛可访问性的重要一步,并为未来的广泛部署、工具扩展和持续的社区合作奠定了基础。 Abstract: This paper presents a feasibility study on the deployment of a European Deep Inference Fabric (eDIF), an NDIF-compatible infrastructure designed to support mechanistic interpretability research on large language models. The need for widespread accessibility of LLM interpretability infrastructure in Europe drives this initiative to democratize advanced model analysis capabilities for the research community. The project introduces a GPU-based cluster hosted at Ansbach University of Applied Sciences and interconnected with partner institutions, enabling remote model inspection via the NNsight API. A structured pilot study involving 16 researchers from across Europe evaluated the platform's technical performance, usability, and scientific utility. Users conducted interventions such as activation patching, causal tracing, and representation analysis on models including GPT-2 and DeepSeek-R1-70B. The study revealed a gradual increase in user engagement, stable platform performance throughout, and a positive reception of the remote experimentation capabilities. It also marked the starting point for building a user community around the platform. Identified limitations such as prolonged download durations for activation data as well as intermittent execution interruptions are addressed in the roadmap for future development. This initiative marks a significant step towards widespread accessibility of LLM interpretability infrastructure in Europe and lays the groundwork for broader deployment, expanded tooling, and sustained community collaboration in mechanistic interpretability research.

[64] Neural Machine Translation for Coptic-French: Strategies for Low-Resource Ancient Languages

Nasma Chaoui,Richard Khoury

Main category: cs.CL

TL;DR: This paper studies strategies for translating Coptic into French and shows that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality, providing crucial practical insights for developing translation tools for historical languages.

Details Motivation: The motivation of the paper is to provide a systematic study of strategies for translating Coptic into French and to provide practical insights for developing translation tools for historical languages. Method: The authors utilized aligned biblical corpora and systematically evaluated strategies for translating Coptic into French, including pivot versus direct translation, the impact of pre-training, the benefits of multi-version fine-tuning, and model robustness to noise. Result: The result of the study shows that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality. Conclusion: The paper concludes that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality, and provides crucial practical insights for developing translation tools for historical languages in general. Abstract: This paper presents the first systematic study of strategies for translating Coptic into French. Our comprehensive pipeline systematically evaluates: pivot versus direct translation, the impact of pre-training, the benefits of multi-version fine-tuning, and model robustness to noise. Utilizing aligned biblical corpora, we demonstrate that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality. Our findings provide crucial practical insights for developing translation tools for historical languages in general.

[65] Continuous Bangla Sign Language Translation: Mitigating the Expense of Gloss Annotation with the Assistance of Graph

Safaeid Hossain Arib,Rabeya Akter,Sejuti Rahman

Main category: cs.CL

TL;DR: 该研究提出了一种新的连续孟加拉手语翻译方法,结合了基于图的方法和Transformer架构,实现了更有效的无词汇手语翻译,并在多个数据集上达到了最先进的性能。

Details Motivation: 手语是聋人和听力障碍人士的重要交流方式,但在以口语为主的社会中常常被低估,导致沟通障碍和社会排斥。研究的目标是通过改进翻译方法来弥补这一差距。 Method: 该研究结合了基于图的方法和Transformer架构(STGCN-LSTM),探索了多种融合策略,以提高无词汇手语翻译的效果。 Result: 该方法在多个手语数据集(RWTH-PHOENIX-2014T、CSL-Daily、How2Sign和BornilDB v1.0)上均取得了优于现有方法的翻译效果,BLEU-4分数分别提高了4.01、2.07和0.5。此外,该研究首次在BornilDB v1.0数据集上进行了基准测试。 Conclusion: 该研究通过融合架构和多种策略,实现了更高效无词汇的手语翻译,为未来的相关研究设定了新的基准,并强调了提升聋人和听力障碍人士沟通可及性的重要性。 Abstract: Millions of individuals worldwide are affected by deafness and hearing impairment. Sign language serves as a sophisticated means of communication for the deaf and hard of hearing. However, in societies that prioritize spoken languages, sign language often faces underestimation, leading to communication barriers and social exclusion. The Continuous Bangla Sign Language Translation project aims to address this gap by enhancing translation methods. While recent approaches leverage transformer architecture for state-of-the-art results, our method integrates graph-based methods with the transformer architecture. This fusion, combining transformer and STGCN-LSTM architectures, proves more effective in gloss-free translation. Our contributions include architectural fusion, exploring various fusion strategies, and achieving a new state-of-the-art performance on diverse sign language datasets, namely RWTH-PHOENIX-2014T, CSL-Daily, How2Sign, and BornilDB v1.0. Our approach demonstrates superior performance compared to current translation outcomes across all datasets, showcasing notable improvements of BLEU-4 scores of 4.01, 2.07, and 0.5, surpassing those of GASLT, GASLT and slt_how2sign in RWTH-PHOENIX-2014T, CSL-Daily, and How2Sign, respectively. Also, we introduce benchmarking on the BornilDB v1.0 dataset for the first time. Our method sets a benchmark for future research, emphasizing the importance of gloss-free translation to improve communication accessibility for the deaf and hard of hearing.

[66] Learning from Natural Language Feedback for Personalized Question Answering

Alireza Salemi,Hamed Zamani

Main category: cs.CL

TL;DR: VAC improves personalized question answering by using natural language feedback instead of scalar rewards, leading to better learning efficiency and personalization quality.

Details Motivation: Current approaches using scalar rewards for personalizing large language models sometimes provide weak and non-instructive feedback, limiting learning efficiency and personalization quality. There is a need for richer and more actionable supervision signals. Method: VAC replaces scalar rewards with natural language feedback conditioned on user profiles and question narratives. Training alternates between optimizing the feedback model and fine-tuning the policy model, enabling the model to internalize effective personalization strategies without requiring feedback during inference. Result: Evaluation on the LaMP-QA benchmark across three diverse domains showed consistent and significant improvements over state-of-the-art results, with human evaluations confirming the superior quality of the generated responses. Conclusion: The proposed VAC framework, which utilizes natural language feedback (NLF), demonstrates significant improvements in personalization quality for response generation in question-answering tasks compared to existing methods. Abstract: Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.

[67] Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs

Xiangqi Jin,Yuxuan Wang,Yifeng Gao,Zichen Wen,Biqing Qi,Dongrui Liu,Linfeng Zhang

Main category: cs.CL

TL;DR: This paper proposes ICE, a new prompting framework for diffusion large language models (dLLMs), which enables more flexible and efficient prompting through in-place token masking and early exit mechanisms, achieving significant performance gains.

Details Motivation: The motivation is to overcome the limitations of prefix-only prompting in LLMs by leveraging the bidirectional attention and iterative refinement of dLLMs for more flexible and efficient prompting. Method: The paper introduces ICE, which uses in-place prompting within masked token positions and a confidence-aware early exit mechanism. It evaluates ICE through extensive experiments on tasks like GSM8K and MMLU. Result: Experiments show that ICE achieves up to 17.29% accuracy improvement with 4.12× speedup on GSM8K and up to 276.67× acceleration on MMLU. Conclusion: The paper concludes that ICE is an effective framework for in-place prompting in dLLMs, offering significant accuracy improvements and speedups while maintaining competitive performance. Abstract: Despite large language models (LLMs) have achieved remarkable success, their prefix-only prompting paradigm and sequential generation process offer limited flexibility for bidirectional information. Diffusion large language models (dLLMs) present new opportunities through their bidirectional attention mechanisms and iterative refinement processes, enabling more flexible in-place prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework that transforms prefix-only prompting into in-place prompting specifically designed for dLLMs. ICE integrates in-place prompts directly within masked token positions during iterative refinement and employs a confidence-aware early exit mechanism to significantly reduce computational overhead. Extensive experiments demonstrate ICE's effectiveness, achieving up to 17.29% accuracy improvement with 4.12$\times$ speedup on GSM8K, and up to 276.67$\times$ acceleration on MMLU while maintaining competitive performance.

[68] Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback

Osama Mohammed Afzal,Preslav Nakov,Tom Hope,Iryna Gurevych

Main category: cs.CL

TL;DR: This paper introduces a structured LLM-based method for automated novelty evaluation in peer review, achieving high alignment with human reviewers and improving consistency and transparency in assessing research novelty.

Details Motivation: The motivation is to address the challenge of novelty assessment in peer review, especially in high-volume fields like NLP where reviewer capacity is strained. There is a need for more consistent, evidence-based evaluations. Method: The paper proposes a structured approach for automated novelty evaluation that models expert reviewer behavior through content extraction, retrieval and synthesis of related work, and structured comparison. The method is informed by a large-scale analysis of human-written novelty reviews. Result: Evaluated on 182 ICLR 2025 submissions with human annotations, the approach achieved 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions, outperforming existing LLM-based baselines. Conclusion: The paper concludes that their structured LLM-assisted approach significantly improves the rigor and transparency of peer review, particularly in assessing novelty, without replacing human expertise. Abstract: Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.

[69] Reinforced Language Models for Sequential Decision Making

Jim Dilkes,Vahid Yazdanpanah,Sebastian Stein

Main category: cs.CL

TL;DR: 本文提出了 MS-GRPO 算法,通过后训练小模型实现高效的序列决策,显著提升了小模型性能,为不依赖大规模模型提供了新思路。

Details Motivation: 大语言模型在序列决策任务中表现出潜力,但由于依赖大规模计算模型,其应用受到限制,因此需要改进小模型的性能。 Method: 提出了 Multi-Step Group-Relative Policy Optimization (MS-GRPO),并结合一种新的基于绝对优势加权的片段采样策略,用于解决多步代理任务中的信用分配问题。 Result: 通过在 Snake 和 Frozen Lake 任务上对 30 亿参数模型进行后训练,实验表明该方法有效提升了决策性能,尤其是在 Frozen Lake 任务中,30 亿参数模型的表现优于 720 亿参数基线模型 50%。 Conclusion: MS-GRPO 是一种有效的后训练算法,能够显著提升小规模模型在序列决策任务中的表现,证明了有针对性的后训练是依赖模型规模的一种实用且高效的替代方案。 Abstract: Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.

[70] Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning

Chongyuan Dai,Jinpeng Hu,Hongchang Shi,Zhuo Li,Xun Yang,Meng Wang

Main category: cs.CL

TL;DR: This paper presents Psyche-R1, a novel Chinese psychological LLM that combines empathy, psychological expertise, and reasoning to address the shortage of mental health professionals. It achieves strong results despite its smaller size.

Details Motivation: The integration of LLMs into psychological applications is driven by the shortage of qualified mental health professionals and the need to alleviate the growing burden of mental health disorders. Current research has largely overlooked the importance of reasoning mechanisms in generating reliable psychological responses. Method: The paper introduces Psyche-R1, a Chinese psychological LLM, built using a data synthesis pipeline that generates high-quality psychological questions with rationales via CoT reasoning and iterative prompt-rationale optimization, along with empathetic dialogues. A hybrid training strategy combining GRPO for reasoning improvement and SFT for empathetic response generation is employed. Result: Psyche-R1, despite being significantly smaller (7B vs. 671B parameters), achieves comparable performance to DeepSeek-R1 on several psychological benchmarks, demonstrating the effectiveness of the proposed approach in enhancing both reasoning and empathetic response generation capabilities. Conclusion: Psyche-R1 proves to be effective in psychological benchmarks, achieving comparable results to much larger models, which highlights the potential of integrating empathy, psychological expertise, and reasoning in LLMs for mental health applications. Abstract: Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.

[71] From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms

Zhaokun Jiang,Ziyin Zhang

Main category: cs.CL

TL;DR: 本文提出了一种强调可解释性的多维度建模框架,在英汉交替口译数据集上实现了良好的预测性能,并提供详细诊断反馈以支持自我调节学习。

Details Motivation: 现有的机器学习在自动解释质量评估中的研究存在对语言使用质量检验不足、由于数据稀缺和不平衡导致的建模效果不理想以及缺乏对模型预测的解释等问题。 Method: 提出了一种多维度建模框架,结合特征工程、数据增强和可解释机器学习,仅使用与构念相关的透明特征并进行Shapley值(SHAP)分析。 Result: 在新的英汉交替口译数据集上取得了良好的预测性能,其中BLEURT和CometKiwi分数对于保真度最为重要,停顿相关特征对于流利度至关重要,汉语特定短语多样性指标对于语言使用有显著影响。 Conclusion: 本研究通过强调可解释性,提出了一种可扩展、可靠且透明的传统人工评估替代方案,为学习者提供详细的诊断反馈,并支持自我调节学习的优势。 Abstract: Recent advancements in machine learning have spurred growing interests in automated interpreting quality assessment. Nevertheless, existing research suffers from insufficient examination of language use quality, unsatisfactory modeling effectiveness due to data scarcity and imbalance, and a lack of efforts to explain model predictions. To address these gaps, we propose a multi-dimensional modeling framework that integrates feature engineering, data augmentation, and explainable machine learning. This approach prioritizes explainability over ``black box'' predictions by utilizing only construct-relevant, transparent features and conducting Shapley Value (SHAP) analysis. Our results demonstrate strong predictive performance on a novel English-Chinese consecutive interpreting dataset, identifying BLEURT and CometKiwi scores to be the strongest predictive features for fidelity, pause-related features for fluency, and Chinese-specific phraseological diversity metrics for language use. Overall, by placing particular emphasis on explainability, we present a scalable, reliable, and transparent alternative to traditional human evaluation, facilitating the provision of detailed diagnostic feedback for learners and supporting self-regulated learning advantages not afforded by automated scores in isolation.

[72] SSRL: Self-Search Reinforcement Learning

Yuchen Fan,Kaiyan Zhang,Heng Zhou,Yuxin Zuo,Yanxu Chen,Yu Fu,Xinwei Long,Xuekai Zhu,Che Jiang,Yuchen Zhang,Li Kang,Gang Chen,Cheng Huang,Zhizhou He,Bingning Wang,Lei Bai,Ning Ding,Bowen Zhou

Main category: cs.CL

TL;DR: This paper explores how large language models can simulate agentic search tasks in reinforcement learning, introducing Self-Search and Self-Search RL methods to enhance performance and reduce dependence on external search engines.

Details Motivation: We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. Method: We first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. We introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. Result: Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. Conclusion: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Abstract: We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.

[73] A Survey on Diffusion Language Models

Tianyi Li,Mingda Chen,Bowei Guo,Zhiqiang Shen

Main category: cs.CL

TL;DR: This survey provides an overview of Diffusion Language Models (DLMs), which offer advantages in inference speed and bidirectional context over traditional autoregressive models, covering their development, techniques, applications, and future research directions.

Details Motivation: DLMs have emerged as a powerful alternative to autoregressive models, offering reduced inference latency and better capture of bidirectional context. This survey aims to provide a holistic overview of the DLM landscape. Method: The authors provide a comprehensive survey of DLMs, including their evolution, foundational principles, state-of-the-art models, pre-training strategies, post-training methods, inference strategies, multimodal extensions, and applications. Result: This survey presents a taxonomy of DLMs, analyzes current techniques, reviews inference strategies and optimizations, discusses multimodal extensions, and highlights challenges and future research directions. Conclusion: DLMs offer a promising alternative to AR models with advantages in latency, bidirectional context, and generation control. This survey outlines their evolution, techniques, and future research directions. Abstract: Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

cs.CV [Back]

[74] Stochastic-based Patch Filtering for Few-Shot Learning

Javier Rodenas,Eduardo Aguilar,Petia Radeva

Main category: cs.CV

TL;DR: The paper proposes SPFF, a stochastic-based patch filtering method for few-shot learning, which addresses the challenges of food image classification by focusing on class-specific features and outperforms existing methods.

Details Motivation: Food images present unique challenges for few-shot learning models due to their visual complexity and variability, leading to misclassification. Method: Stochastic-based Patch Filtering for Few-Shot Learning (SPFF) filters patch embeddings stochastically and uses a similarity matrix to quantify the relationship between query and support images. Result: SPFF outperforms existing state-of-the-art methods on few-shot classification benchmarks: Food-101, VireoFood-172, and UECFood-256. Conclusion: SPFF is effective in focusing on class-specific food features and outperforms existing state-of-the-art methods in few-shot classification. Abstract: Food images present unique challenges for few-shot learning models due to their visual complexity and variability. For instance, a pasta dish might appear with various garnishes on different plates and in diverse lighting conditions and camera perspectives. This problem leads to losing focus on the most important elements when comparing the query with support images, resulting in misclassification. To address this issue, we propose Stochastic-based Patch Filtering for Few-Shot Learning (SPFF) to attend to the patch embeddings that show greater correlation with the class representation. The key concept of SPFF involves the stochastic filtering of patch embeddings, where patches less similar to the class-aware embedding are more likely to be discarded. With patch embedding filtered according to the probability of appearance, we use a similarity matrix that quantifies the relationship between the query image and its respective support images. Through a qualitative analysis, we demonstrate that SPFF effectively focuses on patches where class-specific food features are most prominent while successfully filtering out non-relevant patches. We validate our approach through extensive experiments on few-shot classification benchmarks: Food-101, VireoFood-172 and UECFood-256, outperforming the existing SoA methods.

[75] DINOv3

Oriane Siméoni,Huy V. Vo,Maximilian Seitzer,Federico Baldassarre,Maxime Oquab,Cijo Jose,Vasil Khalidov,Marc Szafraniec,Seungeun Yi,Michaël Ramamonjisoa,Francisco Massa,Daniel Haziza,Luca Wehrstedt,Jianyuan Wang,Timothée Darcet,Théo Moutakanni,Leonel Sentana,Claire Roberts,Andrea Vedaldi,Jamie Tolan,John Brandt,Camille Couprie,Julien Mairal,Hervé Jégou,Patrick Labatut,Piotr Bojanowski

Main category: cs.CV

TL;DR: DINOv3是一种通用的视觉基础模型,利用简单的策略来扩展数据集和模型大小,解决了密集特征图退化问题,并增强了模型的灵活性,无需微调即可在各种设置中超越现有技术。

Details Motivation: 自监督学习有潜力消除对手动数据注释的需求,使模型能够轻松扩展到大规模数据集和更大架构,学习来自各种来源的视觉表示。 Method: 通过仔细的数据准备、设计和优化来扩展数据集和模型大小,引入了一种新方法Gram anchoring来解决密集特征图在长时间训练中的退化问题,并应用了事后策略来增强模型在分辨率、模型大小和与文本对齐方面的灵活性。 Result: DINOv3在各种视觉任务中表现出色,显著超越了之前的自监督和弱监督基础模型,产生高质量的密集特征。 Conclusion: DINOv3是一种通用的视觉基础模型,无需微调即可在各种设置中超越专业最先进的技术,提供可扩展的解决方案以适应不同的资源限制和部署场景。 Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

[76] Empowering Morphing Attack Detection using Interpretable Image-Text Foundation Model

Sushrut Patwardhan,Raghavendra Ramachandra,Sushma Venkatesh

Main category: cs.CV

TL;DR: This paper proposes a CLIP-based multimodal framework for zero-shot morphing attack detection in face recognition systems, with interpretable textual outputs.

Details Motivation: To enhance the reliability of face recognition systems by detecting morphing attacks and providing human-understandable explanations. Method: A multimodal learning approach using CLIP for zero-shot morphing attack detection with textual prompts. Result: The model achieved generalizable detection performance and successfully predicted relevant textual snippets across ten prompts and five morphing techniques. Conclusion: The proposed multimodal framework can effectively detect morphing attacks and predict relevant text descriptions in a zero-shot setting. Abstract: Morphing attack detection has become an essential component of face recognition systems for ensuring a reliable verification scenario. In this paper, we present a multimodal learning approach that can provide a textual description of morphing attack detection. We first show that zero-shot evaluation of the proposed framework using Contrastive Language-Image Pretraining (CLIP) can yield not only generalizable morphing attack detection, but also predict the most relevant text snippet. We present an extensive analysis of ten different textual prompts that include both short and long textual prompts. These prompts are engineered by considering the human understandable textual snippet. Extensive experiments were performed on a face morphing dataset that was developed using a publicly available face biometric dataset. We present an evaluation of SOTA pre-trained neural networks together with the proposed framework in the zero-shot evaluation of five different morphing generation techniques that are captured in three different mediums.

[77] Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs

Kaixin Peng,Mengyang Zhao,Haiyang Yu,Teng Fu,Bin Li

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言模型的甲骨文破译方法,结合部首和象形分析,提高了零样本破译能力,并构建了一个大规模数据集,推动甲骨文研究和数字人文学科发展。

Details Motivation: 由于甲骨文的稀缺性、抽象性和象形多样性,其破译在考古学中长期面临挑战,现有深度学习方法在零样本设置和未破译甲骨文中泛化性和可解释性有限。 Method: 提出了一种渐进式训练策略和象形-部首双匹配机制,结合部首分析和象形语义理解,实现从甲骨文字形到意义的推理。 Result: 在公共基准测试中实现了最先进的Top-10准确率,并具备优越的零样本破译能力,同时提供逻辑分析过程,可能为未破译甲骨文提供考古学有价值的参考结果。 Conclusion: 该论文提出了一种基于大型视觉-语言模型的可解释性甲骨文破译方法,并构建了一个包含甲骨文图像和象形分析文本的数据集,以提升零样本破译性能和考古学应用潜力。 Abstract: As the oldest mature writing system, Oracle Bone Script (OBS) has long posed significant challenges for archaeological decipherment due to its rarity, abstractness, and pictographic diversity. Current deep learning-based methods have made exciting progress on the OBS decipherment task, but existing approaches often ignore the intricate connections between glyphs and the semantics of OBS. This results in limited generalization and interpretability, especially when addressing zero-shot settings and undeciphered OBS. To this end, we propose an interpretable OBS decipherment method based on Large Vision-Language Models, which synergistically combines radical analysis and pictograph-semantic understanding to bridge the gap between glyphs and meanings of OBS. Specifically, we propose a progressive training strategy that guides the model from radical recognition and analysis to pictographic analysis and mutual analysis, thus enabling reasoning from glyph to meaning. We also design a Radical-Pictographic Dual Matching mechanism informed by the analysis results, significantly enhancing the model's zero-shot decipherment performance. To facilitate model training, we propose the Pictographic Decipherment OBS Dataset, which comprises 47,157 Chinese characters annotated with OBS images and pictographic analysis texts. Experimental results on public benchmarks demonstrate that our approach achieves state-of-the-art Top-10 accuracy and superior zero-shot decipherment capabilities. More importantly, our model delivers logical analysis processes, possibly providing archaeologically valuable reference results for undeciphered OBS, and thus has potential applications in digital humanities and historical research. The dataset and code will be released in https://github.com/PKXX1943/PD-OBS.

[78] Deep Learning Enables Large-Scale Shape and Appearance Modeling in Total-Body DXA Imaging

Arianna Bunnell,Devon Cataldi,Yannik Glaser,Thomas K. Wolfgruber,Steven Heymsfield,Alan B. Zonderman,Thomas L. Kelly,Peter Sadowski,John A. Shepherd

Main category: cs.CV

TL;DR: 研究开发了一种深度学习方法,实现了对全身双X射线吸收扫描图像自动关键点识别,并展示了其在身体成分和健康标志物分析中的潜力。

Details Motivation: 全身双X射线吸收扫描是一种低成本的全身成像方式,广泛用于身体成分评估。为了提高关键点放置的自动化水平并探索其在健康标志物分析中的潜力,研究者开发了深度学习方法。 Method: 研究开发并验证了一种深度学习方法,使用1,683个手动标注的TBDXA扫描数据进行训练,并在外部测试数据集上达到了99.5%的正确关键点识别率。随后,该方法被应用于35,928个扫描数据以进行SAM建模,并使用两样本Kolmogorov-Smirnov检验测试了与健康标志物的关联。 Result: 该方法在外部测试数据集中实现了99.5%的正确关键点识别率,成功应用于35,928个扫描数据的SAM建模,并通过两样本Kolmogorov-Smirnov检验验证了与健康生物标志物的分布关联性。 Conclusion: 该论文提出了一种基于深度学习的方法,用于在全身双X射线吸收扫描(TBDXA)上自动放置关键点,并展示了其在形状和外观建模(SAM)中的应用价值。 Abstract: Total-body dual X-ray absorptiometry (TBDXA) imaging is a relatively low-cost whole-body imaging modality, widely used for body composition assessment. We develop and validate a deep learning method for automatic fiducial point placement on TBDXA scans using 1,683 manually-annotated TBDXA scans. The method achieves 99.5% percentage correct keypoints in an external testing dataset. To demonstrate the value for shape and appearance modeling (SAM), our method is used to place keypoints on 35,928 scans for five different TBDXA imaging modes, then associations with health markers are tested in two cohorts not used for SAM model generation using two-sample Kolmogorov-Smirnov tests. SAM feature distributions associated with health biomarkers are shown to corroborate existing evidence and generate new hypotheses on body composition and shape's relationship to various frailty, metabolic, inflammation, and cardiometabolic health markers. Evaluation scripts, model weights, automatic point file generation code, and triangulation files are available at https://github.com/hawaii-ai/dxa-pointplacement.

[79] MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Thanh-Dat Truong,Christophe Bobda,Nitin Agarwal,Khoa Luu

Main category: cs.CV

TL;DR: This paper proposes MANGO, a new multimodal fusion method using attention-based normalizing flow, achieving top performance on multiple tasks by explicitly modeling modality interactions.

Details Motivation: Current multimodal fusion methods rely on implicit attention mechanisms in Transformers, which struggle to capture essential modality features and complex correlations, necessitating a more explicit and interpretable approach. Method: The paper proposes a Multimodal Attention-based Normalizing Flow (MANGO) approach that includes an Invertible Cross-Attention (ICA) layer with three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Result: The experimental results show that the MANGO approach achieves state-of-the-art performance on three multimodal learning tasks: semantic segmentation, image-to-image translation, and movie genre classification. Conclusion: The paper concludes that the proposed MANGO approach achieves state-of-the-art performance in multimodal learning tasks by enabling explicit, interpretable, and tractable fusion through a novel attention-based normalizing flow method. Abstract: Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

[80] Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model

Nitin Rai,Nathan S. Boyd,Gary E. Vallad,Arnold W. Schumann

Main category: cs.CV

TL;DR: 该研究探讨了将少量真实图像与大量生成式人工智能合成图像结合,以提高西瓜病害分类性能的方法。

Details Motivation: 当前生成式人工智能(GenAI)模型的发展为生成高分辨率的合成图像提供了新的可能性,这为传统的农业计算机视觉模型训练图像获取方法提供了一个有希望的替代方案。然而,对于将真实图像与合成图像结合以提高疾病分类性能的有效性,相关研究还很有限。 Method: 这篇论文的方法是使用定制的EfficientNetV2-L架构,结合增强的微调和迁移学习技术,对五种不同的训练数据集处理方法(H0-H4)进行训练,其中包括仅使用真实图像、仅使用合成图像、以及不同比例的真实和合成图像混合等方案。 Result: 使用H2、H3和H4处理方法训练的模型表现出高精度、召回率和F1分数。加权F1分数从H0处理的0.65增加到H3和H4处理的1.00,表明将少量真实图像与大量合成图像结合可以提高模型性能和泛化能力。 Conclusion: 这篇论文的结论是,合成图像本身不足以替代真实图像,在作物病害分类中必须将真实图像和合成图像以混合的方式使用,以最大限度地提高模型性能。 Abstract: The current advancements in generative artificial intelligence (GenAI) models have paved the way for new possibilities for generating high-resolution synthetic images, thereby offering a promising alternative to traditional image acquisition for training computer vision models in agriculture. In the context of crop disease diagnosis, GenAI models are being used to create synthetic images of various diseases, potentially facilitating model creation and reducing the dependency on resource-intensive in-field data collection. However, limited research has been conducted on evaluating the effectiveness of integrating real with synthetic images to improve disease classification performance. Therefore, this study aims to investigate whether combining a limited number of real images with synthetic images can enhance the prediction accuracy of an EfficientNetV2-L model for classifying watermelon \textit{(Citrullus lanatus)} diseases. The training dataset was divided into five treatments: H0 (only real images), H1 (only synthetic images), H2 (1:1 real-to-synthetic), H3 (1:10 real-to-synthetic), and H4 (H3 + random images to improve variability and model generalization). All treatments were trained using a custom EfficientNetV2-L architecture with enhanced fine-tuning and transfer learning techniques. Models trained on H2, H3, and H4 treatments demonstrated high precision, recall, and F1-score metrics. Additionally, the weighted F1-score increased from 0.65 (on H0) to 1.00 (on H3-H4) signifying that the addition of a small number of real images with a considerable volume of synthetic images improved model performance and generalizability. Overall, this validates the findings that synthetic images alone cannot adequately substitute for real images; instead, both must be used in a hybrid manner to maximize model performance for crop disease classification.

[81] SynSpill: Improved Industrial Spill Detection With Synthetic Data

Aaditya Baranwal,Abdul Mueez,Jason Voelker,Guneet Bhatia,Shruti Vyas

Main category: cs.CV

TL;DR: 本文提出了一种利用合成数据提升视觉-语言模型和检测器在工业安全关键领域性能的可扩展框架。

Details Motivation: 工业场景中由于隐私问题、数据敏感性和真实事件罕见性导致传统微调方法不可行。 Method: 提出了一种基于高质量合成数据生成管道的可扩展框架,并验证其在VLMs和目标检测器上的有效性。 Result: 合成数据显著提升了VLMs和目标检测器的性能,弥补了领域差距。 Conclusion: 合成数据与轻量级适应方法的结合为工业环境中视觉系统部署提供了高效且可扩展的解决方案。 Abstract: Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities. However, their performance degrades significantly in niche, safety-critical domains such as industrial spill detection, where hazardous events are rare, sensitive, and difficult to annotate. This scarcity -- driven by privacy concerns, data sensitivity, and the infrequency of real incidents -- renders conventional fine-tuning of detectors infeasible for most industrial settings. We address this challenge by introducing a scalable framework centered on a high-quality synthetic data generation pipeline. We demonstrate that this synthetic corpus enables effective Parameter-Efficient Fine-Tuning (PEFT) of VLMs and substantially boosts the performance of state-of-the-art object detectors such as YOLO and DETR. Notably, in the absence of synthetic data (SynSpill dataset), VLMs still generalize better to unseen spill scenarios than these detectors. When SynSpill is used, both VLMs and detectors achieve marked improvements, with their performance becoming comparable. Our results underscore that high-fidelity synthetic data is a powerful means to bridge the domain gap in safety-critical applications. The combination of synthetic generation and lightweight adaptation offers a cost-effective, scalable pathway for deploying vision systems in industrial environments where real data is scarce/impractical to obtain. Project Page: https://synspill.vercel.app

[82] EntropyGS: An Efficient Entropy Coding on 3D Gaussian Splatting

Yuning Huang,Jiahao Pang,Fengqing Zhu,Dong Tian

Main category: cs.CV

TL;DR: EntropyGS is a compression method for 3D Gaussian Splatting data that significantly reduces data rates while preserving visual quality.

Details Motivation: The separation of 3DGS tasks over time or devices necessitates efficient compression for storage and transmission. Method: EntropyGS uses correlation and statistical analysis to identify attribute distributions and applies factorized, parameterized entropy coding with adaptive quantization. Result: EntropyGS achieves about 30x rate reduction on benchmark datasets with minimal impact on rendering quality and fast encoding/decoding. Conclusion: EntropyGS provides an effective compression solution for 3DGS Gaussians, achieving significant rate reduction while maintaining rendering quality. Abstract: As an emerging novel view synthesis approach, 3D Gaussian Splatting (3DGS) demonstrates fast training/rendering with superior visual quality. The two tasks of 3DGS, Gaussian creation and view rendering, are typically separated over time or devices, and thus storage/transmission and finally compression of 3DGS Gaussians become necessary. We begin with a correlation and statistical analysis of 3DGS Gaussian attributes. An inspiring finding in this work reveals that spherical harmonic AC attributes precisely follow Laplace distributions, while mixtures of Gaussian distributions can approximate rotation, scaling, and opacity. Additionally, harmonic AC attributes manifest weak correlations with other attributes except for inherited correlations from a color space. A factorized and parameterized entropy coding method, EntropyGS, is hereinafter proposed. During encoding, distribution parameters of each Gaussian attribute are estimated to assist their entropy coding. The quantization for entropy coding is adaptively performed according to Gaussian attribute types. EntropyGS demonstrates about 30x rate reduction on benchmark datasets while maintaining similar rendering quality compared to input 3DGS data, with a fast encoding and decoding time.

[83] CellSymphony: Deciphering the molecular and phenotypic orchestration of cells with single-cell pathomics

Paul H. Acosta,Pingjun Chen,Simon P. Castillo,Maria Esther Salvatierra,Yinyin Yuan,Xiaoxi Pan

Main category: cs.CV

TL;DR: CellSymphony是一种新的多模态框架,结合Xenium空间转录组和组织学图像,实现单细胞分辨率下的细胞类型注释和微环境分析。

Details Motivation: 尽管组织学图像中包含丰富的形态学信息,但提取稳健的细胞水平特征并将其与空间转录组数据整合仍然是一个关键挑战。 Method: CellSymphony利用基础模型生成的嵌入向量,融合Xenium转录组数据和组织学图像,学习联合表示以解码细胞在复杂组织生态系统中的生理和表型协调。 Result: CellSymphony在三个癌症类型中实现了准确的细胞类型注释,并揭示了不同的微环境生态位。 Conclusion: CellSymphony是一个灵活的多模态框架,能够有效整合Xenium转录组数据和组织学图像,实现高精度的细胞类型注释,并揭示不同癌症类型的微环境生态位。 Abstract: Xenium, a new spatial transcriptomics platform, enables subcellular-resolution profiling of complex tumor tissues. Despite the rich morphological information in histology images, extracting robust cell-level features and integrating them with spatial transcriptomics data remains a critical challenge. We introduce CellSymphony, a flexible multimodal framework that leverages foundation model-derived embeddings from both Xenium transcriptomic profiles and histology images at true single-cell resolution. By learning joint representations that fuse spatial gene expression with morphological context, CellSymphony achieves accurate cell type annotation and uncovers distinct microenvironmental niches across three cancer types. This work highlights the potential of foundation models and multimodal fusion for deciphering the physiological and phenotypic orchestration of cells within complex tissue ecosystems.

[84] Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets

Xinan Zhang,Haolin Wang,Yung-An Hsieh,Zhongyu Yang,Anthony Yezzi,Yi-Chang Tsai

Main category: cs.CV

TL;DR: 这篇论文回顾了深度学习在裂缝检测中的应用趋势,包括学习范式转变、通用性提升和数据集多样化,并引入了一个新的3D激光扫描数据集3DCrack。

Details Motivation: 裂缝检测在土木基础设施中至关重要,而深度学习的发展正在改变这一领域。论文旨在分析新兴趋势并提供未来研究方向。 Method: 系统分析裂缝检测领域的趋势,并引入新的3D激光扫描数据集3DCrack进行基准测试实验。 Result: 论文总结了裂缝检测中深度学习方法的演变趋势,包括学习范式、通用性和数据集的变化,并提供了基准测试结果。 Conclusion: 深度学习在裂缝检测中不断进步,未来的方向包括改进学习范式、增强模型通用性以及利用新型数据集如3DCrack。 Abstract: Crack detection plays a crucial role in civil infrastructures, including inspection of pavements, buildings, etc., and deep learning has significantly advanced this field in recent years. While numerous technical and review papers exist in this domain, emerging trends are reshaping the landscape. These shifts include transitions in learning paradigms (from fully supervised learning to semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation and fine-tuning foundation models), improvements in generalizability (from single-dataset performance to cross-dataset evaluation), and diversification in dataset reacquisition (from RGB images to specialized sensor-based data). In this review, we systematically analyze these trends and highlight representative works. Additionally, we introduce a new dataset collected with 3D laser scans, 3DCrack, to support future research and conduct extensive benchmarking experiments to establish baselines for commonly used deep learning methodologies, including recent foundation models. Our findings provide insights into the evolving methodologies and future directions in deep learning-based crack detection. Project page: https://github.com/nantonzhang/Awesome-Crack-Detection

[85] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

Haonan Ge,Yiwei Wang,Ming-Hsuan Yang,Yujun Cai

Main category: cs.CV

TL;DR: This paper proposes MRFD, a method to reduce hallucinations in vision-language models by improving factual grounding through inter-region consistency analysis.

Details Motivation: LVLMs often produce hallucinations due to the limited ability to verify information in different regions of the image, which motivates the need for a method to improve factual grounding. Method: The proposed method, Multi-Region Fusion Decoding (MRFD), identifies salient image regions using cross-attention, generates initial responses for each region, computes reliability weights based on Jensen-Shannon Divergence (JSD), and performs a consistency-aware fusion of per-region predictions using region-aware prompts inspired by Chain-of-Thought reasoning. Result: Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality. Conclusion: MRFD is a training-free decoding method that effectively reduces hallucinations in LVLMs by modeling inter-region consistency without requiring model updates. Abstract: Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

[86] Pose-Robust Calibration Strategy for Point-of-Gaze Estimation on Mobile Phones

Yujie Zhao,Jiabei Zeng,Shiguang Shan

Main category: cs.CV

TL;DR: 本文提出了一种动态校准策略,通过在PoG估计校准过程中引入自然的头部姿势变化,有效提高了估计器在面对头部姿势变化时的鲁棒性和性能。

Details Motivation: 基于外观的注视点(PoG)估计方法因个体差异需要进行个体校准,但现有校准后的估计器对头部姿势变化敏感,因此需要探索对姿势变化鲁棒的校准策略。 Method: 构建了包含32名参与者在固定或连续变化头部姿势下注视指定点的面部图像基准(MobilePoG),并系统分析了校准点和头部姿势的多样性对估计精度的影响。提出了一种动态校准策略,用户在注视校准点的同时移动手机以引入自然的头部姿势变化。 Result: 实验表明,在校准过程中引入更广泛的头部姿势变化可以提高估计器对姿势变化的适应能力。动态校准策略在高效且用户友好的基础上显著提升了PoG估计器的性能。 Conclusion: 动态校准策略提高了PoG估计器在处理头部姿势变化时的鲁棒性,同时比传统校准方法更高效且用户友好。 Abstract: Although appearance-based point-of-gaze (PoG) estimation has improved, the estimators still struggle to generalize across individuals due to personal differences. Therefore, person-specific calibration is required for accurate PoG estimation. However, calibrated PoG estimators are often sensitive to head pose variations. To address this, we investigate the key factors influencing calibrated estimators and explore pose-robust calibration strategies. Specifically, we first construct a benchmark, MobilePoG, which includes facial images from 32 individuals focusing on designated points under either fixed or continuously changing head poses. Using this benchmark, we systematically analyze how the diversity of calibration points and head poses influences estimation accuracy. Our experiments show that introducing a wider range of head poses during calibration improves the estimator's ability to handle pose variation. Building on this insight, we propose a dynamic calibration strategy in which users fixate on calibration points while moving their phones. This strategy naturally introduces head pose variation during a user-friendly and efficient calibration process, ultimately producing a better calibrated PoG estimator that is less sensitive to head pose variations than those using conventional calibration strategies. Codes and datasets are available at our project page.

[87] High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance

Danyi Gao

Main category: cs.CV

TL;DR: 这篇论文提出了一种高保真文本驱动图像生成方法,通过结合对比学习和结构指导机制,在不增加计算复杂度的前提下提升了生成图像的语义对齐和结构完整性。

Details Motivation: 解决现有文本驱动图像生成方法在语义对齐准确性和结构一致性方面的性能瓶颈。 Method: 通过引入对比学习模块和结构先验机制,结合文本-图像对比约束和结构指导机制,模型联合优化对比损失、结构一致性损失和语义保持损失。 Result: 在COCO-2014数据集上的系统实验表明,该方法在CLIP Score、FID和SSIM等定量指标上表现出优越性能,并对嵌入维度、文本长度和结构指导强度进行了敏感性分析。 Conclusion: 该论文提出的高保真图像生成方法在不增加计算复杂度的情况下,有效弥合了语义对齐与结构保真度之间的差距,展示了在文本-图像联合建模和图像生成方面的可行技术路径。 Abstract: This paper addresses the performance bottlenecks of existing text-driven image generation methods in terms of semantic alignment accuracy and structural consistency. A high-fidelity image generation method is proposed by integrating text-image contrastive constraints with structural guidance mechanisms. The approach introduces a contrastive learning module that builds strong cross-modal alignment constraints to improve semantic matching between text and image. At the same time, structural priors such as semantic layout maps or edge sketches are used to guide the generator in spatial-level structural modeling. This enhances the layout completeness and detail fidelity of the generated images. Within the overall framework, the model jointly optimizes contrastive loss, structural consistency loss, and semantic preservation loss. A multi-objective supervision mechanism is adopted to improve the semantic consistency and controllability of the generated content. Systematic experiments are conducted on the COCO-2014 dataset. Sensitivity analyses are performed on embedding dimensions, text length, and structural guidance strength. Quantitative metrics confirm the superior performance of the proposed method in terms of CLIP Score, FID, and SSIM. The results show that the method effectively bridges the gap between semantic alignment and structural fidelity without increasing computational complexity. It demonstrates a strong ability to generate semantically clear and structurally complete images, offering a viable technical path for joint text-image modeling and image generation.

[88] VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation

Ryota Tanaka,Tomohiro Suzuki,Keisuke Fujii

Main category: cs.CV

TL;DR: 本研究提出了一种针对花样滑冰跳跃动作的新型时序动作分割框架,结合三维姿态学习和程序结构,有效提升了识别准确率,尤其适用于数据有限的实际场景。

Details Motivation: 由于花样滑冰跳跃动作识别任务需要专家级知识,而现有的时序动作分割方法在标注数据和三维结构理解方面存在局限,因此需要一种更有效的方法来实现自动化识别。 Method: 论文提出了一种视图不变的、针对花样滑冰的姿势表示学习方法(VIFSS),结合对比学习进行预训练,并通过动作分类进行微调;同时引入了一种细粒度标注方案,标记跳跃动作的“起跳(准备)”和“落地”阶段。 Result: 实验表明该方法在元素级时序动作分割任务中达到了92%以上的F1@50评分,且在微调数据有限时,视图不变对比预训练方法特别有效。 Conclusion: 该论文提出了一种新的用于花样滑冰跳跃动作识别的时序动作分割框架,结合了三维姿态表示学习和跳跃动作的语义程序结构,显示出在有限数据情况下的实用性和有效性。 Abstract: Understanding human actions from videos plays a critical role across various domains, including sports analytics. In figure skating, accurately recognizing the type and timing of jumps a skater performs is essential for objective performance evaluation. However, this task typically requires expert-level knowledge due to the fine-grained and complex nature of jump procedures. While recent approaches have attempted to automate this task using Temporal Action Segmentation (TAS), there are two major limitations to TAS for figure skating: the annotated data is insufficient, and existing methods do not account for the inherent three-dimensional aspects and procedural structure of jump actions. In this work, we propose a new TAS framework for figure skating jumps that explicitly incorporates both the three-dimensional nature and the semantic procedure of jump movements. First, we propose a novel View-Invariant, Figure Skating-Specific pose representation learning approach (VIFSS) that combines contrastive learning as pre-training and action classification as fine-tuning. For view-invariant contrastive pre-training, we construct FS-Jump3D, the first publicly available 3D pose dataset specialized for figure skating jumps. Second, we introduce a fine-grained annotation scheme that marks the ``entry (preparation)'' and ``landing'' phases, enabling TAS models to learn the procedural structure of jumps. Extensive experiments demonstrate the effectiveness of our framework. Our method achieves over 92% F1@50 on element-level TAS, which requires recognizing both jump types and rotation levels. Furthermore, we show that view-invariant contrastive pre-training is particularly effective when fine-tuning data is limited, highlighting the practicality of our approach in real-world scenarios.

[89] JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics

Simindokht Jahangard,Mehrzad Mohammadi,Yi Shen,Zhixi Cai,Hamid Rezatofighi

Main category: cs.CV

TL;DR: This paper presents JRDB-Reasoning, a new benchmark for visual reasoning in crowded environments, featuring formalized complexity and detailed annotations for better evaluation of AI models.

Details Motivation: The motivation is to address the limitations of existing visual reasoning benchmarks, which lack clear definitions of reasoning complexity, customization options, and structured annotations. Method: The authors formalize reasoning complexity, introduce an adaptive query engine for generating customizable questions with intermediate annotations, and extend the JRDB dataset with interaction and geometric annotations to create the JRDB-Reasoning benchmark. Result: The result is the creation of the JRDB-Reasoning benchmark and an adaptive query engine that allows for detailed evaluation and dynamic testing of visual reasoning systems. Conclusion: The paper introduces a formalized reasoning complexity and a benchmark tailored for visual reasoning in human-crowded environments, enabling fine-grained evaluation and dynamic assessment of visual reasoning frameworks and models. Abstract: Recent advances in Vision-Language Models (VLMs) and large language models (LLMs) have greatly enhanced visual reasoning, a key capability for embodied AI agents like robots. However, existing visual reasoning benchmarks often suffer from several limitations: they lack a clear definition of reasoning complexity, offer have no control to generate questions over varying difficulty and task customization, and fail to provide structured, step-by-step reasoning annotations (workflows). To bridge these gaps, we formalize reasoning complexity, introduce an adaptive query engine that generates customizable questions of varying complexity with detailed intermediate annotations, and extend the JRDB dataset with human-object interaction and geometric relationship annotations to create JRDB-Reasoning, a benchmark tailored for visual reasoning in human-crowded environments. Our engine and benchmark enable fine-grained evaluation of visual reasoning frameworks and dynamic assessment of visual-language models across reasoning levels.

[90] A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method

Tao Huang,Hongbo Pan,Nanxi Zhou,Shun Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为PCWLAD的亚像素模板匹配方法,用于提高多模态光学图像的匹配准确性,通过两个步骤(粗匹配和精匹配)实现了优于现有方法的性能。

Details Motivation: 多模态光学图像的非线性辐射和几何变形差异通常会降低图像匹配的准确性,因此提出了改进方法。 Method: 提出了一种相位一致性加权最小绝对偏差(PCWLAD)亚像素模板匹配方法,包括使用结构相似性指数度量(SSIM)的粗匹配和使用WLAD的精匹配两个主要步骤。 Result: PCWLAD在三种图像数据集(可见光到红外Landsat图像、可见光到近红外近距离图像、可见光到红外无人机图像)上均优于现有最先进方法,在正确匹配率(CMR)和均方根误差(RMSE)方面达到了平均匹配精度约0.4像素。 Conclusion: PCWLAD方法在多模态光学图像匹配中表现出较高的准确性和鲁棒性,优于现有的最先进方法。 Abstract: High-accuracy matching of multimodal optical images is the basis of geometric processing. However, the image matching accuracy is usually degraded by the nonlinear radiation and geometric deformation differences caused by different spectral responses. To address these problems, we proposed a phase consistency weighted least absolute deviation (PCWLAD) sub-pixel template matching method to improve the matching accuracy of multimodal optical images. This method consists of two main steps: coarse matching with the structural similarity index measure (SSIM) and fine matching with WLAD. In the coarse matching step, PCs are calculated without a noise filter to preserve the original structural details, and template matching is performed using the SSIM. In the fine matching step, we applied the radiometric and geometric transformation models between two multimodal PC templates based on the coarse matching. Furthermore, mutual structure filtering is adopted in the model to mitigate the impact of noise within the corresponding templates on the structural consistency, and the WLAD criterion is used to estimate the sub-pixel offset. To evaluate the performance of PCWLAD, we created three types of image datasets: visible to infrared Landsat images, visible to near-infrared close-range images, and visible to infrared uncrewed aerial vehicle (UAV) images. PCWLAD outperformed existing state-of-the-art eight methods in terms of correct matching rate (CMR) and root mean square error (RMSE) and reached an average matching accuracy of approximately 0.4 pixels across all three datasets. Our software and datasets are publicly available at https://github.com/huangtaocsu/PCWLAD.

[91] InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild

Yiyi Ma,Yuanzhi Liang,Xiu Li,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: InterSyn 是一种新的运动合成框架,通过交错学习策略建模独处和交互行为,实现更自然、更协调的多角色互动运动生成。

Details Motivation: 现有的方法通常将独处和多人互动的运动分开处理,而真实场景中的运动通常包含动态交互和复杂的协调,因此需要一种新方法来更真实地生成交互运动。 Method: InterSyn 包含两个关键模块:交错交互合成 (INS) 模块,从第一人称视角统一建模独处和交互行为;相对协调优化 (REC) 模块,优化角色间的相互动态并确保动作同步。 Result: 实验结果显示,InterSyn 生成的运动序列在文本到运动对齐和多样性方面优于现有方法,同时确保了角色间的同步和自然交互。 Conclusion: InterSyn 通过其交错学习策略和两个关键模块,在运动合成领域取得了显著进展,为未来的研究和开发提供了新的方向和基准。 Abstract: We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis. Additionally, our code will be open-sourced in the future to promote further research and development in this area.

[92] From Pixel to Mask: A Survey of Out-of-Distribution Segmentation

Wenjie Zhao,Jia Li,Yunhui Guo

Main category: cs.CV

TL;DR: 本文综述了面向自动驾驶场景的OoD分割方法,将其归纳为四类,并分析了最新进展、挑战与未来研究方向。

Details Motivation: 传统的OoD检测方法无法提供空间定位信息,限制了其在下游任务中的应用。OoD分割通过像素级定位异常对象,对安全关键型应用(如自动驾驶)至关重要。 Method: 将现有的OoD分割方法分为四类:(i) 测试时OoD分割,(ii) 用于监督训练的异常暴露,(iii) 基于重构的方法,(iv) 利用强大模型的方法。 Result: 论文对当前OoD分割方法进行了分类和系统性回顾,并分析了这些方法在自动驾驶场景中的最新进展。 Conclusion: 这篇论文对自动驾驶场景中的OoD分割方法进行了系统性的综述,归纳为四类方法,并探讨了新兴的挑战和未来的研究方向。 Abstract: Out-of-distribution (OoD) detection and segmentation have attracted growing attention as concerns about AI security rise. Conventional OoD detection methods identify the existence of OoD objects but lack spatial localization, limiting their usefulness in downstream tasks. OoD segmentation addresses this limitation by localizing anomalous objects at pixel-level granularity. This capability is crucial for safety-critical applications such as autonomous driving, where perception modules must not only detect but also precisely segment OoD objects, enabling targeted control actions and enhancing overall system robustness. In this survey, we group current OoD segmentation approaches into four categories: (i) test-time OoD segmentation, (ii) outlier exposure for supervised training, (iii) reconstruction-based methods, (iv) and approaches that leverage powerful models. We systematically review recent advances in OoD segmentation for autonomous-driving scenarios, identify emerging challenges, and discuss promising future research directions.

[93] Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances

Yuanzhi Liang,Yijie Fang,Rui Li,Ziqi Ni,Ruijie Su,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: This paper surveys the application of reinforcement learning methods to visual content generation, highlighting how RL can improve the alignment of generative models with perceptual quality, semantic accuracy, and physical realism.

Details Motivation: Generative models often use surrogate objectives that misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning offers a framework for optimizing objectives that are non-differentiable, preference-driven, and temporally structured. Method: The paper provides a systematic overview of RL-based methods for visual content generation, reviewing the evolution of RL and its integration into image, video, and 3D/4D generation. Result: Recent advances show that reinforcement learning can enhance controllability, consistency, and human alignment across generative tasks. RL serves both as a fine-tuning mechanism and as a structural component for aligning generation with complex, high-level goals. Conclusion: The paper concludes with a discussion of open challenges and future research directions at the intersection of reinforcement learning and generative modeling. Abstract: Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.

[94] Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models

Andrew Bai,Justin Cui,Ruochen Wang,Cho-Jui Hsieh

Main category: cs.CV

TL;DR: 本文提出了一种目标训练数据选择方法,用于视觉语言指令调整,以优化基准测试的性能。

Details Motivation: 发现视觉语言基准测试主要受益于训练具有相似技能或视觉概念的指令,受到这一发现的启发,论文旨在优化特定基准的性能。 Method: 设计了一种简单的目标训练数据选择方法,首先从基准中提取概念/技能,确定基准主要受益于相似的概念还是技能,最后选择匹配度最高的概念/技能的指令。 Result: 在10个以上的基准测试中验证了该方法的有效性,结果显示比现有最佳基线平均高出0.9%,在以技能为中心的子集上高出1.5%。 Conclusion: 该论文强调在指令选择中存在固有的权衡,需要平衡概念知识的获取与视觉技能的获取。 Abstract: Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9\% over the best existing baseline averaged over all benchmarks and +1.5\% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.

[95] Glo-DMU: A Deep Morphometry Framework of Ultrastructural Characterization in Glomerular Electron Microscopic Images

Zhentai Zhang,Danyi Weng,Guibin Zhang,Xiang Chen,Kaixing Long,Jian Geng,Yanmeng Lu,Lei Zhang,Zhitao Zhou,Lei Cao

Main category: cs.CV

TL;DR: 本研究开发了 Glo-DMU 框架,结合深度学习模型实现肾小球超微结构的自动化高精度分析。

Details Motivation: 当前肾小球超微结构分析主要集中在单一结构识别上,难以满足实际诊断需求,因此需要一种更全面的自动分析框架。 Method: Glo-DMU 框架基于三个深度模型:超微结构分割模型、肾小球滤过屏障区域分类模型和电子致密沉积物检测模型。 Result: 在115名患者和9种肾病理类型中验证,Glo-DMU 自动量化结果与病理报告描述具有一致性。 Conclusion: Glo-DMU 是一种全自动、高精度、高通量的超微结构分析框架,为肾病理学家提供了高效的辅助工具。 Abstract: Complex and diverse ultrastructural features can indicate the type, progression, and prognosis of kidney diseases. Recently, computational pathology combined with deep learning methods has shown tremendous potential in advancing automatic morphological analysis of glomerular ultrastructure. However, current research predominantly focuses on the recognition of individual ultrastructure, which makes it challenging to meet practical diagnostic needs. In this study, we propose the glomerular morphometry framework of ultrastructural characterization (Glo-DMU), which is grounded on three deep models: the ultrastructure segmentation model, the glomerular filtration barrier region classification model, and the electron-dense deposits detection model. Following the conventional protocol of renal biopsy diagnosis, this framework simultaneously quantifies the three most widely used ultrastructural features: the thickness of glomerular basement membrane, the degree of foot process effacement, and the location of electron-dense deposits. We evaluated the 115 patients with 9 renal pathological types in real-world diagnostic scenarios, demonstrating good consistency between automatic quantification results and morphological descriptions in the pathological reports. Glo-DMU possesses the characteristics of full automation, high precision, and high throughput, quantifying multiple ultrastructural features simultaneously, and providing an efficient tool for assisting renal pathologists.

[96] Improving OCR for Historical Texts of Multiple Languages

Hylke Westerdijk,Ben Blankenborg,Khondoker Ittehadul Islam

Main category: cs.CV

TL;DR: 本论文探讨了使用深度学习技术在光学字符识别和文档布局分析中的三种任务,提出了数据增强、Kraken和TrOCR模型、CRNN与DeepLabV3+和ResNet34编码器等方法,以提高字符识别的准确性。

Details Motivation: 为了提高不同历史和现代文本的光学字符识别和文档布局分析的准确性。 Method: 在三个不同的任务中应用了多种深度学习模型,包括Kraken和TrOCR模型、CRNN与DeepLabV3+和ResNet34编码器,并使用CTC损失函数进行训练。 Result: 通过应用这些模型和技术,提高了字符识别的准确性。 Conclusion: 该报告提供了有价值的见解,并为未来的研究指明了方向。 Abstract: This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.

[97] AtomDiffuser: Time-Aware Degradation Modeling for Drift and Beam Damage in STEM Imaging

Hao Wang,Hongkui Zheng,Kai He,Abolfazl Razi

Main category: cs.CV

TL;DR: AtomDiffuser is a framework that disentangles degradation effects in STEM data, enabling interpretable modeling of atomic structural evolutions.

Details Motivation: Interpreting time-resolved STEM data is challenging due to entangled degradation effects like spatial drift and beam-induced signal loss, which current methods struggle to separate. Method: AtomDiffuser predicts an affine transformation and a spatially varying decay map between STEM frames, modeling degradation as a physically heuristic, temporally conditioned process. Result: AtomDiffuser successfully separates degradation effects, supports high-resolution inference, and generalizes well to real-world cryo-STEM data. Conclusion: AtomDiffuser provides a framework for disentangling degradation effects in STEM data, offering insights into radiation-induced atomic instabilities. Abstract: Scanning transmission electron microscopy (STEM) plays a critical role in modern materials science, enabling direct imaging of atomic structures and their evolution under external interferences. However, interpreting time-resolved STEM data remains challenging due to two entangled degradation effects: spatial drift caused by mechanical and thermal instabilities, and beam-induced signal loss resulting from radiation damage. These factors distort both geometry and intensity in complex, temporally correlated ways, making it difficult for existing methods to explicitly separate their effects or model material dynamics at atomic resolution. In this work, we present AtomDiffuser, a time-aware degradation modeling framework that disentangles sample drift and radiometric attenuation by predicting an affine transformation and a spatially varying decay map between any two STEM frames. Unlike traditional denoising or registration pipelines, our method leverages degradation as a physically heuristic, temporally conditioned process, enabling interpretable structural evolutions across time. Trained on synthetic degradation processes, AtomDiffuser also generalizes well to real-world cryo-STEM data. It further supports high-resolution degradation inference and drift alignment, offering tools for visualizing and quantifying degradation patterns that correlate with radiation-induced atomic instabilities.

[98] Contrast Sensitivity Function of Multimodal Vision-Language Models

Pablo Hernández-Cámara,Alexandra Gomez-Villa,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Jesus Malo,Valero Laparra

Main category: cs.CV

TL;DR: 该研究通过一种新颖的行为心理物理学方法,评估了多模态视觉-语言模型与人类感知的一致性,发现模型在对比敏感度函数上存在关键差异,并提示了提示稳定性的问题。

Details Motivation: 了解多模态视觉-语言模型(VLMs)如何感知低级视觉特征至关重要,人类视觉的一个关键特征是对比敏感度函数(CSF),描述低对比度下对空间频率的敏感性。 Method: 通过使用带通滤波噪声图像和多样化的提示,评估多种架构的模型响应。 Result: 尽管一些模型近似于人类CSF的形状或幅度,但没有一个模型能完全复制两者。值得注意的是,提示措辞对响应有很大影响,引发了关于提示稳定性的担忧。 Conclusion: 我们的结果提供了一个新的框架,用于探测多模态模型中的视觉敏感性,并揭示了其视觉表征与人类感知之间的关键差异。 Abstract: Assessing the alignment of multimodal vision-language models~(VLMs) with human perception is essential to understand how they perceive low-level visual features. A key characteristic of human vision is the contrast sensitivity function (CSF), which describes sensitivity to spatial frequency at low-contrasts. Here, we introduce a novel behavioral psychophysics-inspired method to estimate the CSF of chat-based VLMs by directly prompting them to judge pattern visibility at different contrasts for each frequency. This methodology is closer to the real experiments in psychophysics than the previously reported. Using band-pass filtered noise images and a diverse set of prompts, we assess model responses across multiple architectures. We find that while some models approximate human-like CSF shape or magnitude, none fully replicate both. Notably, prompt phrasing has a large effect on the responses, raising concerns about prompt stability. Our results provide a new framework for probing visual sensitivity in multimodal models and reveal key gaps between their visual representations and human perception.

[99] Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

Hyundo Lee,Suhyung Choi,Byoung-Tak Zhang,Inwoo Hwang

Main category: cs.CV

TL;DR: 本文提出一种利用内在场景属性提升图像生成空间一致性的方法,通过同时生成图像和属性,改善生成效果。

Details Motivation: 现有图像生成模型由于缺乏对场景结构和空间布局的充分理解,容易生成空间不一致和失真的图像。 Method: 从大规模图像数据集中提取内在场景属性,使用自编码器将其聚合为单一潜在变量,并在预训练的潜在扩散模型基础上,同时对图像和内在属性进行去噪处理。 Result: 实验结果表明,该方法能够纠正空间不一致问题,生成布局更自然的图像,同时保持图像质量和文本对齐能力。 Conclusion: 该方法通过同时生成图像和其对应的内在属性,隐式地捕捉场景结构,从而生成空间一致性更强、更自然的图像。 Abstract: Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).

[100] Unlocking Robust Semantic Segmentation Performance via Label-only Elastic Deformations against Implicit Label Noise

Yechan Kim,Dongho Yoon,Younkwan Lee,Unse Fatima,Hong Kook Kim,Songjae Lee,Sanga Park,Jeong Ho Park,Seonjong Kang,Moongu Jeon

Main category: cs.CV

TL;DR: 本文提出NSegment+,一种解耦图像与标签变换的数据增强方法,用于缓解语义分割中的隐式标签噪声问题,显著提升了模型性能。

Details Motivation: 现实世界数据集中存在隐式的标签不完善问题,如模糊的对象边界和标注者差异,传统的数据增强方法可能会放大这些隐性噪声,限制模型的泛化能力。 Method: NSegment+仅对分割标签引入受控的弹性变形,同时保持原始图像不变,以鼓励模型学习更鲁棒的物体结构表示。 Result: NSegment+在Vaihingen、LoveDA、Cityscapes和PASCAL VOC数据集上分别平均提升了2.29、2.38、1.75和3.39的mIoU,表明其对隐式标签噪声的处理效果显著。 Conclusion: NSegment+是一个新的增强框架,通过解耦图像和标签变换来处理语义分割中的隐式标签噪声,有效提升了模型性能。 Abstract: While previous studies on image segmentation focus on handling severe (or explicit) label noise, real-world datasets also exhibit subtle (or implicit) label imperfections. These arise from inherent challenges, such as ambiguous object boundaries and annotator variability. Although not explicitly present, such mild and latent noise can still impair model performance. Typical data augmentation methods, which apply identical transformations to the image and its label, risk amplifying these subtle imperfections and limiting the model's generalization capacity. In this paper, we introduce NSegment+, a novel augmentation framework that decouples image and label transformations to address such realistic noise for semantic segmentation. By introducing controlled elastic deformations only to segmentation labels while preserving the original images, our method encourages models to focus on learning robust representations of object structures despite minor label inconsistencies. Extensive experiments demonstrate that NSegment+ consistently improves performance, achieving mIoU gains of up to +2.29, +2.38, +1.75, and +3.39 in average on Vaihingen, LoveDA, Cityscapes, and PASCAL VOC, respectively-even without bells and whistles, highlighting the importance of addressing implicit label noise. These gains can be further amplified when combined with other training tricks, including CutMix and Label Smoothing.

[101] PQ-DAF: Pose-driven Quality-controlled Data Augmentation for Data-scarce Driver Distraction Detection

Haibin Sun,Xinghui Song

Main category: cs.CV

TL;DR: 为解决驾驶员分心检测模型在实际部署中泛化能力不足的问题,本文提出了一种基于视觉语言模型和渐进条件扩散模型的姿态驱动数据增强框架(PQ-DAF),通过生成高质量多样化训练样本提升模型在数据稀缺条件下的表现。

Details Motivation: 现有驾驶员分心检测模型在实际部署中泛化能力不足,主要由于实际环境中数据标注成本高昂导致的小样本学习挑战,以及训练数据集与目标部署条件之间存在显著领域差异。 Method: 提出了一种基于姿态驱动的质量控制数据增强框架(PQ-DAF),利用视觉语言模型进行样本筛选,扩展训练数据并增强跨领域鲁棒性。具体方法包括使用渐进条件扩散模型(PCDMs)捕捉驾驶员姿态特征并生成多样化的训练样本,以及使用CogVLM视觉语言模型评估样本质量,过滤低质量合成样本。 Result: PQ-DAF在小样本驾驶员分心检测中显著提升了模型性能,在数据稀缺条件下实现了显著的泛化能力提升。 Conclusion: PQ-DAF有效提高了驾驶员分心检测模型在数据稀缺条件下的泛化能力,通过引入基于视觉语言模型的样本质量评估模块,确保了合成数据集的可靠性,并解决了训练数据集与目标部署条件之间的领域差异问题。 Abstract: Driver distraction detection is essential for improving traffic safety and reducing road accidents. However, existing models often suffer from degraded generalization when deployed in real-world scenarios. This limitation primarily arises from the few-shot learning challenge caused by the high cost of data annotation in practical environments, as well as the substantial domain shift between training datasets and target deployment conditions. To address these issues, we propose a Pose-driven Quality-controlled Data Augmentation Framework (PQ-DAF) that leverages a vision-language model for sample filtering to cost-effectively expand training data and enhance cross-domain robustness. Specifically, we employ a Progressive Conditional Diffusion Model (PCDMs) to accurately capture key driver pose features and synthesize diverse training examples. A sample quality assessment module, built upon the CogVLM vision-language model, is then introduced to filter out low-quality synthetic samples based on a confidence threshold, ensuring the reliability of the augmented dataset. Extensive experiments demonstrate that PQ-DAF substantially improves performance in few-shot driver distraction detection, achieving significant gains in model generalization under data-scarce conditions.

[102] Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models

Eunseo Koh,Seunghoo Hong,Tae-Young Kim,Simon S. Woo,Jae-Pil Heo

Main category: cs.CV

TL;DR: This paper proposes a novel approach to suppress undesired content in Text-to-Image (T2I) diffusion models by modifying the text embedding space using a delta vector, enabling more effective suppression of strongly entangled content.

Details Motivation: The motivation is to address the challenge faced by T2I diffusion models in suppressing content that is strongly entangled with specific words, such as generating a mustache when the prompt is Charlie Chaplin, even when explicitly instructed not to include it. Method: The method involves introducing a delta vector to modify the text embedding space, weakening the influence of undesired content. They also proposed Selective Suppression with Delta Vector (SSDV) to adapt this delta vector into the cross-attention mechanism for more effective suppression of unwanted content. Additionally, they optimized the delta vector for precise suppression in personalized T2I models. Result: Extensive experimental results demonstrate that their approach significantly outperforms existing methods in suppressing entangled content in T2I models, both quantitatively and qualitatively. Conclusion: The paper concludes that their proposed method, Selective Suppression with Delta Vector (SSDV), significantly outperforms existing methods in suppressing undesired content in Text-to-Image (T2I) diffusion models, both in terms of quantitative and qualitative metrics. Abstract: Text-to-Image (T2I) diffusion models have made significant progress in generating diverse high-quality images from textual prompts. However, these models still face challenges in suppressing content that is strongly entangled with specific words. For example, when generating an image of ``Charlie Chaplin", a ``mustache" consistently appears even if explicitly instructed not to include it, as the concept of ``mustache" is strongly entangled with ``Charlie Chaplin". To address this issue, we propose a novel approach to directly suppress such entangled content within the text embedding space of diffusion models. Our method introduces a delta vector that modifies the text embedding to weaken the influence of undesired content in the generated image, and we further demonstrate that this delta vector can be easily obtained through a zero-shot approach. Furthermore, we propose a Selective Suppression with Delta Vector (SSDV) method to adapt delta vector into the cross-attention mechanism, enabling more effective suppression of unwanted content in regions where it would otherwise be generated. Additionally, we enabled more precise suppression in personalized T2I models by optimizing delta vector, which previous baselines were unable to achieve. Extensive experimental results demonstrate that our approach significantly outperforms existing methods, both in terms of quantitative and qualitative metrics.

[103] SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection

Chaesong Park,Eunbin Seo,Jihyeon Hwang,Jongwoo Lim

Main category: cs.CV

TL;DR: SC-Lane是一种新的3D车道检测框架,通过自适应融合多坡度特征和时间一致性模块,显著提高了道路高度估计的准确性和稳定性。

Details Motivation: 为了解决依赖固定坡度锚点的方法在多样道路几何形状上的不足,提高道路高度估计的鲁棒性。 Method: SC-Lane框架包含一个Slope-Aware Adaptive Feature模块和一个Height Consistency Module,分别用于动态预测权重以融合多坡度表示和确保时间一致性。 Result: 在OpenLane基准测试中,SC-Lane在平均绝对误差、均方根误差和基于阈值的准确性方面显著优于现有方法,F-score达到64.3%。 Conclusion: SC-Lane实现了对道路高度估计和3D车道检测的显著改进,并在OpenLane基准测试中达到了最先进的性能。 Abstract: In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the appropriate weights from image cues for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios. To evaluate the effectiveness of SC-Lane, we employ three standardized metrics-Mean Absolute Error(MAE), Root Mean Squared Error (RMSE), and threshold-based accuracy-which, although common in surface and depth estimation, have been underutilized for road height assessment. Using the LiDAR-derived heightmap dataset introduced in prior work [20], we benchmark our method under these metrics, thereby establishing a rigorous standard for future comparisons. Extensive experiments on the OpenLane benchmark demonstrate that SC-Lane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3%, outperforming existing methods by a notable margin. For detailed results and a demonstration video, please refer to our project page:https://parkchaesong.github.io/sclane/

[104] NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer

Shanyuan Liu,Jian Zhu,Junda Lu,Yue Gong,Liuzhuozheng Li,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Dawei Leng,Yuhui Yin

Main category: cs.CV

TL;DR: 本文提出了一种高效的可控文本到图像生成模型NanoControl,通过LoRA风格的控制模块和KV-Context Augmentation机制,在保持生成质量的同时大大降低了计算开销。

Details Motivation: 现有的基于ControlNet的方法引入了显著的参数开销和计算成本。 Method: 设计了一个LoRA风格的控制模块,并引入了KV-Context Augmentation机制。 Result: 模型仅增加了0.024%的参数数量和0.029%的GFLOPs,却达到了最先进的可控文本到图像生成性能。 Conclusion: NanoControl实现了高效的可控文本到图像生成,同时保持了生成质量和可控性。 Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm originally designed for UNet-based diffusion models. This paradigm introduces significant parameter overhead and increased computational costs. To address these challenges, we propose the Nano Control Diffusion Transformer (NanoControl), which employs Flux as the backbone network. Our model achieves state-of-the-art controllable text-to-image generation performance while incurring only a 0.024\% increase in parameter count and a 0.029\% increase in GFLOPs, thus enabling highly efficient controllable generation. Specifically, rather than duplicating the DiT backbone for control, we design a LoRA-style (low-rank adaptation) control module that directly learns control signals from raw conditioning inputs. Furthermore, we introduce a KV-Context Augmentation mechanism that integrates condition-specific key-value information into the backbone in a simple yet highly effective manner, facilitating deep fusion of conditional features. Extensive benchmark experiments demonstrate that NanoControl significantly reduces computational overhead compared to conventional control approaches, while maintaining superior generation quality and achieving improved controllability.

[105] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

Keishi Ishihara,Kento Sasaki,Tsubasa Takahashi,Daiki Shiono,Yu Yamaguchi

Main category: cs.CV

TL;DR: STRIDE-QA is a large-scale dataset designed to enhance the spatiotemporal reasoning capabilities of Vision-Language Models in autonomous driving scenarios, demonstrating significant performance improvements over existing models.

Details Motivation: Vision-Language Models (VLMs) have limitations in spatiotemporal reasoning for dynamic traffic scenes due to training on static image-text pairs. STRIDE-QA addresses this gap. Method: Constructed from 100 hours of multi-sensor driving data in Tokyo, STRIDE-QA offers 16 million QA pairs over 285K frames with dense annotations like 3D bounding boxes and multi-object tracks. It supports object-centric and ego-centric reasoning through novel QA tasks. Result: VLMs fine-tuned on STRIDE-QA achieved 55% success in spatial localization and 28% consistency in future motion prediction, significantly outperforming general-purpose VLMs which scored near-zero. Conclusion: STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems. Abstract: Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.

[106] CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation

Baichen Liu,Qi Lyu,Xudong Wang,Jiahua Dong,Lianqing Liu,Zhi Han

Main category: cs.CV

TL;DR: This paper proposes CRISP, a novel method for continual video instance segmentation that effectively addresses instance-wise, category-wise, and task-wise confusion, leading to improved performance on long-term tasks while avoiding catastrophic forgetting.

Details Motivation: Continual video instance segmentation requires the ability to learn new object categories while retaining previously learned ones and maintaining temporal consistency across video frames. Existing methods face challenges in addressing instance-wise, category-wise, and task-wise confusion. Method: The authors proposed CRISP, which includes Contrastive Residual Injection and Semantic Prompting. This approach addresses instance-wise, category-wise, and task-wise confusion through instance correlation loss, an adaptive residual semantic prompt (ARSP) learning framework, semantic consistency loss based on contrastive learning, and an initialization strategy for incremental prompts. Result: Experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrated that CRISP significantly outperforms existing continual segmentation methods in long-term continual video instance segmentation tasks, avoiding catastrophic forgetting and improving segmentation and classification performance. Conclusion: CRISP effectively addresses the challenges of continual video instance segmentation by enhancing plasticity and stability, while preserving temporal consistency across frames. Abstract: Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/01upup10/CRISP.

[107] DOD-SA: Infrared-Visible Decoupled Object Detection with Single-Modality Annotations

Hang Jin,Chenqiang Gao,Junjie Guo,Fangcen Liu,Kanghui Tian,Qinyao Chang

Main category: cs.CV

TL;DR: This paper proposes DOD-SA, an infrared-visible object detection framework that reduces annotation costs by using single-modality annotations while achieving superior performance.

Details Motivation: Existing infrared-visible object detection methods require costly dual-modality annotations, which this study aims to reduce by proposing a framework that works with single-modality annotations. Method: The method involves a teacher-student network (CoSD-TSNet) with single- and dual-modality branches, a progressive training strategy (PaST), and a pseudo-label assigner (PLA) to improve detection performance with single-modality annotations. Result: Experiments on the DroneVehicle dataset show that the proposed method outperforms state-of-the-art techniques in infrared-visible object detection. Conclusion: The proposed DOD-SA framework, along with CoSD-TSNet, PaST, and PLA, demonstrates superior performance over existing methods in infrared-visible object detection with reduced annotation costs. Abstract: Infrared-visible object detection has shown great potential in real-world applications, enabling robust all-day perception by leveraging the complementary information of infrared and visible images. However, existing methods typically require dual-modality annotations to output detection results for both modalities during prediction, which incurs high annotation costs. To address this challenge, we propose a novel infrared-visible Decoupled Object Detection framework with Single-modality Annotations, called DOD-SA. The architecture of DOD-SA is built upon a Single- and Dual-Modality Collaborative Teacher-Student Network (CoSD-TSNet), which consists of a single-modality branch (SM-Branch) and a dual-modality decoupled branch (DMD-Branch). The teacher model generates pseudo-labels for the unlabeled modality, simultaneously supporting the training of the student model. The collaborative design enables cross-modality knowledge transfer from the labeled modality to the unlabeled modality, and facilitates effective SM-to-DMD branch supervision. To further improve the decoupling ability of the model and the pseudo-label quality, we introduce a Progressive and Self-Tuning Training Strategy (PaST) that trains the model in three stages: (1) pretraining SM-Branch, (2) guiding the learning of DMD-Branch by SM-Branch, and (3) refining DMD-Branch. In addition, we design a Pseudo Label Assigner (PLA) to align and pair labels across modalities, explicitly addressing modality misalignment during training. Extensive experiments on the DroneVehicle dataset demonstrate that our method outperforms state-of-the-art (SOTA).

[108] SkeySpot: Automating Service Key Detection for Digital Electrical Layout Plans in the Construction Industry

Dhruv Dosi,Rohit Meena,Param Rajpura,Yogesh Kumar Meena

Main category: cs.CV

TL;DR: 本文提出了一种自动化符号检测方法,通过引入标记化的数字化电气布局计划(DELP)数据集和基于YOLOv8的开源工具SkeySpot,实现了电气符号的实时检测、分类和量化,从而降低了对专有CAD系统的依赖,减少了手动注释工作,使得中小型建筑企业更容易实现电气布局的数字化。

Details Motivation: 论文的动机是传统的平面图通常仅以扫描文档的形式保存,这仍然是建筑、城市规划和设施管理的重要资源。然而,缺乏机器可读的平面图导致大规模解释既耗时又容易出错。自动化符号检测提供了一个可扩展的解决方案,可以直接从平面图中识别服务键符号,支持诸如成本估算、基础设施维护和法规遵循等工作流。 Method: 论文的方法包括引入一个标记化的数字化电气布局计划(DELP)数据集,包含45个扫描的电气布局计划,标注了34个不同的服务键类别。使用预训练的目标检测模型对DELP数据集进行系统评估,其中YOLOv8表现最佳,平均精度(mAP)达到82.5%。基于YOLOv8开发了一个轻量级的开源工具SkeySpot,用于实时检测、分类和量化电气符号。 Result: 论文的结果显示,YOLOv8模型在DELP数据集上的评估中表现最佳,平均精度(mAP)达到82.5%。基于YOLOv8开发的SkeySpot工具能够实时检测、分类和量化电气符号,并产生结构化、标准化的输出,可以扩展到互操作的建筑信息工作流程中,最终实现下游应用和监管平台之间的兼容性。 Conclusion: 论文的结论是,通过降低对专有CAD系统的依赖并减少手动注释工作,这种方法使得中小型建筑企业更容易实现电气布局的数字化,同时支持建筑环境中标准化、互操作性和可持续性的更广泛目标。 Abstract: Legacy floor plans, often preserved only as scanned documents, remain essential resources for architecture, urban planning, and facility management in the construction industry. However, the lack of machine-readable floor plans render large-scale interpretation both time-consuming and error-prone. Automated symbol spotting offers a scalable solution by enabling the identification of service key symbols directly from floor plans, supporting workflows such as cost estimation, infrastructure maintenance, and regulatory compliance. This work introduces a labelled Digitised Electrical Layout Plans (DELP) dataset comprising 45 scanned electrical layout plans annotated with 2,450 instances across 34 distinct service key classes. A systematic evaluation framework is proposed using pretrained object detection models for DELP dataset. Among the models benchmarked, YOLOv8 achieves the highest performance with a mean Average Precision (mAP) of 82.5\%. Using YOLOv8, we develop SkeySpot, a lightweight, open-source toolkit for real-time detection, classification, and quantification of electrical symbols. SkeySpot produces structured, standardised outputs that can be scaled up for interoperable building information workflows, ultimately enabling compatibility across downstream applications and regulatory platforms. By lowering dependency on proprietary CAD systems and reducing manual annotation effort, this approach makes the digitisation of electrical layouts more accessible to small and medium-sized enterprises (SMEs) in the construction industry, while supporting broader goals of standardisation, interoperability, and sustainability in the built environment.

[109] From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images

Pablo Hernández-Cámara,Jesus Malo,Valero Laparra

Main category: cs.CV

TL;DR: 研究表明,生物启发模型PerceptNet在图像重建任务中能够有效学习人类感知度量,而无需人工监督。

Details Motivation: 人类视觉感知可能源于图像统计信息,从而形成高效的早期视觉神经表示。 Method: 通过端到端优化PerceptNet,执行与图像重建相关的不同任务,如自动编码、去噪、去模糊和稀疏性正则化。 Result: 编码器阶段(类似V1的层)与人类对图像失真的感知判断表现出最高的相关性,即使在初始化或训练中未使用感知信息。 Conclusion: 生物启发模型可以在没有人类监督的情况下学习感知度量标准,这表明视觉系统可能适应于去除特定水平的失真和稀疏性。 Abstract: A number of scientists suggested that human visual perception may emerge from image statistics, shaping efficient neural representations in early vision. In this work, a bio-inspired architecture that can accommodate several known facts in the retina-V1 cortex, the PerceptNet, has been end-to-end optimized for different tasks related to image reconstruction: autoencoding, denoising, deblurring, and sparsity regularization. Our results show that the encoder stage (V1-like layer) consistently exhibits the highest correlation with human perceptual judgments on image distortion despite not using perceptual information in the initialization or training. This alignment exhibits an optimum for moderate noise, blur and sparsity. These findings suggest that the visual system may be tuned to remove those particular levels of distortion with that level of sparsity and that biologically inspired models can learn perceptual metrics without human supervision.

[110] Trajectory-aware Shifted State Space Models for Online Video Super-Resolution

Qiang Zhu,Xiandong Meng,Yuxian Jiang,Fan Zhang,David Bull,Shuyuan Zhu,Bing Zeng

Main category: cs.CV

TL;DR: This paper presents TS-Mamba, a novel method for online video super-resolution that improves computational efficiency and performance by leveraging long-term trajectory modeling and low-complexity Mamba for spatio-temporal information aggregation.

Details Motivation: Most existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Method: A novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Result: Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, TS-Mamba achieves state-of-the-art performance in most cases. Conclusion: TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% complexity reduction (in MACs). Abstract: Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames. Then, a Trajectory-aware Shifted Mamba Aggregation (TSMA) module consisting of proposed shifted SSMs blocks is employed to aggregate the selected tokens. The shifted SSMs blocks are designed based on Hilbert scannings and corresponding shift operations to compensate for scanning losses and strengthen the spatial continuity of Mamba. Additionally, we propose a trajectory-aware loss function to supervise the trajectory generation, ensuring the accuracy of token selection when training our model. Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7\% complexity reduction (in MACs). The source code for TS-Mamba will be available at https://github.com.

[111] Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers

Hanna Herasimchyk,Robin Labryga,Tomislav Prusina

Main category: cs.CV

TL;DR: 本文提出了一种用于多标签植物种类预测的多头视觉Transformer方法,通过多尺度切片和集成策略等技术,在PlantCLEF 2025挑战赛中取得了优异成绩。

Details Motivation: 解决从单物种植物图像训练到多物种样方图像测试的显著领域转移问题。 Method: 使用预训练的DINOv2 Vision Transformer Base (ViT-B/14)作为骨干网络,结合多分类头进行物种、属和科的预测,利用分类层次结构,并引入了多尺度切片、动态阈值优化和集成策略。 Result: 在包含约140万张训练图像的7806种植物上进行了实验,结果表现优异,在私人排行榜上排名第三。 Conclusion: 该论文提出了一种基于多头视觉Transformer的植物种类预测方法,在PlantCLEF 2025挑战赛中取得了优异的成绩,代码已公开。 Abstract: We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.

[112] SingleStrip: learning skull-stripping from a single labeled example

Bella Specktor-Fadida,Malte Hoffmann

Main category: cs.CV

TL;DR: 本文提出了一种结合领域随机化与基于自编码器质量评估的半监督学习方法,有效解决了标注数据稀缺情况下的深度学习医学图像分割问题。

Details Motivation: 深度学习分割严重依赖标注数据,而手动标注费时费力,尤其是在处理如脑磁共振成像等体积图像时。虽然近期的领域随机化技术通过从标签图合成多样化的训练图像减轻了对标注数据的依赖,但在标签图非常少的情况下提供的解剖多样性有限。半监督自训练通过迭代地将模型预测结果纳入训练集,使网络能够从未标注数据中学习,从而缓解了标签稀缺的问题。 Method: 首先,自动对体素强度进行二值化处理,生成用于训练初始头骨剥离模型的标签;其次,基于标注样例训练卷积自编码器,并利用其重构误差评估预测脑掩膜的质量;最后,选择高质量的伪标签对网络进行微调。 Result: 该方法在仅使用一个标注样例的情况下,实现了接近使用更多标注图像训练模型的头骨剥离性能,并且基于自编码器的评估方法与分割准确性具有更强的相关性。 Conclusion: 结合领域随机化和基于自编码器的质量控制策略能够有效缓解标注数据稀缺的问题,为涉及新解剖结构或新兴成像技术的研究提供便利。 Abstract: Deep learning segmentation relies heavily on labeled data, but manual labeling is laborious and time-consuming, especially for volumetric images such as brain magnetic resonance imaging (MRI). While recent domain-randomization techniques alleviate the dependency on labeled data by synthesizing diverse training images from label maps, they offer limited anatomical variability when very few label maps are available. Semi-supervised self-training addresses label scarcity by iteratively incorporating model predictions into the training set, enabling networks to learn from unlabeled data. In this work, we combine domain randomization with self-training to train three-dimensional skull-stripping networks using as little as a single labeled example. First, we automatically bin voxel intensities, yielding labels we use to synthesize images for training an initial skull-stripping model. Second, we train a convolutional autoencoder (AE) on the labeled example and use its reconstruction error to assess the quality of brain masks predicted for unlabeled data. Third, we select the top-ranking pseudo-labels to fine-tune the network, achieving skull-stripping performance on out-of-distribution data that approaches models trained with more labeled images. We compare AE-based ranking to consistency-based ranking under test-time augmentation, finding that the AE approach yields a stronger correlation with segmentation accuracy. Our results highlight the potential of combining domain randomization and AE-based quality control to enable effective semi-supervised segmentation from extremely limited labeled data. This strategy may ease the labeling burden that slows progress in studies involving new anatomical structures or emerging imaging techniques.

[113] Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition

Maimunatu Tunau,Vincent Gbouna Zakka,Zhuangzhuang Dai

Main category: cs.CV

TL;DR: 本文评估了三种主要的毫米波雷达数据处理方法(DBSCAN、匈牙利算法和卡尔曼滤波)在人体动作识别(HAR)中的性能,提出了改进措施,并分析了它们的优缺点和集成效果。

Details Motivation: 毫米波雷达传感器提供了一种保护隐私的HAR替代方案,但其点云数据稀疏且嘈杂,因此需要有效的数据处理方法。现有的三种主要方法需要综合评估。 Method: 使用MiliPoint数据集对DBSCAN、匈牙利算法和卡尔曼滤波三种方法进行了详细性能分析,包括单独使用、两两组合以及三者组合,并提出了改进措施。 Result: 论文提供了每种方法及其组合的识别准确性和计算成本的评估结果,并提出了提高准确性的针对性改进措施。 Conclusion: 该研究为未来基于毫米波雷达的HAR系统提供了关键见解,包括方法优势、权衡和集成效果。 Abstract: Human Action Recognition (HAR) plays a crucial role in healthcare, fitness tracking, and ambient assisted living technologies. While traditional vision based HAR systems are effective, they pose privacy concerns. mmWave radar sensors offer a privacy preserving alternative but present challenges due to the sparse and noisy nature of their point cloud data. In the literature, three primary data processing methods: Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the Hungarian Algorithm, and Kalman Filtering have been widely used to improve the quality and continuity of radar data. However, a comprehensive evaluation of these methods, both individually and in combination, remains lacking. This paper addresses that gap by conducting a detailed performance analysis of the three methods using the MiliPoint dataset. We evaluate each method individually, all possible pairwise combinations, and the combination of all three, assessing both recognition accuracy and computational cost. Furthermore, we propose targeted enhancements to the individual methods aimed at improving accuracy. Our results provide crucial insights into the strengths and trade-offs of each method and their integrations, guiding future work on mmWave based HAR systems

[114] STAMP: Multi-pattern Attention-aware Multiple Instance Learning for STAS Diagnosis in Multi-center Histopathology Images

Liangrui Pan,xiaoyu Li,Guang Zhu,Guanting Li,Ruixin Wang,Jiadi Luo,Yaning Yang,Liang qingchun,Shaoliang Peng

Main category: cs.CV

TL;DR: This study proposes STAMP, a deep learning framework for diagnosing STAS in lung adenocarcinoma, which achieves high accuracy across multiple datasets, surpassing traditional clinical methods.

Details Motivation: STAS is a novel invasive pattern in LUAD linked to poor prognosis, and current diagnostic methods are labor-intensive and error-prone, necessitating an automated deep learning approach. Method: A multi-pattern attention-aware multiple instance learning framework called STAMP was developed, incorporating a dual-branch architecture, transformer-based instance encoding, and multi-pattern attention aggregation modules. Result: STAMP achieved AUCs of 0.8058, 0.8017, and 0.7928 on the STAS-SXY, STAS-TXY, and STAS-TCGA datasets, respectively, outperforming clinical standards. Conclusion: The STAMP framework is effective for diagnosing STAS in LUAD using histopathology images, demonstrating competitive performance above clinical levels. Abstract: Spread through air spaces (STAS) constitutes a novel invasive pattern in lung adenocarcinoma (LUAD), associated with tumor recurrence and diminished survival rates. However, large-scale STAS diagnosis in LUAD remains a labor-intensive endeavor, compounded by the propensity for oversight and misdiagnosis due to its distinctive pathological characteristics and morphological features. Consequently, there is a pressing clinical imperative to leverage deep learning models for STAS diagnosis. This study initially assembled histopathological images from STAS patients at the Second Xiangya Hospital and the Third Xiangya Hospital of Central South University, alongside the TCGA-LUAD cohort. Three senior pathologists conducted cross-verification annotations to construct the STAS-SXY, STAS-TXY, and STAS-TCGA datasets. We then propose a multi-pattern attention-aware multiple instance learning framework, named STAMP, to analyze and diagnose the presence of STAS across multi-center histopathology images. Specifically, the dual-branch architecture guides the model to learn STAS-associated pathological features from distinct semantic spaces. Transformer-based instance encoding and a multi-pattern attention aggregation modules dynamically selects regions closely associated with STAS pathology, suppressing irrelevant noise and enhancing the discriminative power of global representations. Moreover, a similarity regularization constraint prevents feature redundancy across branches, thereby improving overall diagnostic accuracy. Extensive experiments demonstrated that STAMP achieved competitive diagnostic results on STAS-SXY, STAS-TXY and STAS-TCGA, with AUCs of 0.8058, 0.8017, and 0.7928, respectively, surpassing the clinical level.

[115] TweezeEdit: Consistent and Efficient Image Editing with Path Regularization

Jianda Mao,Kaibo Wang,Yang Xiang,Kani Chen

Main category: cs.CV

TL;DR: TweezeEdit 提出了一种无需调整和反转的图像编辑框架,通过在整个去噪路径上进行正则化,实现了更一致且高效的图像编辑。

Details Motivation: 现有的基于扩散模型的图像编辑方法在对齐目标提示的同时难以保持源图像的语义,且编辑路径较长,效率较低。 Method: TweezeEdit 采用梯度驱动的正则化方法,在整个去噪路径上注入目标提示信息,而不是仅依赖于源图像的反转锚点。 Result: 实验表明,TweezeEdit 在语义保持和目标对齐方面优于现有方法,并且仅需 12 步(每次编辑仅需 1.6 秒),具备实时应用潜力。 Conclusion: TweezeEdit 是一种高效且具有一致性的图像编辑方法,适用于需要快速和语义保持的编辑任务。 Abstract: Large-scale pre-trained diffusion models empower users to edit images through text guidance. However, existing methods often over-align with target prompts while inadequately preserving source image semantics. Such approaches generate target images explicitly or implicitly from the inversion noise of the source images, termed the inversion anchors. We identify this strategy as suboptimal for semantic preservation and inefficient due to elongated editing paths. We propose TweezeEdit, a tuning- and inversion-free framework for consistent and efficient image editing. Our method addresses these limitations by regularizing the entire denoising path rather than relying solely on the inversion anchors, ensuring source semantic retention and shortening editing paths. Guided by gradient-driven regularization, we efficiently inject target prompt semantics along a direct path using a consistency model. Extensive experiments demonstrate TweezeEdit's superior performance in semantic preservation and target alignment, outperforming existing methods. Remarkably, it requires only 12 steps (1.6 seconds per edit), underscoring its potential for real-time applications.

[116] Multi-Sample Anti-Aliasing and Constrained Optimization for 3D Gaussian Splatting

Zheng Zhou,Jia-Chen Zhang,Yu-Jie Xiong,Chun-Ming Xia

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的优化框架,通过引入多重抗锯齿技术和双重几何约束,有效提升了3D高斯点渲染中精细细节的重建质量。

Details Motivation: 3D高斯点渲染在场景优化过程中缺乏足够的几何约束,导致高频率纹理和尖锐不连续区域的重建效果模糊。 Method: 通过自适应混合四个子样本计算像素颜色,并引入两种约束:(a) 动态梯度分析下的自适应加权策略,(b) 在物体边界处执行几何正则化的梯度差分约束。 Result: 在多个基准测试中,该方法在结构相似性(SSIM)和感知质量(LPIPS)指标上均取得了优于基线方法的显著提升。 Conclusion: 该论文提出的结合MSAA与双重几何约束的综合优化框架,在保持实时渲染效率的同时,显著提升了高频率纹理和尖锐不连续区域的细节保留效果。 Abstract: Recent advances in 3D Gaussian splatting have significantly improved real-time novel view synthesis, yet insufficient geometric constraints during scene optimization often result in blurred reconstructions of fine-grained details, particularly in regions with high-frequency textures and sharp discontinuities. To address this, we propose a comprehensive optimization framework integrating multisample anti-aliasing (MSAA) with dual geometric constraints. Our system computes pixel colors through adaptive blending of quadruple subsamples, effectively reducing aliasing artifacts in high-frequency components. The framework introduces two constraints: (a) an adaptive weighting strategy that prioritizes under-reconstructed regions through dynamic gradient analysis, and (b) gradient differential constraints enforcing geometric regularization at object boundaries. This targeted optimization enables the model to allocate computational resources preferentially to critical regions requiring refinement while maintaining global consistency. Extensive experimental evaluations across multiple benchmarks demonstrate that our method achieves state-of-the-art performance in detail preservation, particularly in preserving high-frequency textures and sharp discontinuities, while maintaining real-time rendering efficiency. Quantitative metrics and perceptual studies confirm statistically significant improvements over baseline approaches in both structural similarity (SSIM) and perceptual quality (LPIPS).

[117] A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection

Yangjie Xiao,Ke Zhang,Jiacun Wang,Xin Sheng,Yurong Guo,Meijuan Chen,Zehua Ren,Zhaoye Zheng,Zhenbing Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的螺栓缺陷检测数据增强方法SBDE,通过分割驱动的编辑技术生成高质量缺陷图像,解决了数据稀缺和不平衡问题,显著提升了检测性能。

Details Motivation: 由于缺陷图像稀缺和数据分布不平衡严重限制了检测性能,因此需要一种新的数据增强方法来提升螺栓缺陷检测效果。 Method: 提出了一种分割驱动的螺栓缺陷编辑方法(SBDE),包括螺栓属性分割模型(Bolt-SAM)、掩码优化模块与图像修复模型结合的编辑模型(MOD-LaMa),以及编辑恢复增强策略(ERA) Result: 实验结果表明,SBDE生成的螺栓缺陷图像显著优于最先进的图像编辑模型,并有效提升了螺栓缺陷检测的性能。 Conclusion: 论文提出了一种名为SBDE的螺栓缺陷编辑方法,能够显著提升缺陷检测性能,并验证了其有效性与应用潜力。 Abstract: Bolt defect detection is critical to ensure the safety of transmission lines. However, the scarcity of defect images and imbalanced data distributions significantly limit detection performance. To address this problem, we propose a segmentationdriven bolt defect editing method (SBDE) to augment the dataset. First, a bolt attribute segmentation model (Bolt-SAM) is proposed, which enhances the segmentation of complex bolt attributes through the CLAHE-FFT Adapter (CFA) and Multipart- Aware Mask Decoder (MAMD), generating high-quality masks for subsequent editing tasks. Second, a mask optimization module (MOD) is designed and integrated with the image inpainting model (LaMa) to construct the bolt defect attribute editing model (MOD-LaMa), which converts normal bolts into defective ones through attribute editing. Finally, an editing recovery augmentation (ERA) strategy is proposed to recover and put the edited defect bolts back into the original inspection scenes and expand the defect detection dataset. We constructed multiple bolt datasets and conducted extensive experiments. Experimental results demonstrate that the bolt defect images generated by SBDE significantly outperform state-of-the-art image editing models, and effectively improve the performance of bolt defect detection, which fully verifies the effectiveness and application potential of the proposed method. The code of the project is available at https://github.com/Jay-xyj/SBDE.

[118] EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba

Quang Nguyen,Nhat Le,Baoru Huang,Minh Nhat Vu,Chengcheng Tang,Van Nguyen,Ngan Le,Thieu Vo,Anh Nguyen

Main category: cs.CV

TL;DR: This paper introduces EgoMusic Motion Network, a new method for estimating human dance motion using both egocentric video and music as input, outperforming existing approaches and validated on a new large-scale dataset called EgoAIST++.

Details Motivation: Estimating human dance motion from both egocentric video and music is a largely unexplored yet challenging task with various industrial applications. Existing methods typically use only one input modality (either video or music), limiting their effectiveness in realistic scenarios where both modalities are present. Method: The authors developed a new method based on diffusion models and Mamba for sequence modeling, incorporating a Skeleton Mamba module to explicitly model human body structure. They also introduced a new large-scale dataset, EgoAIST++, for training and evaluation. Result: The proposed method outperforms state-of-the-art approaches in dance motion estimation and works effectively on real-world data. The introduction of the EgoAIST++ dataset also provides a valuable resource for future research in this area. Conclusion: The proposed method, EgoMusic Motion Network with Skeleton Mamba, achieves superior performance in estimating human dance motion from both egocentric video and music, showing strong generalization to real-world data. Abstract: Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.

[119] Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies

Ayushman Sarkar,Mohd Yamani Idna Idris,Zhenyu Yu

Main category: cs.CV

TL;DR: This survey categorizes visual reasoning into five major types, analyzes their implementation through various architectures, reviews evaluation protocols, identifies open challenges, and outlines a research agenda for future vision systems.

Details Motivation: The motivation is to address the lack of unified analysis and comparison across reasoning types, methodologies, and evaluation protocols in existing surveys on visual reasoning. Method: The paper categorizes visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examines their implementation through architectures like graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. It also reviews evaluation protocols and identifies open challenges. Result: A comprehensive survey categorizing visual reasoning into five types and analyzing architectures, evaluation protocols, and challenges is presented, along with a forward-looking research agenda for next-generation vision systems. Conclusion: The survey emphasizes the importance of bridging perception and reasoning for developing transparent, trustworthy AI systems, outlining a research agenda focused on addressing current limitations and future directions in visual reasoning. Abstract: Visual reasoning is critical for a wide range of computer vision tasks that go beyond surface-level object detection and classification. Despite notable advances in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys often address these directions in isolation, lacking a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols. This survey aims to address this gap by categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examining their implementation through architectures such as graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of generalizability, reproducibility, and explanatory power. Beyond evaluation, we identify key open challenges in visual reasoning, including scalability to complex scenes, deeper integration of symbolic and neural paradigms, the lack of comprehensive benchmark datasets, and reasoning under weak supervision. Finally, we outline a forward-looking research agenda for next-generation vision systems, emphasizing that bridging perception and reasoning is essential for building transparent, trustworthy, and cross-domain adaptive AI systems, particularly in critical domains such as autonomous driving and medical diagnostics.

[120] Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset

Ziye Deng,Ruihan He,Jiaxiang Liu,Yuan Wang,Zijie Meng,Songtao Jiang,Yong Xie,Zuozhu Liu

Main category: cs.CV

TL;DR: The paper introduces Med-GLIP-5M, a large-scale dataset, and Med-GLIP, a new framework for medical image grounding, which together advance the state of the art and enhance related tasks like VQA and report generation.

Details Motivation: Limitations in existing research such as modality coverage, annotation quality, and lack of a generalizable framework inspired this work. Method: Development of Med-GLIP-5M dataset and a modality-aware grounding framework without expert modules. Result: Med-GLIP outperforms baselines on grounding benchmarks and improves performance in medical VQA and report generation. Conclusion: Med-GLIP provides a unified framework for medical image grounding, showing superior performance and enhancing downstream tasks like VQA and MRG. Abstract: Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data -- enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.

[121] GCRPNet: Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images

Mengyu Ren,Yutong Li,Hua Li,Runmin Cong,Sam Kwong

Main category: cs.CV

TL;DR: This paper proposes GCRPNet, a graph-enhanced contextual and regional perception network for salient object detection in optical remote sensing images, achieving state-of-the-art performance by effectively integrating global and local features.

Details Motivation: Salient object detection in optical remote sensing images faces challenges such as significant variations in target scales, low contrast between targets and background, and difficulties in effectively integrating heterogeneous features from existing vision transformer and CNN-based methods. Method: GCRPNet is built on the Mamba architecture and incorporates a visual state space (VSS) encoder for multi-scale feature extraction. A difference-similarity guided hierarchical graph attention module (DS-HGAM) enhances cross-layer interaction and structural perception, while the LEVSS block decoder integrates adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM) for improved local modeling capability. Result: Extensive experimental results validate that the proposed GCRPNet model achieves superior performance in salient object detection, effectively addressing the challenges of scale variation and low contrast. Conclusion: The proposed GCRPNet model achieves state-of-the-art performance in salient object detection for optical remote sensing images, demonstrating its effectiveness and superiority. Abstract: Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model's structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the decoder of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba's local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

[122] PSScreen: Partially Supervised Multiple Retinal Disease Screening

Boyi Zheng,Qing Liu

Main category: cs.CV

TL;DR: PSScreen是一种新颖的部分监督学习模型,用于解决多个视网膜疾病筛查中的域转移和标签缺失问题,显著提升了检测性能。

Details Motivation: 利用多个部分标注的数据集来训练多个视网膜疾病筛查模型,减少了对完全标注数据集的依赖,但由于来自不同医疗机构的训练数据集之间的显著域转移以及部分类别的标签缺失问题,仍然具有挑战性。 Method: PSScreen由两个流组成,一个学习确定性特征,另一个通过不确定性注入学习概率特征。利用文本指导将两种特征解耦为疾病特征,并通过特征蒸馏进行对齐以提高域泛化能力。同时,使用伪标签一致性来解决标签缺失问题,并引入自蒸馏将任务相关语义从确定性流转移到概率流。 Result: 实验表明,PSScreen在多个视网膜疾病和正常状态的检测性能上显著提升,并在同域和异域数据集上均达到最先进的结果。 Conclusion: PSScreen是一个新颖的部分监督多视网膜疾病筛查模型,可以显著增强六种视网膜疾病和正常状态的检测性能,并在同域和异域数据集上达到最先进的结果。 Abstract: Leveraging multiple partially labeled datasets to train a model for multiple retinal disease screening reduces the reliance on fully annotated datasets, but remains challenging due to significant domain shifts across training datasets from various medical sites, and the label absent issue for partial classes. To solve these challenges, we propose PSScreen, a novel Partially Supervised multiple retinal disease Screening model. Our PSScreen consists of two streams and one learns deterministic features and the other learns probabilistic features via uncertainty injection. Then, we leverage the textual guidance to decouple two types of features into disease-wise features and align them via feature distillation to boost the domain generalization ability. Meanwhile, we employ pseudo label consistency between two streams to address the label absent issue and introduce a self-distillation to transfer task-relevant semantics about known classes from the deterministic to the probabilistic stream to further enhance the detection performances. Experiments show that our PSScreen significantly enhances the detection performances on six retinal diseases and the normal state averagely and achieves state-of-the-art results on both in-domain and out-of-domain datasets. Codes are available at https://github.com/boyiZheng99/PSScreen.

[123] AR Surgical Navigation With Surface Tracing: Comparing In-SitVisualization with Tool-Tracking Guidance for Neurosurgical Applications

Marc J. Fischer,Jeffrey Potts,Gabriel Urreola,Dax Jones,Paolo Palmisciano,E. Bradley Strong,Branden Cord,Andrew D. Hernandez,Julia D. Sharma,E. Brandon Strong

Main category: cs.CV

TL;DR: This study presents an AR surgical navigation system that improves catheter insertion accuracy and user experience through real-time tool-tracking guidance.

Details Motivation: The motivation stems from the limitations of traditional surgical navigation systems and the challenges of AR depth perception and occlusion handling in high-precision surgical environments. Method: The research employed a novel surface tracing method and real-time infrared tool tracking using the Microsoft HoloLens 2 to guide simulated catheter insertions on a phantom model. Procedures were conducted under two AR guidance conditions: static in-situ visualization and real-time tool-tracking guidance. Result: Tool-tracking guidance improved performance metrics across all accuracy measures, including insertion accuracy, target deviation, angular error, and depth precision. Users also preferred the real-time guidance in subjective evaluations. Conclusion: The study concludes that AR guidance systems, particularly with real-time tool-tracking, significantly enhance surgical precision and user experience in simulated catheter insertions. Abstract: Augmented Reality (AR) surgical navigation systems are emerging as the next generation of intraoperative surgical guidance, promising to overcome limitations of traditional navigation systems. However, known issues with AR depth perception due to vergence-accommodation conflict and occlusion handling limitations of the currently commercially available display technology present acute challenges in surgical settings where precision is paramount. This study presents a novel methodology for utilizing AR guidance to register anatomical targets and provide real-time instrument navigation using placement of simulated external ventricular drain catheters on a phantom model as the clinical scenario. The system registers target positions to the patient through a novel surface tracing method and uses real-time infrared tool tracking to aid in catheter placement, relying only on the onboard sensors of the Microsoft HoloLens 2. A group of intended users performed the procedure of simulated insertions under two AR guidance conditions: static in-situ visualization, where planned trajectories are overlaid directly onto the patient anatomy, and real-time tool-tracking guidance, where live feedback of the catheter's pose is provided relative to the plan. Following the insertion tests, computed tomography scans of the phantom models were acquired, allowing for evaluation of insertion accuracy, target deviation, angular error, and depth precision. System Usability Scale surveys assessed user experience and cognitive workload. Tool-tracking guidance improved performance metrics across all accuracy measures and was preferred by users in subjective evaluations. A free copy of this paper and all supplemental materials are available at https://bit.ly/45l89Hq.

[124] Retrieval-Augmented Prompt for OOD Detection

Ruisong Han,Zongbo Han,Jiahao Zhang,Mingyue Cheng,Changqing Zhang

Main category: cs.CV

TL;DR: RAP is a novel OOD detection method that leverages external textual knowledge to improve semantic supervision, achieving state-of-the-art results on large-scale benchmarks.

Details Motivation: Existing OOD detection methods suffer from limited and mismatched outlier samples, resulting in insufficient semantic supervision and suboptimal performance. RAP aims to address these issues by leveraging external textual knowledge. Method: RAP augments pre-trained vision-language model prompts by retrieving external textual knowledge, offering enhanced semantic supervision for OOD detection. It uses retrieved descriptive words for outliers during training and dynamically updates OOD prompts during testing. Result: RAP achieves significant improvements in OOD detection, reducing the average FPR95 by 7.05% and improving AUROC by 1.71% in 1-shot OOD detection on the ImageNet-1k dataset. Conclusion: The proposed RAP method achieves state-of-the-art performance on large-scale OOD detection benchmarks, with notable improvements in FPR95 and AUROC compared to existing methods. Abstract: Out-of-Distribution (OOD) detection is crucial for the reliable deployment of machine learning models in-the-wild, enabling accurate identification of test samples that differ from the training data distribution. Existing methods rely on auxiliary outlier samples or in-distribution (ID) data to generate outlier information for training, but due to limited outliers and their mismatch with real test OOD samples, they often fail to provide sufficient semantic supervision, leading to suboptimal performance. To address this, we propose a novel OOD detection method called Retrieval-Augmented Prompt (RAP). RAP augments a pre-trained vision-language model's prompts by retrieving external knowledge, offering enhanced semantic supervision for OOD detection. During training, RAP retrieves descriptive words for outliers based on joint similarity with external textual knowledge and uses them to augment the model's OOD prompts. During testing, RAP dynamically updates OOD prompts in real-time based on the encountered OOD samples, enabling the model to rapidly adapt to the test environment. Our extensive experiments demonstrate that RAP achieves state-of-the-art performance on large-scale OOD detection benchmarks. For example, in 1-shot OOD detection on the ImageNet-1k dataset, RAP reduces the average FPR95 by 7.05% and improves the AUROC by 1.71% compared to previous methods. Additionally, comprehensive ablation studies validate the effectiveness of each module and the underlying motivations of our approach.

[125] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks

Xinhao Wang,Zhiwei Lin,Zhongyu Xia,Yongtao Wang

Main category: cs.CV

TL;DR: PTQAT improves quantization efficiency and accuracy by selectively fine-tuning critical layers and applying PTQ on others, achieving better performance than QAT.

Details Motivation: The motivation is to address the performance degradation in PTQ and the high computational cost of QAT, aiming for an efficient and accurate quantization approach. Method: The method selects critical layers for QAT fine-tuning and applies PTQ on the remaining layers, focusing on compensating quantization errors during propagation. Result: The proposed PTQAT outperforms QAT-only baselines across various 3D perception tasks while fine-tuning fewer weights. Conclusion: PTQAT is a novel hybrid quantization algorithm that efficiently deploys 3D perception networks by achieving similar performance to QAT with much higher efficiency. Abstract: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight fine-tuning.In this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade-off between PTQ and QAT, our method selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers. Contrary to intuition, fine-tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model's quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT-only baselines. Notably, it achieves 0.2%-0.9% NDS and 0.3%-1.0% mAP gains in object detection, 0.3%-2.0% mIoU gains in semantic segmentation and occupancy prediction while fine-tuning fewer weights.

[126] HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

Shiyu Liu,Kui Jiang,Xianming Liu,Hongxun Yao,Xiaocheng Feng

Main category: cs.CV

TL;DR: HM-Talker improves audio-driven talking head video generation by combining implicit and explicit facial motion cues, resulting in high-fidelity, temporally coherent outputs with accurate lip synchronization across diverse identities.

Details Motivation: Current audio-driven talking head generation methods suffer from motion blur and lip jitter due to implicit modeling of audio-facial correlations without explicit articulatory priors. This work aims to enhance visual quality and lip synchronization by incorporating anatomical guidance. Method: HM-Talker uses a hybrid motion representation combining implicit and explicit cues, employing a Cross-Modal Disentanglement Module (CMDM) and a Hybrid Motion Modeling Module (HMMM) to extract and merge motion features while enforcing identity-agnostic learning. Result: HM-Talker achieves superior visual quality and lip-sync accuracy compared to existing methods, demonstrating robust cross-subject generalization and high-fidelity talking head synthesis. Conclusion: HM-Talker provides high-fidelity, temporally coherent talking head synthesis with robust lip synchronization across diverse identities, outperforming state-of-the-art methods in visual quality and lip-sync accuracy. Abstract: Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

[127] SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous Driving

Philipp Wolters,Johannes Gilg,Torben Teepe,Gerhard Rigoll

Main category: cs.CV

TL;DR: The paper introduces SpaRC-AD, an end-to-end camera-radar fusion framework for autonomous driving, demonstrating enhanced performance in various tasks and improved safety in critical scenarios.

Details Motivation: The motivation is to overcome the limitations of vision-based autonomous driving systems in adverse weather conditions, partial occlusions, and precise velocity estimation, particularly in safety-sensitive scenarios. Method: The method involves a query-based end-to-end camera-radar fusion framework called SpaRC-AD, which uses sparse 3D feature alignment and Doppler-based velocity estimation to enhance 3D scene representations. Result: The result shows significant improvements over state-of-the-art vision-only baselines in 3D detection, multi-object tracking, online mapping, motion prediction, and trajectory planning. Conclusion: The paper concludes that SpaRC-AD, a query-based end-to-end camera-radar fusion framework, effectively improves autonomous driving systems' performance across multiple tasks, especially in safety-critical scenarios. Abstract: End-to-end autonomous driving systems promise stronger performance through unified optimization of perception, motion forecasting, and planning. However, vision-based approaches face fundamental limitations in adverse weather conditions, partial occlusions, and precise velocity estimation - critical challenges in safety-sensitive scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. To address these limitations, we propose SpaRC-AD, a query-based end-to-end camera-radar fusion framework for planning-oriented autonomous driving. Through sparse 3D feature alignment, and doppler-based velocity estimation, we achieve strong 3D scene representations for refinement of agent anchors, map polylines and motion modelling. Our method achieves strong improvements over the state-of-the-art vision-only baselines across multiple autonomous driving tasks, including 3D detection (+4.8% mAP), multi-object tracking (+8.3% AMOTA), online mapping (+1.8% mAP), motion prediction (-4.0% mADE), and trajectory planning (-0.1m L2 and -9% TPC). We achieve both spatial coherence and temporal consistency on multiple challenging benchmarks, including real-world open-loop nuScenes, long-horizon T-nuScenes, and closed-loop simulator Bench2Drive. We show the effectiveness of radar-based fusion in safety-critical scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. The source code of all experiments is available at https://phi-wol.github.io/sparcad/

[128] Adapting SAM via Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection

Humza Naveed,Xina Zeng,Mitch Bryson,Nagita Mehrseresht

Main category: cs.CV

TL;DR: 本文提出了一种结合SAM模型、STFE、MSDF和CEM损失的遥感变化检测方法,在多个数据集上表现优于现有最佳方法。

Details Motivation: 遥感变化检测在多尺度下具有挑战性,且数据集常存在类别不平衡问题,因此需要一种鲁棒的方法来提升检测性能。 Method: 通过微调SAM编码器,结合空间-时间特征增强(STFE)、多尺度解码器融合(MSDF)以及新的交叉熵掩码(CEM)损失函数来改进遥感变化检测。 Result: 该方法在四个变化检测数据集(Levir-CD、WHU-CD、CLCD和S2Looking)上均优于SOTA方法,在S2Looking数据集上F1分数提高了2.5%。 Conclusion: 本文提出了一种基于SAM模型的遥感变化检测方法,并引入了STFE、MSDF和CEM损失函数,以提高检测精度,特别是在处理类别不平衡的数据集方面表现出色。 Abstract: Foundational models have achieved significant success in diverse domains of computer vision. They learn general representations that are easily transferable to tasks not seen during training. One such foundational model is Segment anything model (SAM), which can accurately segment objects in images. We propose adapting the SAM encoder via fine-tuning for remote sensing change detection (RSCD) along with spatial-temporal feature enhancement (STFE) and multi-scale decoder fusion (MSDF) to detect changes robustly at multiple scales. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle high class imbalance in change detection datasets. Our method outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.5% F1-score improvement on a large complex S2Looking dataset. The code is available at: https://github.com/humza909/SAM-CEM-CD

[129] Towards Agentic AI for Multimodal-Guided Video Object Segmentation

Tuyen Tran,Thao Minh Le,Truyen Tran

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态代理的方法,用于解决基于参照的视频对象分割问题,通过利用大语言模型的推理能力生成动态工作流程,相较先前方法在RVOS和Ref-AVS任务上表现出更优性能。

Details Motivation: 传统方法需要训练专用模型,具有高计算复杂性和人工标注成本;现有基于视觉-语言基础模型的方法缺乏适应任务动态性的灵活性。 Method: 提出Multi-Modal Agent,利用大语言模型生成针对每个输入的动态工作流程,并通过与低级任务工具的迭代交互识别目标对象。 Result: 该方法在RVOS和Ref-AVS两个多模态条件下的视频对象分割任务中展现出优于现有方法的性能。 Conclusion: 该方法提供了一种更灵活、自适应的解决方案,展示了在复杂多模态任务中应用大语言模型推理能力的潜力。 Abstract: Referring-based Video Object Segmentation is a multimodal problem that requires producing fine-grained segmentation results guided by external cues. Traditional approaches to this task typically involve training specialized models, which come with high computational complexity and manual annotation effort. Recent advances in vision-language foundation models open a promising direction toward training-free approaches. Several studies have explored leveraging these general-purpose models for fine-grained segmentation, achieving performance comparable to that of fully supervised, task-specific models. However, existing methods rely on fixed pipelines that lack the flexibility needed to adapt to the dynamic nature of the task. To address this limitation, we propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner. Specifically, our method leverages the reasoning capabilities of large language models (LLMs) to generate dynamic workflows tailored to each input. This adaptive procedure iteratively interacts with a set of specialized tools designed for low-level tasks across different modalities to identify the target object described by the multimodal cues. Our agentic approach demonstrates clear improvements over prior methods on two multimodal-conditioned VOS tasks: RVOS and Ref-AVS.

[130] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

Zheng Qin,Ruobing Zheng,Yabing Wang,Tianqi Li,Yi Yuan,Jingdong Chen,Le Wang

Main category: cs.CV

TL;DR: 本文提出HumanSense基准测试,用于评估多模态大语言模型在人性化交互能力方面的表现,并通过多阶段强化学习提升模型推理能力。

Details Motivation: 尽管多模态大语言模型(MLLMs)在实现真正类人交互方面展现出巨大潜力,但进展受到了缺乏细粒度评估框架的阻碍,该框架需要涵盖对复杂人类意图的理解和提供富有同理心、上下文感知的响应。 Method: 引入了一个名为HumanSense的全面基准测试,用于评估MLLMs在人性化感知和交互能力方面的表现,特别关注对扩展多模态上下文的深入理解以及理性反馈的制定。 Result: 评估结果显示,领先的MLLMs在面向交互任务方面仍有相当大的改进空间,补充视觉输入的音频和文本信息带来了显著改进,全模态模型在这些任务上展现了优势。 Conclusion: 论文得出适当的反馈来自于对对话者需求和情感的上下文分析,推理能力是实现这一目标的关键,并通过多阶段、模态渐进的强化学习提升模型的推理能力,同时发现成功的推理过程具有高度一致的思维模式。 Abstract: While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: \textcolor{brightpink}https://digital-avatar.github.io/ai/HumanSense/

[131] EvTurb: Event Camera Guided Turbulence Removal

Yixing Liu,Minggui Teng,Yifei Xia,Peiqi Duan,Boxin Shi

Main category: cs.CV

TL;DR: EvTurb uses high-speed event streams to separate blur and tilt distortions caused by atmospheric turbulence, significantly improving image restoration performance.

Details Motivation: Atmospheric turbulence causes image blur and tilt, challenging downstream vision tasks. Existing methods struggle with the complexity of these distortions, necessitating a more effective solution. Method: EvTurb proposes a two-step event-guided network: first using event integrals to reduce blur, then using a variance map from event streams to eliminate tilt distortion. Additionally, TurbEvent, a real-captured dataset, is introduced. Result: EvTurb achieves superior performance over existing methods on a newly introduced dataset (TurbEvent) while maintaining computational efficiency. Conclusion: EvTurb effectively addresses atmospheric turbulence-induced image degradation by decoupling blur and tilt effects using high-speed event streams, outperforming state-of-the-art methods with computational efficiency. Abstract: Atmospheric turbulence degrades image quality by introducing blur and geometric tilt distortions, posing significant challenges to downstream computer vision tasks. Existing single-image and multi-frame methods struggle with the highly ill-posed nature of this problem due to the compositional complexity of turbulence-induced distortions. To address this, we propose EvTurb, an event guided turbulence removal framework that leverages high-speed event streams to decouple blur and tilt effects. EvTurb decouples blur and tilt effects by modeling event-based turbulence formation, specifically through a novel two-step event-guided network: event integrals are first employed to reduce blur in the coarse outputs. This is followed by employing a variance map, derived from raw event streams, to eliminate the tilt distortion for the refined outputs. Additionally, we present TurbEvent, the first real-captured dataset featuring diverse turbulence scenarios. Experimental results demonstrate that EvTurb surpasses state-of-the-art methods while maintaining computational efficiency.

[132] Towards Powerful and Practical Patch Attacks for 2D Object Detection in Autonomous Driving

Yuxin Cao,Yedi Zhang,Wentao He,Yifan Liao,Yan Xiao,Chang Li,Zhiyong Huang,Jin Song Dong

Main category: cs.CV

TL;DR: 本文提出了一种针对自动驾驶中2D目标检测的高效补丁攻击框架P$^3$A,解决了现有方法在高分辨率数据上的攻击效果不佳问题,并在实验中验证了其优越性。

Details Motivation: 现有的基于mAP的攻击方法在实际攻击场景中成功率较低,且低分辨率训练的补丁在高分辨率图像上效果不佳。 Method: 提出了P$^3$A框架,包括实用攻击成功率(PASR)指标、定位-置信度抑制损失(LCSL)和概率尺度保持填充(PSPP)方法。 Result: 实验表明,P$^3$A在高分辨率数据集和未知模型上攻击成功率更高,攻击效果更实用。 Conclusion: P$^3$A框架在高分辨率数据集上对未知模型和高分辨率数据集的攻击效果优于现有攻击方法。 Abstract: Learning-based autonomous driving systems remain critically vulnerable to adversarial patches, posing serious safety and security risks in their real-world deployment. Black-box attacks, notable for their high attack success rate without model knowledge, are especially concerning, with their transferability extensively studied to reduce computational costs compared to query-based attacks. Previous transferability-based black-box attacks typically adopt mean Average Precision (mAP) as the evaluation metric and design training loss accordingly. However, due to the presence of multiple detected bounding boxes and the relatively lenient Intersection over Union (IoU) thresholds, the attack effectiveness of these approaches is often overestimated, resulting in reduced success rates in practical attacking scenarios. Furthermore, patches trained on low-resolution data often fail to maintain effectiveness on high-resolution images, limiting their transferability to autonomous driving datasets. To fill this gap, we propose P$^3$A, a Powerful and Practical Patch Attack framework for 2D object detection in autonomous driving, specifically optimized for high-resolution datasets. First, we introduce a novel metric, Practical Attack Success Rate (PASR), to more accurately quantify attack effectiveness with greater relevance for pedestrian safety. Second, we present a tailored Localization-Confidence Suppression Loss (LCSL) to improve attack transferability under PASR. Finally, to maintain the transferability for high-resolution datasets, we further incorporate the Probabilistic Scale-Preserving Padding (PSPP) into the patch attack pipeline as a data preprocessing step. Extensive experiments show that P$^3$A outperforms state-of-the-art attacks on unseen models and unseen high-resolution datasets, both under the proposed practical IoU-based evaluation metric and the previous mAP-based metrics.

[133] Fourier-Guided Attention Upsampling for Image Super-Resolution

Daejune Choi,Youchan No,Jinhyung Lee,Duksu Kim

Main category: cs.CV

TL;DR: The paper proposes Frequency-Guided Attention (FGA), a lightweight upsampling module for single image super-resolution, which significantly improves high-frequency detail reconstruction and reduces aliasing artifacts.

Details Motivation: Conventional upsamplers like Sub-Pixel Convolution are efficient but struggle with reconstructing high-frequency details and often introduce aliasing artifacts. Method: FGA integrates three components: a Fourier feature-based MLP for positional frequency encoding, a cross-resolution Correlation Attention Layer for adaptive spatial alignment, and a frequency-domain L1 loss for spectral fidelity supervision. Result: FGA enhances performance across five diverse super-resolution backbones, achieving average PSNR gains of 0.12~0.14 dB and improving frequency-domain consistency by up to 29%, particularly on texture-rich datasets. Conclusion: FGA is a practical and scalable alternative to traditional upsampling methods, effectively reducing aliasing and preserving fine details in single image super-resolution. Abstract: We propose Frequency-Guided Attention (FGA), a lightweight upsampling module for single image super-resolution. Conventional upsamplers, such as Sub-Pixel Convolution, are efficient but frequently fail to reconstruct high-frequency details and introduce aliasing artifacts. FGA addresses these issues by integrating (1) a Fourier feature-based Multi-Layer Perceptron (MLP) for positional frequency encoding, (2) a cross-resolution Correlation Attention Layer for adaptive spatial alignment, and (3) a frequency-domain L1 loss for spectral fidelity supervision. Adding merely 0.3M parameters, FGA consistently enhances performance across five diverse super-resolution backbones in both lightweight and full-capacity scenarios. Experimental results demonstrate average PSNR gains of 0.12~0.14 dB and improved frequency-domain consistency by up to 29%, particularly evident on texture-rich datasets. Visual and spectral evaluations confirm FGA's effectiveness in reducing aliasing and preserving fine details, establishing it as a practical, scalable alternative to traditional upsampling methods.

[134] FIND-Net -- Fourier-Integrated Network with Dictionary Kernels for Metal Artifact Reduction

Farid Tasharofi,Fuxin Fan,Melika Qahqaie,Mareike Thies,Andreas Maier

Main category: cs.CV

TL;DR: FIND-Net是一种新的金属伪影减少框架,通过集成频率域和空间域处理,实现了优越的伪影抑制和结构保留。

Details Motivation: 现有的深度学习算法在金属伪影减少(MAR)方面取得了显著成功,但常常难以在抑制伪影的同时保留结构细节。 Method: FIND-Net采用快速傅里叶卷积(FFC)层和可训练高斯滤波,将MAR视为在空间域和频率域中运行的混合任务。 Result: 实验结果显示,FIND-Net在合成数据集上实现了3.07%的MAE降低,0.18%的SSIM增加,以及0.90%的PSNR改善。在真实世界临床CT扫描中的评估证实了FIND-Net在最小修改干净解剖区域的同时有效抑制金属诱导的失真。 Conclusion: FIND-Net通过结合频率域和空间域处理,有效减少了金属伪影,同时保留了解剖结构,提高了MAR的性能。 Abstract: Metal artifacts, caused by high-density metallic implants in computed tomography (CT) imaging, severely degrade image quality, complicating diagnosis and treatment planning. While existing deep learning algorithms have achieved notable success in Metal Artifact Reduction (MAR), they often struggle to suppress artifacts while preserving structural details. To address this challenge, we propose FIND-Net (Fourier-Integrated Network with Dictionary Kernels), a novel MAR framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation. FIND-Net incorporates Fast Fourier Convolution (FFC) layers and trainable Gaussian filtering, treating MAR as a hybrid task operating in both spatial and frequency domains. This approach enhances global contextual understanding and frequency selectivity, effectively reducing artifacts while maintaining anatomical structures. Experiments on synthetic datasets show that FIND-Net achieves statistically significant improvements over state-of-the-art MAR methods, with a 3.07% MAE reduction, 0.18% SSIM increase, and 0.90% PSNR improvement, confirming robustness across varying artifact complexities. Furthermore, evaluations on real-world clinical CT scans confirm FIND-Net's ability to minimize modifications to clean anatomical regions while effectively suppressing metal-induced distortions. These findings highlight FIND-Net's potential for advancing MAR performance, offering superior structural preservation and improved clinical applicability. Code is available at https://github.com/Farid-Tasharofi/FIND-Net

[135] Increasing the Utility of Synthetic Images through Chamfer Guidance

Nicola Dall'Asen,Xiaofeng Zhang,Reyhane Askari Hemmat,Melissa Hall,Jakob Verbeek,Adriana Romero-Soriano,Michal Drozdzal

Main category: cs.CV

TL;DR: Chamfer Guidance improves synthetic image generation by enhancing both quality and diversity using a few real images, without requiring model training or an unconditional model.

Details Motivation: Recent generative models have focused on improving generation quality at the cost of diversity, limiting their utility for generating synthetic training data. Existing approaches often fail to account for the distribution shift between synthetic and real data. Method: Chamfer Guidance uses a small number of real exemplar images to guide the generative model by characterizing synthetic data quality and diversity, without requiring a training process or an unconditional model. Result: Using Chamfer Guidance, the diversity and quality of generated images are significantly improved on ImageNet-1k and geo-diversity benchmarks. With as few as 2 exemplar images, the method achieves 96.4% precision and 86.4% distributional coverage, improving to 97.5% and 92.7% with 32 images. Training classifiers on the synthetic data shows up to a 15% accuracy boost in-distribution and 16% out-of-distribution compared to baselines, with a 31% reduction in FLOPs at sampling time. Conclusion: Chamfer Guidance is an effective training-free approach that improves diversity and quality of synthetic data generated by conditional image generative models, achieving state-of-the-art few-shot performance while reducing computational costs. Abstract: Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4\% in terms of precision, and 86.4\% in terms of distributional coverage, which increase to 97.5\% and 92.7\%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15\% for in-distribution over the baselines, and up to 16\% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31\% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.

[136] ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation

Hosam Elgendy,Ahmed Sharshar,Ahmed Aboeitta,Mohsen Guizani

Main category: cs.CV

TL;DR: ChatENV is an interactive vision language model that integrates satellite image pairs with real-world sensor data to improve environmental monitoring, offering superior performance and interactive scenario-based reasoning capabilities.

Details Motivation: Current vision language models overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. Method: The framework creates a 177k-image dataset with 152k temporal pairs across 62 land-use classes, uses GPT-4o and Gemini 2.0 for annotation, and fine-tunes Qwen-2.5-VL using LoRA adapters. Result: ChatENV achieves strong performance in temporal and 'what-if' reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models. Conclusion: ChatENV is a powerful tool for grounded, sensor-aware environmental monitoring, supporting interactive scenario-based analysis and rivaling or outperforming state-of-the-art temporal models. Abstract: Understanding environmental changes from aerial imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT- 4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.

[137] Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

Ryan Ramos,Vladan Stojnić,Giorgos Kordopatis-Zilos,Yuta Nakashima,Giorgos Tolias,Noa Garcia

Main category: cs.CV

TL;DR: Subtle image acquisition parameters are encoded in visual representations and can significantly influence semantic predictions, either positively or negatively, depending on their correlation with semantic labels.

Details Motivation: The motivation was to understand how subtle or imperceptible image transformations and acquisition parameters affect the robustness of visual encoders, particularly when these transformations are not seen during training. Method: The researchers analyzed parameters of the image acquisition process and transformations that are subtle or imperceptible to the human eye, assessing their impact on learned visual representations and semantic predictions. Result: The results showed that such parameters are systematically encoded in visual representations and can be easily recovered, with their presence having a significant positive or negative effect on semantic predictions. Conclusion: The study concludes that subtle parameters from the image acquisition process are encoded in visual representations and can significantly impact semantic predictions, depending on their correlation with semantic labels. Abstract: Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces

[138] Lameness detection in dairy cows using pose estimation and bidirectional LSTMs

Helena Russello,Rik van der Tol,Eldert J. van Henten,Gert Kootstra

Main category: cs.CV

TL;DR: 本文介绍了一种使用姿态估计和BLSTM分类器的高效跛行检测方法,减少了人工特征工程并提高了检测准确性。

Details Motivation: 为了改进传统的跛行检测方法,减少人工特征工程,提高在短序列和小训练数据集上的检测效果。 Method: 从奶牛行走视频中提取九个关键点(位于牛蹄、头部和背部)的运动序列,使用BLSTM分类器进行二分类跛行检测。 Result: 该方法的最佳架构达到了85%的分类准确率,而基于特征的方法准确率为80%。 Conclusion: 本文提出了一种结合姿态估计和双向长短期记忆(BLSTM)神经网络的跛行检测方法。该方法在仅需一秒视频数据的情况下能够有效检测跛行,并且优于基于手工设计特征的传统方法。 Abstract: This study presents a lameness detection approach that combines pose estimation and Bidirectional Long-Short-Term Memory (BLSTM) neural networks. Combining pose-estimation and BLSTMs classifier offers the following advantages: markerless pose-estimation, elimination of manual feature engineering by learning temporal motion features from the keypoint trajectories, and working with short sequences and small training datasets. Motion sequences of nine keypoints (located on the cows' hooves, head and back) were extracted from videos of walking cows with the T-LEAP pose estimation model. The trajectories of the keypoints were then used as an input to a BLSTM classifier that was trained to perform binary lameness classification. Our method significantly outperformed an established method that relied on manually-designed locomotion features: our best architecture achieved a classification accuracy of 85%, against 80% accuracy for the feature-based approach. Furthermore, we showed that our BLSTM classifier could detect lameness with as little as one second of video data.

[139] SemPT: Semantic Prompt Tuning for Vision-Language Models

Xiao Shi,Yangjun Ou,Zhenzhong Chen

Main category: cs.CV

TL;DR: SemPT是一种新的视觉提示调整框架,通过利用跨类别的共享属性级知识来提升在各种迁移学习设置下的性能。

Details Motivation: 现有的提示调整方法依赖于稀疏类别标签或不同的LLM生成描述,这导致知识表示的碎片化和可迁移性的受阻。 Method: SemPT采用了一个两步提示策略来引导LLM提取共享视觉属性并生成属性级描述,然后应用视觉引导的加权方法减少噪声并增强文本嵌入,同时将图像嵌入与标签和属性增强的文本嵌入对齐。 Result: SemPT在基础到新类别的泛化、跨数据集迁移、跨域迁移和少样本学习等各种设置下都达到了最先进的性能。 Conclusion: SemPT通过利用跨类别的共享属性级知识,解决了泛化挑战,并在15个基准数据集中展示了其在各种设置下的最先进的性能。 Abstract: Visual transfer learning for unseen categories presents an active research topic yet a challenging task, due to the inherent conflict between preserving category-specific representations and acquiring transferable knowledge. Vision-Language Models (VLMs) pre-trained on large amounts of image-text pairs offer a promising solution. However, existing prompt tuning methods rely on sparse category labels or disparate LLM-generated descriptions, which fragment knowledge representation and hinder transferability. To address this limitation, we introduce Semantic Prompt Tuning (SemPT), a novel framework that tackles the generalization challenge by leveraging shared attribute-level knowledge across categories. Specifically, SemPT adopts a two-step prompting strategy to guide LLM in extracting shared visual attributes and generating attribute-level descriptions, capturing transferable semantic cues beyond labels while ensuring coherent structure. Then, visually guided weighting is applied to the embeddings of attribute-level descriptions to reduce noise from irrelevant attributes and enhance the text embeddings. Additionally, image embeddings are jointly aligned with both label and attribute-enhanced text embeddings, balancing discrimination for seen categories and transferability to unseen ones. Considering the availability of category exposure, our inference dynamically selects between standard label embeddings for seen categories and attribute-enhanced embeddings for unseen ones to ensure effective adaptation. Extensive experiments on 15 benchmark datasets demonstrate that SemPT achieves state-of-the-art performance across various settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning.

[140] Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking

Zhangyong Tang,Tianyang Xu,Xuefeng Zhu,Chunyang Cheng,Tao Zhou,Xiaojun Wu,Josef Kittler

Main category: cs.CV

TL;DR: This paper introduces a unified benchmark for multi-modal visual object tracking and explores continual learning to address performance degradation caused by inconsistency in training and testing.

Details Motivation: The lack of a unified benchmark for MMVOT tasks and the inconsistency between training and testing were identified as challenges that degrade performance. Method: UniBench300 was developed as a unified benchmark for MMVOT tasks, and a serial reformulation of the unification process was implemented to align with continual learning principles. Result: UniBench300 reduced inference passes from three to one and cut time consumption by 27%, while continual learning supported a stable unification process and mitigated performance degradation. Conclusion: The introduction of UniBench300 addresses the inconsistency in training and testing for multi-modal visual object tracking, and continual learning proves beneficial in maintaining performance during the unification process. Abstract: Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing \textit{inconsistency} between training and testing, thus leading to performance \textit{degradation}. To address these issues, this work advances in two aspects: \ding{182} A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27\%. \ding{183} The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT > RGBD > RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at \textit{https://github.com/Zhangyong-Tang/UniBench300}.

[141] AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models

Shixiong Xu,Chenghao Zhang,Lubin Fan,Yuan Zhou,Bin Fan,Shiming Xiang,Gaofeng Meng,Jieping Ye

Main category: cs.CV

TL;DR: 本文提出AddressVLM,通过结合卫星和街景图像改进大型视觉语言模型在街道级别定位的准确性。

Details Motivation: 现有的大型视觉语言模型在国家或城市级别的定位表现良好,但在城市内部街道级别的细粒度定位上表现不足,因此本文旨在改进这一问题。 Method: AddressVLM采用两阶段训练协议:跨视图对齐调整和地址定位调整,并通过街景和卫星图像的结合增强模型对街道分布的理解。 Result: AddressVLM在匹兹堡和旧金山的两个街景VQA数据集上,平均地址定位准确率分别超过现有模型9%和12%。 Conclusion: AddressVLM实现了更精确的街道级定位,通过整合卫星图像和街景图像的跨视图对齐调整,提高了模型在地址定位任务上的表现。 Abstract: Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM's global understanding of street distribution is enhanced through cross-view matching. Our proposed model, named AddressVLM, consists of two-stage training protocols: cross-view alignment tuning and address localization tuning. Furthermore, we have constructed two street-view VQA datasets based on image address localization datasets from Pittsburgh and San Francisco. Qualitative and quantitative evaluations demonstrate that AddressVLM outperforms counterpart LVLMs by over 9% and 12% in average address localization accuracy on these two datasets, respectively.

[142] Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation

Feiran Li,Qianqian Xu,Shilong Bao,Boyu Han,Zhiyong Yang,Qingming Huang

Main category: cs.CV

TL;DR: 本文介绍了一种高效构建高质量人脸数据集的方法,结合了数据清洗、合成生成和课程学习策略,在DataCV ICCV挑战赛中获得第一名。

Details Motivation: 构建一个不与任何现有公共面部数据集重叠的高质量人脸数据集,以训练更有效的人脸识别模型。 Method: 使用混合专家策略清洗数据集,利用Stable Diffusion和Vec2Face生成合成身份,并采用课程学习策略进行模型训练。 Result: 实验结果表明,所构建的数据集在10K、20K和100K身份规模上提升了模型性能,并确保没有身份泄露。 Conclusion: 本文提出了一种构建高质量人脸识别数据集的方法,并在DataCV ICCV挑战赛中获得第一名。 Abstract: In this paper, we present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model. The constructed dataset must not contain identities overlapping with any existing public face datasets. To handle this challenge, we begin with a thorough cleaning of the baseline HSFace dataset, identifying and removing mislabeled or inconsistent identities through a Mixture-of-Experts (MoE) strategy combining face embedding clustering and GPT-4o-assisted verification. We retain the largest consistent identity cluster and apply data augmentation up to a fixed number of images per identity. To further diversify the dataset, we generate synthetic identities using Stable Diffusion with prompt engineering. As diffusion models are computationally intensive, we generate only one reference image per identity and efficiently expand it using Vec2Face, which rapidly produces 49 identity-consistent variants. This hybrid approach fuses GAN-based and diffusion-based samples, enabling efficient construction of a diverse and high-quality dataset. To address the high visual similarity among synthetic identities, we adopt a curriculum learning strategy by placing them early in the training schedule, allowing the model to progress from easier to harder samples. Our final dataset contains 50 images per identity, and all newly generated identities are checked with mainstream face datasets to ensure no identity leakage. Our method achieves \textbf{1st place} in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales. Code is available at https://github.com/Ferry-Li/datacv_fr.

[143] HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection

Zhaoyuan Qi,Weihua Gao,Wenlong Niu,Jie Tang,Yun Li,Xiaodong Peng

Main category: cs.CV

TL;DR: HyperTea improves moving infrared small target detection by integrating global and local temporal perspectives and combining CNNs, RNNs, and hypergraph neural networks.

Details Motivation: MIRSTD remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Hypergraphs have received limited attention in MIRSTD despite their widespread use for high-order correlation learning. Method: HyperTea integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features through three modules: GTEM, LTEM, and TAM. Result: Experiments on DAUB and IRDST demonstrate HyperTea's state-of-the-art (SOTA) performance. Conclusion: HyperTea is the first work to integrate CNNs, RNNs, and HGNNs for MIRSTD, significantly improving detection performance. Abstract: In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.

[144] Physics-Informed Joint Multi-TE Super-Resolution with Implicit Neural Representation for Robust Fetal T2 Mapping

Busra Bulut,Maik Dannecker,Thomas Sanchez,Sara Neves Silva,Vladyslav Zalevskyi,Steven Jia,Jean-Baptiste Ledoux,Guillaume Auzias,François Rousseau,Jana Hutter,Daniel Rueckert,Meritxell Bach Cuadra

Main category: cs.CV

TL;DR: 本文提出了一种创新的T2映射方法,通过联合重建不同TE的数据,解决胎儿MRI中的运动敏感问题,减少了扫描时间,并展示了其在胎儿脑部成像中的潜力。

Details Motivation: 胎儿脑部MRI的T2映射在中等磁场(0.55T)下具有较慢的T2衰减,可以改善对发育中大脑的表征。然而,由于胎儿MRI采集依赖于多个运动受损的厚层堆栈,需要进行切片到体积重建(SVR)来估计高分辨率3D体积,因此T2映射面临挑战。 Method: 该方法结合了隐式神经表示和物理信息正则化,以建模T2衰减,从而在TE之间共享信息,同时保持解剖和定量T2的保真度。 Result: 研究展示了在模拟胎儿脑部和具有胎儿样运动的活体成人数据集上的SOTA性能,并展示了首个在0.55T下的活体胎儿T2映射结果。 Conclusion: 该研究展示了一种新颖的方法,通过联合重建不同回波时间(TE)的数据,解决了胎儿脑部T2映射中的运动敏感问题,并展示了其在减少扫描时间方面的潜力。 Abstract: T2 mapping in fetal brain MRI has the potential to improve characterization of the developing brain, especially at mid-field (0.55T), where T2 decay is slower. However, this is challenging as fetal MRI acquisition relies on multiple motion-corrupted stacks of thick slices, requiring slice-to-volume reconstruction (SVR) to estimate a high-resolution (HR) 3D volume. Currently, T2 mapping involves repeated acquisitions of these stacks at each echo time (TE), leading to long scan times and high sensitivity to motion. We tackle this challenge with a method that jointly reconstructs data across TEs, addressing severe motion. Our approach combines implicit neural representations with a physics-informed regularization that models T2 decay, enabling information sharing across TEs while preserving anatomical and quantitative T2 fidelity. We demonstrate state-of-the-art performance on simulated fetal brain and in vivo adult datasets with fetal-like motion. We also present the first in vivo fetal T2 mapping results at 0.55T. Our study shows potential for reducing the number of stacks per TE in T2 mapping by leveraging anatomical redundancy.

[145] IADGPT: Unified LVLM for Few-Shot Industrial Anomaly Detection, Localization, and Reasoning via In-Context Learning

Mengyang Zhao,Teng Fu,Haiyang Yu,Ke Niu,Bin Li

Main category: cs.CV

TL;DR: 本文提出了一种名为IADGPT的统一框架,通过三阶段训练策略和上下文学习,实现了在工业异常检测中接近人类水平的表现。

Details Motivation: 现有的大视觉语言模型缺乏工业知识和推理能力,无法满足工业质量检测的需求。 Method: 提出了一个三阶段渐进式训练策略,结合上下文学习范式,并设计了输出图像级和像素级异常评分的策略。 Result: IADGPT在异常检测方面表现出显著性能提升,并创建了一个包含100K图像的新数据集。 Conclusion: IADGPT实现了在工业产品缺陷检测方面的人类水平表现,并在异常定位和推理方面具有竞争力。 Abstract: Few-Shot Industrial Anomaly Detection (FS-IAD) has important applications in automating industrial quality inspection. Recently, some FS-IAD methods based on Large Vision-Language Models (LVLMs) have been proposed with some achievements through prompt learning or fine-tuning. However, existing LVLMs focus on general tasks but lack basic industrial knowledge and reasoning capabilities related to FS-IAD, making these methods far from specialized human quality inspectors. To address these challenges, we propose a unified framework, IADGPT, designed to perform FS-IAD in a human-like manner, while also handling associated localization and reasoning tasks, even for diverse and novel industrial products. To this end, we introduce a three-stage progressive training strategy inspired by humans. Specifically, the first two stages gradually guide IADGPT in acquiring fundamental industrial knowledge and discrepancy awareness. In the third stage, we design an in-context learning-based training paradigm, enabling IADGPT to leverage a few-shot image as the exemplars for improved generalization to novel products. In addition, we design a strategy that enables IADGPT to output image-level and pixel-level anomaly scores using the logits output and the attention map, respectively, in conjunction with the language output to accomplish anomaly reasoning. To support our training, we present a new dataset comprising 100K images across 400 diverse industrial product categories with extensive attribute-level textual annotations. Experiments indicate IADGPT achieves considerable performance gains in anomaly detection and demonstrates competitiveness in anomaly localization and reasoning. We will release our dataset in camera-ready.

[146] Novel View Synthesis using DDIM Inversion

Sehajdeep SIngh,A V Subramanyam

Main category: cs.CV

TL;DR: The paper introduces a lightweight framework for synthesizing novel views from a single input image by utilizing a camera pose-conditioned translation U-Net and a fusion strategy that exploits noise correlation in DDIM inversion, thereby leveraging the capabilities of a pretrained diffusion model.

Details Motivation: The motivation is to address the challenge of synthesizing novel views from a single input image, which involves extrapolating 3D scene structure, inferring details in occluded regions, and maintaining geometric consistency across viewpoints. The authors aim to explore a lightweight view translation framework that leverages the high-fidelity generative capabilities of a pretrained diffusion model. Method: The method involves using a camera pose-conditioned translation U-Net (TUNet) to predict the inverted latent corresponding to the desired target view. A fusion strategy is also employed to preserve texture and fine-grained details by exploiting the noise correlation structure in DDIM inversion. The fused latent is then used as the initial condition for DDIM sampling. Result: The result is a method that effectively synthesizes novel views from a single input image, with experiments on MVImgNet showing that it outperforms existing methods. Conclusion: The paper concludes that their method, which uses a fusion strategy and a camera pose-conditioned translation U-Net with a pretrained diffusion model, outperforms existing methods in synthesizing novel views from a single input image. Abstract: Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.

[147] Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios

Zhanwen Liu,Yujing Sun,Yang Wang,Nan Yang,Shengbo Eben Li,Xiangmo Zhao

Main category: cs.CV

TL;DR: 为了解决传统RGB相机在复杂交通环境中的动态范围限制问题,本文提出了一种融合事件相机和RGB相机的运动线索融合网络(MCFNet),实现了更优的时空对齐和跨模态特征融合,从而在低光照和快速移动场景中提升目标检测性能。

Details Motivation: 传统RGB相机的动态范围限制导致复杂交通环境中对比度降低和高频细节丢失,影响特征提取和基于帧的目标检测。 Method: 提出了一种运动线索融合网络(MCFNet),包括事件校正模块(ECM)、事件动态上采样模块(EDUM)和跨模态mamba融合模块(CMM),实现时空对齐和自适应跨模态特征融合。 Result: 在DSEC-Det和PKU-DAVIS-SOD数据集上的实验表明,MCFNet在各种低光照和快速移动交通场景中显著优于现有方法。在DSEC-Det数据集上,mAP50和mAP指标分别提升了7.4%和1.7%。 Conclusion: MCFNet通过融合事件相机和RGB相机的信息,在复杂光照和快速移动的交通场景中显著优于现有方法,提升了目标检测的性能。 Abstract: The dynamic range limitation of conventional RGB cameras reduces global contrast and causes loss of high-frequency details such as textures and edges in complex traffic environments (e.g., nighttime driving, tunnels), hindering discriminative feature extraction and degrading frame-based object detection. To address this, we integrate a bio-inspired event camera with an RGB camera to provide high dynamic range information and propose a motion cue fusion network (MCFNet), which achieves optimal spatiotemporal alignment and adaptive cross-modal feature fusion under challenging lighting. Specifically, an event correction module (ECM) temporally aligns asynchronous event streams with image frames via optical-flow-based warping, jointly optimized with the detection network to learn task-aware event representations. The event dynamic upsampling module (EDUM) enhances spatial resolution of event frames to match image structures, ensuring precise spatiotemporal alignment. The cross-modal mamba fusion module (CMM) uses adaptive feature fusion with a novel interlaced scanning mechanism, effectively integrating complementary information for robust detection. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively. The code is available at https://github.com/Charm11492/MCFNet.

[148] CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation

Joohyeon Lee,Jin-Seop Lee,Jee-Hyong Lee

Main category: cs.CV

TL;DR: 为了解决扩散模型在生成图像时不能准确反映输入提示中指定对象数量的问题,文章提出了一种名为CountCluster的新方法,这种方法通过在推理时引导对象交叉注意图聚类,无需外部工具或额外训练,实现了更高的对象数量准确性和更好的数量控制性能。

Details Motivation: 扩散模型在生成图像质量和多样性方面表现出色,但在根据输入提示准确反映指定对象数量方面仍存在困难,现有的方法存在局限性并且忽视了去噪过程早期阶段决定对象实例数量的重要结构性特征。 Method: CountCluster方法在推理时基于注意力得分将对象交叉注意图划分为k个簇,定义一个理想分布,其中每个簇在空间上明显分离,并优化潜在变量以与目标分布对齐。 Result: 该方法相比现有方法在对象数量准确性上平均提高了18.5%p,并且在多种提示下展示了更好的数量控制性能。 Conclusion: CountCluster方法在不依赖任何外部工具或额外训练的情况下,通过引导对象交叉注意图按照指定的对象数量进行聚类,提高了生成图像中对象数量的准确性,并在各种提示下展示了优越的数量控制性能。 Abstract: Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic--The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into $k$ clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5\%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster .

[149] NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep Team,Chunrui Han,Guopeng Li,Jingwei Wu,Quan Sun,Yan Cai,Yuang Peng,Zheng Ge,Deyu Zhou,Haomiao Tang,Hongyu Zhou,Kenkun Liu,Ailin Huang,Bin Wang,Changxin Miao,Deshan Sun,En Yu,Fukun Yin,Gang Yu,Hao Nie,Haoran Lv,Hanpeng Hu,Jia Wang,Jian Zhou,Jianjian Sun,Kaijun Tan,Kang An,Kangheng Lin,Liang Zhao,Mei Chen,Peng Xing,Rui Wang,Shiyu Liu,Shutao Xia,Tianhao You,Wei Ji,Xianfang Zeng,Xin Han,Xuelin Zhang,Yana Wei,Yanming Xu,Yimin Jiang,Yingming Wang,Yu Zhou,Yucheng Han,Ziyang Meng,Binxing Jiao,Daxin Jiang,Xiangyu Zhang,Yibo Zhu

Main category: cs.CV

TL;DR: 本文提出 NextStep-1,一种强大的自回归模型,用于文本到图像生成,具有优秀的图像合成与编辑能力。

Details Motivation: 现有的文本到图像生成自回归模型要么依赖于计算密集的扩散模型来处理连续图像标记,要么使用矢量量化(VQ)获得具有量化损失的离散标记。本文旨在推动自回归范式的发展。 Method: 使用 NextStep-1 模型(一个14B自回归模型配以157M流匹配头),对离散文本标记和连续图像标记进行下一项预测目标的训练。 Result: NextStep-1 在文本到图像生成任务中实现了自回归模型的最先进性能,表现出色的高保真图像合成能力,并在图像编辑方面表现良好。 Conclusion: NextStep-1 是一种具有强大图像合成和编辑能力的自回归模型,展示了统一方法的威力和多功能性。此外,作者计划向社区开放代码和模型以促进开放研究。 Abstract: Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.

[150] Lightweight CNNs for Embedded SAR Ship Target Detection and Classification

Fabian Kresse,Georgios Pilikos,Mario Azcueta,Nicolas Floury

Main category: cs.CV

TL;DR: 该研究提出了一种用于未聚焦SAR数据的实时神经网络推理方法,适用于卫星上的机载处理,以减少下传数据量并实现目标分类。

Details Motivation: 近实时监测受到需要下传所有原始数据、在地面进行图像聚焦和分析的限制,因此需要一种能够在卫星上处理的更高层次产品的机载处理方法。 Method: 提出了针对Stripmap和Sentinel-1捕获的Interferometric Wide(IW)模式获取的未聚焦SAR数据的实时推理神经网络,并评估了其性能。 Result: 结果证明了使用其中一种模型进行机载处理并部署在FPGA上的可行性,并展示了船只和风车之间二分类任务的目标分类可能性。 Conclusion: 研究证明了使用设计的神经网络对未聚焦SAR数据进行实时推理的可行性,并展示了在FPGA上部署模型的可能性,此外,通过调查船只和风车之间的二分类任务,展示了目标分类的可能性。 Abstract: Synthetic Aperture Radar (SAR) data enables large-scale surveillance of maritime vessels. However, near-real-time monitoring is currently constrained by the need to downlink all raw data, perform image focusing, and subsequently analyze it on the ground. On-board processing to generate higher-level products could reduce the data volume that needs to be downlinked, alleviating bandwidth constraints and minimizing latency. However, traditional image focusing and processing algorithms face challenges due to the satellite's limited memory, processing power, and computational resources. This work proposes and evaluates neural networks designed for real-time inference on unfocused SAR data acquired in Stripmap and Interferometric Wide (IW) modes captured with Sentinel-1. Our results demonstrate the feasibility of using one of our models for on-board processing and deployment on an FPGA. Additionally, by investigating a binary classification task between ships and windmills, we demonstrate that target classification is possible.

[151] Revisiting Cross-View Localization from Image Matching

Panwang Xia,Qiong Wu,Lei Yu,Yi Liu,Mingtao Xiong,Lei Liang,Yongjun Zhang,Yi Wan

Main category: cs.CV

TL;DR: 本文提出了一种新的跨视角定位框架,通过改进跨视角图像匹配和定位,提高了定位精度和图像匹配质量,并介绍了CVFM基准数据集。

Details Motivation: 现有的跨视角定位方法无法建立精确的跨视角对应关系,导致匹配粗糙或几何不一致,因此需要一种能够提高细粒度图像匹配精度的方法。 Method: 引入了Surface Model和SimRefiner模块,通过局部-全局残差校正来优化相似性矩阵,消除了对RANSAC等后处理的依赖。 Result: 在极端视角差异下,该方法显著提高了定位精度和图像匹配质量,建立了新的基线。 Conclusion: 本文提出了一种新的跨视角定位框架,通过改进跨视角图像匹配和定位,提高了定位精度和图像匹配质量,并引入了CVFM这一首个具有32509对跨视角图像的基准数据集。 Abstract: Cross-view localization aims to estimate the 3 degrees of freedom pose of a ground-view image by registering it to aerial or satellite imagery. It is essential in GNSS-denied environments such as urban canyons and disaster zones. Existing methods either regress poses directly or align features in a shared bird's-eye view (BEV) space, both built upon accurate spatial correspondences between perspectives. However, these methods fail to establish strict cross-view correspondences, yielding only coarse or geometrically inconsistent matches. Consequently, fine-grained image matching between ground and aerial views remains an unsolved problem, which in turn constrains the interpretability of localization results. In this paper, we revisit cross-view localization from the perspective of cross-view image matching and propose a novel framework that improves both matching and localization. Specifically, we introduce a Surface Model to model visible regions for accurate BEV projection, and a SimRefiner module to refine the similarity matrix through local-global residual correction, eliminating the reliance on post-processing like RANSAC. To further support research in this area, we introduce CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach substantially improves both localization accuracy and image matching quality, setting new baselines under extreme viewpoint disparity.

[152] Exploiting Discriminative Codebook Prior for Autoregressive Image Generation

Longxiang Tang,Ruihang Chu,Xiang Wang,Yujin Han,Pingyu Wu,Chunming He,Yingya Zhang,Shiwei Zhang,Jiaya Jia

Main category: cs.CV

TL;DR: This paper proposes DCPE as a better alternative to k-means clustering for codebook-based autoregressive image generation, improving training efficiency and model performance.

Details Motivation: k-means clustering is inadequate for token feature space due to issues like token space disparity and centroid distance inaccuracy, prompting the need for a better approach. Method: The Discriminative Codebook Prior Extractor (DCPE) uses an instance-based distance instead of centroid-based distance and applies agglomerative merging to address token space disparity. Result: DCPE accelerates training by 42% on LlamaGen-B and improves FID and IS performance, demonstrating its plug-and-play compatibility with existing paradigms. Conclusion: DCPE effectively addresses the limitations of k-means clustering in codebook feature space, enhancing the training efficiency and performance of autoregressive models. Abstract: Advanced discrete token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm. While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited. Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook. However, we reveal that k-means clustering performs poorly in the codebook feature space due to inherent issues, including token space disparity and centroid distance inaccuracy. In this work, we propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means clustering for more effectively mining and utilizing the token similarity information embedded in the codebook. DCPE replaces the commonly used centroid-based distance, which is found to be unsuitable and inaccurate for the token feature space, with a more reasonable instance-based distance. Using an agglomerative merging technique, it further addresses the token space disparity issue by avoiding splitting high-density regions and aggregating low-density ones. Extensive experiments demonstrate that DCPE is plug-and-play and integrates seamlessly with existing codebook prior-based paradigms. With the discriminative prior extracted, DCPE accelerates the training of autoregressive models by 42% on LlamaGen-B and improves final FID and IS performance.

[153] EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering

Yanjun Li,Yuqian Fu,Tianwen Qian,Qi'ao Xu,Silong Dai,Danda Pani Paudel,Luc Van Gool,Xiaoling Wang

Main category: cs.CV

TL;DR: 本文提出了EgoCross,一个用于评估多模态大语言模型在自我中心视频问答中跨领域泛化能力的基准。

Details Motivation: 现有基准和研究主要局限于日常活动,而实际部署中遇到的领域转移在视觉风格和语义内容上都有很大差异,因此需要一个更具挑战性的跨领域评估基准。 Method: 介绍了一个全面的基准EgoCross,涵盖了四个不同且具有挑战性的领域,包括手术、工业、极限运动和动物视角,并通过广泛的实验评估现有MLLM的跨领域泛化能力。 Result: 实验结果表明,大多数现有的MLLM在面对非日常生活领域的挑战时表现不佳,突显了当前模型的局限性。 Conclusion: EgoCross强调了当前MLLM在跨领域泛化能力的不足,并提出了改进方法的探索,为未来的领域自适应和鲁棒性视频理解提供了基础。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly pushed the frontier of egocentric video question answering (EgocentricQA). However, existing benchmarks and studies are mainly limited to common daily activities such as cooking and cleaning. In contrast, real-world deployment inevitably encounters domain shifts, where target domains differ substantially in both visual style and semantic content. To bridge this gap, we introduce \textbf{EgoCross}, a comprehensive benchmark designed to evaluate the cross-domain generalization of MLLMs in EgocentricQA. EgoCross covers four diverse and challenging domains, including surgery, industry, extreme sports, and animal perspective, representing realistic and high-impact application scenarios. It comprises approximately 1,000 QA pairs across 798 video clips, spanning four key QA tasks: prediction, recognition, localization, and counting. Each QA pair provides both OpenQA and CloseQA formats to support fine-grained evaluation. Extensive experiments show that most existing MLLMs, whether general-purpose or egocentric-specialized, struggle to generalize to domains beyond daily life, highlighting the limitations of current models. Furthermore, we conduct several pilot studies, \eg, fine-tuning and reinforcement learning, to explore potential improvements. We hope EgoCross and our accompanying analysis will serve as a foundation for advancing domain-adaptive, robust egocentric video understanding. Data and codes will be released at: \href{https://github.com/MyUniverse0726/EgoCross}{https://github.com/MyUniverse0726/EgoCross.}

[154] Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction

Luyao Tang,Kunze Huang,Chaoqi Chen,Yuxuan Yuan,Chenxin Li,Xiaotong Tu,Xinghao Ding,Yue Huang

Main category: cs.CV

TL;DR: 本文提出了一种类人认知启发的广义类别发现方法ConGCD,通过视觉基元分解和共识机制,在多个基准任务中表现出色。

Details Motivation: 人类感知系统在识别和归纳对象方面表现优异,而当前机器学习框架尚无法实现这一能力。广义类别发现(GCD)旨在缩小这一差距,但现有方法主要集中在优化目标函数上,缺乏类人认知启发的解决方案。 Method: 提出ConGCD,通过高层语义重建建立基于视觉基元的表示,并结合主导和上下文共识单元捕捉类间差异模式和分布不变性,同时通过共识调度器动态优化激活路径。 Result: 在粗粒度和细粒度基准上的广泛评估表明,ConGCD在广义类别发现任务中具有出色的性能,代码已开源。 Conclusion: ConGCD通过一种新颖的、类人认知启发的方法,在广义类别发现任务上取得了显著成效,强调了其作为共识感知范式的有效性。 Abstract: Human perceptual systems excel at inducing and recognizing objects across both known and novel categories, a capability far beyond current machine learning frameworks. While generalized category discovery (GCD) aims to bridge this gap, existing methods predominantly focus on optimizing objective functions. We present an orthogonal solution, inspired by the human cognitive process for novel object understanding: decomposing objects into visual primitives and establishing cross-knowledge comparisons. We propose ConGCD, which establishes primitive-oriented representations through high-level semantic reconstruction, binding intra-class shared attributes via deconstruction. Mirroring human preference diversity in visual processing, where distinct individuals leverage dominant or contextual cues, we implement dominant and contextual consensus units to capture class-discriminative patterns and inherent distributional invariants, respectively. A consensus scheduler dynamically optimizes activation pathways, with final predictions emerging through multiplex consensus integration. Extensive evaluations across coarse- and fine-grained benchmarks demonstrate ConGCD's effectiveness as a consensus-aware paradigm. Code is available at github.com/lytang63/ConGCD.

[155] Privacy-enhancing Sclera Segmentation Benchmarking Competition: SSBC 2025

Matej Vitek,Darian Tomašević,Abhijit Das,Sabari Nathan,Gökhan Özbulak,Gözde Ayşe Tataroğlu Özbulak,Jean-Paul Calbimonte,André Anjos,Hariohm Hemant Bhatt,Dhruv Dhirendra Premani,Jay Chaudhari,Caiyong Wang,Jian Jiang,Chi Zhang,Qi Zhang,Iyyakutti Iyappan Ganapathi,Syed Sadaf Ali,Divya Velayudan,Maregu Assefa,Naoufel Werghi,Zachary A. Daniels,Leeon John,Ritesh Vyas,Jalil Nourmohammadi Khiarak,Taher Akbari Saeed,Mahsa Nasehi,Ali Kianfar,Mobina Pashazadeh Panahi,Geetanjali Sharma,Pushp Raj Panth,Raghavendra Ramachandra,Aditya Nigam,Umapada Pal,Peter Peer,Vitomir Štruc

Main category: cs.CV

TL;DR: 2025巩膜分割基准竞赛评估了基于合成数据训练的隐私保护巩膜分割模型的性能,结果表明合成数据训练可实现优异性能,尤其在采用特定训练策略时。

Details Motivation: 评估在合成数据上训练的模型与在真实世界数据集上训练的模型的性能差异,探索合成数据在隐私保护生物特征开发中的潜力。 Method: 竞赛分为两个赛道:(i) 仅使用合成数据进行模型开发;(ii) 结合同成数据和有限的真实世界数据。参赛团队采用了多种模型架构,包括基于Transformer的方案、轻量级模型和由生成框架引导的分割网络。 Result: 在合成数据赛道中,顶级模型的F1分数超过了0.8,表明完全在合成数据上训练的模型可以取得有竞争力的表现。在混合赛道中,性能提升通常更多由方法论选择驱动,而非真实数据的引入。 Conclusion: 模型完全在合成数据上训练可以在隐私保护生物特征开发中表现出竞争力,特别是在采用专用训练策略时,方法选择对性能提升起到了关键作用。 Abstract: This paper presents a summary of the 2025 Sclera Segmentation Benchmarking Competition (SSBC), which focused on the development of privacy-preserving sclera-segmentation models trained using synthetically generated ocular images. The goal of the competition was to evaluate how well models trained on synthetic data perform in comparison to those trained on real-world datasets. The competition featured two tracks: $(i)$ one relying solely on synthetic data for model development, and $(ii)$ one combining/mixing synthetic with (a limited amount of) real-world data. A total of nine research groups submitted diverse segmentation models, employing a variety of architectural designs, including transformer-based solutions, lightweight models, and segmentation networks guided by generative frameworks. Experiments were conducted across three evaluation datasets containing both synthetic and real-world images, collected under diverse conditions. Results show that models trained entirely on synthetic data can achieve competitive performance, particularly when dedicated training strategies are employed, as evidenced by the top performing models that achieved $F_1$ scores of over $0.8$ in the synthetic data track. Moreover, performance gains in the mixed track were often driven more by methodological choices rather than by the inclusion of real data, highlighting the promise of synthetic data for privacy-aware biometric development. The code and data for the competition is available at: https://github.com/dariant/SSBC_2025.

[156] Axis-level Symmetry Detection with Group-Equivariant Representation

Wongyun Yu,Ahyun Seo,Minsu Cho

Main category: cs.CV

TL;DR: This paper proposes a novel framework for precise axis-level detection of reflection and rotational symmetry in computer vision, achieving state-of-the-art results through a dual-branch architecture and dihedral group-equivariant features.

Details Motivation: Detecting symmetry in complex scenes remains a significant challenge in computer vision, and recent heatmap-based approaches often lack precision in identifying individual symmetry axes. Method: The method uses a dual-branch architecture for detecting reflection and rotation symmetry. It introduces orientational anchors and reflectional matching for reflection symmetry, and rotational matching for rotational symmetry, all while being equivariant to the dihedral group. Result: Extensive experiments demonstrate that the proposed method outperforms existing approaches and achieves state-of-the-art performance in symmetry detection. Conclusion: The proposed method achieves state-of-the-art performance in detecting reflection and rotational symmetry by representing them as explicit geometric primitives and employing a dual-branch architecture that is equivariant to the dihedral group. Abstract: Symmetry is a fundamental concept that has been extensively studied, yet detecting it in complex scenes remains a significant challenge in computer vision. Recent heatmap-based approaches can localize potential regions of symmetry axes but often lack precision in identifying individual axes. In this work, we propose a novel framework for axis-level detection of the two most common symmetry types-reflection and rotation-by representing them as explicit geometric primitives, i.e. lines and points. Our method employs a dual-branch architecture that is equivariant to the dihedral group, with each branch specialized to exploit the structure of dihedral group-equivariant features for its respective symmetry type. For reflection symmetry, we introduce orientational anchors, aligned with group components, to enable orientation-specific detection, and a reflectional matching that measures similarity between patterns and their mirrored counterparts across candidate axes. For rotational symmetry, we propose a rotational matching that compares patterns at fixed angular intervals to identify rotational centers. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches.

[157] Forgery Guided Learning Strategy with Dual Perception Network for Deepfake Cross-domain Detection

Lixin Jia,Zhiqing Guo,Gaobo Yang,Liejun Wang,Keqin Li

Main category: cs.CV

TL;DR: This paper proposes FGL and DPNet to improve deepfake detection by enabling models to adapt to unknown forgery techniques and better capture forgery trace correlations.

Details Motivation: Current deepfake detection methods perform poorly on datasets with unknown forgery techniques, and cross-domain detection methods relying on common forgery traces are becoming ineffective as forgery techniques evolve. Method: FGL captures differential information between known and unknown forgery techniques, while DPNet extracts features in frequency and spatial domains and uses graph convolution to perceive feature relationships. Result: The proposed approach generalizes well across different scenarios and effectively handles unknown forgery challenges, as shown by extensive experiments. Conclusion: Forgery Guided Learning (FGL) strategy and Dual Perception Network (DPNet) provide robust support for deepfake detection by enabling models to adapt to unknown forgery techniques and capture forgery trace correlations. Abstract: The emergence of deepfake technology has introduced a range of societal problems, garnering considerable attention. Current deepfake detection methods perform well on specific datasets, but exhibit poor performance when applied to datasets with unknown forgery techniques. Moreover, as the gap between emerging and traditional forgery techniques continues to widen, cross-domain detection methods that rely on common forgery traces are becoming increasingly ineffective. This situation highlights the urgency of developing deepfake detection technology with strong generalization to cope with fast iterative forgery techniques. To address these challenges, we propose a Forgery Guided Learning (FGL) strategy designed to enable detection networks to continuously adapt to unknown forgery techniques. Specifically, the FGL strategy captures the differential information between known and unknown forgery techniques, allowing the model to dynamically adjust its learning process in real time. To further improve the ability to perceive forgery traces, we design a Dual Perception Network (DPNet) that captures both differences and relationships among forgery traces. In the frequency stream, the network dynamically perceives and extracts discriminative features across various forgery techniques, establishing essential detection cues. These features are then integrated with spatial features and projected into the embedding space. In addition, graph convolution is employed to perceive relationships across the entire feature space, facilitating a more comprehensive understanding of forgery trace correlations. Extensive experiments show that our approach generalizes well across different scenarios and effectively handles unknown forgery challenges, providing robust support for deepfake detection. Our code is available on https://github.com/vpsg-research/FGL.

[158] An Efficient Model-Driven Groupwise Approach for Atlas Construction

Ziwei Zou,Bei Zou,Xiaoyan Kui,Wenqi Lu,Haoran Dou,Arezoo Zakeri,Timothy Cootes,Alejandro F Frangi,Jinming Duan

Main category: cs.CV

TL;DR: The paper introduces DARC, a model-driven groupwise registration framework for constructing diffeomorphic medical image atlases, offering advantages in scalability, anatomical fidelity, and applications like one-shot segmentation and shape synthesis.

Details Motivation: While data-driven registration methods have recently shown promise in pairwise settings, their reliance on large training datasets, limited generalizability, and lack of true inference phases in groupwise contexts hinder their practical use. In contrast, model-driven methods offer training-free, theoretically grounded, and data-efficient alternatives, though they often face scalability and optimization challenges when applied to large 3D datasets. Method: DARC (Diffeomorphic Atlas Registration via Coordinate descent), a novel model-driven groupwise registration framework for atlas construction. Result: DARC produces unbiased, diffeomorphic atlases with high anatomical fidelity. One-shot segmentation outperforms state-of-the-art few-shot methods. New anatomical variants are generated by warping the atlas mesh using synthesized diffeomorphic deformation fields. Conclusion: DARC offers a flexible, generalizable, and resource-efficient framework for atlas construction and applications. Abstract: Atlas construction is fundamental to medical image analysis, offering a standardized spatial reference for tasks such as population-level anatomical modeling. While data-driven registration methods have recently shown promise in pairwise settings, their reliance on large training datasets, limited generalizability, and lack of true inference phases in groupwise contexts hinder their practical use. In contrast, model-driven methods offer training-free, theoretically grounded, and data-efficient alternatives, though they often face scalability and optimization challenges when applied to large 3D datasets. In this work, we introduce DARC (Diffeomorphic Atlas Registration via Coordinate descent), a novel model-driven groupwise registration framework for atlas construction. DARC supports a broad range of image dissimilarity metrics and efficiently handles arbitrary numbers of 3D images without incurring GPU memory issues. Through a coordinate descent strategy and a centrality-enforcing activation function, DARC produces unbiased, diffeomorphic atlases with high anatomical fidelity. Beyond atlas construction, we demonstrate two key applications: (1) One-shot segmentation, where labels annotated only on the atlas are propagated to subjects via inverse deformations, outperforming state-of-the-art few-shot methods; and (2) shape synthesis, where new anatomical variants are generated by warping the atlas mesh using synthesized diffeomorphic deformation fields. Overall, DARC offers a flexible, generalizable, and resource-efficient framework for atlas construction and applications.

[159] From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

Tiancheng Han,Yunfei Gao,Yong Li,Wuzhou Yu,Qiaosheng Zhang,Wenqi Shao

Main category: cs.CV

TL;DR: 本文研究了视觉语言模型在物理空间推理方面的表现,发现其表现不佳,并提出了一种改进方法,但新物理场景中的泛化能力仍然有限。

Details Motivation: 尽管最近的视觉语言模型在多模态数学和纯空间理解等专门领域取得了显著进展,但它们在物理空间推理方面的功能在很大程度上仍未被探索。 Method: 应用了监督微调,随后对Qwen2.5-VL-7B进行了基于规则的强化学习以提升其物理空间推理能力。 Result: 通过对主流视觉语言模型进行全面的诊断分析,揭示了当前模型在这一关键任务上的表现不佳,这主要是由于人类先验造成的偏见和缺乏深度推理。 Conclusion: 尽管模型在物理空间推理能力上有显著提升,但其在新物理场景中的泛化能力仍然有限,这突出了在物理空间推理方面对新方法的迫切需求。 Abstract: Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model's generalization to new physics scenarios remains limited -- underscoring the pressing need for new approaches in spatio-physical reasoning.

[160] AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences

Jieyu Li,Xin Zhang,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: AEGIS 是一个新的大规模视频真实性检测基准,包含超过 10,000 个真实和 AI 生成的视频,旨在提升检测技术以应对现实世界的伪造威胁。

Details Motivation: 现有的视频真实性检测基准在现实性、规模和复杂性方面存在局限,无法有效评估现代视觉-语言模型对抗复杂伪造的能力。 Method: 构建了一个包含超过 10,000 个真实和合成视频的数据集,涵盖多种先进的生成模型,并提供多模态注释。 Result: 实验表明,现有模型在 AEGIS 最具挑战性的子集上检测能力有限,突显了该数据集的复杂性和现实性超出当前模型的泛化能力。 Conclusion: AEGIS 是一个用于检测高度逼真和语义复杂的 AI 生成视频的新大型基准测试,旨在推动视频真实性检测方法的研究。 Abstract: Recent advances in AI-generated content have fueled the rise of highly realistic synthetic videos, posing severe risks to societal trust and digital integrity. Existing benchmarks for video authenticity detection typically suffer from limited realism, insufficient scale, and inadequate complexity, failing to effectively evaluate modern vision-language models against sophisticated forgeries. To address this critical gap, we introduce AEGIS, a novel large-scale benchmark explicitly targeting the detection of hyper-realistic and semantically nuanced AI-generated videos. AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models, including Stable Video Diffusion, CogVideoX-5B, KLing, and Sora, encompassing open-source and proprietary architectures. In particular, AEGIS features specially constructed challenging subsets enhanced with robustness evaluation. Furthermore, we provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features, facilitating authenticity detection and supporting downstream tasks such as multimodal fusion and forgery localization. Extensive experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS, highlighting the dataset's unique complexity and realism beyond the current generalization capabilities of existing models. In essence, AEGIS establishes an indispensable evaluation benchmark, fundamentally advancing research toward developing genuinely robust, reliable, broadly generalizable video authenticity detection methodologies capable of addressing real-world forgery threats. Our dataset is available on https://huggingface.co/datasets/Clarifiedfish/AEGIS.

[161] Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation

Youping Gu,Xiaolong Li,Yuhao Hu,Bohan Zhuang

Main category: cs.CV

TL;DR: BLADE is a novel framework that accelerates video generation in diffusion transformers by combining adaptive sparse attention and sparsity-aware distillation, achieving faster inference and improved video quality.

Details Motivation: Diffusion transformers face inference bottlenecks due to slow iterative denoising and high computational costs from quadratic attention. Existing acceleration methods like step distillation and sparse attention have limitations when used independently, such as suboptimal performance or high data requirements. Method: BLADE introduces two key components: (1) Adaptive Block-Sparse Attention (ASA) for dynamically generating content-aware sparsity masks, and (2) a sparsity-aware step distillation paradigm based on Trajectory Distribution Matching (TDM). These are integrated in a data-free training framework applied to text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Result: BLADE achieves significant acceleration across different model scales—14.10x speedup on Wan2.1-1.3B and 8.89x on CogVideoX-5B—without sacrificing quality. It also improves video generation quality, with enhanced scores on VBench-2.0 and better human evaluation ratings. Conclusion: BLADE is a data-free joint training framework that effectively combines adaptive block-sparse attention and sparsity-aware step distillation to significantly accelerate inference in diffusion transformers while maintaining or improving video generation quality. Abstract: Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges -- training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Our code and model weights are publicly available at: http://ziplab.co/BLADE-Homepage/.

[162] Ultra-High-Definition Reference-Based Landmark Image Super-Resolution with Generative Diffusion Prior

Zhenning Shi,Zizheng Yan,Yuhang Yu,Clara Xue,Jingyu Zhuang,Qi Zhang,Jinwei Chen,Tao Li,Qingnan Fan

Main category: cs.CV

TL;DR: 本文提出了一种新的基于扩散的RefSR框架TriFlowSR以及首个用于UHD地标场景的RefSR数据集Landmark-4K,解决了现有方法在信息对齐和图像质量方面的限制。

Details Motivation: 现有的基于扩散的RefSR方法在有效地对齐LR图像和参考HR图像之间的信息方面存在困难,并且现有的RefSR数据集分辨率有限,图像质量差。 Method: 设计了一种参考匹配策略,并提出了Landmark-4K,这是第一个用于超高清定义(UHD)地标场景的RefSR数据集。 Result: 实验结果表明,与以前的方法相比,该方法能更有效地利用参考HR图像的语义和纹理信息。 Conclusion: TriFlowSR是一个新的RefSR框架,可以有效地利用参考HR图像的语义和纹理信息进行超高清地标场景的图像超分辨率。 Abstract: Reference-based Image Super-Resolution (RefSR) aims to restore a low-resolution (LR) image by utilizing the semantic and texture information from an additional reference high-resolution (reference HR) image. Existing diffusion-based RefSR methods are typically built upon ControlNet, which struggles to effectively align the information between the LR image and the reference HR image. Moreover, current RefSR datasets suffer from limited resolution and poor image quality, resulting in the reference images lacking sufficient fine-grained details to support high-quality restoration. To overcome the limitations above, we propose TriFlowSR, a novel framework that explicitly achieves pattern matching between the LR image and the reference HR image. Meanwhile, we introduce Landmark-4K, the first RefSR dataset for Ultra-High-Definition (UHD) landmark scenarios. Considering the UHD scenarios with real-world degradation, in TriFlowSR, we design a Reference Matching Strategy to effectively match the LR image with the reference HR image. Experimental results show that our approach can better utilize the semantic and texture information of the reference HR image compared to previous methods. To the best of our knowledge, we propose the first diffusion-based RefSR pipeline for ultra-high definition landmark scenarios under real-world degradation. Our code and model will be available at https://github.com/nkicsl/TriFlowSR.

[163] Cooperative Face Liveness Detection from Optical Flow

Artem Sokolov,Mikhail Nikitin,Anton Konushin

Main category: cs.CV

TL;DR: A new cooperative video-based face liveness detection method is proposed, which uses a controlled face approaching protocol and optical flow analysis to improve the discrimination between real faces and presentation attacks.

Details Motivation: The motivation is to enhance face liveness detection by introducing a novel user interaction protocol that allows for more robust extraction of facial volume information, thereby improving the detection of presentation attacks. Method: The method involves a cooperative scenario where users slowly move their face towards the camera, followed by optical flow analysis and processing through a neural classifier that leverages spatial-temporal features. Result: The approach effectively improves the detection of genuine faces versus presentation attacks, including photos, screen displays, masks, and video replays, by leveraging both optical flows and RGB frames through a neural classifier. Conclusion: The proposed method for cooperative video-based face liveness detection significantly improves the discrimination between genuine faces and presentation attacks by utilizing a controlled face approaching protocol and optical flow analysis. Abstract: In this work, we proposed a novel cooperative video-based face liveness detection method based on a new user interaction scenario where participants are instructed to slowly move their frontal-oriented face closer to the camera. This controlled approaching face protocol, combined with optical flow analysis, represents the core innovation of our approach. By designing a system where users follow this specific movement pattern, we enable robust extraction of facial volume information through neural optical flow estimation, significantly improving discrimination between genuine faces and various presentation attacks (including printed photos, screen displays, masks, and video replays). Our method processes both the predicted optical flows and RGB frames through a neural classifier, effectively leveraging spatial-temporal features for more reliable liveness detection compared to passive methods.

[164] VasoMIM: Vascular Anatomy-Aware Masked Image Modeling for Vessel Segmentation

De-Xing Huang,Xiao-Hu Zhou,Mei-Jiang Gui,Xiao-Liang Xie,Shi-Qi Liu,Shuang-Yi Wang,Tian-Yu Xiang,Rui-Ze Ma,Nu-Fang Xiao,Zeng-Guang Hou

Main category: cs.CV

TL;DR: VasoMIM is a novel self-supervised learning framework for X-ray angiograms that improves vessel segmentation by incorporating anatomical knowledge into the pre-training process through an anatomy-guided masking strategy and anatomical consistency loss.

Details Motivation: Accurate vessel segmentation in X-ray angiograms is crucial for clinical applications, but annotated data is scarce, making self-supervised learning challenging due to class imbalance between vessel and background pixels. Method: VasoMIM introduces an anatomy-guided masking strategy and anatomical consistency loss to improve vascular representation learning in X-ray angiograms. Result: VasoMIM addresses the limitations of conventional masked image modeling by explicitly integrating anatomical knowledge, leading to better discriminability of vascular representations. Conclusion: VasoMIM achieves state-of-the-art performance across three datasets, highlighting its potential to facilitate X-ray angiogram analysis. Abstract: Accurate vessel segmentation in X-ray angiograms is crucial for numerous clinical applications. However, the scarcity of annotated data presents a significant challenge, which has driven the adoption of self-supervised learning (SSL) methods such as masked image modeling (MIM) to leverage large-scale unlabeled data for learning transferable representations. Unfortunately, conventional MIM often fails to capture vascular anatomy because of the severe class imbalance between vessel and background pixels, leading to weak vascular representations. To address this, we introduce Vascular anatomy-aware Masked Image Modeling (VasoMIM), a novel MIM framework tailored for X-ray angiograms that explicitly integrates anatomical knowledge into the pre-training process. Specifically, it comprises two complementary components: anatomy-guided masking strategy and anatomical consistency loss. The former preferentially masks vessel-containing patches to focus the model on reconstructing vessel-relevant regions. The latter enforces consistency in vascular semantics between the original and reconstructed images, thereby improving the discriminability of vascular representations. Empirically, VasoMIM achieves state-of-the-art performance across three datasets. These findings highlight its potential to facilitate X-ray angiogram analysis.

[165] Object Fidelity Diffusion for Remote Sensing Image Generation

Ziqi Ye,Shuran Ma,Jie Yang,Xiaoyi Yang,Ziyang Gong,Xue Yang,Haipeng Wang

Main category: cs.CV

TL;DR: 本文提出OF-Diff,一种用于生成高保真遥感图像的扩散模型,通过先验形状提取和双分支扩散模型等方法,在没有真实图像的情况下也能生成高质量图像,并显著提升小物体和多态物体的检测性能。

Details Motivation: 现有扩散模型在生成遥感图像时常常无法充分捕捉形态细节,导致生成图像的低保真度,这可能影响目标检测模型的鲁棒性和可靠性。 Method: 本文提出了OF-Diff,通过提取对象的先验形状并引入双分支扩散模型与扩散一致性损失,并利用DDPO微调扩散过程,以生成高保真且多样化的遥感图像。 Result: 实验表明,OF-Diff在关键质量指标上优于现有技术,例如飞机、船只和车辆的mAP分别提高了8.3%、7.7%和4.0%。 Conclusion: OF-Diff在生成高质量遥感图像方面优于现有技术,特别是在多态和小物体类别的性能上有显著提升。 Abstract: High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.

[166] Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops

Anand Kumar,Harminder Pal Monga,Tapasi Brahma,Satyam Kalra,Navas Sherif

Main category: cs.CV

TL;DR: This paper presents a mobile-friendly plant disease classification system using computer vision techniques, achieving high accuracy with EfficientNet-B1.

Details Motivation: Plant diseases pose a threat to global food security, and early detection systems are crucial. Computer vision advancements offer a potential solution for accurate and efficient detection. Method: A mobile-friendly solution was developed by combining various datasets (Plant Doc, PlantVillage, PlantWild) and evaluating lightweight architectures (MobileNetV2, MobileNetV3, MobileNetV3-Large, and EfficientNet-B0, B1) for classification accuracy and efficiency. Result: EfficientNet-B1 achieved the highest classification accuracy of 94.7% among the tested architectures, demonstrating a balance between accuracy and computational efficiency. Conclusion: The study concludes that EfficientNet-B1 is the most suitable architecture for classifying plant diseases due to its high accuracy and computational efficiency, making it ideal for deployment on mobile devices. Abstract: Plant diseases are a major threat to food security globally. It is important to develop early detection systems which can accurately detect. The advancement in computer vision techniques has the potential to solve this challenge. We have developed a mobile-friendly solution which can accurately classify 101 plant diseases across 33 crops. We built a comprehensive dataset by combining different datasets, Plant Doc, PlantVillage, and PlantWild, all of which are for the same purpose. We evaluated performance across several lightweight architectures - MobileNetV2, MobileNetV3, MobileNetV3-Large, and EfficientNet-B0, B1 - specifically chosen for their efficiency on resource-constrained devices. The results were promising, with EfficientNet-B1 delivering our best performance at 94.7% classification accuracy. This architecture struck an optimal balance between accuracy and computational efficiency, making it well-suited for real-world deployment on mobile devices.

[167] UI-Venus Technical Report: Building High-performance UI Agents with RFT

Zhangxuan Gu,Zhengwen Zeng,Zhenyu Xu,Xingran Zhou,Shuheng Shen,Yunfei Liu,Beitong Zhou,Changhua Meng,Tianyu Xia,Weizhi Chen,Yue Wen,Jingya Dou,Fei Tang,Jinzhen Lin,Yulin Liu,Zhenlin Guo,Yichen Gong,Heng Jia,Changlong Gao,Yuan Guo,Yong Deng,Zhenyu Guo,Liang Chen,Weiqiang Wang

Main category: cs.CV

TL;DR: UI-Venus是基于多模态大语言模型的UI代理,在UI定位和导航任务中达到最先进的性能,并提出新的训练框架和优化方法。

Details Motivation: 提升UI代理在屏幕截图输入下的UI定位和导航任务性能,同时通过开源促进社区研究和开发。 Method: UI-Venus使用Qwen2.5-VL通过强化微调(RFT)训练,引入了精心设计的奖励函数以及Self-Evolving Trajectory History Alignment & Sparse Action Enhancement方法优化导航性能。 Result: UI-Venus在Screenspot-V2/Pro基准测试中分别取得94.1%/50.8%(7B变体)和95.3%/61.9%(72B变体)的性能,在AndroidWorld导航任务中成功率达到49.1%(7B变体)和65.9%(72B变体),超越现有模型。 Conclusion: UI-Venus是一个基于多模态大语言模型的本地UI代理,实现了最先进的性能,并通过引入奖励函数、数据清洗策略以及自演化框架推动了UI导航任务的研究发展。 Abstract: We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment \& Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.

[168] Self-Supervised Stereo Matching with Multi-Baseline Contrastive Learning

Peng Xu,Zhiyu Xiang,Jingyun Fu,Tianyu Pu,Kai Wang,Chaojie Ji,Tingming Bai,Eryun Liu

Main category: cs.CV

TL;DR: 本文提出 BaCon-Stereo,一种基于教师-学生框架的自监督立体匹配方法,利用多基线输入和遮挡感知注意力图,有效提升遮挡区域和非遮挡区域的预测性能。

Details Motivation: 解决当前自监督立体匹配方法在遮挡区域因对应关系不明确而性能下降的问题。 Method: 采用教师-学生范式,使用多基线输入和遮挡感知注意力图,并合成多基线数据集 BaCon-20k 支持训练。 Result: BaCon-Stereo 在遮挡和非遮挡区域均表现出改进的预测能力,具有良好的泛化性和鲁棒性。 Conclusion: BaCon-Stereo 是一种用于自监督立体匹配训练的对比学习框架,旨在提高在遮挡和非遮挡区域的预测性能,并在 KITTI 基准测试中表现优于现有技术。 Abstract: Current self-supervised stereo matching relies on the photometric consistency assumption, which breaks down in occluded regions due to ill-posed correspondences. To address this issue, we propose BaCon-Stereo, a simple yet effective contrastive learning framework for self-supervised stereo network training in both non-occluded and occluded regions. We adopt a teacher-student paradigm with multi-baseline inputs, in which the stereo pairs fed into the teacher and student share the same reference view but differ in target views. Geometrically, regions occluded in the student's target view are often visible in the teacher's, making it easier for the teacher to predict in these regions. The teacher's prediction is rescaled to match the student's baseline and then used to supervise the student. We also introduce an occlusion-aware attention map to better guide the student in learning occlusion completion. To support training, we synthesize a multi-baseline dataset BaCon-20k. Extensive experiments demonstrate that BaCon-Stereo improves prediction in both occluded and non-occluded regions, achieves strong generalization and robustness, and outperforms state-of-the-art self-supervised methods on both KITTI 2015 and 2012 benchmarks. Our code and dataset will be released upon paper acceptance.

[169] Generalizable Federated Learning using Client Adaptive Focal Modulation

Tajamul Ashraf,Iqra Altaf Gillani

Main category: cs.CV

TL;DR: AdaptFED改进了联邦学习中的焦点调制方法,通过任务感知的个性化策略和低秩超网络条件化,提升了模型在非独立同分布和跨任务设置下的适应性和可扩展性。

Details Motivation: 为了提高联邦学习中模型在非独立同分布和跨领域数据上的泛化能力和个性化适应能力,同时减少服务器与客户端之间的通信开销。 Method: AdaptFED在TransFed基础上引入了任务感知的客户嵌入以进一步个性化调制动态,并通过低秩超网络条件化减少通信开销,同时提供了更强的理论性能保证。 Result: 在八个多样化数据集上的实验表明,AdaptFED在源无关和跨任务联邦设置中优于现有最先进方法,尤其在可扩展性和适应性方面表现突出。 Conclusion: AdaptFED通过增强的焦点调制策略和通信效率优化,为构建更具适应性、可扩展性和泛化能力的联邦学习系统提供了新路径。 Abstract: Federated learning (FL) has proven essential for privacy-preserving, collaborative training across distributed clients. Our prior work, TransFed, introduced a robust transformer-based FL framework that leverages a learn-to-adapt hypernetwork to generate personalized focal modulation layers per client, outperforming traditional methods in non-IID and cross-domain settings. In this extended version, we propose AdaptFED, where we deepen the investigation of focal modulation in generalizable FL by incorporating: (1) a refined adaptation strategy that integrates task-aware client embeddings to personalize modulation dynamics further, (2) enhanced theoretical bounds on adaptation performance, and (3) broader empirical validation across additional modalities, including time-series and multilingual data. We also introduce an efficient variant of TransFed that reduces server-client communication overhead via low-rank hypernetwork conditioning, enabling scalable deployment in resource-constrained environments. Extensive experiments on eight diverse datasets reaffirm the superiority of our method over state-of-the-art baselines, particularly in source-free and cross-task federated setups. Our findings not only extend the capabilities of focal modulation in FL but also pave the way for more adaptive, scalable, and generalizable transformer-based federated systems. The code is available at http://github.com/Tajamul21/TransFed

[170] Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation

Harold Haodong Chen,Haojian Huang,Qifeng Chen,Harry Yang,Ser-Nam Lim

Main category: cs.CV

TL;DR: PhysHPO通过多层级优化和数据选择提升视频生成的真实性和质量。

Details Motivation: 生成符合物理规律的视频对于提升真实感和准确性至关重要。 Method: 提出PhysHPO框架,包括四个层次的优化和自动化数据选择流程。 Result: PhysHPO在多个基准测试中显著提高了物理合理性和视频生成质量。 Conclusion: PhysHPO实现了更真实、更符合物理规律的视频生成,为视频生成领域提供了新的范式。 Abstract: Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize "good data" from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.

[171] Performance of GPT-5 in Brain Tumor MRI Reasoning

Mojtaba Safari,Shansong Wang,Mingzhe Hu,Zach Eidex,Qiang Li,Xiaofeng Yang

Main category: cs.CV

TL;DR: This study evaluated the performance of GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 in brain tumor differentiation using MRI-based VQA tasks. While the GPT-5 family models showed moderate accuracy, their performance is not yet suitable for clinical use.

Details Motivation: Accurate differentiation of brain tumor types using MRI is crucial for treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that combine image interpretation with natural language reasoning, prompting the evaluation of GPT models in this context. Method: The study evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 using a curated brain tumor VQA benchmark derived from three BraTS datasets. Each case involved multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. The models were assessed in a zero-shot chain-of-thought setting for accuracy on visual and reasoning tasks. Result: GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied across tumor subtypes, with no single model dominating all cohorts. Conclusion: The study concluded that the GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but their performance is not yet sufficient for clinical applications. Abstract: Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (GLI), meningioma (MEN), and brain metastases (MET). Each case included multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. Models were assessed in a zero-shot chain-of-thought setting for accuracy on both visual and reasoning tasks. Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype, with no single model dominating across all cohorts. These findings suggest that GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.

[172] TexVerse: A Universe of 3D Objects with High-Resolution Textures

Yibo Zhang,Li Zhang,Rui Ma,Nan Cao

Main category: cs.CV

TL;DR: TexVerse是一个大规模3D数据集,提供高分辨率纹理,填补了现有数据集在端到端高分辨率纹理生成方面的空白,并具有广泛的应用潜力。

Details Motivation: 现有的大规模3D数据集主要关注几何生成,而对高分辨率纹理生成的研究不足,缺乏合适的数据集。 Method: TexVerse通过从Sketchfab收集超过858K个独特的高分辨率3D模型,其中包括158K个具有PBR材质的模型,并提供详细注释和特殊子集,如TexVerse-Skeleton和TexVerse-Animation。 Result: TexVerse总共包含1.6M个3D实例,涵盖了高分辨率纹理、PBR材质、骨骼和动画数据,并提供了详细的模型注释。 Conclusion: TexVerse为纹理合成、PBR材质开发、动画制作以及各种3D视觉和图形任务提供了高质量的数据资源。 Abstract: We introduce TexVerse, a large-scale 3D dataset featuring high-resolution textures. While recent advances in large-scale 3D datasets have enhanced high-resolution geometry generation, creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. TexVerse fills this gap with a curated collection of over 858K unique high-resolution 3D models sourced from Sketchfab, including more than 158K models with physically based rendering (PBR) materials. Each model encompasses all of its high-resolution variants, bringing the total to 1.6M 3D instances. TexVerse also includes specialized subsets: TexVerse-Skeleton, with 69K rigged models, and TexVerse-Animation, with 54K animated models, both preserving original skeleton and animation data uploaded by the user. We also provide detailed model annotations describing overall characteristics, structural components, and intricate features. TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks.

[173] Medico 2025: Visual Question Answering for Gastrointestinal Imaging

Sushant Gautam,Vajira Thambawita,Michael Riegler,Pål Halvorsen,Steven Hicks

Main category: cs.CV

TL;DR: The Medico 2025 challenge focuses on developing Explainable AI models for answering clinical questions from GI images, using a large dataset and evaluating both performance and explainability to advance trustworthy AI in medicine.

Details Motivation: To develop and benchmark models that not only answer clinical questions accurately but also provide interpretable justifications aligned with medical reasoning, thus enhancing trust in AI for medical applications. Method: The challenge introduces two subtasks using the Kvasir-VQA-x1 dataset: answering visual questions and generating multimodal explanations. It evaluates models based on quantitative performance metrics and expert-reviewed explainability. Result: The creation of the Kvasir-VQA-x1 dataset with 6,500 images and 159,549 question-answer pairs serves as a benchmark. The challenge promotes the development of models that combine accurate visual question answering with explainability. Conclusion: The Medico 2025 challenge aims to advance trustworthy AI in medical image analysis by focusing on the development of Explainable Artificial Intelligence models for Visual Question Answering in Gastrointestinal imaging. Abstract: The Medico 2025 challenge addresses Visual Question Answering (VQA) for Gastrointestinal (GI) imaging, organized as part of the MediaEval task series. The challenge focuses on developing Explainable Artificial Intelligence (XAI) models that answer clinically relevant questions based on GI endoscopy images while providing interpretable justifications aligned with medical reasoning. It introduces two subtasks: (1) answering diverse types of visual questions using the Kvasir-VQA-x1 dataset, and (2) generating multimodal explanations to support clinical decision-making. The Kvasir-VQA-x1 dataset, created from 6,500 images and 159,549 complex question-answer (QA) pairs, serves as the benchmark for the challenge. By combining quantitative performance metrics and expert-reviewed explainability assessments, this task aims to advance trustworthy Artificial Intelligence (AI) in medical image analysis. Instructions, data access, and an updated guide for participation are available in the official competition repository: https://github.com/simula/MediaEval-Medico-2025

[174] ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Lingen Li,Guangzhi Wang,Zhaoyang Zhang,Yaowei Li,Xiaoyu Li,Qi Dou,Jinwei Gu,Tianfan Xue,Ying Shan

Main category: cs.CV

TL;DR: ToonComposer is a new AI tool that simplifies cartoon production by combining inbetweening and colorization, reducing manual work and improving motion control.

Details Motivation: Traditional cartoon and anime production is labor-intensive, and existing AI methods often handle key stages separately, leading to errors and artifacts. Method: ToonComposer combines inbetweening and colorization into a single post-keyframing stage using a sparse sketch injection mechanism and a cartoon adaptation method with a spatial low-rank adapter. Result: ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, as demonstrated by the PKBench benchmark. Conclusion: ToonComposer provides a more efficient and flexible solution for AI-assisted cartoon production by integrating inbetweening and colorization, reducing manual workload and improving motion control. Abstract: Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.

[175] STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Yushi Lan,Yihang Luo,Fangzhou Hong,Shangchen Zhou,Honghua Chen,Zhaoyang Lyu,Shuai Yang,Bo Dai,Chen Change Loy,Xingang Pan

Main category: cs.CV

TL;DR: STream3R 是一种新的 3D 重建方法,利用因果 Transformer 模型高效处理图像序列,适用于实时和动态场景。

Details Motivation: 现有的多视角重建方法要么依赖昂贵的全局优化,要么依赖简单的内存机制,难以处理长序列和动态场景。 Method: 将点图预测重新定义为仅解码器的 Transformer 问题,引入了一个基于因果注意力的流媒体框架,以高效处理图像序列。 Result: STream3R 在静态和动态场景基准测试中均优于以往方法,并能够与类似 LLM 的训练架构兼容,实现高效的预训练和微调。 Conclusion: STream3R 采用因果 Transformer 模型实现了高效的在线 3D 重建,为实时 3D 理解在流媒体环境中的应用铺平了道路。 Abstract: We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.

[176] MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

Antoine Labatie,Michael Vaccaro,Nina Lardiere,Anatol Garioud,Nicolas Gonthier

Main category: cs.CV

TL;DR: 本文提出了一种适用于遥感数据的自监督学习方法MAESTRO,在多时态任务上表现优异。

Details Motivation: 遥感数据具有多模态、多时态和多光谱特性,标准的自监督方法需要适应这些特性。 Method: 提出MAESTRO,改进的掩码自编码器,采用优化的融合策略和定制的目标归一化方案。 Result: MAESTRO在四个遥感数据集上创造了新的最先进的结果。 Conclusion: MAESTRO表现出在依赖多时态动态的任务上的优势,并在单一时间模态任务上保持竞争力。 Abstract: Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalization schemes for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder, featuring optimized fusion strategies and a tailored target normalization scheme that introduces a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.

[177] ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning

Jongseo Lee,Kyungho Bae,Kyle Min,Gyeong-Moon Park,Jinwoo Choi

Main category: cs.CV

TL;DR: This paper proposes ESSENTIAL, a method for video class-incremental learning that balances memory efficiency and performance through integration of episodic memory and semantic prompts with cross-attention.

Details Motivation: To address the trade-off between memory-efficiency and performance in video class-incremental learning. Method: Proposed ESSENTIAL method integrates episodic memory and semantic prompts through a novel memory retrieval module using cross-attention. Result: The method was validated on diverse datasets including UCF-101, HMDB51, Something-Something-V2, ActivityNet, and Kinetics-400. Conclusion: ESSENTIAL achieves favorable performance on benchmarks while significantly reducing memory usage. Abstract: In this work, we tackle the problem of video classincremental learning (VCIL). Many existing VCIL methods mitigate catastrophic forgetting by rehearsal training with a few temporally dense samples stored in episodic memory, which is memory-inefficient. Alternatively, some methods store temporally sparse samples, sacrificing essential temporal information and thereby resulting in inferior performance. To address this trade-off between memory-efficiency and performance, we propose EpiSodic and SEmaNTIc memory integrAtion for video class-incremental Learning (ESSENTIAL). ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts. We introduce a novel memory retrieval (MR) module that integrates episodic memory and semantic prompts through cross-attention, enabling the retrieval of temporally dense features from temporally sparse features. We rigorously validate ESSENTIAL on diverse datasets: UCF-101, HMDB51, and Something-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, and Kinetics-400 from the vCLIMB benchmark. Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.

[178] Human-in-Context: Unified Cross-Domain 3D Human Motion Modeling via In-Context Learning

Mengyuan Liu,Xinshun Wang,Zhongbin Fang,Deheng Ye,Xia Li,Tao Tang,Songtao Wu,Xiangtai Li,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 该论文提出了一种名为HiC的统一跨域3D人体运动建模方法,该方法结合了姿态和网格表示,改进了跨模态、任务和数据集的泛化能力,并通过实验验证了其优越性能。

Details Motivation: 现有跨域模型通常依赖于领域特定组件和多阶段训练,限制了其实用性和可扩展性。本文旨在提出一种新的训练统一跨域模型的方法,消除领域特定组件和多阶段训练的需要。 Method: 作者提出了Pose-in-Context (PiC) 模型,该模型利用上下文学习构建一个以姿态为中心的跨域模型。随后,基于PiC扩展出Human-in-Context (HiC) 模型,结合姿态和网格表示,引入了最大-最小相似性提示采样策略以及双分支上下文注入网络架构,以提升跨域泛化能力和上下文依赖处理能力。 Result: 实验结果表明,HiC在跨域泛化能力、数据规模和性能方面均优于PiC,展示了其在构建统一跨域3D人体运动模型方面的潜力。 Conclusion: HiC为跨域3D人体运动建模提供了一个灵活且可扩展的解决方案,通过统一框架实现多模态、多任务和多数据集的有效处理。 Abstract: This paper aims to model 3D human motion across domains, where a single model is expected to handle multiple modalities, tasks, and datasets. Existing cross-domain models often rely on domain-specific components and multi-stage training, which limits their practicality and scalability. To overcome these challenges, we propose a new setting to train a unified cross-domain model through a single process, eliminating the need for domain-specific components and multi-stage training. We first introduce Pose-in-Context (PiC), which leverages in-context learning to create a pose-centric cross-domain model. While PiC generalizes across multiple pose-based tasks and datasets, it encounters difficulties with modality diversity, prompting strategy, and contextual dependency handling. We thus propose Human-in-Context (HiC), an extension of PiC that broadens generalization across modalities, tasks, and datasets. HiC combines pose and mesh representations within a unified framework, expands task coverage, and incorporates larger-scale datasets. Additionally, HiC introduces a max-min similarity prompt sampling strategy to enhance generalization across diverse domains and a network architecture with dual-branch context injection for improved handling of contextual dependencies. Extensive experimental results show that HiC performs better than PiC in terms of generalization, data scale, and performance across a wide range of domains. These results demonstrate the potential of HiC for building a unified cross-domain 3D human motion model with improved flexibility and scalability. The source codes and models are available at https://github.com/BradleyWang0416/Human-in-Context.

[179] Puppeteer: Rig and Animate Your 3D Models

Chaoyue Song,Xiu Li,Fan Yang,Zhongcong Xu,Jiacheng Wei,Fayao Liu,Jiashi Feng,Guosheng Lin,Jianfeng Zhang

Main category: cs.CV

TL;DR: Puppeteer 是一个自动化3D模型装配和动画生成的框架,通过创新的基于变压器的骨骼预测、注意力机制蒙皮和优化动画管线,显著提升效果与效率。

Details Motivation: 静态3D模型向动态动画资产的转化依赖专家干预,成为内容创作的瓶颈,而生成式AI在静态模型上的进步促使研究者探索自动化装配和动画的方法。 Method: 该框架包括一个基于自回归变压器的骨骼结构预测系统,使用基于关节的标记化策略和分层排序方法;一个基于注意力机制的拓扑感知蒙皮权重推断架构;以及一个基于可微分优化的高效动画生成管线。 Result: Puppeteer 在多个基准测试中显著优于现有技术,不仅在骨骼预测精度和蒙皮质量方面表现突出,而且能够高效生成稳定、高保真的动画,同时消除现有方法中的抖动问题。 Conclusion: Puppeteer 是一种全面的框架,通过自动装配和动画生成解决了3D内容创建中的瓶颈,能够处理各种3D对象,生成高质量、时间连贯的动画。 Abstract: Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.

[180] Quantum Visual Fields with Neural Amplitude Encoding

Shuteng Wang,Christian Theobalt,Vladislav Golyanik

Main category: cs.CV

TL;DR: This paper introduces Quantum Visual Field (QVF), a new type of Quantum Implicit Neural Representation (QINR) for 2D image and 3D geometric field learning that directly employs projective measurement to extract learned signals encoded in the ansatz. QVF outperforms existing quantum approaches and classical foundational baselines in visual representation accuracy.

Details Motivation: Many challenges concerning the architecture and ansatz design, the utility of quantum-mechanical properties, training efficiency, and the interplay with classical modules remain in Quantum Implicit Neural Representations (QINRs). This paper aims to advance the field by introducing a new type of QINR for 2D image and 3D geometric field learning. Method: QVF encodes classical data into quantum statevectors using neural amplitude encoding grounded in a learnable energy manifold. It follows a fully entangled design of learnable parametrised quantum circuits, with quantum operations performed in the real Hilbert space. Result: Experiments on a quantum hardware simulator demonstrated that QVF outperforms existing quantum approaches and classical foundational baselines in terms of visual representation accuracy. It showed superiority in learning high-frequency details and ensured meaningful Hilbert space embeddings. Conclusion: QVF is a new type of QINR that directly employs projective measurement to extract learned signals encoded in the ansatz and does not rely on classical post-processing. It has practical potential in 2D and 3D field completion and 3D shape interpolation. Abstract: Quantum Implicit Neural Representations (QINRs) include components for learning and execution on gate-based quantum computers. While QINRs recently emerged as a promising new paradigm, many challenges concerning their architecture and ansatz design, the utility of quantum-mechanical properties, training efficiency and the interplay with classical modules remain. This paper advances the field by introducing a new type of QINR for 2D image and 3D geometric field learning, which we collectively refer to as Quantum Visual Field (QVF). QVF encodes classical data into quantum statevectors using neural amplitude encoding grounded in a learnable energy manifold, ensuring meaningful Hilbert space embeddings. Our ansatz follows a fully entangled design of learnable parametrised quantum circuits, with quantum (unitary) operations performed in the real Hilbert space, resulting in numerically stable training with fast convergence. QVF does not rely on classical post-processing -- in contrast to the previous QINR learning approach -- and directly employs projective measurement to extract learned signals encoded in the ansatz. Experiments on a quantum hardware simulator demonstrate that QVF outperforms the existing quantum approach and widely used classical foundational baselines in terms of visual representation accuracy across various metrics and model characteristics, such as learning of high-frequency details. We also show applications of QVF in 2D and 3D field completion and 3D shape interpolation, highlighting its practical potential.