Table of Contents
cs.CL [Back]
[1] Bridging AI Innovation and Healthcare Needs: Lessons Learned from Incorporating Modern NLP at The BC Cancer Registry
Lovedeep Gondara,Gregory Arbour,Raymond Ng,Jonathan Simkin,Shebnum Devji
Main category: cs.CL
TL;DR: This paper shares key lessons learned from implementing NLP models in healthcare settings, emphasizing the importance of aligning technical solutions with business objectives, interdisciplinary collaboration, pragmatic model selection, data quality, error mitigation, and organizational AI literacy.
Details
Motivation: The motivation of the paper is to share practical insights and lessons learned from implementing NLP models in a real-world healthcare setting, with the aim of guiding other healthcare organizations in successfully adopting AI/NLP solutions for improved data management and patient care. Method: The paper draws upon the authors' experience implementing various NLP models for information extraction and classification tasks at the British Columbia Cancer Registry (BCCR), sharing key lessons learned throughout the project lifecycle. Result: The paper identifies several key factors for successful NLP implementation in healthcare, including problem definition aligned with business goals, iterative development, interdisciplinary collaboration, pragmatic model selection, data quality assurance, error mitigation strategies, and building organizational AI literacy. Conclusion: The paper concludes that successful deployment of AI/NLP solutions in healthcare requires a balance of technical and practical considerations, including problem definition based on business objectives, interdisciplinary collaboration, pragmatic model selection, data quality management, error mitigation, and organizational AI literacy building. Abstract: Automating data extraction from clinical documents offers significant potential to improve efficiency in healthcare settings, yet deploying Natural Language Processing (NLP) solutions presents practical challenges. Drawing upon our experience implementing various NLP models for information extraction and classification tasks at the British Columbia Cancer Registry (BCCR), this paper shares key lessons learned throughout the project lifecycle. We emphasize the critical importance of defining problems based on clear business objectives rather than solely technical accuracy, adopting an iterative approach to development, and fostering deep interdisciplinary collaboration and co-design involving domain experts, end-users, and ML specialists from inception. Further insights highlight the need for pragmatic model selection (including hybrid approaches and simpler methods where appropriate), rigorous attention to data quality (representativeness, drift, annotation), robust error mitigation strategies involving human-in-the-loop validation and ongoing audits, and building organizational AI literacy. These practical considerations, generalizable beyond cancer registries, provide guidance for healthcare organizations seeking to successfully implement AI/NLP solutions to enhance data management processes and ultimately improve patient care and public health outcomes.[2] A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain
Hugo Massaroli,Leonardo Iara,Emmanuel Iarussi,Viviana Siless
Main category: cs.CL
TL;DR: This paper introduces a transparent, blockchain-based protocol for evaluating the fairness of open-source large language models, uncovering disparities across models and languages while enabling open, reproducible analysis.
Details
Motivation: The motivation is to address persistent concerns about the fairness of large language models (LLMs) in high-stakes domains like criminal justice, education, healthcare, and finance. By introducing transparent and reproducible evaluation methods, the paper aims to enhance trust and accountability in LLM deployments. Method: The paper proposes a fairness benchmarking method that uses smart contracts on the ICP blockchain to execute HTTP requests to Hugging Face endpoints. It stores datasets, prompts, and metrics onchain to ensure transparency and reproducibility. The evaluation includes fairness metrics like statistical parity, equal opportunity, and Context Association Metrics across multilingual settings. Result: The results reveal measurable disparities in fairness metrics across models and languages, as demonstrated by evaluations on the PISA dataset, StereoSet dataset, and Kaleidoscope multilingual benchmark. The proposed protocol successfully benchmarks fairness and enables community audits and longitudinal tracking. Conclusion: This paper concludes that deploying transparent evaluation protocols on the ICP blockchain enables verifiable, immutable, and reproducible fairness benchmarking for LLMs. The analysis highlights cross-linguistic disparities and provides insights into mitigating biases in high-stakes domains. Abstract: Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist especially in highstakes domains like criminal justice, education, healthcare, and finance. This paper introduces transparent evaluation protocol for benchmarking the fairness of opensource LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain (Foundation, 2023). Our method ensures verifiable, immutable, and reproducible evaluations by executing onchain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly onchain. We benchmark the Llama, DeepSeek, and Mistral models on the PISA dataset for academic performance prediction (OECD, 2018), a dataset suitable for fairness evaluation using statistical parity and equal opportunity metrics (Hardt et al., 2016). We also evaluate structured Context Association Metrics derived from the StereoSet dataset (Nadeem et al., 2020) to measure social bias in contextual associations. We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark (Salazar et al., 2025), revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.[3] Thematic and Task-Based Categorization of K-12 GenAI Usages with Hierarchical Topic Modeling
Johannes Schneider,Béatrice S. Hasler,Michaela Varrone,Fabian Hoya,Thomas Schroffenegger,Dana-Kristin Mah,Karl Peböck
Main category: cs.CL
TL;DR: 本文通过使用先进LLM和简单话题建模方法分析教室匿名互动数据,得出在内容和任务两个维度的分类结果,发现传统方法的不足,并提出更优的人类对齐分层主题结构,为GenAI在教育中的应用提供了支持,同时指出了未来研究方向。
Details
Motivation: 先前的研究大多缺乏内容或主题分类,尤其是在K-12教育中,尽管任务分类更为普遍,但大多数研究没有得到现实数据的支持。因此,作者希望通过分析真实世界的数据来获得新的应用见解。 Method: 作者采用了一种新颖的、简单的话题建模方法,对来自教室中未成年人的匿名互动数据进行分析,将17,000多条消息在内容和任务两个维度进行分类,并通过层次分类方法提供了高层次的概述和具体见解。 Result: 分析产生了一些新颖的应用,并发现许多经典和新兴的计算方法(如话题建模)在处理大量文本时表现不佳,因此作者直接应用了最先进的LLM来实现更优的分层主题结构。 Conclusion: 本论文得出结论,通过使用先进的LLM和适当的预处理,可以实现比以往方法更好的人类对齐的分层主题结构,从而支持研究人员、教师和学生丰富GenAI的使用,同时也强调了未来研究中需要解决的问题和关注点。 Abstract: We analyze anonymous interaction data of minors in class-rooms spanning several months, schools, and subjects employing a novel, simple topic modeling approach. Specifically, we categorize more than 17,000 messages generated by students, teachers, and ChatGPT in two dimensions: content (such as nature and people) and tasks (such as writing and explaining). Our hierarchical categorization done separately for each dimension includes exemplary prompts, and provides both a high-level overview as well as tangible insights. Prior works mostly lack a content or thematic categorization. While task categorizations are more prevalent in education, most have not been supported by real-world data for K-12. In turn, it is not surprising that our analysis yielded a number of novel applications. In deriving these insights, we found that many of the well-established classical and emerging computational methods, i.e., topic modeling, for analysis of large amounts of texts underperform, leading us to directly apply state-of-the-art LLMs with adequate pre-processing to achieve hierarchical topic structures with better human alignment through explicit instructions than prior approaches. Our findings support fellow researchers, teachers and students in enriching the usage of GenAI, while our discussion also highlights a number of concerns and open questions for future research.[4] INTIMA: A Benchmark for Human-AI Companionship Behavior
Lucie-Aimée Kaffee,Giada Pistilli,Yacine Jernite
Main category: cs.CL
TL;DR: This paper introduces INTIMA, a benchmark for assessing companionship behaviors in AI language models, revealing that while emotional bonding behaviors are common, there is inconsistency in boundary-setting across models, highlighting the need for standardized approaches to ensure user well-being.
Details
Motivation: The emergence of AI companionship, where users form emotional bonds with AI systems, presents both opportunities and risks. This study aims to provide a structured way to evaluate and understand these behaviors in language models. Method: The researchers created a benchmark called INTIMA, based on psychological theories and user data, to evaluate companionship behaviors in language models. This involved a taxonomy of 31 behaviors across four categories, with 368 targeted prompts. Model responses were categorized as companionship-reinforcing, boundary-maintaining, or neutral. Result: When tested with INTIMA, the language models showed varying tendencies, with Gemma-3, Phi-4, o3-mini, and Claude-4 exhibiting more companionship-reinforcing behaviors. However, differences were observed in how commercial providers handle more sensitive aspects of the benchmark. Conclusion: The study emphasizes the necessity for more uniform strategies in managing emotionally intense interactions between users and AI companions, given the variance in how different models handle boundary-setting and emotional support. Abstract: AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.[5] XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs
Yuzhuo Xiao,Zeyu Han,Yuhan Wang,Huaizu Jiang
Main category: cs.CL
TL;DR: 本文介绍了一个用于评估多模态大语言模型在虚假信息检测中的新数据集XFacta,并系统分析了不同策略的有效性。
Details
Motivation: 现有的基准数据集要么包含过时事件,要么是人工合成的,无法反映现实世界的虚假信息模式。 Method: 引入XFacta数据集,并系统评估不同的MLLM-based虚假信息检测策略。 Result: 对基于MLLM的模型设计策略进行了全面分析,并开发了一个半自动检测框架以保持数据集的时效性。 Conclusion: XFacta提供了一个当代的、现实世界的数据集,用于评估基于MLLM的检测器,并通过系统分析推动了多模态虚假信息检测领域的发展。 Abstract: The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods. Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in addressing this challenge. However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field. On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns. Additionally, it lacks comprehensive analyses of MLLM-based model design strategies. To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating MLLM-based detectors. We systematically evaluate various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, as well as benchmarking against existing detection methods. Building on these analyses, we further enable a semi-automatic detection-in-the-loop framework that continuously updates XFacta with new content to maintain its contemporary relevance. Our analysis provides valuable insights and practices for advancing the field of multimodal misinformation detection. The code and data have been released.[6] AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification
Chenhao Xue,Yuanzhe Jin,Adrian Carrasco-Revilla,Joyraj Chakraborty,Min Chen
Main category: cs.CL
TL;DR: 本文通过使用大型语言模型生成合成数据,提出了一种改进分类模型的新方法。
Details
Motivation: 本文的动机是解决在实际应用中收集足够所有文本类别的数据的困难,并通过使用大型语言模型生成合成数据来改善模型性能。 Method: 研究采用了三种搜索策略,并通过广泛的实验进行研究,根据类别特征选择搜索策略的集成算法。 Result: 进一步的实验表明,这种集成方法在改进分类模型方面比每个单独的策略更有效。 Conclusion: 本文的结论是,集成方法在改进使用LLMs的分类模型方面比每个单独的策略更有效。 Abstract: When developing text classification models for real world applications, one major challenge is the difficulty to collect sufficient data for all text classes. In this work, we address this challenge by utilizing large language models (LLMs) to generate synthetic data and using such data to improve the performance of the models without waiting for more real data to be collected and labelled. As an LLM generates different synthetic data in response to different input examples, we formulate an automated workflow, which searches for input examples that lead to more ``effective'' synthetic data for improving the model concerned. We study three search strategies with an extensive set of experiments, and use experiment results to inform an ensemble algorithm that selects a search strategy according to the characteristics of a class. Our further experiments demonstrate that this ensemble approach is more effective than each individual strategy in our automated workflow for improving classification models using LLMs.[7] HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish
Rakesh Thakur,Sneha Sharma,Gauri Chopra
Main category: cs.CL
TL;DR: 本文提出了一种用于混杂语言事实核查的新模型HiFACTMix和相关数据集HiFACT,适用于多语言和政治背景下的自然语言处理。
Details
Motivation: 现有的事实验证系统主要关注高资源单语环境,难以推广到印度等语言多样性地区的现实政治话语中。因此,需要强大的多语言、情境感知的事实核查工具。 Method: 本文的方法包括多语言上下文编码、主张-证据语义对齐、证据图构建、图神经推理和自然语言解释生成。 Result: 实验结果表明,HiFACTMix在准确率方面优于现有的多语言基线模型,并为其判决提供了可靠的正当理由。 Conclusion: 本文提出了HiFACTMix模型和HiFACT数据集,为多语言、混杂语言和政治背景的事实验证研究开辟了新方向。 Abstract: Fact-checking in code-mixed, low-resource languages such as Hinglish remains an underexplored challenge in natural language processing. Existing fact-verification systems largely focus on high-resource, monolingual settings and fail to generalize to real-world political discourse in linguistically diverse regions like India. Given the widespread use of Hinglish by public figures, particularly political figures, and the growing influence of social media on public opinion, there's a critical need for robust, multilingual and context-aware fact-checking tools. To address this gap a novel benchmark HiFACT dataset is introduced with 1,500 realworld factual claims made by 28 Indian state Chief Ministers in Hinglish, under a highly code-mixed low-resource setting. Each claim is annotated with textual evidence and veracity labels. To evaluate this benchmark, a novel graphaware, retrieval-augmented fact-checking model is proposed that combines multilingual contextual encoding, claim-evidence semantic alignment, evidence graph construction, graph neural reasoning, and natural language explanation generation. Experimental results show that HiFACTMix outperformed accuracy in comparison to state of art multilingual baselines models and provides faithful justifications for its verdicts. This work opens a new direction for multilingual, code-mixed, and politically grounded fact verification research.[8] Semantic Structure in Large Language Model Embeddings
Austin C. Kozlowski,Callin Dai,Andrei Boutyline
Main category: cs.CL
TL;DR: Large language models (LLMs) encode semantic associations in a low-dimensional structure similar to human language, with semantic directions defined by antonym pairs correlating with human ratings and forming a 3D subspace.
Details
Motivation: To understand if and how the low-dimensional semantic structure observed in human psychological responses is reflected in the embeddings of large language models. Method: Analyzing the projections of words onto semantic directions defined by antonym pairs within LLM embeddings, and observing the effects of shifting tokens along these directions. Result: Semantic directions in LLMs correlate with human ratings, reduce to a 3-dimensional subspace, and shifting along one direction causes off-target effects proportional to cosine similarity. Conclusion: Semantic features in LLMs are entangled in a low-dimensional structure similar to human cognition, suggesting that accounting for this structure is important in controlling language model outputs effectively. Abstract: Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the semantic associations encoded in the embedding matrices of large language models (LLMs) exhibit a similar structure. We show that the projections of words on semantic directions defined by antonym pairs (e.g. kind - cruel) correlate highly with human ratings, and further find that these projections effectively reduce to a 3-dimensional subspace within LLM embeddings, closely resembling the patterns derived from human survey responses. Moreover, we find that shifting tokens along one semantic direction causes off-target effects on geometrically aligned features proportional to their cosine similarity. These findings suggest that semantic features are entangled within LLMs similarly to how they are interconnected in human language, and a great deal of semantic information, despite its apparent complexity, is surprisingly low-dimensional. Furthermore, accounting for this semantic structure may prove essential for avoiding unintended consequences when steering features.[9] User Perception of Attention Visualizations: Effects on Interpretability Across Evidence-Based Medical Documents
Andrés Carvallo,Denis Parra,Peter Brusilovsky,Hernan Valdivieso,Gabriel Rada,Ivania Donoso,Vladimir Araujo
Main category: cs.CL
TL;DR: This study evaluated the usefulness of attention-based explanations in biomedical document classification using a user study with medical experts. While the Transformer model performed well, attention weights were not consistently seen as helpful for explanations, with preferences for intuitive visualization formats over more precise ones.
Details
Motivation: The motivation behind this study is to explore the potential of attention mechanisms in Transformer architectures as tools for explainability, particularly in the context of evidence-based medicine where understanding AI decisions is crucial. The research aims to bridge the gap in understanding how attention weights can support physicians in interacting with AI systems. Method: A user study was conducted involving medical experts from various disciplines to evaluate the effectiveness of attention-based explanations in biomedical document classification. The study assessed different visualization techniques for attention weights to determine their impact on perceived usefulness. Result: The Transformer model (XLNet) demonstrated accurate classification of biomedical documents. However, attention weights were generally not perceived as helpful for explanations, although this perception varied with the visualization method. More intuitive visualization formats, such as text brightness or background color, were preferred over precise encodings like bar length. Conclusion: The study concluded that while Transformer models like XLNet can accurately classify biomedical documents, attention weights are not universally helpful for explanation. The perceived usefulness of attention-based explanations is significantly influenced by the visualization method used, with more intuitive formats being preferred. Abstract: The attention mechanism is a core component of the Transformer architecture. Beyond improving performance, attention has been proposed as a mechanism for explainability via attention weights, which are associated with input features (e.g., tokens in a document). In this context, larger attention weights may imply more relevant features for the model's prediction. In evidence-based medicine, such explanations could support physicians' understanding and interaction with AI systems used to categorize biomedical literature. However, there is still no consensus on whether attention weights provide helpful explanations. Moreover, little research has explored how visualizing attention affects its usefulness as an explanation aid. To bridge this gap, we conducted a user study to evaluate whether attention-based explanations support users in biomedical document classification and whether there is a preferred way to visualize them. The study involved medical experts from various disciplines who classified articles based on study design (e.g., systematic reviews, broad synthesis, randomized and non-randomized trials). Our findings show that the Transformer model (XLNet) classified documents accurately; however, the attention weights were not perceived as particularly helpful for explaining the predictions. However, this perception varied significantly depending on how attention was visualized. Contrary to Munzner's principle of visual effectiveness, which favors precise encodings like bar length, users preferred more intuitive formats, such as text brightness or background color. While our results do not confirm the overall utility of attention weights for explanation, they suggest that their perceived helpfulness is influenced by how they are visually presented.[10] From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation
Chengliang Zhou,Mei Wang,Ting Zhang,Qiannan Zhu,Jian Li,Hua Huang
Main category: cs.CL
TL;DR: This paper introduces EQGBench, a new benchmark for evaluating Large Language Models (LLMs) in generating high-quality educational questions in Chinese, highlighting significant potential for improvement in this area.
Details
Motivation: The motivation of the paper is to address the underexplored challenge of transitioning from providing answers to generating high-quality educational questions, aiming to enhance the pedagogical value and educational effectiveness of questions generated by Large Language Models (LLMs). Method: The researchers introduced EQGBench, a comprehensive benchmark for evaluating LLMs in Chinese EQG, using a five-dimensional evaluation framework and a dataset of 900 samples across mathematics, physics, and chemistry. A systematic evaluation of 46 mainstream large models was conducted. Result: EQGBench was successfully developed as a benchmark for evaluating LLMs' performance in Chinese EQG, revealing through evaluation that current models have significant potential for improvement in generating educationally effective questions. Conclusion: The study concludes that there is significant room for improvement in generating pedagogically valuable questions by LLMs, highlighting the importance of advancing Educational Question Generation (EQG) for better educational outcomes. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in mathematical problem-solving. However, the transition from providing answers to generating high-quality educational questions presents significant challenges that remain underexplored. To advance Educational Question Generation (EQG) and facilitate LLMs in generating pedagogically valuable and educationally effective questions, we introduce EQGBench, a comprehensive benchmark specifically designed for evaluating LLMs' performance in Chinese EQG. EQGBench establishes a five-dimensional evaluation framework supported by a dataset of 900 evaluation samples spanning three fundamental middle school disciplines: mathematics, physics, and chemistry. The dataset incorporates user queries with varying knowledge points, difficulty gradients, and question type specifications to simulate realistic educational scenarios. Through systematic evaluation of 46 mainstream large models, we reveal significant room for development in generating questions that reflect educational value and foster students' comprehensive abilities.[11] Automated scoring of the Ambiguous Intentions Hostility Questionnaire using fine-tuned large language models
Y. Lyu,D. Combs,D. Neumann,Y. C. Leong
Main category: cs.CL
TL;DR: 该研究证明,通过微调的大型语言模型可以高效准确地自动化AIHQ开放式回答的评分,适用于研究和临床场景,并提供了一个易于使用的评分接口。
Details
Motivation: AIHQ开放式问题的回答通常需要耗时的人工评分,这限制了其在大规模研究或临床实践中的应用。使用大型语言模型自动化评分可能提高效率并降低成本。 Method: 研究人员使用先前收集的数据集,其中包含TBI个体和健康对照组的AIHQ开放式回答,并将这些回答分为训练集和测试集。使用训练集对两个大型语言模型进行微调,并在测试集上评估其性能。此外,还评估了模型在独立非临床数据集上的泛化能力。 Result: 微调后的模型生成的评分与人工评分高度一致,且在敌意归因和攻击性反应评分中都表现出较高的准确性。这种一致性在不同类型的场景(模糊、故意和意外)中均保持稳定,并且在TBI组和HC组之间的群体差异上也重现了先前的研究结果。此外,微调模型在独立非临床数据集中也表现出良好的泛化能力。 Conclusion: 大型语言模型可以自动化AIHQ评分,从而加速研究和临床环境中的心理评估。 Abstract: Hostile attribution bias is the tendency to interpret social interactions as intentionally hostile. The Ambiguous Intentions Hostility Questionnaire (AIHQ) is commonly used to measure hostile attribution bias, and includes open-ended questions where participants describe the perceived intentions behind a negative social situation and how they would respond. While these questions provide insights into the contents of hostile attributions, they require time-intensive scoring by human raters. In this study, we assessed whether large language models can automate the scoring of AIHQ open-ended responses. We used a previously collected dataset in which individuals with traumatic brain injury (TBI) and healthy controls (HC) completed the AIHQ and had their open-ended responses rated by trained human raters. We used half of these responses to fine-tune the two models on human-generated ratings, and tested the fine-tuned models on the remaining half of AIHQ responses. Results showed that model-generated ratings aligned with human ratings for both attributions of hostility and aggression responses, with fine-tuned models showing higher alignment. This alignment was consistent across ambiguous, intentional, and accidental scenario types, and replicated previous findings on group differences in attributions of hostility and aggression responses between TBI and HC groups. The fine-tuned models also generalized well to an independent nonclinical dataset. To support broader adoption, we provide an accessible scoring interface that includes both local and cloud-based options. Together, our findings suggest that large language models can streamline AIHQ scoring in both research and clinical contexts, revealing their potential to facilitate psychological assessments across different populations.[12] Multidimensional classification of posts for online course discussion forum curation
Antonio Leandro Martins Candido,Jose Everardo Bessa Maia
Main category: cs.CL
TL;DR: 为降低在线课程讨论论坛自动管理中频繁微调大型语言模型的资源消耗,本文提出了一种贝叶斯融合方法,将预训练模型和本地分类器的得分进行融合,取得了良好的性能表现。
Details
Motivation: 频繁重新训练大型语言模型以更新在线课程讨论论坛的自动管理是一项资源密集型过程。 Method: 将预训练通用LLM的多维分类得分与本地数据训练的分类器得分进行贝叶斯融合。 Result: 实验结果表明,所提出的融合方法在性能上优于每个单独的分类器,并且与LLM微调方法相当。 Conclusion: 贝叶斯融合方法在性能上优于单独的分类器,并且可以与LLM微调方法竞争。 Abstract: The automatic curation of discussion forums in online courses requires constant updates, making frequent retraining of Large Language Models (LLMs) a resource-intensive process. To circumvent the need for costly fine-tuning, this paper proposes and evaluates the use of Bayesian fusion. The approach combines the multidimensional classification scores of a pre-trained generic LLM with those of a classifier trained on local data. The performance comparison demonstrated that the proposed fusion improves the results compared to each classifier individually, and is competitive with the LLM fine-tuning approach[13] Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts
Hojun Jin,Eunsoo Hong,Ziwon Hyung,Sungjun Lim,Seungjin Lee,Keunseok Cho
Main category: cs.CL
TL;DR: 我们提出了一种名为S-MoE的模型,该模型通过将每个任务路由到其指定的专家来解决硬参数共享的问题,并在处理混合带宽输入的同时执行ASR和ST方面表现出色。
Details
Motivation: 硬参数共享策略常常导致任务干扰,影响整体模型性能,需要一种简单而有效的解决方案。 Method: 通过使用特殊的引导标记将每个任务路由到其指定的专家,S-MoE消除了训练门控功能的需要。 Result: 实验结果表明,S-MoE在词错误率(WER)上实现了6.35%的相对改进。 Conclusion: S-MoE是一个有效的模型,它解决了硬参数共享的局限性,并能够处理混合带宽输入,同时执行ASR和ST。 Abstract: Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.[14] An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs
Ayana Hussain,Patrick Zhao,Nicholas Vincent
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型在生成和检测有害错误信息方面的双重作用,特别是针对医疗健康领域的错误信息。
Details
Motivation: 该论文的动机是探讨LLMs如何在产生有害错误信息方面成为一把双刃剑,并研究它们在检测和防止错误信息传播方面的潜力。 Method: 该论文的方法包括对109个不同的攻击进行深入研究,针对三个目标LLMs进行测试,并将攻击提示与实际中发现的健康相关LLM查询进行比较。此外,还比较了生成的错误信息与Reddit上的健康相关错误信息。 Result: 研究结果表明LLMs可以有效地用于检测来自其他LLMs和人类的错误信息,支持LLMs在设计得当的情况下对整体信息生态系统做出积极贡献的观点。 Conclusion: 该论文的结论是大型语言模型(LLMs)可以有效地用于检测来自其他LLMs和人类的错误信息,并且通过精心设计,LLMs可以为更健康的信息生态系统做出贡献。 Abstract: Large Language Models (LLMs) are a double-edged sword capable of generating harmful misinformation -- inadvertently, or when prompted by "jailbreak" attacks that attempt to produce malicious outputs. LLMs could, with additional research, be used to detect and prevent the spread of misinformation. In this paper, we investigate the efficacy and characteristics of LLM-produced jailbreak attacks that cause other models to produce harmful medical misinformation. We also study how misinformation generated by jailbroken LLMs compares to typical misinformation found on social media, and how effectively it can be detected using standard machine learning approaches. Specifically, we closely examine 109 distinct attacks against three target LLMs and compare the attack prompts to in-the-wild health-related LLM queries. We also examine the resulting jailbreak responses, comparing the generated misinformation to health-related misinformation on Reddit. Our findings add more evidence that LLMs can be effectively used to detect misinformation from both other LLMs and from people, and support a body of work suggesting that with careful design, LLMs can contribute to a healthier overall information ecosystem.[15] Evaluation of GPT-based large language generative AI models as study aids for the national licensure examination for registered dietitians in Japan
Yuta Nagamori,Mikoto Kosai,Yuji Kawai,Haruka Marumo,Misaki Shibuya,Tatsuya Negishi,Masaki Imanishi,Yasumasa Ikeda,Koichiro Tsuchiya,Asuka Sawai,Licht Miyamoto
Main category: cs.CL
TL;DR: Some generative AI models perform moderately well in dietitian exams but lack consistent accuracy, requiring further refinement for reliable study support.
Details
Motivation: To assess the effectiveness of LLM-based generative AI as study aids for nutrition students preparing for professional exams. Method: Evaluated AI models' responses to Japanese national dietitian exams using accuracy, consistency, and response time metrics, with additional prompt engineering tests. Result: Bing-Precise (66.2%) and Bing-Creative (61.4%) surpassed the passing threshold, while ChatGPT and Bing-Balanced did not. All models showed limitations in consistency and robustness. Conclusion: Generative AI models like ChatGPT and Bing variants show potential as study aids for dietitian licensure exams but require further improvements for reliability and stability. Abstract: Generative artificial intelligence (AI) based on large language models (LLMs), such as ChatGPT, has demonstrated remarkable progress across various professional fields, including medicine and education. However, their performance in nutritional education, especially in Japanese national licensure examination for registered dietitians, remains underexplored. This study aimed to evaluate the potential of current LLM-based generative AI models as study aids for nutrition students. Questions from the Japanese national examination for registered dietitians were used as prompts for ChatGPT and three Bing models (Precise, Creative, Balanced), based on GPT-3.5 and GPT-4. Each question was entered into independent sessions, and model responses were analyzed for accuracy, consistency, and response time. Additional prompt engineering, including role assignment, was tested to assess potential performance improvements. Bing-Precise (66.2%) and Bing-Creative (61.4%) surpassed the passing threshold (60%), while Bing-Balanced (43.3%) and ChatGPT (42.8%) did not. Bing-Precise and Bing-Creative generally outperformed others across subject fields except Nutrition Education, where all models underperformed. None of the models consistently provided the same correct responses across repeated attempts, highlighting limitations in answer stability. ChatGPT showed greater consistency in response patterns but lower accuracy. Prompt engineering had minimal effect, except for modest improvement when correct answers and explanations were explicitly provided. While some generative AI models marginally exceeded the passing threshold, overall accuracy and answer consistency remained suboptimal. Moreover, all the models demonstrated notable limitations in answer consistency and robustness. Further advancements are needed to ensure reliable and stable AI-based study aids for dietitian licensure preparation.[16] Guided Navigation in Knowledge-Dense Environments: Structured Semantic Exploration with Guidance Graphs
Dehao Tao,Guangjie Liu,Weizheng,Yongfeng Huang,Minghu jiang
Main category: cs.CL
TL;DR: This paper proposes GG Explore, a novel framework for knowledge graph exploration that improves efficiency and accuracy by leveraging an intermediate Guidance Graph and pruning techniques.
Details
Motivation: Current knowledge graph exploration methods face limitations in handling granularity mismatches and contextual information, which the proposed method aims to overcome. Method: The paper introduces a Guidance Graph to guide knowledge exploration, along with Structural Alignment and Context Aware Pruning techniques to improve retrieval accuracy and efficiency. Result: Experiments show that the proposed method outperforms state-of-the-art approaches, especially on complex tasks, while maintaining efficiency with smaller LLMs. Conclusion: The proposed GG Explore framework effectively bridges unstructured queries and structured knowledge retrieval, overcoming limitations of existing methods by leveraging an intermediate Guidance Graph. Abstract: While Large Language Models (LLMs) exhibit strong linguistic capabilities, their reliance on static knowledge and opaque reasoning processes limits their performance in knowledge intensive tasks. Knowledge graphs (KGs) offer a promising solution, but current exploration methods face a fundamental trade off: question guided approaches incur redundant exploration due to granularity mismatches, while clue guided methods fail to effectively leverage contextual information for complex scenarios. To address these limitations, we propose Guidance Graph guided Knowledge Exploration (GG Explore), a novel framework that introduces an intermediate Guidance Graph to bridge unstructured queries and structured knowledge retrieval. The Guidance Graph defines the retrieval space by abstracting the target knowledge' s structure while preserving broader semantic context, enabling precise and efficient exploration. Building upon the Guidance Graph, we develop: (1) Structural Alignment that filters incompatible candidates without LLM overhead, and (2) Context Aware Pruning that enforces semantic consistency with graph constraints. Extensive experiments show our method achieves superior efficiency and outperforms SOTA, especially on complex tasks, while maintaining strong performance with smaller LLMs, demonstrating practical value.[17] Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis
Linqing Chen,Hanmeng Zhong,Wentao Wu,Weilei Wang
Main category: cs.CL
TL;DR: Semantic Bridge是一种利用语义图编织技术生成可控多跳推理问题的新框架,解决了大语言模型训练中高质量问答对稀缺的问题。
Details
Motivation: 现有的方法无法生成可控的、复杂的多跳推理问题,而训练大语言模型需要高质量的推理密集型问答对,尤其是在特定领域如生物医学或法律文档中。 Method: 提出了一种名为语义图编织的技术,包括实体桥接、谓词链桥接和因果桥接,利用AMR驱动的分析构建复杂的推理路径。 Result: 在多语言数据集上实现了9.5%的往返质量提升,在生成问题的复杂性、可回答性和模式覆盖方面分别提高了23.4%、18.7%和31.2%。 Conclusion: Semantic Bridge是一种可以从任意来源可控生成复杂多跳推理问题的新范式,它通过语义图编织技术,实现了训练数据的高效合成,并将在未来释放核心代码和模型。 Abstract: Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.[18] PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?
Lingfeng Zhou,Jialing Zhang,Jin Gao,Mohan Jiang,Dequan Wang
Main category: cs.CL
TL;DR: PersonaEval reveals that current LLM-based evaluators are not as accurate as humans in identifying roles in role-play scenarios, indicating a need for more human-like reasoning in LLM evaluators.
Details
Motivation: The motivation is to address the lack of validated methods for evaluating role fidelity in role-play studies involving LLMs, which may not accurately reflect human perception. The study focuses on the fundamental need for correct role identification to assess role-playing quality accurately. Method: The study introduces PersonaEval, a benchmark using human-authored dialogues from novels, scripts, and video transcripts to test LLMs' ability to identify personas. It compares the performance of LLMs with human participants and examines factors like training-time adaptation and test-time compute. Result: Experiments show that even the best-performing LLMs achieve only around 69% accuracy on PersonaEval, significantly lower than human participants, who achieved 90.8% accuracy. This highlights the gap between current LLM evaluators and human judgment in role-play evaluation. Conclusion: The study concludes that current LLM evaluators are not sufficiently human-like to reliably judge role-play scenarios, and effective evaluation depends on strong, human-like reasoning abilities. Abstract: Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at https://github.com/maple-zhou/PersonaEval.[19] RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis
Enzhi Wang,Qicheng Li,Shiwan Zhao,Aobo Kong,Jiaming Zhou,Xi Yang,Yequan Wang,Yonghua Lin,Yong Qin
Main category: cs.CL
TL;DR: 本文介绍了RealTalk-CN,这是第一个中文多轮、多领域语音-文本双模态TOD数据集,用于评估语音LLM的鲁棒性。
Details
Motivation: 现有的TOD数据集主要是基于文本,缺乏真实的语音信号和关键要素,如语音不流畅和说话人变化。 Method: 引入了RealTalk-CN,一个包含5.4k对话的中文多轮、多领域语音-文本双模态TOD数据集,并提出了一个新的跨模态聊天任务。 Result: 实验验证了RealTalk-CN在语音不流畅性、说话人特征敏感性和跨域性能方面的鲁棒性。 Conclusion: RealTalk-CN的推出为中文语音LLM研究提供了坚实的基础,并通过全面的评估验证了其有效性。 Abstract: In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.[20] Training-Free Multimodal Large Language Model Orchestration
Tianyu Xie,Yuhang Wu,Yongdong Luo,Jiayi Ji,Xiawu Zheng
Main category: cs.CL
TL;DR: This paper proposes MLLM Orchestration, a training-free framework for integrating multimodal AI systems, achieving better performance, efficiency, and interpretability through a controller LLM, parallel speech processing, and cross-modal memory integration.
Details
Motivation: The motivation stems from the difficulty of integrating different MLLMs into a unified system due to issues like modal alignment and computational efficiency, prompting the need for a training-free orchestration approach. Method: The paper introduces a framework based on three innovations: a central controller LLM for task routing, a parallel Text-to-Speech architecture for full-duplex interaction, and a cross-modal memory integration system for context coherence. Result: The framework achieves comprehensive multimodal capabilities without additional training, with performance improvements of up to 7.8%, reduced latency by 10.3%, and enhanced interpretability. Conclusion: MLLM Orchestration provides a new method for building interactive multimodal AI systems without requiring additional training, offering improved efficiency, interpretability, and performance. Abstract: Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.[21] A Rose by Any Other Name Would Smell as Sweet: Categorical Homotopy Theory for Large Language Models
Sridhar Mahadevan
Main category: cs.CL
TL;DR: 这篇论文讨论了大语言模型在处理语义相同但表达不同的语句时,如何生成一致的下一词概率。作者引入了范畴同伦框架,以解决语言模型中概率分布的弱等价问题。
Details
Motivation: 大语言模型(LLMs)在面对语义相同但表达方式不同的句子时,往往生成不一致的下一词概率。这种不一致性需要通过更抽象的方法解决,以提高模型的语义理解能力。 Method: 论文引入了一种LLM马尔可夫范畴来表示语言的概率分布,并通过范畴同伦技术解决语言中等价重述带来的非同构问题。此外,作者借鉴了高阶代数K理论和模型范畴等理论工具进行分析。 Result: 提出了一个详细的范畴同伦在LLMs中的应用框架,能够更好地捕捉语言模型中的弱等价关系,为解决语义等价问题提供了理论基础。 Conclusion: 通过引入范畴同伦理论,论文为解决大语言模型在语义等价表达上的不一致性问题提供了新的理论视角和方法基础。 Abstract: Natural language is replete with superficially different statements, such as ``Charles Darwin wrote" and ``Charles Darwin is the author of", which carry the same meaning. Large language models (LLMs) should generate the same next-token probabilities in such cases, but usually do not. Empirical workarounds have been explored, such as using k-NN estimates of sentence similarity to produce smoothed estimates. In this paper, we tackle this problem more abstractly, introducing a categorical homotopy framework for LLMs. We introduce an LLM Markov category to represent probability distributions in language generated by an LLM, where the probability of a sentence, such as ``Charles Darwin wrote" is defined by an arrow in a Markov category. However, this approach runs into difficulties as language is full of equivalent rephrases, and each generates a non-isomorphic arrow in the LLM Markov category. To address this fundamental problem, we use categorical homotopy techniques to capture ``weak equivalences" in an LLM Markov category. We present a detailed overview of application of categorical homotopy to LLMs, from higher algebraic K-theory to model categories, building on powerful theoretical results developed over the past half a century.[22] Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning
Li Wang,Changhao Zhang,Zengqi Xiu,Kai Lu,Xin Yu,Kui Zhang,Wenjun Wu
Main category: cs.CL
TL;DR: This paper proposes DURIT, a framework that separates understanding from reasoning in Small Language Models to improve their reasoning performance and robustness by mapping problems into a simplified canonical space and training them iteratively.
Details
Motivation: Improving reasoning capabilities in Small Language Models (SLMs) is challenging due to the complexity and variability of natural language, which forces SLMs to manage both linguistic understanding and reasoning, limiting their performance. Method: The paper introduces DURIT, a three-step algorithm that uses reinforcement learning to map natural language problems into a canonical problem space, aligns reasoning trajectories via self-distillation, and trains reasoning policies in this simplified space. The mapper and reasoner are co-trained iteratively. Result: Experiments show that DURIT significantly enhances SLMs' performance on mathematical and logical reasoning tasks, both in-domain and out-of-domain, while also improving the robustness of the reasoning process. Conclusion: DURIT improves the reasoning capabilities and robustness of Small Language Models by decoupling understanding from reasoning, offering a promising strategy for enhancing SLMs' performance on in-domain and out-of-domain tasks. Abstract: Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., $\leq$ 1.5B) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.[23] FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models
Chuan Li,Qianyi Zhao,Fengran Mo,Cen Chen
Main category: cs.CL
TL;DR: FedCoT is a framework designed to improve the reasoning capabilities of large language models in federated learning environments, especially for healthcare applications, by generating multiple reasoning paths and selecting the most promising one while preserving data privacy.
Details
Motivation: The motivation is to enhance the reasoning capabilities of large language models in federated learning environments without violating privacy, especially in healthcare contexts where interpretability and traceability are crucial. Method: FedCoT uses a lightweight chain-of-thought enhancement mechanism where local models generate multiple reasoning paths, and a compact discriminator selects the most promising one. It also uses an improved aggregation approach with LoRA module stacking for managing client heterogeneity. Result: Experiments showed that FedCoT significantly improves client-side reasoning performance under tight resource budgets while maintaining data privacy. Conclusion: FedCoT provides an efficient framework for enhancing reasoning in federated settings, particularly beneficial for medical applications while preserving data privacy. Abstract: Efficiently enhancing the reasoning capabilities of large language models (LLMs) in federated learning environments remains challenging, particularly when balancing performance gains with strict computational, communication, and privacy constraints. This challenge is especially acute in healthcare, where decisions-spanning clinical, operational, and patient-facing contexts-demand not only accurate outputs but also interpretable, traceable rationales to ensure safety, accountability, and regulatory compliance. Conventional federated tuning approaches on LLM fail to address this need: they optimize primarily for answer correctness while neglecting rationale quality, leaving CoT capabilities dependent on models' innate pre-training abilities. Moreover, existing methods for improving rationales typically rely on privacy-violating knowledge distillation from centralized models. Additionally, the communication overhead in traditional federated fine-tuning on LLMs remains substantial. We addresses this gap by proposing FedCoT, a novel framework specifically designed to enhance reasoning in federated settings. FedCoT leverages a lightweight chain-of-thought enhancement mechanism: local models generate multiple reasoning paths, and a compact discriminator dynamically selects the most promising one. This approach improves reasoning accuracy and robustness while providing valuable interpretability, which is particularly critical for medical applications. To manage client heterogeneity efficiently, we adopt an improved aggregation approach building upon advanced LoRA module stacking, incorporating client classifier-awareness to achieve noise-free aggregation across diverse clients. Comprehensive experiments on medical reasoning tasks demonstrate that FedCoT significantly boosts client-side reasoning performance under stringent resource budgets while fully preserving data privacy.[24] LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients
Egor Fadeev,Dzhambulat Mollaev,Aleksei Shestov,Dima Korolev,Omar Zoloev,Ivan Kireev,Andrey Savchenko,Maksim Makarenko
Main category: cs.CL
TL;DR: 本文提出了一种名为LATTE的对比学习框架,用于从客户历史通信的事件序列中学习嵌入表示。
Details
Motivation: 大型语言模型(LLMs)虽然提供了广泛的世界知识,但直接在长事件序列上使用LLMs计算成本高且不切实际。因此,需要一种更高效的方法来处理金融应用中的事件序列数据。 Method: 提出了一种对比学习框架LATTE,将原始事件嵌入与冻结LLMs的语义嵌入对齐。行为特征被总结成短提示,由LLM嵌入,并通过对比损失作为监督。 Result: 实验表明,该方法在现实世界的金融数据集上优于现有的事件序列表示学习技术,并且可以在延迟敏感的环境中部署。 Conclusion: LATTE通过对比学习框架有效地对齐了事件嵌入和语义嵌入,显著减少了推理成本和输入大小,适用于实际金融应用。 Abstract: Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.[25] Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control
Yuanchang Ye
Main category: cs.CL
TL;DR: 该研究通过整合显著性检验和共形预测,提出了一种新的框架,有效提升大型语言模型在多选问答中的可靠性。
Details
Motivation: 大型语言模型在特定领域问答场景中的部署因幻觉和非真实生成而受到严重影响,需要提高其响应的可靠性。 Method: 该方法整合了p值计算和一致性评分,通过自洽性重采样计算选项频率,进而通过零假设检验构建预测集。 Result: 在MMLU和MMLU-Pro基准上的评估表明,增强的共形预测达到了用户指定的经验误覆盖率,且测试集平均预测集大小随着风险水平增加而单调减少。 Conclusion: 该研究提出了一种基于显著性检验的共形预测框架,为大型语言模型在多选问答中的可靠部署提供了一种有原则的统计框架。 Abstract: This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA). While LLMs have been increasingly deployed in disciplinary QA scenarios, hallucination and nonfactual generation substantially compromise response reliability. Although CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor, their synergistic integration remains unexplored. To mitigate hallucination and factual inaccuracies, our framework integrates $p$-value computation with conformity scoring through self-consistency resampling of MCQA responses. This approach calculates option frequencies to address LLMs' black-box nature, subsequently constructing prediction sets via null hypothesis testing ($\mathcal{H}_0$) with empirically derived $p$-values. Evaluations on MMLU and MMLU-Pro benchmarks using off-the-shelf LLMs demonstrate: (1) The enhanced CP achieves user-specified empirical miscoverage rates; (2) Test-set average prediction set size (APSS) decreases monotonically with increasing risk levels ($\alpha$), validating APSS as an effective uncertainty metric. This work establishes a principled statistical framework for trustworthy LLM deployment in high-stakes QA applications.[26] RTTC: Reward-Guided Collaborative Test-Time Compute
J. Pablo Muñoz,Jinjie Yuan
Main category: cs.CL
TL;DR: RTTC是一种能够自适应选择最佳TTC策略的框架,提高了大型语言模型的推理性能和效率。
Details
Motivation: 传统TTC策略在不同查询间效果不一,盲目应用会导致计算开销增加。 Method: 引入了基于预训练奖励模型的RTTC框架,并提出了查询状态缓存技术以减少冗余计算。 Result: 实验表明,RTTC在多个LLM和基准测试中均表现出比传统RAG或TTT更高的准确性。 Conclusion: RTTC通过自适应选择最有效的TTC策略,提高了大型语言模型在推理时的性能和效率。 Abstract: Test-Time Compute (TTC) has emerged as a powerful paradigm for enhancing the performance of Large Language Models (LLMs) at inference, leveraging strategies such as Test-Time Training (TTT) and Retrieval-Augmented Generation (RAG). However, the optimal adaptation strategy varies across queries, and indiscriminate application of TTC strategy incurs substantial computational overhead. In this work, we introduce Reward-Guided Test-Time Compute (RTTC), a novel framework that adaptively selects the most effective TTC strategy for each query via a pretrained reward model, maximizing downstream accuracy across diverse domains and tasks. RTTC operates in a distributed server-client architecture, retrieving relevant samples from a remote knowledge base and applying RAG or lightweight fine-tuning on client devices only when necessary. To further mitigate redundant computation, we propose Query-State Caching, which enables the efficient reuse of historical query states at both retrieval and adaptation levels. Extensive experiments across multiple LLMs and benchmarks demonstrate that RTTC consistently achieves superior accuracy compared to vanilla RAG or TTT, validating the necessity of adaptive, reward-guided TTC selection and the potential of RTTC for scalable, high-performance language model adaptation.[27] Detecting and explaining postpartum depression in real-time with generative artificial intelligence
Silvia García-Méndez,Francisco de Arriba-Pérez
Main category: cs.CL
TL;DR: This paper proposes an intelligent PPD screening system combining NLP, ML, and LLMs for real-time, affordable, and non-invasive detection of postpartum depression, achieving 90% accuracy while providing interpretable results for timely intervention.
Details
Motivation: The motivation is to address the critical need for rapid detection of postpartum depression (PPD) and its risk factors using advanced technologies, enabling real-time screening and interpretable treatment recommendations for timely and effective intervention. Method: The study utilizes Natural Language Processing, Machine Learning (tree-based algorithms), and Large Language Models to develop an intelligent PPD screening system. It integrates feature importance and natural language to enhance interpretability and address the black-box problem in predictions. Result: The system achieved a 90% accuracy in PPD detection across all evaluation metrics, outperforming existing solutions in the literature. Conclusion: The study concludes that the proposed intelligent PPD screening system effectively enables rapid, affordable, real-time, and non-invasive detection of postpartum depression and its associated risk factors, offering critical support for timely assessment and intervention. Abstract: Among the many challenges mothers undergo after childbirth, postpartum depression (PPD) is a severe condition that significantly impacts their mental and physical well-being. Consequently, the rapid detection of ppd and their associated risk factors is critical for in-time assessment and intervention through specialized prevention procedures. Accordingly, this work addresses the need to help practitioners make decisions with the latest technological advancements to enable real-time screening and treatment recommendations. Mainly, our work contributes to an intelligent PPD screening system that combines Natural Language Processing, Machine Learning (ML), and Large Language Models (LLMs) towards an affordable, real-time, and non-invasive free speech analysis. Moreover, it addresses the black box problem since the predictions are described to the end users thanks to the combination of LLMs with interpretable ml models (i.e., tree-based algorithms) using feature importance and natural language. The results obtained are 90 % on ppd detection for all evaluation metrics, outperforming the competing solutions in the literature. Ultimately, our solution contributes to the rapid detection of PPD and their associated risk factors, critical for in-time and proper assessment and intervention.[28] SABER: Switchable and Balanced Training for Efficient LLM Reasoning
Kai Zhao,Yanjun Zhao,Jiaming Song,Shien He,Lusheng Zhang,Qiang Zhang,Tianjiao Li
Main category: cs.CL
TL;DR: SABER is a reinforcement learning framework for large language models that enables efficient, token-budgeted reasoning with flexible trade-offs between latency and accuracy, demonstrating strong performance on math, code generation, and logical reasoning tasks.
Details
Motivation: Large language models using chain-of-thought reasoning achieve high accuracy but suffer from high inference costs and latency when applied uniformly. SABER addresses this issue by enabling efficient and user-controllable reasoning based on token budgets. Method: The authors propose SABER, which uses reinforcement learning to train large language models (LLMs) with token-budgeted reasoning. It profiles token usage, assigns examples to budget tiers, and employs length-aware rewards during fine-tuning. The framework also includes no-think examples and supports four inference modes: NoThink, FastThink, CoreThink, and DeepThink. Result: Extensive evaluations on math reasoning (MATH, GSM8K), code generation (MBPP), and logical reasoning (LiveBench-Reasoning) show that SABER achieves high accuracy under tight budgets, graceful degradation, and effective cross-scale and cross-domain generalization. SABER-FastThink reduces reasoning length by 65.4% and improves accuracy by 3.6% on the MATH benchmark compared to the base model. Conclusion: SABER provides a reinforcement learning framework for efficient and controllable reasoning in large language models, achieving accuracy under tight token budgets while enabling flexible trade-offs between latency and reasoning depth. Abstract: Large language models (LLMs) empowered by chain-of-thought reasoning have achieved impressive accuracy on complex tasks but suffer from excessive inference costs and latency when applied uniformly to all problems. We propose SABER (Switchable and Balanced Training for Efficient LLM Reasoning), a reinforcement learning framework that endows LLMs with user-controllable, token-budgeted reasoning. SABER first profiles each training example's base-model thinking token usage and assigns it to one of the predefined budget tiers. During fine-tuning, the model is guided by system prompts and length-aware rewards to respect its assigned budget. In parallel, we incorporate no-think examples to ensure the model remains reliable even when explicit reasoning is turned off. SABER further supports four discrete inference modes - NoThink, FastThink, CoreThink, and DeepThink, enabling flexible trade-offs between latency and reasoning depth. Extensive evaluations on math reasoning (MATH, GSM8K), code generation (MBPP), and logical reasoning (LiveBench-Reasoning) demonstrate that SABER achieves high accuracy under tight budgets, graceful degradation, and effective cross-scale and cross-domain generalization. In particular, SABER-FastThink cuts reasoning length by 65.4% and yields a 3.6% accuracy gain compared with the base model on the MATH benchmark.[29] LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data
Ali Zolnour,Hossein Azadmaleki,Yasaman Haghbin,Fatemeh Taherinezhad,Mohamad Javad Momeni Nezhad,Sina Rashidi,Masoud Khani,AmirSajjad Taleban,Samin Mahdizadeh Sani,Maryam Dadkhah,James M. Noble,Suzanne Bakken,Yadollah Yaghoobzadeh,Abdol-Hossein Vahabie,Masoud Rouhizadeh,Maryam Zolnoori
Main category: cs.CL
TL;DR: 本论文提出了一种新的筛查流程,将transformer嵌入与语言特征相结合,用于从言语中检测阿尔茨海默病和相关痴呆症。通过使用大型语言模型生成的合成语音进行数据增强,并评估单模态和多模态分类器的性能,研究结果显示该方法在检测ADRD方面表现出色。
Details
Motivation: 阿尔茨海默病和相关痴呆症影响了约五百万美国老年人,但超过一半的人未被诊断。基于言语的自然语言处理提供了一种有前景且可扩展的方法,通过语言标记来检测早期认知衰退。 Method: 开发并评估了一种结合transformer嵌入与手工制作的语言特征的筛查流程,测试了使用大型语言模型生成的合成语音进行数据增强,并评估了单模态和多模态LLM分类器。 Result: 融合模型达到了F1 = 83.3(AUC = 89.5),优于语言或仅transformer的基线模型。使用2倍MedAlpaca-7B合成语音进行训练数据增强将F1提高到85.7。微调显著提高了单模态LLM分类器的性能。当前多模态模型的性能较低。 Conclusion: 将transformer嵌入与语言特征相结合可以提高从言语中检测ADRD的准确性。临床调优的LLM有效支持分类和数据增强,而多模态建模需要进一步改进。 Abstract: Alzheimer's disease and related dementias (ADRD) affect approximately five million older adults in the U.S., yet over half remain undiagnosed. Speech-based natural language processing (NLP) offers a promising, scalable approach to detect early cognitive decline through linguistic markers. To develop and evaluate a screening pipeline that (i) fuses transformer embeddings with handcrafted linguistic features, (ii) tests data augmentation using synthetic speech generated by large language models (LLMs), and (iii) benchmarks unimodal and multimodal LLM classifiers for ADRD detection. Transcripts from the DementiaBank "cookie-theft" task (n = 237) were used. Ten transformer models were evaluated under three fine-tuning strategies. A fusion model combined embeddings from the top-performing transformer with 110 lexical-derived linguistic features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech, which was used to augment training data. Three multimodal models (GPT-4o, Qwen-Omni, Phi-4) were tested for speech-text classification in zero-shot and fine-tuned settings. The fusion model achieved F1 = 83.3 (AUC = 89.5), outperforming linguistic or transformer-only baselines. Augmenting training data with 2x MedAlpaca-7B synthetic speech increased F1 to 85.7. Fine-tuning significantly improved unimodal LLM classifiers (e.g., MedAlpaca: F1 = 47.3 -> 78.5 F1). Current multimodal models demonstrated lower performance (GPT-4o = 70.2 F1; Qwen = 66.0). Performance gains aligned with the distributional similarity between synthetic and real speech. Integrating transformer embeddings with linguistic features enhances ADRD detection from speech. Clinically tuned LLMs effectively support both classification and data augmentation, while further advancement is needed in multimodal modeling.[30] PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs
Xiao Fu,Hossein A. Rahmani,Bin Wu,Jerome Ramos,Emine Yilmaz,Aldo Lipani
Main category: cs.CL
TL;DR: PREF is a reference-free framework for evaluating personalized text generation by combining universal quality metrics with user-specific preferences.
Details
Motivation: Traditional evaluation methods ignore user individuality, limiting the effectiveness of personalized text generation systems. Method: PREF operates in three steps: coverage (generating universal quality criteria), preference (personalizing criteria using user profiles), and scoring (applying LLM judges to rate outputs). Result: Experiments on PrefEval show that PREF outperforms baselines in accuracy, calibration, and alignment with human judgments. Conclusion: PREF provides a reliable, scalable, and interpretable framework for evaluating personalized text generation, enabling better alignment with user preferences and improved system development. Abstract: Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce \textbf{PREF}, a \textbf{P}ersonalised \textbf{R}eference-free \textbf{E}valuation \textbf{F}ramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user's profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.[31] Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
Wenpeng Xing,Mohan Li,Chunqiang Hu,Haitao XuNingyu Zhang,Bo Lin,Meng Han
Main category: cs.CL
TL;DR: This paper introduces Latent Fusion Jailbreak (LFJ), a new attack method that exploits hidden state interpolation to bypass safety measures in large language models (LLMs), achieving a 94.01% attack success rate. An adversarial training defense is also proposed to reduce the effectiveness of LFJ by over 80%.
Details
Motivation: The motivation is to demonstrate vulnerabilities in large language models (LLMs) by introducing a new representation-based attack method that circumvents safety alignments. Method: The paper introduces Latent Fusion Jailbreak (LFJ), which interpolates hidden states from harmful and benign query pairs to create effective jailbreak attacks. It also proposes an adversarial training defense to mitigate such attacks. Result: Evaluations show that LFJ achieves an average attack success rate (ASR) of 94.01% on models like Vicuna and LLaMA-2, outperforming existing methods. The adversarial training defense reduces ASR by over 80%. Conclusion: The paper concludes that LFJ is a highly effective jailbreak attack method, and the proposed adversarial training defense can significantly reduce its impact without affecting model performance on benign inputs. Abstract: Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.[32] Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models
Saaduddin Mahmud,Mason Nakamura,Kyle H. Wray,Shlomo Zilberstein
Main category: cs.CL
TL;DR: This paper introduces an inference-aware prompt optimization framework called IAPO and its algorithm PSST, showing their effectiveness in aligning black-box language models while managing performance trade-offs.
Details
Motivation: The motivation stems from the identified gap where current prompt optimization methods do not consider the inference strategy, affecting alignment and performance trade-offs. Method: The authors propose a framework called IAPO for joint optimization of prompts and inference scale, along with a training algorithm named PSST that considers fixed inference budgets. Result: PSST was evaluated on six tasks, demonstrating the importance of inference-aware prompt optimization in enhancing alignment and managing trade-offs in objectives and budgets. Conclusion: The study concludes that incorporating inference-awareness in prompt optimization is crucial for aligning black-box LLMs effectively. Abstract: Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have also proven to enhance alignment and performance by trading off computation. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without regard to the inference strategy employed during deployment. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a unified novel framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, which we call PSST (Prompt Scaling via Sequential Trimming), and analyze finite-budget guarantees on error probability. Finally, we evaluate the effectiveness of PSST on six different tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness when aligning black-box LLMs through prompt optimization.[33] The Cost of Thinking: Increased Jailbreak Risk in Large Language Models
Fan Yang
Main category: cs.CL
TL;DR: 研究发现LLMs的思维模式更易受攻击,提出通过添加特定思维标记来降低攻击成功率的安全干预方法。
Details
Motivation: 发现具有思维模式的LLMs更容易受到越狱攻击,且攻击成功数据具有教育目的和过度长思维长度的特征。 Method: 通过在提示中添加“特定思维标记”来明确引导LLMs的内部思维过程。 Result: 在AdvBench和HarmBench上评估了9个LLMs,发现攻击思维模式的成功率几乎高于非思维模式。 Conclusion: 提出了一种安全思维干预方法,可以显著降低具有思维模式的LLMs的攻击成功率。 Abstract: Thinking mode has always been regarded as one of the most valuable modes in LLMs. However, we uncover a surprising and previously overlooked phenomenon: LLMs with thinking mode are more easily broken by Jailbreak attack. We evaluate 9 LLMs on AdvBench and HarmBench and find that the success rate of attacking thinking mode in LLMs is almost higher than that of non-thinking mode. Through large numbers of sample studies, it is found that for educational purposes and excessively long thinking lengths are the characteristics of successfully attacked data, and LLMs also give harmful answers when they mostly know that the questions are harmful. In order to alleviate the above problems, this paper proposes a method of safe thinking intervention for LLMs, which explicitly guides the internal thinking processes of LLMs by adding "specific thinking tokens" of LLMs to the prompt. The results demonstrate that the safe thinking intervention can significantly reduce the attack success rate of LLMs with thinking mode.[34] Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion
Dong Zhao,Yadong Wang,Xiang Chen,Chenxi Wang,Hongliang Dai,Chuanxing Geng,Shengzhong Zhang,Shaoyuan Li,Sheng-Jun Huang
Main category: cs.CL
TL;DR: This paper introduces APIE, a framework that improves the performance of Large Language Models in information extraction by addressing both semantic and structural uncertainties.
Details
Motivation: The motivation stems from the sensitivity of Large Language Models to in-context examples and the lack of guidance from conventional selection strategies in addressing both semantic and structural confusion in information extraction tasks. Method: The researchers developed the APIE framework, which uses a dual-component uncertainty metric to assess and rank unlabeled data for active sample selection. Result: Experiments showed that the APIE framework outperforms strong baselines, significantly improving extraction accuracy and robustness. Conclusion: The study concludes that the APIE framework effectively enhances the performance of Large Language Models in information extraction tasks by addressing both format and content uncertainties. Abstract: Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.[35] mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning
Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen
Main category: cs.CL
TL;DR: This paper introduces mSCoRe, a new benchmark for evaluating multilingual commonsense reasoning in LLMs, highlighting their current limitations and suggesting future improvements.
Details
Motivation: The motivation is to investigate how reasoning-reinforced LLMs utilize different human reasoning skills, particularly in multilingual and multicultural commonsense reasoning, which is currently underexplored. Method: The paper proposes a benchmark called mSCoRe, which includes a taxonomy of reasoning skills, a data synthesis pipeline, and a complexity scaling framework to evaluate LLMs' reasoning capabilities. Result: Experiments on eight state-of-the-art LLMs show that mSCoRe is challenging for current models, especially at higher complexity levels, revealing their limitations in nuanced multilingual general and cultural commonsense reasoning. Conclusion: The paper concludes that while current reasoning-reinforced LLMs show promise, they still have significant limitations in multilingual commonsense reasoning, especially as task complexity increases. Abstract: Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that \textbf{mSCoRe} remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models' reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.[36] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
Kartikeya Badola,Jonathan Simon,Arian Hosseini,Sara Marie Mc Carthy,Tsendsuren Munkhdalai,Abhimanyu Goyal,Tomáš Kočiský,Shyam Upadhyay,Bahare Fatemi,Mehran Kazemi
Main category: cs.CL
TL;DR: This paper introduces a benchmark for evaluating multi-turn dialogue and reasoning abilities in LLMs, highlighting areas for improvement such as instruction following and planning.
Details
Motivation: LLMs often struggle with nuanced environments or interactive tasks common in real-world scenarios, necessitating the development of models that can engage in logically consistent multi-turn dialogue and reason with incomplete data. Method: A novel benchmark with multi-turn tasks was introduced to evaluate the reasoning, interactive dialogue, and information-seeking abilities of LLMs, using deterministic scoring mechanisms. Result: Evaluations of frontier models on the benchmark revealed significant headroom, with most errors attributed to poor instruction following, reasoning failures, and poor planning. Conclusion: The study concludes that there is significant room for improvement in current LLMs, particularly in instruction following, reasoning, and planning within complex, interactive scenarios. Abstract: Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.[37] LaajMeter: A Framework for LaaJ Evaluation
Gal Amram,Eitan Farchi,Shmulik Froimovich,Raviv Gal,Avi Ziv
Main category: cs.CL
TL;DR: LaaJMeter 是一个基于模拟的框架,用于对自然语言处理中的 LLM-as-a-Judge (LaaJ) 进行受控的元评估,帮助工程师在特定领域中验证和优化评估指标,从而提升评估的可靠性和可重复性。
Details
Motivation: 在特定领域中,由于标注数据稀缺且专家评估成本高昂,常用的评估指标未经过充分验证,难以确定其有效性和阈值,因此需要一个系统的方法来评估 LaaJ 的质量。 Method: LaaJMeter 通过生成代表虚拟模型和裁判的合成数据,提供一个受控的环境来系统地分析评估指标在现实条件下的表现,从而验证和优化 LaaJ 的评估过程。 Result: LaaJMeter 在代码翻译任务中的应用表明,不同评估指标在识别高质量 LaaJ 方面存在显著差异,强调了选择合适评估指标的重要性。 Conclusion: LaaJMeter 为低资源环境下的 LaaJ 评估提供了一个可扩展且可扩展的解决方案,有助于推动自然语言处理中可信和可重复评估的发展。 Abstract: Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). While effective in general domains, LaaJs pose significant challenges in domain-specific contexts, where annotated data is scarce and expert evaluation is costly. In such cases, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. As a result, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate and refine LaaJs for specific evaluation tasks: they can test whether their metrics correctly distinguish between better and worse (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP.[38] Estimating Machine Translation Difficulty
Lorenzo Proietti,Stefano Perrella,Vilém Zouhar,Roberto Navigli,Tom Kocmi
Main category: cs.CL
TL;DR: 本文提出了翻译难度估计的任务,定义了基于翻译质量的文本难度,并介绍了一种新的评估指标。
Details
Motivation: 高质量的机器翻译输出使得难以区分最先进的模型和识别未来的改进方向。 Method: 我们正式化了翻译难度估计任务,并引入了新的评估指标来评估难度估计器。 Result: 专用模型(Sentinel-src)在性能上超过了基于启发式的方法和LLM-as-a-judge方法。 Conclusion: Sentinel-src-24和Sentinel-src-25模型可用于扫描大量文本并选择最可能挑战当代机器翻译系统的文本。 Abstract: Machine translation quality has began achieving near-perfect translations in some setups. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. Automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. We formalize the task of translation difficulty estimation, defining a text's difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging machine translation benchmarks. Our results show that dedicated models (dubbed Sentinel-src) outperform both heuristic-based methods (e.g. word rarity or syntactic complexity) and LLM-as-a-judge approaches. We release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.[39] Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs
Wenlong Deng,Jiaming Zhang,Qi Zeng,Christos Thrampoulidis,Boying Gong,Xiaoxiao Li
Main category: cs.CL
TL;DR: For-Value是一个高效的数据估值框架,能够通过一次前向传递计算每个训练样本的影响,适用于大语言模型和视觉语言模型。
Details
Motivation: 为了提高大语言模型(LLMs)和视觉语言模型(VLMs)的透明度和责任性,需要量化单个训练样本的影响。然而,现有的数据估值方法计算成本过高。 Method: For-Value是一种仅前向的数据估值框架,利用现代基础模型的丰富表示,通过一个简单的闭式表达式计算影响分数,仅需要一次前向传递。 Result: For-Value在识别有影响力的微调示例和有效检测错误标记数据方面与基于梯度的基线方法相当或更优。 Conclusion: For-Value是一个有效的数据估值框架,能够准确估计每个训练样本的影响,有助于识别重要的微调样本和检测错误标记的数据。 Abstract: Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.[40] PakBBQ: A Culturally Adapted Bias Benchmark for QA
Abdullah Hashmat,Muhammad Arham Mirza,Agha Ali Raza
Main category: cs.CL
TL;DR: This paper introduces PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset, to address the lack of attention paid to low-resource languages and regional contexts in training and evaluating LLMs. The experiments show the average accuracy gain with disambiguation, stronger counter bias behaviors in Urdu, and framing effects reducing stereotypical responses.
Details
Motivation: With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. Method: We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Result: Our experiments reveal (i) an average accuracy gain of 12\% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. Conclusion: These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings. Abstract: With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12\% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.[41] Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models
Igor Halperin
Main category: cs.CL
TL;DR: The paper introduces Semantic Divergence Metrics (SDM), a framework to detect hallucinations in LLMs by measuring semantic divergence between prompts and responses using clustering and information-theoretic metrics.
Details
Motivation: The motivation is to address the challenge of hallucinations in Large Language Models (LLMs), particularly Faithfulness Hallucinations where responses deviate severely from input contexts. Method: The method involves using joint clustering on sentence embeddings to create a shared topic space for prompts and answers, and computing information-theoretic metrics such as Jensen-Shannon divergence, Wasserstein distance, and KL divergence to measure semantic divergence between prompts and responses. Result: The result is a practical score, $\mathcal{S}_H$, that quantifies semantic divergence, and the Semantic Box, a diagnostic framework for classifying LLM response types, including confident confabulations. Conclusion: The paper concludes that SDM is a more prompt-aware and effective framework for detecting Faithfulness Hallucinations in LLMs compared to existing methods. Abstract: The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations -- events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {confabulations, defined as responses that are arbitrary and semantically misaligned with the user's query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{Semantic Exploration}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.[42] Understanding Textual Emotion Through Emoji Prediction
Ethan Gordon,Nishank Kuppa,Rigved Tummala,Sriram Anasuri
Main category: cs.CL
TL;DR: 该项目研究了使用深度学习模型从短文本中预测表情符号的效果,比较了四种不同的架构,并提出了BERT和CNN在不同类别上的优势。
Details
Motivation: 探索从短文本序列中预测表情符号的方法,以提高人机交互的效果。 Method: 该项目使用四种深度学习架构进行表情符号预测:前馈网络、CNN、Transformer和BERT,并使用TweetEval数据集通过焦点损失和正则化技术解决类别不平衡问题。 Result: BERT因其预训练优势实现了最高的整体性能,而CNN在罕见表情符号类别上表现出更高的有效性。 Conclusion: BERT在整体性能上表现最好,而CNN在罕见表情符号类别上表现出色,这表明在情感感知的表情符号预测中,架构选择和超参数调优至关重要。 Abstract: This project explores emoji prediction from short text sequences using four deep learning architectures: a feed-forward network, CNN, transformer, and BERT. Using the TweetEval dataset, we address class imbalance through focal loss and regularization techniques. Results show BERT achieves the highest overall performance due to its pre-training advantage, while CNN demonstrates superior efficacy on rare emoji classes. This research shows the importance of architecture selection and hyperparameter tuning for sentiment-aware emoji prediction, contributing to improved human-computer interaction.[43] Using Large Language Models to Measure Symptom Severity in Patients At Risk for Schizophrenia
Andrew X. Chen,Guillermo Horga,Sean Escola
Main category: cs.CL
TL;DR: This study explores the use of large language models to predict BPRS scores from clinical interviews of patients at clinical high risk for schizophrenia, showing high accuracy and potential for standardized, efficient symptom monitoring in multiple languages and longitudinal settings.
Details
Motivation: The BPRS, while a validated tool for measuring symptoms in schizophrenia, is underused in clinical practice due to its time-consuming nature; thus, leveraging LLMs can streamline and standardize symptom monitoring for CHR patients. Method: LLMs were used to predict BPRS scores from clinical interview transcripts of 409 CHR patients from the AMP-SCZ cohort, employing zero-shot, one-shot, and few-shot learning approaches, including assessments in foreign languages. Result: LLM predictions showed high concordance with true assessments (median concordance: 0.84, ICC: 0.73), approached human inter- and intra-rater reliability, and demonstrated accuracy in foreign languages (median concordance: 0.88, ICC: 0.70) and through longitudinal integration. Conclusion: Large language models (LLMs) demonstrate significant potential in accurately predicting BPRS scores from clinical interview transcripts of CHR patients, offering a promising tool for improving and standardizing symptom assessment in clinical practice. Abstract: Patients who are at clinical high risk (CHR) for schizophrenia need close monitoring of their symptoms to inform appropriate treatments. The Brief Psychiatric Rating Scale (BPRS) is a validated, commonly used research tool for measuring symptoms in patients with schizophrenia and other psychotic disorders; however, it is not commonly used in clinical practice as it requires a lengthy structured interview. Here, we utilize large language models (LLMs) to predict BPRS scores from clinical interview transcripts in 409 CHR patients from the Accelerating Medicines Partnership Schizophrenia (AMP-SCZ) cohort. Despite the interviews not being specifically structured to measure the BPRS, the zero-shot performance of the LLM predictions compared to the true assessment (median concordance: 0.84, ICC: 0.73) approaches human inter- and intra-rater reliability. We further demonstrate that LLMs have substantial potential to improve and standardize the assessment of CHR patients via their accuracy in assessing the BPRS in foreign languages (median concordance: 0.88, ICC: 0.70), and integrating longitudinal information in a one-shot or few-shot learning approach.[44] A Computational Approach to Analyzing Language Change and Variation in the Constructed Language Toki Pona
Daniel Huang,Hyoun-A Joo
Main category: cs.CL
TL;DR: 本研究采用计算和语料库方法探讨人造语言Tok Pona中的变化及变异,表明社会语言因素促使其演化,如同自然语言。
Details
Motivation: 探讨人造语言Tok Pona中的语言变化及变异,分析其语法位置内容词偏好随时间的变化及在不同语料库中的使用差异。 Method: 采用计算和语料库方法研究语言特征如词类灵活性和及物性。 Result: 研究结果显示社会语言因素对Tok Pona的影响与自然语言相同。 Conclusion: 语言社区的使用会导致人造语言系统自然演化,就像自然语言一样。 Abstract: This study explores language change and variation in Toki Pona, a constructed language with approximately 120 core words. Taking a computational and corpus-based approach, the study examines features including fluid word classes and transitivity in order to examine (1) changes in preferences of content words for different syntactic positions over time and (2) variation in usage across different corpora. The results suggest that sociolinguistic factors influence Toki Pona in the same way as natural languages, and that even constructed linguistic systems naturally evolve as communities use them.[45] Inductive Bias Extraction and Matching for LLM Prompts
Christian M. Angel,Francis Ferraro
Main category: cs.CL
TL;DR: 本文研究了如何通过利用LLM的归纳偏差来优化提示词,从而显著提高模型在分类和排序任务中的表现。
Details
Motivation: 提示工程的活跃研究领域表明,LLM对提示词的微小变化非常敏感。 Method: 通过将LLM的输出作为其提示的一部分,以更轻松地创建满足模型归纳偏差的提示。 Result: 使用归纳偏差提取和匹配策略可使LLM在分类任务中的Likert评分提高19%,在排序任务中的Likert评分提高27%。 Conclusion: 使用归纳偏差提取和匹配策略可以显著提高LLM在分类和排序任务中的Likert评分。 Abstract: The active research topic of prompt engineering makes it evident that LLMs are sensitive to small changes in prompt wording. A portion of this can be ascribed to the inductive bias that is present in the LLM. By using an LLM's output as a portion of its prompt, we can more easily create satisfactory wording for prompts. This has the effect of creating a prompt that matches the inductive bias in model. Empirically, we show that using this Inductive Bias Extraction and Matching strategy improves LLM Likert ratings used for classification by up to 19% and LLM Likert ratings used for ranking by up to 27%.[46] Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race
Gustavo Bonil,Simone Hashiguti,Jhessica Silva,João Gondim,Helena Maia,Nádia Silva,Helio Pedrini,Sandra Avila
Main category: cs.CL
TL;DR: 本研究通过定性分析发现大型语言模型在生成文本时再现性别和种族偏见,黑人女性常与祖先和抵抗联系,而白人女性则处于自我发现过程。模型在纠正偏见时提供表面修改,无法真正消除偏见,强调需要批判性和跨学科方法来改进人工智能的伦理设计和应用。
Details
Motivation: 随着人工智能的发展,大型语言模型被广泛应用于各种场景,但它们是否再现歧视和种族化等偏见仍值得关注。当前的偏见检测方法主要依赖于定量、自动化的方法,而忽视了自然语言中偏见的细微表现。因此,需要一种定性的、话语分析的方法来补充现有方法。 Method: 通过手动分析LLM生成的关于黑人女性和白人女性的短篇故事,研究者调查了语言模型中的性别和种族偏见,并评估了模型在被提示纠正偏见时的反应。 Result: 研究发现,黑人女性被描绘为与祖先和抵抗联系在一起,而白人女性则出现在自我发现的过程中。这些模式反映了语言模型如何复制固化的话语表征,强化了本质化和社会流动性缺失的观念。当被提示纠正偏见时,模型提供的修改是表面的,仍然保留了有问题的含义。 Conclusion: 该研究强调了大型语言模型(LLMs)在生成文本时再现性别和种族偏见的问题,并提出定性分析方法对于识别和缓解这些偏见至关重要。研究呼吁对人工智能的设计和部署采取批判性和跨学科的方法,以解决LLM生成话语如何反映和延续不平等问题。 Abstract: With the advance of Artificial Intelligence (AI), Large Language Models (LLMs) have gained prominence and been applied in diverse contexts. As they evolve into more sophisticated versions, it is essential to assess whether they reproduce biases, such as discrimination and racialization, while maintaining hegemonic discourses. Current bias detection approaches rely mostly on quantitative, automated methods, which often overlook the nuanced ways in which biases emerge in natural language. This study proposes a qualitative, discursive framework to complement such methods. Through manual analysis of LLM-generated short stories featuring Black and white women, we investigate gender and racial biases. We contend that qualitative methods such as the one proposed here are fundamental to help both developers and users identify the precise ways in which biases manifest in LLM outputs, thus enabling better conditions to mitigate them. Results show that Black women are portrayed as tied to ancestry and resistance, while white women appear in self-discovery processes. These patterns reflect how language models replicate crystalized discursive representations, reinforcing essentialization and a sense of social immobility. When prompted to correct biases, models offered superficial revisions that maintained problematic meanings, revealing limitations in fostering inclusive narratives. Our results demonstrate the ideological functioning of algorithms and have significant implications for the ethical use and development of AI. The study reinforces the need for critical, interdisciplinary approaches to AI design and deployment, addressing how LLM-generated discourses reflect and perpetuate inequalities.[47] ReviewRL: Towards Automated Scientific Review with RL
Sihang Zeng,Kai Tian,Kaiyan Zhang,Yuru wang,Junqi Gao,Runze Liu,Sa Yang,Jingxuan Li,Xinwei Long,Jiaheng Ma,Biqing Qi,Bowen Zhou
Main category: cs.CL
TL;DR: ReviewRL is a reinforcement learning framework designed to generate comprehensive and factually grounded scientific paper reviews, outperforming current automated approaches in quality and accuracy.
Details
Motivation: Peer review is essential for scientific progress but faces challenges such as increasing submission volumes and reviewer fatigue, while existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth. Method: ReviewRL combines a retrieval-augmented context generation pipeline (ArXiv-MCP), supervised fine-tuning, and a reinforcement learning procedure with a composite reward function to enhance review quality and rating accuracy. Result: Experiments on ICLR 2025 papers show that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. Conclusion: ReviewRL provides a foundational framework for RL-driven automatic critique generation in scientific discovery and demonstrates promising potential for future development in this domain. Abstract: Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.[48] From Surface to Semantics: Semantic Structure Parsing for Table-Centric Document Analysis
Xuan Li,Jialiang Dong,Raymond Wong
Main category: cs.CL
TL;DR: DOTABLER是一种以表格为中心的语义文档解析框架,通过深层语义分析提升表格与上下文关联的理解,实现高效的文档结构解析和领域特定表格检索。
Details
Motivation: 现有研究主要集中在布局分析、表格检测和数据提取等表层任务,缺乏对表格及其上下文关联的深层语义解析,限制了跨段落数据分析和上下文一致性分析等高级任务的性能。 Method: DOTABLER利用自定义数据集和对预训练模型的领域微调,集成完整的解析流程,识别与表格在语义上相关的上下文片段。 Result: 在近4000页包含1000多个表格的真实PDF文档中,DOTABLER的精确率和F1分数均超过90%,在表格-上下文语义分析和深度文档解析方面表现优异。 Conclusion: DOTABLER是一个以表格为中心的语义文档解析框架,能够有效揭示表格与其上下文之间的深层语义联系,并在实际应用中展现出优于现有模型的性能。 Abstract: Documents are core carriers of information and knowl-edge, with broad applications in finance, healthcare, and scientific research. Tables, as the main medium for structured data, encapsulate key information and are among the most critical document components. Existing studies largely focus on surface-level tasks such as layout analysis, table detection, and data extraction, lacking deep semantic parsing of tables and their contextual associations. This limits advanced tasks like cross-paragraph data interpretation and context-consistent analysis. To address this, we propose DOTABLER, a table-centric semantic document parsing framework designed to uncover deep semantic links between tables and their context. DOTABLER leverages a custom dataset and domain-specific fine-tuning of pre-trained models, integrating a complete parsing pipeline to identify context segments semantically tied to tables. Built on this semantic understanding, DOTABLER implements two core functionalities: table-centric document structure parsing and domain-specific table retrieval, delivering comprehensive table-anchored semantic analysis and precise extraction of semantically relevant tables. Evaluated on nearly 4,000 pages with over 1,000 tables from real-world PDFs, DOTABLER achieves over 90% Precision and F1 scores, demonstrating superior performance in table-context semantic analysis and deep document parsing compared to advanced models such as GPT-4o.[49] Beyond Semantic Understanding: Preserving Collaborative Frequency Components in LLM-based Recommendation
Minhao Wang,Yunhang He,Cong Xu,Zhangchi Zhu,Wei Zhang
Main category: cs.CL
TL;DR: FreLLM4Rec是一种推荐系统方法,通过平衡语义和协作信息来改善大型语言模型中的推荐性能。
Details
Motivation: 大型语言模型在推荐系统中倾向于过度强调用户交互历史中的语义关联,削弱了协作信号。 Method: 使用全局图低通滤波器(G-LPF)净化项目嵌入,并通过时间频率调制(TFM)逐层保留协作信号。 Result: 在四个基准数据集上的实验表明,FreLLM4Rec成功缓解了协作信号的衰减,并在最佳基线上实现了NDCG@10最多8.00%的提升。 Conclusion: FreLLM4Rec为改进基于大型语言模型的推荐系统提供了一个原则性方法,并揭示了大型语言模型处理协作信息的方式。 Abstract: Recommender systems in concert with Large Language Models (LLMs) present promising avenues for generating semantically-informed recommendations. However, LLM-based recommenders exhibit a tendency to overemphasize semantic correlations within users' interaction history. When taking pretrained collaborative ID embeddings as input, LLM-based recommenders progressively weaken the inherent collaborative signals as the embeddings propagate through LLM backbones layer by layer, as opposed to traditional Transformer-based sequential models in which collaborative signals are typically preserved or even enhanced for state-of-the-art performance. To address this limitation, we introduce FreLLM4Rec, an approach designed to balance semantic and collaborative information from a spectral perspective. Item embeddings that incorporate both semantic and collaborative information are first purified using a Global Graph Low-Pass Filter (G-LPF) to preliminarily remove irrelevant high-frequency noise. Temporal Frequency Modulation (TFM) then actively preserves collaborative signal layer by layer. Note that the collaborative preservation capability of TFM is theoretically guaranteed by establishing a connection between the optimal but hard-to-implement local graph fourier filters and the suboptimal yet computationally efficient frequency-domain filters. Extensive experiments on four benchmark datasets demonstrate that FreLLM4Rec successfully mitigates collaborative signal attenuation and achieves competitive performance, with improvements of up to 8.00\% in NDCG@10 over the best baseline. Our findings provide insights into how LLMs process collaborative information and offer a principled approach for improving LLM-based recommendation systems.[50] Cross-Prompt Encoder for Low-Performing Languages
Beso Mikaberidze,Teimuraz Saghinadze,Simon Ostermann,Philipp Muller
Main category: cs.CL
TL;DR: This paper introduces the Cross-Prompt Encoder (XPE) and a Dual Soft Prompt mechanism to improve language transferability and performance in low-performing languages within parameter-efficient fine-tuning frameworks.
Details
Motivation: The motivation stems from the need to improve the performance of large language models on low-performing languages without altering their architecture or updating parameters, particularly focusing on enhancing transferability across languages. Method: The authors introduced the Cross-Prompt Encoder (XPE), which uses a lightweight architecture and multi-source training on typologically diverse languages, along with a Dual Soft Prompt mechanism combining encoder-based and standard soft prompts. Result: Experiments on the SIB-200 benchmark showed that XPE is most effective for low-performing languages, while hybrid variants provide broader adaptability in multilingual settings. Conclusion: The study concludes that the Cross-Prompt Encoder (XPE) and the Dual Soft Prompt mechanism significantly enhance the adaptability and performance of large language models across diverse languages, especially for low-performing languages. Abstract: Soft prompts have emerged as a powerful alternative to adapters in parameter-efficient fine-tuning (PEFT), enabling large language models (LLMs) to adapt to downstream tasks without architectural changes or parameter updates. While prior work has focused on stabilizing training via parameter interaction in small neural prompt encoders, their broader potential for transfer across languages remains unexplored. In this paper, we demonstrate that a prompt encoder can play a central role in improving performance on low-performing languages-those that achieve poor accuracy even under full-model fine-tuning. We introduce the Cross-Prompt Encoder (XPE), which combines a lightweight encoding architecture with multi-source training on typologically diverse languages - a design that enables the model to capture abstract and transferable patterns across languages. To complement XPE, we propose a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. This hybrid design proves especially effective for target languages that benefit from both broadly shared structure and language-specific alignment. Experiments on the SIB-200 benchmark reveal a consistent trade-off: XPE is most effective for low-performing languages, while hybrid variants offer broader adaptability across multilingual settings.[51] Making Qwen3 Think in Korean with Reinforcement Learning
Jungyup Lee,Jemin Kim,Sang Park,SeungJae Lee
Main category: cs.CL
TL;DR: A two-stage fine-tuning method was used to enhance the Qwen3 14B model's ability to think in Korean, resulting in improved performance on reasoning benchmarks.
Details
Motivation: To make the large language model Qwen3 14B 'think' natively in Korean and improve its Korean-language tasks and general reasoning ability. Method: Two-stage fine-tuning approach: supervised fine-tuning (SFT) followed by reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm. Result: Notable improvements in Korean-language tasks, some gains in general reasoning ability, and stable learning achieved through the introduction of an oracle judge model. Conclusion: The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks while maintaining knowledge and language proficiency. Abstract: We present a two-stage fine-tuning approach to make the large language model Qwen3 14B "think" natively in Korean. In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning, yielding notable improvements in Korean-language tasks and even some gains in general reasoning ability. In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning alignment and overall problem-solving performance. We address critical stability challenges in GRPO training - such as reward hacking and policy collapse - by introducing an oracle judge model that calibrates the reward signal. Our approach achieves stable learning (avoiding the collapse observed in naive GRPO) and leads to steady, incremental performance gains. The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks (particularly math and coding tasks) while maintaining knowledge and language proficiency, successfully conducting its internal chain-of-thought entirely in Korean.[52] Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models
Jakub Šmíd,Pavel Přibáň,Pavel Král
Main category: cs.CL
TL;DR: 本文提出了一种新的序列到序列方法,用于解决低资源语言中的复合方面情感分析(ABSA)任务,无需依赖外部翻译工具,并通过受限解码提高了跨语言ABSA性能。
Details
Motivation: 由于现有研究主要集中于英语,低资源语言中的ABSA仍面临挑战;此外,现有的跨语言ABSA方法通常集中于较简单的任务,并且严重依赖外部翻译工具。 Method: 提出了一种新的序列到序列方法,结合受限解码技术,用于处理复合ABSA任务。 Result: 该方法在跨语言ABSA任务上提升了高达10%的性能,并且能够处理更复杂的任务,同时提供了比依赖翻译的方法更实用、高效的替代方案。 Conclusion: 该方法为低资源语言的跨语言ABSA任务提供了一种有效的新解决方案,同时研究还表明,虽然经过微调的多语言大语言模型(LLMs)可以取得可比结果,但以英语为中心的LLMs在这些任务上表现不佳。 Abstract: Aspect-based sentiment analysis (ABSA) has made significant strides, yet challenges remain for low-resource languages due to the predominant focus on English. Current cross-lingual ABSA studies often centre on simpler tasks and rely heavily on external translation tools. In this paper, we present a novel sequence-to-sequence method for compound ABSA tasks that eliminates the need for such tools. Our approach, which uses constrained decoding, improves cross-lingual ABSA performance by up to 10\%. This method broadens the scope of cross-lingual ABSA, enabling it to handle more complex tasks and providing a practical, efficient alternative to translation-dependent techniques. Furthermore, we compare our approach with large language models (LLMs) and show that while fine-tuned multilingual LLMs can achieve comparable results, English-centric LLMs struggle with these tasks.[53] Large Language Models for Summarizing Czech Historical Documents and Beyond
Václav Tran,Jakub Šmíd,Jiří Martínek,Ladislav Lenc,Pavel Král
Main category: cs.CL
TL;DR: This paper explores the use of Mistral and mT5 models for Czech text summarization, achieving state-of-the-art results and introducing a new dataset for historical document summarization.
Details
Motivation: Czech text summarization, especially for historical documents, has been underexplored due to linguistic complexities and lack of annotated datasets. The motivation is to bridge this gap using advanced language models. Method: The researchers employed large language models (Mistral and mT5) for Czech text summarization and introduced a new dataset for historical Czech documents called Posel od Čerchova. Result: The study achieved new state-of-the-art results on the SumeCzech dataset and introduced a new dataset for historical Czech summarization with baseline results. Conclusion: The study concludes that the application of large language models like Mistral and mT5 can significantly advance Czech text summarization, particularly for historical documents. Abstract: Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od \v{C}erchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.[54] Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding
Jakub Šmíd,Pavel Přibáň,Pavel Král
Main category: cs.CL
TL;DR: This paper presents a constrained decoding approach for cross-lingual ABSA that outperforms current methods, especially for complex tasks, while fine-tuned LLMs show competitive results at the expense of increased computational cost.
Details
Motivation: The motivation stems from the challenges faced in aspect-based sentiment analysis (ABSA) for low-resource languages, where existing cross-lingual approaches are limited and often depend on unreliable translation tools. Method: The paper introduces a novel constrained decoding approach using sequence-to-sequence models for cross-lingual ABSA, and evaluates it across multiple languages and tasks. It also assesses large language models (LLMs) in zero-shot, few-shot, and fine-tuning scenarios. Result: The approach improves cross-lingual ABSA performance by 5% on average for the most complex task, with constrained decoding boosting results by more than 10%. The method supports multi-tasking and achieves state-of-the-art results across seven languages and six ABSA tasks. Fine-tuned LLMs perform competitively but with longer training and inference times. Conclusion: This paper concludes that the proposed constrained decoding approach with sequence-to-sequence models significantly improves cross-lingual ABSA performance, especially for complex tasks, while fine-tuning LLMs can achieve competitive results with longer training and inference times. Abstract: While aspect-based sentiment analysis (ABSA) has made substantial progress, challenges remain for low-resource languages, which are often overlooked in favour of English. Current cross-lingual ABSA approaches focus on limited, less complex tasks and often rely on external translation tools. This paper introduces a novel approach using constrained decoding with sequence-to-sequence models, eliminating the need for unreliable translation tools and improving cross-lingual performance by 5\% on average for the most complex task. The proposed method also supports multi-tasking, which enables solving multiple ABSA tasks with a single model, with constrained decoding boosting results by more than 10\%. We evaluate our approach across seven languages and six ABSA tasks, surpassing state-of-the-art methods and setting new benchmarks for previously unexplored tasks. Additionally, we assess large language models (LLMs) in zero-shot, few-shot, and fine-tuning scenarios. While LLMs perform poorly in zero-shot and few-shot settings, fine-tuning achieves competitive results compared to smaller multilingual models, albeit at the cost of longer training and inference times. We provide practical recommendations for real-world applications, enhancing the understanding of cross-lingual ABSA methodologies. This study offers valuable insights into the strengths and limitations of cross-lingual ABSA approaches, advancing the state-of-the-art in this challenging research domain.[55] Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
Chiyu Zhang,Lu Zhou,Xiaogang Xu,Jiafei Wu,Liming Fang,Zhe Liu
Main category: cs.CL
TL;DR: 本文提出了一种高效检测恶意内容的混合框架MDH和两种新的越狱攻击策略,显著提升了攻击成功率。
Details
Motivation: 现有的红队数据集包含不合适的提示,需要有效方法进行评估和清理,而现有方法在准确性和效率上存在不足。 Method: 提出了一种混合评估框架MDH,并提出了两种新策略D-Attack和DH-CoT,通过上下文模拟和劫持思维链进行攻击。 Result: 发现精心设计的开发者信息显著提升了越狱成功率,并成功应用新框架和策略。 Conclusion: MDH框架结合了LLM和人工监督,提高了恶意内容检测的准确性和效率,同时提出了D-Attack和DH-CoT两种新的越狱攻击策略。 Abstract: Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.[56] Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Huizhen Shu,Xuying Li,Qirui Wang,Yuji Kosuga,Mengqiu Tian,Zhuo Li
Main category: cs.CL
TL;DR: 本文提出了一种新的对抗文本生成方法SFPF,该方法能够绕过最先进的防御机制,揭示当前NLP系统中的漏洞,但其在不同提示和层面上的有效性各异,并且其对其他架构和更大模型的通用性仍有待验证。
Details
Motivation: 生成对抗样本以破解大型语言模型(LLMs)仍是理解模型漏洞和提高鲁棒性的关键挑战。 Method: 本文提出了一种新的黑盒攻击方法,即稀疏特征扰动框架(SFPF),利用稀疏自编码器识别并操纵文本中的关键特征。 Result: 实验结果表明,SFPF生成的对抗文本能够绕过最先进的防御机制,揭示当前NLP系统中存在的持久漏洞。 Conclusion: SFPF方法在对抗文本生成中展现出良好的性能,但也存在一定的局限性,需要进一步研究其通用性和适应性。 Abstract: With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.[57] ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning
Juyuan Wang,Rongchen Zhao,Wei Wei,Yufeng Wang,Mo Yu,Jie Zhou,Jin Xu,Liyan Xu
Main category: cs.CL
TL;DR: 本文提出了一种名为ComoRAG的方法,用于改进长篇故事和小说的叙事理解,其通过迭代推理周期和动态内存工作区的交互,在四个基准测试中表现优于传统RAG方法。
Details
Motivation: 由于长篇故事和小说中的复杂情节和纠缠关系,传统的RAG方法可能因状态无关、单步骤检索过程而不足,这往往忽略了捕捉长距离背景中的相互关联关系的动态性。 Method: 提出了一种名为ComoRAG的方法,该方法在遇到推理困境时,通过生成探测查询来设计新的探索路径,并将检索到的新方面证据整合到全局内存池中,从而支持查询解决的连贯背景的出现。 Result: 在四个具有挑战性的长上下文叙事基准测试中(200K+标记),ComoRAG比强大的RAG基线表现更优,相对于最强基线,相对增益高达11%。 Conclusion: ComoRAG提供了一种基于检索的叙事理解新方法,通过迭代推理周期和动态内存工作区的交互,优于传统的RAG方法,尤其适用于需要全局理解的复杂查询。 Abstract: Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG[58] Evaluating LLMs on Chinese Idiom Translation
Cai Yang,Yao Dou,David Heineman,Xiaofeng Wu,Wei Xu
Main category: cs.CL
TL;DR: This paper introduces IdiomEval, a framework for analyzing Chinese idiom translation errors, revealing that current translation systems, including advanced models like GPT-4, perform poorly, and existing evaluation metrics are insufficient for accurately assessing idiom translations.
Details
Motivation: Despite the prevalence of idioms in languages like Chinese and the importance of accurate translation due to their figurative meanings, there has been little research on Chinese idiom translation. This study aims to address this gap and improve the understanding and evaluation of idiom translation systems. Method: The researchers introduced a framework called IdiomEval, which includes a detailed error taxonomy. They annotated 900 translation pairs from nine modern translation systems across four domains and evaluated the performance of these systems. They also developed improved models for detecting translation errors. Result: Modern translation systems, including GPT-4 and Google Translate, were found to perform poorly in translating Chinese idioms, with errors such as incorrect, literal, partial, or missing translations. The best system, GPT-4, had an error rate of 28%. Existing evaluation metrics showed weak correlation with human judgments, and improved models achieved an F$_1$ score of 0.68 for error detection. Conclusion: The study concludes that existing Chinese idiom translation systems, including advanced ones like GPT-4, perform poorly, with the best system still making errors in 28% of cases. Current evaluation metrics are also inadequate for assessing idiom translation quality. Abstract: Idioms, whose figurative meanings usually differ from their literal interpretations, are common in everyday language, especially in Chinese, where they often contain historical references and follow specific structural patterns. Despite recent progress in machine translation with large language models, little is known about Chinese idiom translation. In this work, we introduce IdiomEval, a framework with a comprehensive error taxonomy for Chinese idiom translation. We annotate 900 translation pairs from nine modern systems, including GPT-4o and Google Translate, across four domains: web, news, Wikipedia, and social media. We find these systems fail at idiom translation, producing incorrect, literal, partial, or even missing translations. The best-performing system, GPT-4, makes errors in 28% of cases. We also find that existing evaluation metrics measure idiom quality poorly with Pearson correlation below 0.48 with human ratings. We thus develop improved models that achieve F$_1$ scores of 0.68 for detecting idiom translation errors.[59] Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints
Sandeep Reddy,Kabir Khan,Rohit Patil,Ananya Chakraborty,Faizan A. Khan,Swati Kulkarni,Arjun Verma,Neha Singh
Main category: cs.CL
TL;DR: This paper introduces a computational economics framework for large language models to optimize computation allocation, resulting in more efficient and transparent models under resource constraints.
Details
Motivation: The motivation is to address the substantial computational cost of large language models (LLMs) and develop a method to optimize computation allocation when resources are limited. Method: The researchers introduced a 'computational economics' framework, treating LLMs as internal economies of resource-constrained agents, and used an incentive-driven training paradigm to optimize computation allocation. Result: The proposed method showed a forty percent reduction in FLOPS, lower latency, and more interpretable attention patterns while maintaining accuracy on GLUE and WikiText-103 benchmarks. Conclusion: The study concludes that applying economic principles can lead to more efficient, adaptive, and transparent large language models (LLMs) under resource constraints. Abstract: Large language models (LLMs) are limited by substantial computational cost. We introduce a "computational economics" framework that treats an LLM as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard LLMs reallocate attention toward high-value tokens while preserving accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging sparse and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints.[60] DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales
Herun Wan,Jiaying Wu,Minnan Luo,Xiangzheng Kong,Zihan Ma,Zhi Zeng
Main category: cs.CL
TL;DR: DiFaR是一个能够提升虚假信息检测效果的框架,通过提高生成理性化文本的多样性、准确性和相关性,实验表明其性能优于现有方法。
Details
Motivation: 生成文本理性化以支持可训练的多模态虚假信息检测器的方法受限于生成理性化文本的多样性不足、事实错误和无关或冲突内容引入的噪声三个核心问题。 Method: DiFaR采用五种思维链提示来引导大型视觉-语言模型生成多样化的推理痕迹,并通过一个轻量级的后处理过滤模块选择基于句子级别的事实性和相关性评分的理性化句子。 Result: 在四个流行基准上的广泛实验表明,DiFaR在最多5.9%的范围内超过了四类基线方法,并将现有检测器的效果提高了最多8.7%。自动指标和人工评估均证实了DiFaR在所有三个维度上对理性化质量的显著改进。 Conclusion: DiFaR有效地解决了生成理性化文本的多样性、准确性和相关性问题,从而显著提升了虚假信息检测的效果。 Abstract: Generating textual rationales from large vision-language models (LVLMs) to support trainable multimodal misinformation detectors has emerged as a promising paradigm. However, its effectiveness is fundamentally limited by three core challenges: (i) insufficient diversity in generated rationales, (ii) factual inaccuracies due to hallucinations, and (iii) irrelevant or conflicting content that introduces noise. We introduce DiFaR, a detector-agnostic framework that produces diverse, factual, and relevant rationales to enhance misinformation detection. DiFaR employs five chain-of-thought prompts to elicit varied reasoning traces from LVLMs and incorporates a lightweight post-hoc filtering module to select rationale sentences based on sentence-level factuality and relevance scores. Extensive experiments on four popular benchmarks demonstrate that DiFaR outperforms four baseline categories by up to 5.9% and boosts existing detectors by as much as 8.7%. Both automatic metrics and human evaluations confirm that DiFaR significantly improves rationale quality across all three dimensions.[61] When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing
Mahdi Dhaini,Stephen Meisenbacher,Ege Erdogan,Florian Matthes,Gjergji Kasneci
Main category: cs.CL
TL;DR: This paper explores the relationship between privacy and explainability in NLP, showing that both can be achieved and offering practical recommendations for future research.
Details
Motivation: There is a lack of research exploring the intersection of explainability and privacy in NLP, creating a gap in understanding if both can be achieved simultaneously. Method: Empirical investigation guided by Differential Privacy and Post-hoc Explainability methods. Result: Findings reveal an intricate relationship between privacy and explainability, influenced by factors such as the nature of the task and the methods used for privatization and explainability. Conclusion: Privacy and explainability in NLP can co-exist, and the paper offers practical recommendations for future work at their intersection. Abstract: In the study of trustworthy Natural Language Processing (NLP), a number of important research fields have emerged, including that of \textit{explainability} and \textit{privacy}. While research interest in both explainable and privacy-preserving NLP has increased considerably in recent years, there remains a lack of investigation at the intersection of the two. This leaves a considerable gap in understanding of whether achieving \textit{both} explainability and privacy is possible, or whether the two are at odds with each other. In this work, we conduct an empirical investigation into the privacy-explainability trade-off in the context of NLP, guided by the popular overarching methods of \textit{Differential Privacy} (DP) and Post-hoc Explainability. Our findings include a view into the intricate relationship between privacy and explainability, which is formed by a number of factors, including the nature of the downstream task and choice of the text privatization and explainability method. In this, we highlight the potential for privacy and explainability to co-exist, and we summarize our findings in a collection of practical recommendations for future work at this important intersection.[62] When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models
Huyu Wu,Meng Tang,Xinhan Zheng,Haiyun Jiang
Main category: cs.CL
TL;DR: 多模态大型语言模型存在文本主导问题,作者提出了评估指标和解决方法,以实现更公平的多模态模型。
Details
Motivation: 多模态大型语言模型在多种多模态任务中表现出色,但存在文本主导的问题,即过度依赖文本进行推理,而未能充分利用其他模态。 Method: 提出了两种评估指标,即模态主导指数(MDI)和注意力效率指数(AEI),并进行了全面分析。 Result: 通过综合分析发现,文本主导在所有测试的模态中都是显著且普遍存在的,并通过提出的方法显著降低了LLaVA-7B的MDI值。 Conclusion: 文本主导的问题在多模态大型语言模型中普遍存在,作者提出了一种简单的标记压缩方法,可以有效平衡模型注意力,为开发更公平和全面的多模态语言模型提供了基础。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a diverse range of multimodal tasks. However, these models suffer from a core problem known as text dominance: they depend heavily on text for their inference, while underutilizing other modalities. While prior work has acknowledged this phenomenon in vision-language tasks, often attributing it to data biases or model architectures. In this paper, we conduct the first systematic investigation of text dominance across diverse data modalities, including images, videos, audio, time-series, and graphs. To measure this imbalance, we propose two evaluation metrics: the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI). Our comprehensive analysis reveals that text dominance is both significant and pervasive across all tested modalities. Our in-depth analysis identifies three underlying causes: attention dilution from severe token redundancy in non-textual modalities, the influence of fusion architecture design, and task formulations that implicitly favor textual inputs. Furthermore, we propose a simple token compression method that effectively rebalances model attention. Applying this method to LLaVA-7B, for instance, drastically reduces its MDI from 10.23 to a well-balanced value of 0.86. Our analysis and methodological framework offer a foundation for the development of more equitable and comprehensive multimodal language models.[63] eDIF: A European Deep Inference Fabric for Remote Interpretability of LLM
Irma Heithoff. Marc Guggenberger,Sandra Kalogiannis,Susanne Mayer,Fabian Maag,Sigurd Schacht,Carsten Lanquillon
Main category: cs.CL
TL;DR: 本文探讨了欧洲深度推理基础设施(eDIF)的部署,以支持对大型语言模型的机制可解释性研究。
Details
Motivation: 该论文的动机是为了解决欧洲对LLM可解释性基础设施的广泛可访问性的需求,从而为研究社区提供民主化的先进模型分析能力。 Method: 该论文的方法包括在安斯巴赫应用科学大学建立一个基于GPU的集群,并通过NNsight API实现远程模型检查,以及组织一个由16名研究人员参与的试点研究,以评估平台的技术性能、可用性和科学效用。 Result: 研究结果显示,用户参与度逐渐增加,平台性能稳定,并对远程实验能力给予了积极评价。试点研究还标志着围绕平台建立用户社区的起点。 Conclusion: 该论文的结论是,eDIF基础设施的部署是欧洲广泛获取LLM可解释性基础设施的重要一步,并为未来的广泛部署、扩展工具和持续的社区合作奠定了基础。 Abstract: This paper presents a feasibility study on the deployment of a European Deep Inference Fabric (eDIF), an NDIF-compatible infrastructure designed to support mechanistic interpretability research on large language models. The need for widespread accessibility of LLM interpretability infrastructure in Europe drives this initiative to democratize advanced model analysis capabilities for the research community. The project introduces a GPU-based cluster hosted at Ansbach University of Applied Sciences and interconnected with partner institutions, enabling remote model inspection via the NNsight API. A structured pilot study involving 16 researchers from across Europe evaluated the platform's technical performance, usability, and scientific utility. Users conducted interventions such as activation patching, causal tracing, and representation analysis on models including GPT-2 and DeepSeek-R1-70B. The study revealed a gradual increase in user engagement, stable platform performance throughout, and a positive reception of the remote experimentation capabilities. It also marked the starting point for building a user community around the platform. Identified limitations such as prolonged download durations for activation data as well as intermittent execution interruptions are addressed in the roadmap for future development. This initiative marks a significant step towards widespread accessibility of LLM interpretability infrastructure in Europe and lays the groundwork for broader deployment, expanded tooling, and sustained community collaboration in mechanistic interpretability research.[64] Neural Machine Translation for Coptic-French: Strategies for Low-Resource Ancient Languages
Nasma Chaoui,Richard Khoury
Main category: cs.CL
TL;DR: This paper explores strategies for translating Coptic to French, showing that fine-tuning models with varied and noise-aware data improves translation quality and provides insights for historical language translation tools.
Details
Motivation: To provide the first systematic study on translating Coptic into French and to develop effective translation tools for historical languages. Method: The study systematically evaluates pivot versus direct translation, the impact of pre-training, multi-version fine-tuning benefits, and model robustness to noise using aligned biblical corpora. Result: Demonstrated enhancement in translation quality through fine-tuning with stylistically-varied and noise-aware training data. Conclusion: The study concludes that fine-tuning with a stylistically-varied and noise-aware training corpus significantly improves translation quality, offering crucial insights for developing translation tools for historical languages. Abstract: This paper presents the first systematic study of strategies for translating Coptic into French. Our comprehensive pipeline systematically evaluates: pivot versus direct translation, the impact of pre-training, the benefits of multi-version fine-tuning, and model robustness to noise. Utilizing aligned biblical corpora, we demonstrate that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality. Our findings provide crucial practical insights for developing translation tools for historical languages in general.[65] Continuous Bangla Sign Language Translation: Mitigating the Expense of Gloss Annotation with the Assistance of Graph
Safaeid Hossain Arib,Rabeya Akter,Sejuti Rahman
Main category: cs.CL
TL;DR: The Continuous Bangla Sign Language Translation project improves translation using a novel method that combines transformer and STGCN-LSTM architectures, achieving superior results and establishing benchmarks for accessibility.
Details
Motivation: Sign language often faces underestimation in spoken language-prioritizing societies, creating communication barriers. This project aims to bridge that gap through improved translation methods. Method: The method combines transformer and STGCN-LSTM architectures, exploring architectural fusion and various fusion strategies for gloss-free translation. Result: The proposed method achieves state-of-the-art performance on multiple sign language datasets, showing significant BLEU-4 score improvements over existing methods. Conclusion: The project successfully enhances sign language translation by integrating graph-based methods with transformer architecture, setting a benchmark for future research and emphasizing improved communication accessibility. Abstract: Millions of individuals worldwide are affected by deafness and hearing impairment. Sign language serves as a sophisticated means of communication for the deaf and hard of hearing. However, in societies that prioritize spoken languages, sign language often faces underestimation, leading to communication barriers and social exclusion. The Continuous Bangla Sign Language Translation project aims to address this gap by enhancing translation methods. While recent approaches leverage transformer architecture for state-of-the-art results, our method integrates graph-based methods with the transformer architecture. This fusion, combining transformer and STGCN-LSTM architectures, proves more effective in gloss-free translation. Our contributions include architectural fusion, exploring various fusion strategies, and achieving a new state-of-the-art performance on diverse sign language datasets, namely RWTH-PHOENIX-2014T, CSL-Daily, How2Sign, and BornilDB v1.0. Our approach demonstrates superior performance compared to current translation outcomes across all datasets, showcasing notable improvements of BLEU-4 scores of 4.01, 2.07, and 0.5, surpassing those of GASLT, GASLT and slt_how2sign in RWTH-PHOENIX-2014T, CSL-Daily, and How2Sign, respectively. Also, we introduce benchmarking on the BornilDB v1.0 dataset for the first time. Our method sets a benchmark for future research, emphasizing the importance of gloss-free translation to improve communication accessibility for the deaf and hard of hearing.[66] Learning from Natural Language Feedback for Personalized Question Answering
Alireza Salemi,Hamed Zamani
Main category: cs.CL
TL;DR: 该研究提出了一种新的个性化问答框架VAC,通过自然语言反馈提升个性化效果,并在多个领域取得显著改进。
Details
Motivation: 标量奖励有时提供弱且无指导性的反馈,限制了个性化学习的效率和质量。 Method: 提出了一种名为VAC的新框架,用自然语言反馈取代标量奖励,交替训练反馈模型和策略模型。 Result: 在LaMP-QA基准测试中取得了显著改进,人类评估也确认了生成回答的优越质量。 Conclusion: 自然语言反馈(NLF)为优化个性化问答提供了更有效的信号。 Abstract: Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.[67] Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs
Xiangqi Jin,Yuxuan Wang,Yifeng Gao,Zichen Wen,Biqing Qi,Dongrui Liu,Linfeng Zhang
Main category: cs.CL
TL;DR: This paper introduces ICE, a novel in-place prompting framework for diffusion large language models (dLLMs), which improves accuracy and significantly reduces computational overhead through iterative refinement and confidence-aware early exit.
Details
Motivation: The motivation stems from the limitations of prefix-only prompting in traditional large language models, which restricts bidirectional information flow. The study aims to leverage the bidirectional attention and iterative refinement capabilities of dLLMs for more flexible and efficient prompting. Method: The authors introduced ICE, an in-place prompting framework for dLLMs, which integrates prompts directly within masked token positions during iterative refinement and utilizes a confidence-aware early exit mechanism to reduce computational costs. Result: Extensive experiments showed that ICE achieved up to a 17.29% accuracy improvement with a 4.12× speedup on GSM8K and up to a 276.67× acceleration on MMLU while maintaining competitive performance. Conclusion: The study concludes that the proposed ICE framework effectively enhances prompting flexibility and significantly reduces computational overhead in diffusion large language models (dLLMs), demonstrating substantial performance improvements and acceleration across multiple benchmarks. Abstract: Despite large language models (LLMs) have achieved remarkable success, their prefix-only prompting paradigm and sequential generation process offer limited flexibility for bidirectional information. Diffusion large language models (dLLMs) present new opportunities through their bidirectional attention mechanisms and iterative refinement processes, enabling more flexible in-place prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework that transforms prefix-only prompting into in-place prompting specifically designed for dLLMs. ICE integrates in-place prompts directly within masked token positions during iterative refinement and employs a confidence-aware early exit mechanism to significantly reduce computational overhead. Extensive experiments demonstrate ICE's effectiveness, achieving up to 17.29% accuracy improvement with 4.12$\times$ speedup on GSM8K, and up to 276.67$\times$ acceleration on MMLU while maintaining competitive performance.[68] Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback
Osama Mohammed Afzal,Preslav Nakov,Tom Hope,Iryna Gurevych
Main category: cs.CL
TL;DR: The paper proposes a structured LLM-assisted approach for automated novelty evaluation in peer review, achieving high alignment with human reasoning and improving consistency in assessments.
Details
Motivation: Novelty assessment is crucial yet understudied in peer review, especially in high-volume fields like NLP where reviewer capacity is strained. Method: The method involves a three-stage process: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence-based assessment, informed by human-written novelty reviews. Result: Evaluated on 182 ICLR 2025 submissions, the approach achieved 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions, outperforming existing LLM-based baselines. Conclusion: The structured LLM-assisted approach for novelty assessment shows potential in enhancing peer review rigor and transparency without replacing human expertise. Abstract: Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.[69] Reinforced Language Models for Sequential Decision Making
Jim Dilkes,Vahid Yazdanpanah,Sebastian Stein
Main category: cs.CL
TL;DR: This paper introduces MS-GRPO, a new post-training method for smaller language models, enabling them to excel in sequential decision-making tasks like Frozen Lake, outperforming much larger models.
Details
Motivation: The motivation is to address the limitations of large language models (LLMs) that rely on computationally expensive architectures and the lack of effective post-training methods for multi-step agentic tasks. Method: The paper introduces Multi-Step Group-Relative Policy Optimization (MS-GRPO), grounded in Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks, along with an absolute-advantage-weighted episode sampling strategy. Result: The experiments show that a 3B parameter model post-trained with MS-GRPO outperforms a 72B baseline by 50% on the Frozen Lake task. Conclusion: The paper concludes that targeted post-training methods like MS-GRPO offer a practical and efficient alternative to large-scale models for sequential decision-making tasks. Abstract: Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.[70] Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning
Chongyuan Dai,Jinpeng Hu,Hongchang Shi,Zhuo Li,Xun Yang,Meng Wang
Main category: cs.CL
TL;DR: 本文提出了一种结合推理能力、专业知识和同理心的中文心理大型语言模型Psyche-R1,并展示了其在心理应用中的有效性。
Details
Motivation: 心理健康专业人员短缺,而集成大型语言模型(LLM)到心理应用中提供了一种有前途的方式来减轻日益增长的心理健康障碍负担。 Method: 构建了一个包含75k高质量心理问题和73k富有同理心对话的数据集,并采用混合训练策略,包括多LLM交叉选择策略和组相对策略优化(GRPO)以及监督微调(SFT) Result: Psyche-R1在多个心理基准测试中表现出色,其7B版本达到了与671B DeepSeek-R1相当的结果。 Conclusion: Psyche-R1的开发证明了结合推理能力和同理心在心理领域的可行性,并且其表现与更大的模型相当。 Abstract: Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.[71] From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms
Zhaokun Jiang,Ziyin Zhang
Main category: cs.CL
TL;DR: This paper proposes an explainable machine learning framework for interpreting quality assessment, combining feature engineering, data augmentation, and SHAP analysis to provide transparent and reliable evaluation with diagnostic feedback for learners.
Details
Motivation: The motivation stems from the limitations in existing research on automated interpreting quality assessment, including insufficient examination of language use quality, ineffective modeling due to data scarcity and imbalance, and the lack of efforts to explain model predictions. Method: The study employs a multi-dimensional modeling framework integrating feature engineering, data augmentation, and explainable machine learning. It utilizes construct-relevant, transparent features and applies Shapley Value (SHAP) analysis for explainability. Result: The approach demonstrated strong predictive performance on an English-Chinese consecutive interpreting dataset, identifying BLEURT and CometKiwi scores as the strongest predictors for fidelity, pause-related features for fluency, and Chinese-specific phraseological diversity metrics for language use. Conclusion: The study concludes that the proposed multi-dimensional modeling framework offers a scalable, reliable, and transparent alternative to traditional human evaluation by prioritizing explainability, thereby facilitating detailed diagnostic feedback for learners. Abstract: Recent advancements in machine learning have spurred growing interests in automated interpreting quality assessment. Nevertheless, existing research suffers from insufficient examination of language use quality, unsatisfactory modeling effectiveness due to data scarcity and imbalance, and a lack of efforts to explain model predictions. To address these gaps, we propose a multi-dimensional modeling framework that integrates feature engineering, data augmentation, and explainable machine learning. This approach prioritizes explainability over ``black box'' predictions by utilizing only construct-relevant, transparent features and conducting Shapley Value (SHAP) analysis. Our results demonstrate strong predictive performance on a novel English-Chinese consecutive interpreting dataset, identifying BLEURT and CometKiwi scores to be the strongest predictive features for fidelity, pause-related features for fluency, and Chinese-specific phraseological diversity metrics for language use. Overall, by placing particular emphasis on explainability, we present a scalable, reliable, and transparent alternative to traditional human evaluation, facilitating the provision of detailed diagnostic feedback for learners and supporting self-regulated learning advantages not afforded by automated scores in isolation.[72] SSRL: Self-Search Reinforcement Learning
Yuchen Fan,Kaiyan Zhang,Heng Zhou,Yuxin Zuo,Yanxu Chen,Yu Fu,Xinwei Long,Xuekai Zhu,Che Jiang,Yuchen Zhang,Li Kang,Gang Chen,Cheng Huang,Zhizhou He,Bingning Wang,Lei Bai,Ning Ding,Bowen Zhou
Main category: cs.CL
TL;DR: This paper explores the use of large language models (LLMs) as efficient simulators for reinforcement learning (RL) tasks, introducing Self-Search RL (SSRL) to enhance LLMs' search capabilities through rewards. The study finds that LLMs can effectively utilize internal knowledge for high performance, reduce hallucination, and integrate with external search engines seamlessly.
Details
Motivation: We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. Method: We first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. We introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. Result: LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. Conclusion: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Abstract: We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.[73] A Survey on Diffusion Language Models
Tianyi Li,Mingda Chen,Bowei Guo,Zhiqiang Shen
Main category: cs.CL
TL;DR: The paper surveys the current state of Diffusion Language Models (DLMs) as a promising alternative to autoregressive models in natural language processing.
Details
Motivation: DLMs offer advantages in reducing inference latency and capturing bidirectional context, making them a compelling choice for NLP tasks. The paper aims to provide a comprehensive overview of the current DLM landscape. Method: The paper provides a survey of the current state of DLMs, including their evolution, foundational principles, pre-training strategies, post-training methods, inference strategies, and multimodal extensions. Result: The paper presents a taxonomy of DLMs, analyzes current techniques, and discusses limitations and challenges, as well as future research directions. Conclusion: DLMs show great promise as an alternative to AR models and are likely to continue evolving in the future. Abstract: Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.cs.CV [Back]
[74] Stochastic-based Patch Filtering for Few-Shot Learning
Javier Rodenas,Eduardo Aguilar,Petia Radeva
Main category: cs.CV
TL;DR: The paper introduces SPFF, a novel approach for few-shot learning that improves food image classification by focusing on relevant features while filtering out noise, outperforming current state-of-the-art methods.
Details
Motivation: Food images pose challenges for few-shot learning due to their complexity and variability, often leading to misclassification by diverting focus from key elements. Method: The proposed method, Stochastic-based Patch Filtering for Few-Shot Learning (SPFF), filters patch embeddings stochastically to focus on those with greater correlation to class representation. It uses a similarity matrix to quantify relationships between query and support images. Result: The SPFF method demonstrated improved performance in focusing on class-specific features while filtering out irrelevant details, validated through experiments on Food-101, VireoFood-172, and UECFood-256 benchmarks. Conclusion: SPFF effectively improves the focus on relevant features in food images for few-shot learning, outperforming existing methods. Abstract: Food images present unique challenges for few-shot learning models due to their visual complexity and variability. For instance, a pasta dish might appear with various garnishes on different plates and in diverse lighting conditions and camera perspectives. This problem leads to losing focus on the most important elements when comparing the query with support images, resulting in misclassification. To address this issue, we propose Stochastic-based Patch Filtering for Few-Shot Learning (SPFF) to attend to the patch embeddings that show greater correlation with the class representation. The key concept of SPFF involves the stochastic filtering of patch embeddings, where patches less similar to the class-aware embedding are more likely to be discarded. With patch embedding filtered according to the probability of appearance, we use a similarity matrix that quantifies the relationship between the query image and its respective support images. Through a qualitative analysis, we demonstrate that SPFF effectively focuses on patches where class-specific food features are most prominent while successfully filtering out non-relevant patches. We validate our approach through extensive experiments on few-shot classification benchmarks: Food-101, VireoFood-172 and UECFood-256, outperforming the existing SoA methods.[75] DINOv3
Oriane Siméoni,Huy V. Vo,Maximilian Seitzer,Federico Baldassarre,Maxime Oquab,Cijo Jose,Vasil Khalidov,Marc Szafraniec,Seungeun Yi,Michaël Ramamonjisoa,Francisco Massa,Daniel Haziza,Luca Wehrstedt,Jianyuan Wang,Timothée Darcet,Théo Moutakanni,Leonel Sentana,Claire Roberts,Andrea Vedaldi,Jamie Tolan,John Brandt,Camille Couprie,Julien Mairal,Hervé Jégou,Patrick Labatut,Piotr Bojanowski
Main category: cs.CV
TL;DR: DINOv3 is a self-supervised vision model that eliminates the need for manual data annotation, scales to large datasets and architectures, and outperforms existing models on a wide range of vision tasks.
Details
Motivation: The motivation is to realize the vision of self-supervised learning by eliminating the need for manual data annotation, enabling models to learn visual representations from diverse sources using a single algorithm, and scaling effortlessly to massive datasets and larger architectures. Method: DINOv3 leverages scaling of datasets and model size, introduces a new method called Gram anchoring to address degradation in dense feature maps during training, and applies post-hoc strategies to enhance model flexibility regarding resolution, size, and alignment with text. Result: DINOv3 achieves high-quality dense features that significantly surpass previous self- and weakly-supervised foundation models, achieving outstanding performance on various vision tasks. Conclusion: DINOv3 is a versatile vision foundation model that outperforms specialized state-of-the-art models across a broad range of settings without fine-tuning, and it offers scalable solutions for diverse resource constraints and deployment scenarios. Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.[76] Empowering Morphing Attack Detection using Interpretable Image-Text Foundation Model
Sushrut Patwardhan,Raghavendra Ramachandra,Sushma Venkatesh
Main category: cs.CV
TL;DR: 本文提出了一种基于CLIP模型的多模态学习框架,用于零样本评估下的人脸变形攻击检测,并能生成对应的文本描述,实验表明其在多种数据和攻击类型中具有优异性能。
Details
Motivation: 为了提高人脸识别系统在可靠验证场景中的安全性,需要有效的人脸变形攻击检测方法。现有的方法可能无法提供对攻击的可解释性描述,因此提出了一种能够生成文本解释的新型多模态学习方法。 Method: 提出了一种多模态学习方法,结合了对比语言-图像预训练(CLIP)模型,用于提供变形攻击检测的文本描述。通过零样本评估验证了框架的有效性,并对十种不同类型的文本提示进行了分析。 Result: 所提出的框架在零样本评估中展示了良好的泛化能力,能够准确检测变形攻击并匹配相关的文本描述。实验还表明,该方法在多个变形生成技术和不同媒介中的表现优于现有方法。 Conclusion: 实验结果表明,所提出的多模态学习框架在零样本评估中能够有效地进行人脸变形攻击检测,并能预测最相关的文本片段。此外,该方法在多种变形生成技术和不同媒介中的表现优于现有的预训练神经网络。 Abstract: Morphing attack detection has become an essential component of face recognition systems for ensuring a reliable verification scenario. In this paper, we present a multimodal learning approach that can provide a textual description of morphing attack detection. We first show that zero-shot evaluation of the proposed framework using Contrastive Language-Image Pretraining (CLIP) can yield not only generalizable morphing attack detection, but also predict the most relevant text snippet. We present an extensive analysis of ten different textual prompts that include both short and long textual prompts. These prompts are engineered by considering the human understandable textual snippet. Extensive experiments were performed on a face morphing dataset that was developed using a publicly available face biometric dataset. We present an evaluation of SOTA pre-trained neural networks together with the proposed framework in the zero-shot evaluation of five different morphing generation techniques that are captured in three different mediums.[77] Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs
Kaixin Peng,Mengyang Zhao,Haiyang Yu,Teng Fu,Bin Li
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉-语言大模型的可解释方法,用于甲骨文破译,显著提升了零样本下的性能,并为未破译甲骨文提供考古参考。
Details
Motivation: 由于甲骨文的稀有性、抽象性和象形多样性,其破译工作长期面临挑战。现有的深度学习方法在零样本设置和未破译甲骨文处理上存在泛化性和可解释性的局限,因此需要一种更具逻辑性和语义理解能力的方法。 Method: 论文采用渐进式训练策略,结合部件分析与象形语义理解,通过设计“部件-象形双匹配机制”来弥合甲骨文字形与含义之间的差距。 Result: 该方法在公共基准测试中实现了最先进的Top-10准确率,并在零样本破译任务中表现出色。模型具备逻辑分析过程,有望为未破译甲骨文提供有价值的参考。 Conclusion: 该论文提出了一种基于大视觉-语言模型的可解释甲骨文破译方法,并构建了一个新的数据集,有效提升了零样本设置下的破译性能。此外,该方法具有逻辑分析过程,可能为未破译的甲骨文提供考古学上有价值的参考结果。 Abstract: As the oldest mature writing system, Oracle Bone Script (OBS) has long posed significant challenges for archaeological decipherment due to its rarity, abstractness, and pictographic diversity. Current deep learning-based methods have made exciting progress on the OBS decipherment task, but existing approaches often ignore the intricate connections between glyphs and the semantics of OBS. This results in limited generalization and interpretability, especially when addressing zero-shot settings and undeciphered OBS. To this end, we propose an interpretable OBS decipherment method based on Large Vision-Language Models, which synergistically combines radical analysis and pictograph-semantic understanding to bridge the gap between glyphs and meanings of OBS. Specifically, we propose a progressive training strategy that guides the model from radical recognition and analysis to pictographic analysis and mutual analysis, thus enabling reasoning from glyph to meaning. We also design a Radical-Pictographic Dual Matching mechanism informed by the analysis results, significantly enhancing the model's zero-shot decipherment performance. To facilitate model training, we propose the Pictographic Decipherment OBS Dataset, which comprises 47,157 Chinese characters annotated with OBS images and pictographic analysis texts. Experimental results on public benchmarks demonstrate that our approach achieves state-of-the-art Top-10 accuracy and superior zero-shot decipherment capabilities. More importantly, our model delivers logical analysis processes, possibly providing archaeologically valuable reference results for undeciphered OBS, and thus has potential applications in digital humanities and historical research. The dataset and code will be released in https://github.com/PKXX1943/PD-OBS.[78] Deep Learning Enables Large-Scale Shape and Appearance Modeling in Total-Body DXA Imaging
Arianna Bunnell,Devon Cataldi,Yannik Glaser,Thomas K. Wolfgruber,Steven Heymsfield,Alan B. Zonderman,Thomas L. Kelly,Peter Sadowski,John A. Shepherd
Main category: cs.CV
TL;DR: A highly accurate deep learning method for automatic fiducial point placement on TBDXA scans was developed and validated, which can be used to generate new hypotheses about the relationship between body composition and health markers.
Details
Motivation: The motivation was to develop a highly accurate and cost-effective method for body composition assessment using TBDXA imaging, which could generate new hypotheses about the relationship between body composition and health markers. Method: A deep learning method was developed and validated for automatic fiducial point placement on TBDXA scans using 1,683 manually-annotated TBDXA scans. Shape and appearance modeling (SAM) was used to place keypoints on 35,928 scans for five different TBDXA imaging modes. Result: The method achieved 99.5% percentage correct keypoints in an external testing dataset. SAM feature distributions associated with health biomarkers corroborated existing evidence and generated new hypotheses. Conclusion: The study concludes that the deep learning method for automatic fiducial point placement on TBDXA scans is highly accurate and useful for generating hypotheses on body composition and shape's relationship to various health markers. Abstract: Total-body dual X-ray absorptiometry (TBDXA) imaging is a relatively low-cost whole-body imaging modality, widely used for body composition assessment. We develop and validate a deep learning method for automatic fiducial point placement on TBDXA scans using 1,683 manually-annotated TBDXA scans. The method achieves 99.5% percentage correct keypoints in an external testing dataset. To demonstrate the value for shape and appearance modeling (SAM), our method is used to place keypoints on 35,928 scans for five different TBDXA imaging modes, then associations with health markers are tested in two cohorts not used for SAM model generation using two-sample Kolmogorov-Smirnov tests. SAM feature distributions associated with health biomarkers are shown to corroborate existing evidence and generate new hypotheses on body composition and shape's relationship to various frailty, metabolic, inflammation, and cardiometabolic health markers. Evaluation scripts, model weights, automatic point file generation code, and triangulation files are available at https://github.com/hawaii-ai/dxa-pointplacement.[79] MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning
Thanh-Dat Truong,Christophe Bobda,Nitin Agarwal,Khoa Luu
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态注意力归一化流(MANGO)方法,用于开发显式、可解释和可追踪的多模态融合学习。
Details
Motivation: 当前的多模态融合方法采用Transformer的注意力机制隐式学习多模态特征的潜在相关性,导致多模态模型无法捕捉每种模态的本质特征,使其难以理解多模态输入的复杂结构和相关性。 Method: 提出了一种新的可逆交叉注意力(ICA)层来开发基于归一化流的多模态数据模型。在所提出的可逆交叉注意力层中,为了有效捕捉多模态数据的复杂潜在相关性,设计了三种新的交叉注意力机制:模态到模态交叉注意力(MMCA)、模态间交叉注意力(IMCA)和可学习模态间交叉注意力(LICA)。引入了一种新的基于多模态注意力的归一化流,以使所提出的方法能够扩展到高维多模态数据。 Result: 实验结果表明,该方法在三种不同的多模态学习任务(语义分割、图像到图像翻译和电影类型分类)上均达到了最先进的性能。 Conclusion: 本文提出了一种新的多模态融合方法,通过引入可逆交叉注意力层和基于多模态注意力的归一化流,有效捕捉多模态数据的复杂潜在相关性,提高了多模态模型的性能。 Abstract: Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.[80] Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model
Nitin Rai,Nathan S. Boyd,Gary E. Vallad,Arnold W. Schumann
Main category: cs.CV
TL;DR: 本研究发现,将少量真实图像与大量合成图像结合使用可显著提升西瓜病害分类模型的性能。
Details
Motivation: 生成式人工智能(GenAI)模型为生成高分辨率合成图像提供了新可能性,但缺乏关于真实与合成图像结合使用对提升病害分类性能的有效性研究。 Method: 将训练数据集分为五种处理方式(H0-H4),使用改进的EfficientNetV2-L架构结合微调和迁移学习技术进行训练,并评估不同处理方式对模型性能的影响。 Result: H2、H3和H4处理的模型在精度、召回率和F1分数上表现优异,加权F1分数从H0的0.65提升至H3-H4的1.00。 Conclusion: 该研究验证了合成图像无法完全替代真实图像,两者结合使用可以最大化作物病害分类模型的性能。 Abstract: The current advancements in generative artificial intelligence (GenAI) models have paved the way for new possibilities for generating high-resolution synthetic images, thereby offering a promising alternative to traditional image acquisition for training computer vision models in agriculture. In the context of crop disease diagnosis, GenAI models are being used to create synthetic images of various diseases, potentially facilitating model creation and reducing the dependency on resource-intensive in-field data collection. However, limited research has been conducted on evaluating the effectiveness of integrating real with synthetic images to improve disease classification performance. Therefore, this study aims to investigate whether combining a limited number of real images with synthetic images can enhance the prediction accuracy of an EfficientNetV2-L model for classifying watermelon \textit{(Citrullus lanatus)} diseases. The training dataset was divided into five treatments: H0 (only real images), H1 (only synthetic images), H2 (1:1 real-to-synthetic), H3 (1:10 real-to-synthetic), and H4 (H3 + random images to improve variability and model generalization). All treatments were trained using a custom EfficientNetV2-L architecture with enhanced fine-tuning and transfer learning techniques. Models trained on H2, H3, and H4 treatments demonstrated high precision, recall, and F1-score metrics. Additionally, the weighted F1-score increased from 0.65 (on H0) to 1.00 (on H3-H4) signifying that the addition of a small number of real images with a considerable volume of synthetic images improved model performance and generalizability. Overall, this validates the findings that synthetic images alone cannot adequately substitute for real images; instead, both must be used in a hybrid manner to maximize model performance for crop disease classification.[81] SynSpill: Improved Industrial Spill Detection With Synthetic Data
Aaditya Baranwal,Abdul Mueez,Jason Voelker,Guneet Bhatia,Shruti Vyas
Main category: cs.CV
TL;DR: 该研究提出了一种利用高质量合成数据提升大规模视觉-语言模型和目标检测器在工业溢出检测等安全关键领域性能的可扩展框架。
Details
Motivation: 大规模视觉-语言模型(VLM)在工业溢出检测等小众、安全关键领域表现明显下降,因为这些领域的事件罕见、敏感且难以注释。这种稀缺性使得传统检测器的微调在大多数工业场景中不可行。 Method: 介绍了一个以高质量合成数据生成管道为中心的可扩展框架,并展示了该合成语料库如何实现VLM的有效参数高效微调(PEFT)以及YOLO和DETR等目标检测器性能的显著提升。 Result: 研究结果强调了高保真合成数据是弥合安全关键应用中领域差距的一种强大手段。在使用SynSpill数据集的情况下,VLM和检测器的性能都有显著提升,性能变得相当。 Conclusion: 合成数据与轻量级适应的结合为在数据稀缺的工业环境中部署视觉系统提供了一种具有成本效益、可扩展的途径。 Abstract: Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities. However, their performance degrades significantly in niche, safety-critical domains such as industrial spill detection, where hazardous events are rare, sensitive, and difficult to annotate. This scarcity -- driven by privacy concerns, data sensitivity, and the infrequency of real incidents -- renders conventional fine-tuning of detectors infeasible for most industrial settings. We address this challenge by introducing a scalable framework centered on a high-quality synthetic data generation pipeline. We demonstrate that this synthetic corpus enables effective Parameter-Efficient Fine-Tuning (PEFT) of VLMs and substantially boosts the performance of state-of-the-art object detectors such as YOLO and DETR. Notably, in the absence of synthetic data (SynSpill dataset), VLMs still generalize better to unseen spill scenarios than these detectors. When SynSpill is used, both VLMs and detectors achieve marked improvements, with their performance becoming comparable. Our results underscore that high-fidelity synthetic data is a powerful means to bridge the domain gap in safety-critical applications. The combination of synthetic generation and lightweight adaptation offers a cost-effective, scalable pathway for deploying vision systems in industrial environments where real data is scarce/impractical to obtain. Project Page: https://synspill.vercel.app[82] EntropyGS: An Efficient Entropy Coding on 3D Gaussian Splatting
Yuning Huang,Jiahao Pang,Fengqing Zhu,Dong Tian
Main category: cs.CV
TL;DR: 本文提出 EntropyGS,一种针对 3DGS 高斯属性的高效压缩方法,通过统计分布分析和参数化熵编码实现高质量压缩。
Details
Motivation: 3DGS 的高斯创建和视图渲染任务通常在时间和设备上分离,因此需要对高斯属性进行存储、传输以及压缩。 Method: 通过分析 3DGS 高斯属性的相关性和统计分布,提出了一种因子化和参数化的熵编码方法 EntropyGS,并根据属性类型自适应地进行量化处理。 Result: EntropyGS 在基准数据集上实现了约 30 倍的率减少,同时保持了与原始 3DGS 数据相似的渲染质量,且编解码速度较快。 Conclusion: EntropyGS 实现了高效的 3DGS 高斯属性压缩,在保持渲染质量的同时减少了约 30 倍的数据率,且编解码速度快。 Abstract: As an emerging novel view synthesis approach, 3D Gaussian Splatting (3DGS) demonstrates fast training/rendering with superior visual quality. The two tasks of 3DGS, Gaussian creation and view rendering, are typically separated over time or devices, and thus storage/transmission and finally compression of 3DGS Gaussians become necessary. We begin with a correlation and statistical analysis of 3DGS Gaussian attributes. An inspiring finding in this work reveals that spherical harmonic AC attributes precisely follow Laplace distributions, while mixtures of Gaussian distributions can approximate rotation, scaling, and opacity. Additionally, harmonic AC attributes manifest weak correlations with other attributes except for inherited correlations from a color space. A factorized and parameterized entropy coding method, EntropyGS, is hereinafter proposed. During encoding, distribution parameters of each Gaussian attribute are estimated to assist their entropy coding. The quantization for entropy coding is adaptively performed according to Gaussian attribute types. EntropyGS demonstrates about 30x rate reduction on benchmark datasets while maintaining similar rendering quality compared to input 3DGS data, with a fast encoding and decoding time.[83] CellSymphony: Deciphering the molecular and phenotypic orchestration of cells with single-cell pathomics
Paul H. Acosta,Pingjun Chen,Simon P. Castillo,Maria Esther Salvatierra,Yinyin Yuan,Xiaoxi Pan
Main category: cs.CV
TL;DR: CellSymphony是一种多模态框架,通过整合Xenium转录组和组织学图像数据,实现了高精度细胞注释和微环境分析。
Details
Motivation: 尽管组织学图像包含丰富的形态学信息,但如何从中提取稳健的细胞水平特征并将其与空间转录组数据整合仍是一个关键挑战。 Method: CellSymphony采用了基于基础模型的多模态框架,融合了Xenium转录组数据和组织学图像的单细胞分辨率信息。 Result: CellSymphony在三个癌症类型中成功揭示了不同的微环境生态位,并实现了准确的细胞类型注释。 Conclusion: CellSymphony通过整合Xenium转录组数据和组织学图像的联合表征,实现了精确的细胞类型注释,并揭示了不同癌症类型的微环境生态位。 Abstract: Xenium, a new spatial transcriptomics platform, enables subcellular-resolution profiling of complex tumor tissues. Despite the rich morphological information in histology images, extracting robust cell-level features and integrating them with spatial transcriptomics data remains a critical challenge. We introduce CellSymphony, a flexible multimodal framework that leverages foundation model-derived embeddings from both Xenium transcriptomic profiles and histology images at true single-cell resolution. By learning joint representations that fuse spatial gene expression with morphological context, CellSymphony achieves accurate cell type annotation and uncovers distinct microenvironmental niches across three cancer types. This work highlights the potential of foundation models and multimodal fusion for deciphering the physiological and phenotypic orchestration of cells within complex tissue ecosystems.[84] Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets
Xinan Zhang,Haolin Wang,Yung-An Hsieh,Zhongyu Yang,Anthony Yezzi,Yi-Chang Tsai
Main category: cs.CV
TL;DR: 这篇论文综述了深度学习在裂缝检测领域的最新进展,包括学习范式、数据集多样性和通用性的转变,并介绍了一个新的3D激光扫描数据集3DCrack,用于支持未来的研究。
Details
Motivation: 裂缝检测在土木基础设施中至关重要,而深度学习的发展正在改变这一领域。现有的技术论文和综述论文已无法充分反映新兴趋势,因此需要系统分析这些变化并提供未来研究方向。 Method: 本文通过系统分析裂缝检测领域的新趋势,介绍了代表性工作,并提出了一个新的3D激光扫描数据集3DCrack。此外,论文还进行了广泛的基准实验,以建立常用深度学习方法的基线。 Result: 论文揭示了裂缝检测领域方法论的演变趋势,包括从全监督学习到其他学习范式的转变,以及数据集多样性的提升。基准实验为常用深度学习模型提供了基线结果。 Conclusion: 深度学习正在推动裂缝检测技术的发展,而本文通过分析新兴趋势和引入新数据集3DCrack,为未来研究提供了方向和资源。 Abstract: Crack detection plays a crucial role in civil infrastructures, including inspection of pavements, buildings, etc., and deep learning has significantly advanced this field in recent years. While numerous technical and review papers exist in this domain, emerging trends are reshaping the landscape. These shifts include transitions in learning paradigms (from fully supervised learning to semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation and fine-tuning foundation models), improvements in generalizability (from single-dataset performance to cross-dataset evaluation), and diversification in dataset reacquisition (from RGB images to specialized sensor-based data). In this review, we systematically analyze these trends and highlight representative works. Additionally, we introduce a new dataset collected with 3D laser scans, 3DCrack, to support future research and conduct extensive benchmarking experiments to establish baselines for commonly used deep learning methodologies, including recent foundation models. Our findings provide insights into the evolving methodologies and future directions in deep learning-based crack detection. Project page: https://github.com/nantonzhang/Awesome-Crack-Detection[85] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs
Haonan Ge,Yiwei Wang,Ming-Hsuan Yang,Yujun Cai
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的解码方法Multi-Region Fusion Decoding (MRFD),通过建模区域间一致性来提高大型视觉语言模型(LVLMs)的事实基础,从而显著减少幻觉并提高响应的真实性。
Details
Motivation: 大型视觉语言模型(LVLMs)在多模态任务中表现出色,但由于在验证图像不同区域信息方面的能力有限,它们往往会生成与视觉输入不一致的文本(幻觉)。为了解决这个问题,作者提出了MRFD方法。 Method: MRFD使用交叉注意力识别显著区域,生成每个区域的初始响应,并根据响应间的Jensen-Shannon散度(JSD)计算可靠性权重。这些权重指导使用受思维链推理启发的区域感知提示进行一致性感知的融合。 Result: 实验结果表明,MRFD在多个LVLMs和基准测试中显著减少了幻觉,并提高了响应的事实性,而不需要模型更新。 Conclusion: MRFD是一种有效的解码方法,能够提升LVLMs的事实基础和响应的真实性,同时无需进行模型训练。 Abstract: Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.[86] Pose-Robust Calibration Strategy for Point-of-Gaze Estimation on Mobile Phones
Yujie Zhao,Jiabei Zeng,Shiguang Shan
Main category: cs.CV
TL;DR: This paper presents a dynamic calibration strategy for appearance-based point-of-gaze estimation that improves robustness to head pose variations.
Details
Motivation: Appearance-based point-of-gaze estimators struggle to generalize across individuals and are sensitive to head pose variations, requiring person-specific calibration strategies that can handle pose changes. Method: The authors created the MobilePoG benchmark with facial images from 32 individuals under varying head poses. They systematically analyzed how calibration point diversity and head pose variation affect estimation accuracy, then proposed a dynamic calibration strategy where users fixate on calibration points while moving their phones. Result: Experiments showed that introducing a wider range of head poses during calibration improves the estimator's ability to handle pose variation. The proposed dynamic calibration strategy resulted in a more robust point-of-gaze estimator compared to conventional approaches. Conclusion: The study successfully demonstrated that their dynamic calibration strategy produces a better calibrated point-of-gaze estimator that is less sensitive to head pose variations than conventional calibration strategies. Abstract: Although appearance-based point-of-gaze (PoG) estimation has improved, the estimators still struggle to generalize across individuals due to personal differences. Therefore, person-specific calibration is required for accurate PoG estimation. However, calibrated PoG estimators are often sensitive to head pose variations. To address this, we investigate the key factors influencing calibrated estimators and explore pose-robust calibration strategies. Specifically, we first construct a benchmark, MobilePoG, which includes facial images from 32 individuals focusing on designated points under either fixed or continuously changing head poses. Using this benchmark, we systematically analyze how the diversity of calibration points and head poses influences estimation accuracy. Our experiments show that introducing a wider range of head poses during calibration improves the estimator's ability to handle pose variation. Building on this insight, we propose a dynamic calibration strategy in which users fixate on calibration points while moving their phones. This strategy naturally introduces head pose variation during a user-friendly and efficient calibration process, ultimately producing a better calibrated PoG estimator that is less sensitive to head pose variations than those using conventional calibration strategies. Codes and datasets are available at our project page.[87] High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance
Danyi Gao
Main category: cs.CV
TL;DR: 本文提出了一种高保真图像生成方法,结合文本-图像对比约束和结构引导机制,以提高语义对齐准确性和结构一致性。
Details
Motivation: 解决现有文本驱动图像生成方法在语义对齐准确性和结构一致性方面的性能瓶颈。 Method: 引入对比学习模块,并结合结构先验(如语义布局图或边缘草图)来指导生成器进行空间级结构建模。 Result: 在COCO-2014数据集上的实验表明,该方法在CLIP Score、FID和SSIM等指标上表现优越,且不增加计算复杂度。 Conclusion: 该方法有效结合了语义对齐和结构保真,为文本-图像联合建模和图像生成提供了可行的技术路径。 Abstract: This paper addresses the performance bottlenecks of existing text-driven image generation methods in terms of semantic alignment accuracy and structural consistency. A high-fidelity image generation method is proposed by integrating text-image contrastive constraints with structural guidance mechanisms. The approach introduces a contrastive learning module that builds strong cross-modal alignment constraints to improve semantic matching between text and image. At the same time, structural priors such as semantic layout maps or edge sketches are used to guide the generator in spatial-level structural modeling. This enhances the layout completeness and detail fidelity of the generated images. Within the overall framework, the model jointly optimizes contrastive loss, structural consistency loss, and semantic preservation loss. A multi-objective supervision mechanism is adopted to improve the semantic consistency and controllability of the generated content. Systematic experiments are conducted on the COCO-2014 dataset. Sensitivity analyses are performed on embedding dimensions, text length, and structural guidance strength. Quantitative metrics confirm the superior performance of the proposed method in terms of CLIP Score, FID, and SSIM. The results show that the method effectively bridges the gap between semantic alignment and structural fidelity without increasing computational complexity. It demonstrates a strong ability to generate semantically clear and structurally complete images, offering a viable technical path for joint text-image modeling and image generation.[88] VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation
Ryota Tanaka,Tomohiro Suzuki,Keisuke Fujii
Main category: cs.CV
TL;DR: 本研究提出了一种针对花样滑冰跳跃动作识别的时间动作分割新框架,结合三维姿态表示与程序结构学习,有效解决了标注数据不足和三维特性建模问题,实验表明其在复杂任务中表现出色。
Details
Motivation: 在花样滑冰中,准确识别滑冰者跳跃动作的类型和时间对于客观评估表现至关重要,但这项任务通常需要专家级知识。现有时间动作分割方法在花样滑冰应用中存在两大局限:标注数据不足,且未考虑跳跃动作的三维特性和程序结构。 Method: 研究提出了一种视图不变的、针对花样滑冰的动作姿态表征学习方法(VIFSS),结合对比学习进行预训练,并通过动作分类进行微调。此外,引入了细粒度标注方案,标记跳跃动作的“进入(准备)”和“落地”阶段,以帮助模型学习跳跃的程序结构。 Result: 该方法在元素级时间动作分割任务中达到了92%以上的F1@50性能指标,表明其在识别跳跃类型和旋转级别方面的有效性。同时,研究还表明视图不变对比预训练在微调数据有限时特别有效。 Conclusion: 该研究提出了一种新的时间动作分割框架,解决了花样滑冰中跳跃动作识别的两个主要问题,即标注数据不足和未考虑动作的三维特性与程序结构。实验表明该方法在元素级TAS任务上表现优异,特别是在微调数据有限的情况下依然有效,显示了其在实际应用中的潜力。 Abstract: Understanding human actions from videos plays a critical role across various domains, including sports analytics. In figure skating, accurately recognizing the type and timing of jumps a skater performs is essential for objective performance evaluation. However, this task typically requires expert-level knowledge due to the fine-grained and complex nature of jump procedures. While recent approaches have attempted to automate this task using Temporal Action Segmentation (TAS), there are two major limitations to TAS for figure skating: the annotated data is insufficient, and existing methods do not account for the inherent three-dimensional aspects and procedural structure of jump actions. In this work, we propose a new TAS framework for figure skating jumps that explicitly incorporates both the three-dimensional nature and the semantic procedure of jump movements. First, we propose a novel View-Invariant, Figure Skating-Specific pose representation learning approach (VIFSS) that combines contrastive learning as pre-training and action classification as fine-tuning. For view-invariant contrastive pre-training, we construct FS-Jump3D, the first publicly available 3D pose dataset specialized for figure skating jumps. Second, we introduce a fine-grained annotation scheme that marks the ``entry (preparation)'' and ``landing'' phases, enabling TAS models to learn the procedural structure of jumps. Extensive experiments demonstrate the effectiveness of our framework. Our method achieves over 92% F1@50 on element-level TAS, which requires recognizing both jump types and rotation levels. Furthermore, we show that view-invariant contrastive pre-training is particularly effective when fine-tuning data is limited, highlighting the practicality of our approach in real-world scenarios.[89] JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics
Simindokht Jahangard,Mehrzad Mohammadi,Yi Shen,Zhixi Cai,Hamid Rezatofighi
Main category: cs.CV
TL;DR: The paper introduces JRDB-Reasoning and an adaptive query engine to improve the evaluation of visual reasoning in AI models.
Details
Motivation: The motivation is to address the limitations in existing visual reasoning benchmarks, such as the lack of defined reasoning complexity, customization, and structured annotations. Method: The researchers formalized reasoning complexity, developed an adaptive query engine for generating customizable questions, and extended the JRDB dataset with additional annotations. Result: The result is the creation of JRDB-Reasoning, a new benchmark for visual reasoning in human-crowded environments, and a query engine capable of generating questions with varying complexity and annotations. Conclusion: The paper concludes that the introduced adaptive query engine and the JRDB-Reasoning benchmark enable detailed and dynamic evaluation of visual reasoning frameworks and models. Abstract: Recent advances in Vision-Language Models (VLMs) and large language models (LLMs) have greatly enhanced visual reasoning, a key capability for embodied AI agents like robots. However, existing visual reasoning benchmarks often suffer from several limitations: they lack a clear definition of reasoning complexity, offer have no control to generate questions over varying difficulty and task customization, and fail to provide structured, step-by-step reasoning annotations (workflows). To bridge these gaps, we formalize reasoning complexity, introduce an adaptive query engine that generates customizable questions of varying complexity with detailed intermediate annotations, and extend the JRDB dataset with human-object interaction and geometric relationship annotations to create JRDB-Reasoning, a benchmark tailored for visual reasoning in human-crowded environments. Our engine and benchmark enable fine-grained evaluation of visual reasoning frameworks and dynamic assessment of visual-language models across reasoning levels.[90] A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method
Tao Huang,Hongbo Pan,Nanxi Zhou,Shun Zhou
Main category: cs.CV
TL;DR: The paper proposes PCWLAD, a method for improving the matching accuracy of multimodal optical images, which outperforms existing methods and achieves high accuracy.
Details
Motivation: The motivation is to address degraded image matching accuracy caused by nonlinear radiation and geometric deformation differences due to different spectral responses in multimodal optical images. Method: The method involves coarse matching using the structural similarity index measure (SSIM) and fine matching with a phase consistency weighted least absolute deviation (WLAD) criterion, incorporating radiometric and geometric transformation models and mutual structure filtering. Result: PCWLAD outperformed eight state-of-the-art methods in terms of correct matching rate (CMR) and root mean square error (RMSE), achieving an average matching accuracy of approximately 0.4 pixels across three datasets. Conclusion: PCWLAD is an effective method for improving the matching accuracy of multimodal optical images, outperforming existing methods and achieving an average matching accuracy of approximately 0.4 pixels. Abstract: High-accuracy matching of multimodal optical images is the basis of geometric processing. However, the image matching accuracy is usually degraded by the nonlinear radiation and geometric deformation differences caused by different spectral responses. To address these problems, we proposed a phase consistency weighted least absolute deviation (PCWLAD) sub-pixel template matching method to improve the matching accuracy of multimodal optical images. This method consists of two main steps: coarse matching with the structural similarity index measure (SSIM) and fine matching with WLAD. In the coarse matching step, PCs are calculated without a noise filter to preserve the original structural details, and template matching is performed using the SSIM. In the fine matching step, we applied the radiometric and geometric transformation models between two multimodal PC templates based on the coarse matching. Furthermore, mutual structure filtering is adopted in the model to mitigate the impact of noise within the corresponding templates on the structural consistency, and the WLAD criterion is used to estimate the sub-pixel offset. To evaluate the performance of PCWLAD, we created three types of image datasets: visible to infrared Landsat images, visible to near-infrared close-range images, and visible to infrared uncrewed aerial vehicle (UAV) images. PCWLAD outperformed existing state-of-the-art eight methods in terms of correct matching rate (CMR) and root mean square error (RMSE) and reached an average matching accuracy of approximately 0.4 pixels across all three datasets. Our software and datasets are publicly available at https://github.com/huangtaocsu/PCWLAD.[91] InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild
Yiyi Ma,Yuanzhi Liang,Xiu Li,Chi Zhang,Xuelong Li
Main category: cs.CV
TL;DR: 本文介绍了一种新的运动合成框架 InterSyn,通过整合单人和多角色动态的学习,实现了更自然真实的交互运动生成,并在相关领域设定了新基准。
Details
Motivation: 为了解决先前方法将单人和多人动态分开处理的问题,InterSyn 致力于通过交错学习策略生成更加真实自然的交互运动。 Method: 提出了一种名为 Interleaved Learning for Motion Synthesis (InterSyn) 的新框架,包括 Interleaved Interaction Synthesis (INS) 模块和 Relative Coordination Refinement (REC) 模块,用于联合建模单人和交互行为,并优化角色间的相互动态和同步运动。 Result: 实验结果表明,InterSyn 生成的运动序列在文本到运动对齐和多样性方面优于现有方法。 Conclusion: InterSyn 为鲁棒且自然的运动合成设定了新基准,通过开源代码促进了该领域的进一步研究和发展。 Abstract: We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis. Additionally, our code will be open-sourced in the future to promote further research and development in this area.[92] From Pixel to Mask: A Survey of Out-of-Distribution Segmentation
Wenjie Zhao,Jia Li,Yunhui Guo
Main category: cs.CV
TL;DR: The paper surveys advances in out-of-distribution (OoD) segmentation, focusing on methods for autonomous driving, and highlights its importance in improving AI safety and robustness.
Details
Motivation: The motivation is to address the limitations of conventional OoD detection methods, which lack spatial localization, by focusing on OoD segmentation for pixel-level precision in safety-critical applications like autonomous driving. Method: The paper uses a systematic review approach to group current OoD segmentation approaches into four categories and analyze recent advances, challenges, and future research directions. Result: The paper groups current OoD segmentation approaches into four categories: test-time OoD segmentation, outlier exposure for supervised training, reconstruction-based methods, and approaches leveraging powerful models. Conclusion: The paper concludes that OoD segmentation is a crucial area of research for enhancing the safety and robustness of AI systems, especially in autonomous driving scenarios. Abstract: Out-of-distribution (OoD) detection and segmentation have attracted growing attention as concerns about AI security rise. Conventional OoD detection methods identify the existence of OoD objects but lack spatial localization, limiting their usefulness in downstream tasks. OoD segmentation addresses this limitation by localizing anomalous objects at pixel-level granularity. This capability is crucial for safety-critical applications such as autonomous driving, where perception modules must not only detect but also precisely segment OoD objects, enabling targeted control actions and enhancing overall system robustness. In this survey, we group current OoD segmentation approaches into four categories: (i) test-time OoD segmentation, (ii) outlier exposure for supervised training, (iii) reconstruction-based methods, (iv) and approaches that leverage powerful models. We systematically review recent advances in OoD segmentation for autonomous-driving scenarios, identify emerging challenges, and discuss promising future research directions.[93] Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances
Yuanzhi Liang,Yijie Fang,Rui Li,Ziqi Ni,Ruijie Su,Chi Zhang,Xuelong Li
Main category: cs.CV
TL;DR: This paper surveys reinforcement learning-based methods for visual content generation, highlighting their ability to align with high-level goals and improve controllability, while identifying future research challenges at the intersection of RL and generative modeling.
Details
Motivation: Generative models often rely on surrogate objectives like likelihood or reconstruction loss, which may not align with perceptual quality or semantic accuracy. Reinforcement learning offers a principled framework for optimizing more realistic and preference-driven objectives, prompting the need to explore its effectiveness in visual content generation. Method: The paper systematically reviews the evolution of reinforcement learning (RL) and its integration into visual content generation, examining its role as both a fine-tuning mechanism and a structural component in aligning generation with high-level goals. Result: Recent advances show that reinforcement learning enhances controllability, consistency, and human alignment in generative tasks, serving as both a tool for fine-tuning and structurally aligning generation with complex goals. Conclusion: The paper concludes with a discussion of open challenges and future research directions at the intersection of reinforcement learning and generative modeling. Abstract: Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.[94] Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models
Andrew Bai,Justin Cui,Ruochen Wang,Cho-Jui Hsieh
Main category: cs.CV
TL;DR: 本文设计了一种针对性的训练数据选择方法,用于优化视觉-语言指令调整的性能,强调了在概念知识获取和视觉技能之间的权衡。
Details
Motivation: 视觉-语言指令调整主要有两个目的:学习视觉概念和学习视觉技能。然而,视觉-语言基准测试主要受益于类似技能或视觉概念的训练指令。 Method: 首先从基准测试中提取概念/技能,确定基准主要受益于相似的概念还是技能,最后选择具有最匹配概念/技能的指令。 Result: 在10个以上基准测试中的实验验证了所提出的针对性数据选择方法的有效性,在所有基准测试中比现有最佳基线平均提高了+0.9%,在以技能为中心的子集上提高了+1.5%。 Conclusion: 视觉-语言指令调整的性能优化需要在概念知识获取和视觉技能之间取得平衡。 Abstract: Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9\% over the best existing baseline averaged over all benchmarks and +1.5\% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.[95] Glo-DMU: A Deep Morphometry Framework of Ultrastructural Characterization in Glomerular Electron Microscopic Images
Zhentai Zhang,Danyi Weng,Guibin Zhang,Xiang Chen,Kaixing Long,Jian Geng,Yanmeng Lu,Lei Zhang,Zhitao Zhou,Lei Cao
Main category: cs.CV
TL;DR: Glo-DMU is an automated deep learning framework that accurately and efficiently quantifies multiple key ultrastructural features of glomeruli, supporting renal disease diagnosis.
Details
Motivation: The complexity and diversity of kidney disease ultrastructures require advanced automated analysis tools. Current methods mainly focus on individual structures, which do not fully meet clinical diagnostic needs. Method: The study developed Glo-DMU, a framework based on three deep learning models: ultrastructure segmentation, glomerular filtration barrier classification, and electron-dense deposits detection. These models collectively quantify key ultrastructural features following renal biopsy protocols. Result: The framework was tested on 115 patients with 9 renal pathological types and showed strong consistency between automated results and manual pathological reports in real-world scenarios. Conclusion: The Glo-DMU framework provides a fully automated, high-precision, and high-throughput solution for analyzing multiple glomerular ultrastructural features, offering an efficient tool for renal pathologists. Abstract: Complex and diverse ultrastructural features can indicate the type, progression, and prognosis of kidney diseases. Recently, computational pathology combined with deep learning methods has shown tremendous potential in advancing automatic morphological analysis of glomerular ultrastructure. However, current research predominantly focuses on the recognition of individual ultrastructure, which makes it challenging to meet practical diagnostic needs. In this study, we propose the glomerular morphometry framework of ultrastructural characterization (Glo-DMU), which is grounded on three deep models: the ultrastructure segmentation model, the glomerular filtration barrier region classification model, and the electron-dense deposits detection model. Following the conventional protocol of renal biopsy diagnosis, this framework simultaneously quantifies the three most widely used ultrastructural features: the thickness of glomerular basement membrane, the degree of foot process effacement, and the location of electron-dense deposits. We evaluated the 115 patients with 9 renal pathological types in real-world diagnostic scenarios, demonstrating good consistency between automatic quantification results and morphological descriptions in the pathological reports. Glo-DMU possesses the characteristics of full automation, high precision, and high throughput, quantifying multiple ultrastructural features simultaneously, and providing an efficient tool for assisting renal pathologists.[96] Improving OCR for Historical Texts of Multiple Languages
Hylke Westerdijk,Ben Blankenborg,Khondoker Ittehadul Islam
Main category: cs.CV
TL;DR: 这篇论文探讨了使用深度学习技术改进光学字符识别和文档布局分析的方法,并在三个不同的文本识别任务中取得了有效的成果。
Details
Motivation: 这篇论文的动机是探索和改进光学字符识别(OCR)和文档布局分析中的深度学习方法,尤其是在处理不同历史时期的文本资料时,如死海古卷的希伯来文片段、16至18世纪的会议决议和现代英文手写识别。 Method: 这篇论文使用了多种深度学习技术和模型,包括Kraken和TrOCR模型,以及卷积循环神经网络(CRNN)与DeepLabV3+和双向LSTM的结合。此外,还采用了基于置信度的伪标签技术来优化模型。 Result: 论文报告了通过数据增强和先进模型的应用,在字符识别和语义分割任务中取得的有效成果。 Conclusion: 这篇论文通过使用先进的深度学习技术,在涉及光学字符识别(OCR)和文档布局分析的三个任务中提出了我们的方法和研究发现。论文的结论是,通过数据增强和先进的模型结构,如Kraken、TrOCR模型和CRNN与DeepLabV3+和双向LSTM的结合,可以在字符识别和语义分割方面取得有效的成果,并为未来的研究提供了有价值的见解和方向。 Abstract: This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.[97] AtomDiffuser: Time-Aware Degradation Modeling for Drift and Beam Damage in STEM Imaging
Hao Wang,Hongkui Zheng,Kai He,Abolfazl Razi
Main category: cs.CV
TL;DR: This paper introduces AtomDiffuser, a time-aware framework that effectively addresses degradation effects in STEM data, enabling clearer insights into structural evolution and radiation-induced instabilities.
Details
Motivation: Interpreting time-resolved STEM data is challenging due to spatial drift and beam-induced signal loss; existing methods struggle to separate these effects or model material dynamics at atomic resolution. Method: AtomDiffuser predicts an affine transformation and a spatially varying decay map between STEM frames, leveraging degradation as a physically heuristic, temporally conditioned process. Result: AtomDiffuser effectively generalizes to real-world cryo-STEM data, supporting high-resolution degradation inference and drift alignment for visualizing and quantifying degradation patterns. Conclusion: AtomDiffuser provides a framework to disentangle degradation effects in STEM data, enabling interpretable structural evolution visualization and quantification of degradation patterns linked to radiation-induced atomic instabilities. Abstract: Scanning transmission electron microscopy (STEM) plays a critical role in modern materials science, enabling direct imaging of atomic structures and their evolution under external interferences. However, interpreting time-resolved STEM data remains challenging due to two entangled degradation effects: spatial drift caused by mechanical and thermal instabilities, and beam-induced signal loss resulting from radiation damage. These factors distort both geometry and intensity in complex, temporally correlated ways, making it difficult for existing methods to explicitly separate their effects or model material dynamics at atomic resolution. In this work, we present AtomDiffuser, a time-aware degradation modeling framework that disentangles sample drift and radiometric attenuation by predicting an affine transformation and a spatially varying decay map between any two STEM frames. Unlike traditional denoising or registration pipelines, our method leverages degradation as a physically heuristic, temporally conditioned process, enabling interpretable structural evolutions across time. Trained on synthetic degradation processes, AtomDiffuser also generalizes well to real-world cryo-STEM data. It further supports high-resolution degradation inference and drift alignment, offering tools for visualizing and quantifying degradation patterns that correlate with radiation-induced atomic instabilities.[98] Contrast Sensitivity Function of Multimodal Vision-Language Models
Pablo Hernández-Cámara,Alexandra Gomez-Villa,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Jesus Malo,Valero Laparra
Main category: cs.CV
TL;DR: 论文介绍了一种新颖的方法来评估多模态视觉语言模型的对比敏感度,并发现模型与人类感知之间存在显著差距。
Details
Motivation: 评估多模态视觉语言模型与人类感知的一致性对于理解它们如何感知低级视觉特征至关重要。 Method: 通过直接提示模型在不同对比度下判断每种频率的模式可见性,并使用带通滤波噪声图像和多样化的提示来评估多个架构的模型响应。 Result: 虽然某些模型近似于人类对比敏感度函数的形状或幅度,但没有一个模型能够完全复制两者。 Conclusion: 该论文提出了一种新的行为心理物理学方法来估计基于聊天的多模态视觉语言模型的对比敏感度函数,揭示了多模态模型在视觉敏感性探测方面的关键差距,并展示了提示措辞对响应的显著影响。 Abstract: Assessing the alignment of multimodal vision-language models~(VLMs) with human perception is essential to understand how they perceive low-level visual features. A key characteristic of human vision is the contrast sensitivity function (CSF), which describes sensitivity to spatial frequency at low-contrasts. Here, we introduce a novel behavioral psychophysics-inspired method to estimate the CSF of chat-based VLMs by directly prompting them to judge pattern visibility at different contrasts for each frequency. This methodology is closer to the real experiments in psychophysics than the previously reported. Using band-pass filtered noise images and a diverse set of prompts, we assess model responses across multiple architectures. We find that while some models approximate human-like CSF shape or magnitude, none fully replicate both. Notably, prompt phrasing has a large effect on the responses, raising concerns about prompt stability. Our results provide a new framework for probing visual sensitivity in multimodal models and reveal key gaps between their visual representations and human perception.[99] Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models
Hyundo Lee,Suhyung Choi,Byoung-Tak Zhang,Inwoo Hwang
Main category: cs.CV
TL;DR: 本文提出了一种基于固有场景属性的图像生成方法,通过联合生成图像及其固有属性,提高了生成图像的空间一致性和自然布局。
Details
Motivation: 由于传统图像生成模型在缺乏底层结构和空间布局信息的情况下容易生成空间不一致和失真的图像,因此本文提出了一种新方法,利用提供丰富场景信息的固有属性来提高生成图像的空间一致性和真实性。 Method: 通过预训练估计器提取固有场景属性,并使用自编码器将多种固有属性聚合为单一潜在变量,基于预训练的大规模潜在扩散模型(LDMs),同时对图像和固有域进行去噪,以共享互信息。 Result: 该方法能够在不需要额外场景信息或显式3D表示的情况下,生成更自然、空间一致性更强的图像,并保持基础模型的保真度和文本对齐能力(例如Stable Diffusion)。 Conclusion: 实验结果表明,该方法能纠正空间不一致性,在保持基础模型文本对齐的同时生成更自然的场景布局。 Abstract: Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).[100] Unlocking Robust Semantic Segmentation Performance via Label-only Elastic Deformations against Implicit Label Noise
Yechan Kim,Dongho Yoon,Younkwan Lee,Unse Fatima,Hong Kook Kim,Songjae Lee,Sanga Park,Jeong Ho Park,Seonjong Kang,Moongu Jeon
Main category: cs.CV
TL;DR: NSegment+ 是一种新的数据增强框架,通过仅对标签进行弹性变形来应对隐性标签噪声,从而在多个数据集上显著提升模型性能。
Details
Motivation: 隐性标签噪声(如模糊的物体边界和标注者差异)会影响模型性能,而传统增强方法可能放大这些噪声。 Method: 引入 NSegment+ 框架,仅对分割标签进行弹性变形,与传统数据增强方法不同。 Result: 在 Vaihingen、LoveDA、Cityscapes 和 PASCAL VOC 数据集上分别平均提升了 2.29、2.38、1.75 和 3.39 的 mIoU。 Conclusion: NSegment+ 通过解耦图像和标签变换,有效应对隐性标签噪声,提高了模型的性能,并且在结合其他训练技巧时效果更佳。 Abstract: While previous studies on image segmentation focus on handling severe (or explicit) label noise, real-world datasets also exhibit subtle (or implicit) label imperfections. These arise from inherent challenges, such as ambiguous object boundaries and annotator variability. Although not explicitly present, such mild and latent noise can still impair model performance. Typical data augmentation methods, which apply identical transformations to the image and its label, risk amplifying these subtle imperfections and limiting the model's generalization capacity. In this paper, we introduce NSegment+, a novel augmentation framework that decouples image and label transformations to address such realistic noise for semantic segmentation. By introducing controlled elastic deformations only to segmentation labels while preserving the original images, our method encourages models to focus on learning robust representations of object structures despite minor label inconsistencies. Extensive experiments demonstrate that NSegment+ consistently improves performance, achieving mIoU gains of up to +2.29, +2.38, +1.75, and +3.39 in average on Vaihingen, LoveDA, Cityscapes, and PASCAL VOC, respectively-even without bells and whistles, highlighting the importance of addressing implicit label noise. These gains can be further amplified when combined with other training tricks, including CutMix and Label Smoothing.[101] PQ-DAF: Pose-driven Quality-controlled Data Augmentation for Data-scarce Driver Distraction Detection
Haibin Sun,Xinghui Song
Main category: cs.CV
TL;DR: 本文提出PQ-DAF框架,通过高质量数据增强解决驾驶员分心检测中的少样本与跨域泛化问题,显著提升模型性能。
Details
Motivation: 现有驾驶员分心检测模型在实际应用中面临少样本学习和跨域泛化能力不足的问题,主要由于数据标注成本高和训练数据与实际环境存在显著域差异。 Method: 提出了一种基于姿态驱动的质量控制数据增强框架(PQ-DAF),利用渐进条件扩散模型(PCDMs)生成多样化训练样本,并通过CogVLM模型进行样本质量评估以过滤低质量样本。 Result: PQ-DAF在少样本条件下显著提升了模型的泛化性能,有效增强了跨域鲁棒性。 Conclusion: PQ-DAF通过结合视觉语言模型和渐进条件扩散模型,有效解决了驾驶员分心检测中的少样本学习和跨域泛化问题。 Abstract: Driver distraction detection is essential for improving traffic safety and reducing road accidents. However, existing models often suffer from degraded generalization when deployed in real-world scenarios. This limitation primarily arises from the few-shot learning challenge caused by the high cost of data annotation in practical environments, as well as the substantial domain shift between training datasets and target deployment conditions. To address these issues, we propose a Pose-driven Quality-controlled Data Augmentation Framework (PQ-DAF) that leverages a vision-language model for sample filtering to cost-effectively expand training data and enhance cross-domain robustness. Specifically, we employ a Progressive Conditional Diffusion Model (PCDMs) to accurately capture key driver pose features and synthesize diverse training examples. A sample quality assessment module, built upon the CogVLM vision-language model, is then introduced to filter out low-quality synthetic samples based on a confidence threshold, ensuring the reliability of the augmented dataset. Extensive experiments demonstrate that PQ-DAF substantially improves performance in few-shot driver distraction detection, achieving significant gains in model generalization under data-scarce conditions.[102] Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models
Eunseo Koh,Seunghoo Hong,Tae-Young Kim,Simon S. Woo,Jae-Pil Heo
Main category: cs.CV
TL;DR: This paper proposes a novel method, SSDV, to suppress undesired content in text-to-image diffusion models by modifying text embeddings and utilizing cross-attention mechanisms, achieving superior performance over existing techniques.
Details
Motivation: Text-to-Image diffusion models struggle to suppress content strongly associated with certain words, such as a mustache with Charlie Chaplin, even when instructed otherwise. Method: The method introduces a delta vector to modify text embeddings and uses a Selective Suppression with Delta Vector (SSDV) approach in the cross-attention mechanism for localized suppression. Result: The proposed approach significantly improves suppression of undesired content, both quantitatively and qualitatively, including in personalized models. Conclusion: The proposed SSDV method effectively suppresses undesired content in text-to-image diffusion models, outperforming existing methods. Abstract: Text-to-Image (T2I) diffusion models have made significant progress in generating diverse high-quality images from textual prompts. However, these models still face challenges in suppressing content that is strongly entangled with specific words. For example, when generating an image of ``Charlie Chaplin", a ``mustache" consistently appears even if explicitly instructed not to include it, as the concept of ``mustache" is strongly entangled with ``Charlie Chaplin". To address this issue, we propose a novel approach to directly suppress such entangled content within the text embedding space of diffusion models. Our method introduces a delta vector that modifies the text embedding to weaken the influence of undesired content in the generated image, and we further demonstrate that this delta vector can be easily obtained through a zero-shot approach. Furthermore, we propose a Selective Suppression with Delta Vector (SSDV) method to adapt delta vector into the cross-attention mechanism, enabling more effective suppression of unwanted content in regions where it would otherwise be generated. Additionally, we enabled more precise suppression in personalized T2I models by optimizing delta vector, which previous baselines were unable to achieve. Extensive experimental results demonstrate that our approach significantly outperforms existing methods, both in terms of quantitative and qualitative metrics.[103] SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection
Chaesong Park,Eunbin Seo,Jihyeon Hwang,Jongwoo Lim
Main category: cs.CV
TL;DR: SC-Lane is a new framework for 3D lane detection that adaptively determines the fusion of slope-specific height features and enforces temporal coherence.
Details
Motivation: Previous approaches rely on fixed slope anchors, which are not robust to diverse road geometries. Method: SC-Lane uses a Slope-Aware Adaptive Feature module and a Height Consistency Module to improve heightmap estimation. Result: SC-Lane outperforms existing methods with an F-score of 64.3% on the OpenLane benchmark. Conclusion: SC-Lane is a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection that achieves state-of-the-art performance. Abstract: In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the appropriate weights from image cues for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios. To evaluate the effectiveness of SC-Lane, we employ three standardized metrics-Mean Absolute Error(MAE), Root Mean Squared Error (RMSE), and threshold-based accuracy-which, although common in surface and depth estimation, have been underutilized for road height assessment. Using the LiDAR-derived heightmap dataset introduced in prior work [20], we benchmark our method under these metrics, thereby establishing a rigorous standard for future comparisons. Extensive experiments on the OpenLane benchmark demonstrate that SC-Lane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3%, outperforming existing methods by a notable margin. For detailed results and a demonstration video, please refer to our project page:https://parkchaesong.github.io/sclane/[104] NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer
Shanyuan Liu,Jian Zhu,Junda Lu,Yue Gong,Liuzhuozheng Li,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Dawei Leng,Yuhui Yin
Main category: cs.CV
TL;DR: NanoControl通过引入LoRA风格的控制模块和KV-上下文增强机制,实现了在保持高质量生成的同时,显著减少了可控文本到图像生成的计算开销。
Details
Motivation: 现有的基于ControlNet的方法在使用DiTs进行可控文本到图像生成时引入了显著的参数和计算开销。 Method: 采用Flux作为主干网络,并设计了LoRA风格的控制模块和KV-上下文增强机制。 Result: NanoControl在可控文本到图像生成任务中取得了最先进的性能,参数增加仅为0.024%,GFLOPs增加仅为0.029%。 Conclusion: NanoControl实现了高效的可控文本到图像生成,同时显著减少了参数和计算开销。 Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm originally designed for UNet-based diffusion models. This paradigm introduces significant parameter overhead and increased computational costs. To address these challenges, we propose the Nano Control Diffusion Transformer (NanoControl), which employs Flux as the backbone network. Our model achieves state-of-the-art controllable text-to-image generation performance while incurring only a 0.024\% increase in parameter count and a 0.029\% increase in GFLOPs, thus enabling highly efficient controllable generation. Specifically, rather than duplicating the DiT backbone for control, we design a LoRA-style (low-rank adaptation) control module that directly learns control signals from raw conditioning inputs. Furthermore, we introduce a KV-Context Augmentation mechanism that integrates condition-specific key-value information into the backbone in a simple yet highly effective manner, facilitating deep fusion of conditional features. Extensive benchmark experiments demonstrate that NanoControl significantly reduces computational overhead compared to conventional control approaches, while maintaining superior generation quality and achieving improved controllability.[105] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes
Keishi Ishihara,Kento Sasaki,Tsubasa Takahashi,Daiki Shiono,Yu Yamaguchi
Main category: cs.CV
TL;DR: STRIDE-QA是一个用于自动驾驶场景中时空推理的大型视觉问答数据集,它显著提高了视觉语言模型的性能。
Details
Motivation: 现有的视觉语言模型(VLMs)在处理自动驾驶所需的精确时空推理方面存在局限性。 Method: 利用100小时的东京多传感器驾驶数据构建了一个大规模的视觉问答(VQA)数据集,并提供了285K帧的1600万问答对。 Result: 使用STRIDE-QA微调的VLM在空间定位上取得了55%的成功率,在未来运动预测的一致性上达到了28%。 Conclusion: STRIDE-QA为开发更可靠的自动驾驶系统VLMs提供了全面的基础。 Abstract: Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.[106] CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation
Baichen Liu,Qi Lyu,Xudong Wang,Jiahua Dong,Lianqing Liu,Zhi Han
Main category: cs.CV
TL;DR: CRISP is a new framework for continual video instance segmentation that addresses instance-wise, category-wise, and task-wise confusion through contrastive residual injection and semantic prompting, leading to improved performance without catastrophic forgetting.
Details
Motivation: Continual video instance segmentation requires the ability to learn new object categories while retaining previously learned ones and maintaining temporal consistency. Existing methods struggle with instance-wise, category-wise, and task-wise confusion, which CRISP aims to address. Method: The paper proposes CRISP, which combines Contrastive Residual Injection and Semantic Prompting. It introduces an instance correlation loss, an adaptive residual semantic prompt (ARSP) framework with a query-prompt matching mechanism, and a semantic consistency loss for incremental training. It also includes an initialization strategy for inter-task correlation. Result: Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets show that CRISP significantly outperforms existing continual segmentation methods in long-term continual video instance segmentation, effectively avoiding catastrophic forgetting and improving segmentation and classification performance. Conclusion: CRISP effectively addresses the challenges of continual video instance segmentation by enhancing plasticity and stability while maintaining temporal consistency, outperforming existing methods on YouTube-VIS datasets. Abstract: Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/01upup10/CRISP.[107] DOD-SA: Infrared-Visible Decoupled Object Detection with Single-Modality Annotations
Hang Jin,Chenqiang Gao,Junjie Guo,Fangcen Liu,Kanghui Tian,Qinyao Chang
Main category: cs.CV
TL;DR: 本文提出了一种基于单模态注释的红外-可见光目标检测框架DOD-SA,通过协同教师-学生网络和跨模态知识迁移策略,显著提升检测效果。
Details
Motivation: 现有方法在红外-可见光目标检测中需要双模态注释,导致标注成本高,本文旨在通过单模态注释解决该问题,降低成本并提升检测效果。 Method: 基于单-双模态协同教师-学生网络(CoSD-TSNet)构建DOD-SA框架,包括单模态分支(SM-Branch)和双模态解耦分支(DMD-Branch),并引入渐进自调训练策略(PaST)和伪标签分配器(PLA)提升模型性能。 Result: 在DroneVehicle数据集上的实验表明,该方法优于当前最先进的模型,实现了更优的检测性能。 Conclusion: 本文提出了一种新的红外-可见光解耦目标检测框架DOD-SA,通过单模态注释实现跨模态知识迁移,提高了模型解耦能力和检测性能。 Abstract: Infrared-visible object detection has shown great potential in real-world applications, enabling robust all-day perception by leveraging the complementary information of infrared and visible images. However, existing methods typically require dual-modality annotations to output detection results for both modalities during prediction, which incurs high annotation costs. To address this challenge, we propose a novel infrared-visible Decoupled Object Detection framework with Single-modality Annotations, called DOD-SA. The architecture of DOD-SA is built upon a Single- and Dual-Modality Collaborative Teacher-Student Network (CoSD-TSNet), which consists of a single-modality branch (SM-Branch) and a dual-modality decoupled branch (DMD-Branch). The teacher model generates pseudo-labels for the unlabeled modality, simultaneously supporting the training of the student model. The collaborative design enables cross-modality knowledge transfer from the labeled modality to the unlabeled modality, and facilitates effective SM-to-DMD branch supervision. To further improve the decoupling ability of the model and the pseudo-label quality, we introduce a Progressive and Self-Tuning Training Strategy (PaST) that trains the model in three stages: (1) pretraining SM-Branch, (2) guiding the learning of DMD-Branch by SM-Branch, and (3) refining DMD-Branch. In addition, we design a Pseudo Label Assigner (PLA) to align and pair labels across modalities, explicitly addressing modality misalignment during training. Extensive experiments on the DroneVehicle dataset demonstrate that our method outperforms state-of-the-art (SOTA).[108] SkeySpot: Automating Service Key Detection for Digital Electrical Layout Plans in the Construction Industry
Dhruv Dosi,Rohit Meena,Param Rajpura,Yogesh Kumar Meena
Main category: cs.CV
TL;DR: This paper presents an automated symbol spotting solution for interpreting electrical layout plans, introducing the DELP dataset and SkeySpot toolkit, which achieves high performance using YOLOv8 and supports interoperable building information workflows.
Details
Motivation: The motivation behind this study is the lack of machine-readable floor plans which makes large-scale interpretation time-consuming and error-prone. The paper aims to provide an automated symbol spotting solution to support workflows like cost estimation, infrastructure maintenance, and regulatory compliance. Method: This paper uses a labeled dataset (DELP) comprising 45 scanned electrical layout plans. It introduces SkeySpot, a lightweight, open-source toolkit developed using YOLOv8 for real-time detection, classification, and quantification of electrical symbols. Result: Among the models evaluated, YOLOv8 achieved the highest performance with a mean Average Precision (mAP) of 82.5%. SkeySpot produces structured, standardized outputs that can be scaled up for interoperable building information workflows. Conclusion: The paper concludes that the proposed method, using YOLOv8 and the DELP dataset, makes digitization of electrical layouts more accessible to SMEs in the construction industry, supporting broader goals of standardization, interoperability, and sustainability in the built environment. Abstract: Legacy floor plans, often preserved only as scanned documents, remain essential resources for architecture, urban planning, and facility management in the construction industry. However, the lack of machine-readable floor plans render large-scale interpretation both time-consuming and error-prone. Automated symbol spotting offers a scalable solution by enabling the identification of service key symbols directly from floor plans, supporting workflows such as cost estimation, infrastructure maintenance, and regulatory compliance. This work introduces a labelled Digitised Electrical Layout Plans (DELP) dataset comprising 45 scanned electrical layout plans annotated with 2,450 instances across 34 distinct service key classes. A systematic evaluation framework is proposed using pretrained object detection models for DELP dataset. Among the models benchmarked, YOLOv8 achieves the highest performance with a mean Average Precision (mAP) of 82.5\%. Using YOLOv8, we develop SkeySpot, a lightweight, open-source toolkit for real-time detection, classification, and quantification of electrical symbols. SkeySpot produces structured, standardised outputs that can be scaled up for interoperable building information workflows, ultimately enabling compatibility across downstream applications and regulatory platforms. By lowering dependency on proprietary CAD systems and reducing manual annotation effort, this approach makes the digitisation of electrical layouts more accessible to small and medium-sized enterprises (SMEs) in the construction industry, while supporting broader goals of standardisation, interoperability, and sustainability in the built environment.[109] From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images
Pablo Hernández-Cámara,Jesus Malo,Valero Laparra
Main category: cs.CV
TL;DR: 本文提出了一种生物启发架构PerceptNet,并发现其编码层与人类视觉感知判断具有高度一致性,即使未使用人类监督进行训练。
Details
Motivation: 一些科学家认为,人类视觉感知可能源于图像统计,塑造早期视觉中的高效神经表示。 Method: 端到端优化了名为PerceptNet的生物启发架构,用于图像重建任务,包括自动编码、去噪、去模糊和稀疏正则化。 Result: 编码器阶段(类似V1的层)与人类对图像失真的感知判断表现出最高的一致性,即使在初始化或训练中未使用感知信息。 Conclusion: 生物启发模型可以在没有人类监督的情况下学习感知度量,这表明视觉系统可能被调谐以去除特定水平的失真和稀疏性。 Abstract: A number of scientists suggested that human visual perception may emerge from image statistics, shaping efficient neural representations in early vision. In this work, a bio-inspired architecture that can accommodate several known facts in the retina-V1 cortex, the PerceptNet, has been end-to-end optimized for different tasks related to image reconstruction: autoencoding, denoising, deblurring, and sparsity regularization. Our results show that the encoder stage (V1-like layer) consistently exhibits the highest correlation with human perceptual judgments on image distortion despite not using perceptual information in the initialization or training. This alignment exhibits an optimum for moderate noise, blur and sparsity. These findings suggest that the visual system may be tuned to remove those particular levels of distortion with that level of sparsity and that biologically inspired models can learn perceptual metrics without human supervision.[110] Trajectory-aware Shifted State Space Models for Online Video Super-Resolution
Qiang Zhu,Xiandong Meng,Yuxian Jiang,Fan Zhang,David Bull,Shuyuan Zhu,Bing Zeng
Main category: cs.CV
TL;DR: TS-Mamba is a new online video super-resolution technique that leverages trajectory modeling and state space models for improved performance and reduced computational complexity.
Details
Motivation: The motivation is to overcome the limitations of existing online video super-resolution methods that use only one previous frame, thus restricting long-range temporal modeling. Method: The method involves constructing video trajectories to select similar tokens from previous frames, using a Trajectory-aware Shifted Mamba Aggregation module with shifted state space model blocks, and implementing a trajectory-aware loss function. Result: TS-Mamba achieves state-of-the-art performance on three video super-resolution test datasets and provides over a 22.7% reduction in complexity (in MACs). Conclusion: The paper concludes that TS-Mamba, an innovative online video super-resolution method, achieves state-of-the-art performance with reduced complexity compared to existing models. Abstract: Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames. Then, a Trajectory-aware Shifted Mamba Aggregation (TSMA) module consisting of proposed shifted SSMs blocks is employed to aggregate the selected tokens. The shifted SSMs blocks are designed based on Hilbert scannings and corresponding shift operations to compensate for scanning losses and strengthen the spatial continuity of Mamba. Additionally, we propose a trajectory-aware loss function to supervise the trajectory generation, ensuring the accuracy of token selection when training our model. Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7\% complexity reduction (in MACs). The source code for TS-Mamba will be available at https://github.com.[111] Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers
Hanna Herasimchyk,Robin Labryga,Tomislav Prusina
Main category: cs.CV
TL;DR: This paper proposes a multi-head vision transformer approach with a pre-trained DINOv2 backbone to address the PlantCLEF 2025 challenge, achieving 3rd place on the private leaderboard by leveraging multi-scale tiling, dynamic thresholding, and ensemble methods.
Details
Motivation: The motivation is to address the PlantCLEF 2025 challenge, which involves predicting multiple plant species from multi-species quadrat images while training on single-species images, creating a significant domain shift that traditional models struggle to overcome. Method: The method uses a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) with multiple classification heads for species, genus, and family prediction. It incorporates multi-scale tiling, dynamic threshold optimization based on mean prediction length, and ensemble strategies via bagging and Hydra model architectures. Inference techniques include image cropping, top-n filtering, and logit thresholding. Result: The model achieves strong performance on approximately 1.4 million training images covering 7,806 plant species and ranks 3rd on the private leaderboard of the PlantCLEF 2025 challenge. Conclusion: The paper concludes that their multi-head vision transformer approach, leveraging a pre-trained DINOv2 backbone and incorporating multi-scale tiling, dynamic threshold optimization, and ensemble strategies, achieves strong performance in multi-label plant species prediction, ranking 3rd on the PlantCLEF 2025 private leaderboard. Abstract: We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.[112] SingleStrip: learning skull-stripping from a single labeled example
Bella Specktor-Fadida,Malte Hoffmann
Main category: cs.CV
TL;DR: 本文提出了一种结合领域随机化与自编码器质量控制的半监督分割方法,仅需少量标注数据即可实现高性能的三维颅骨剥离。
Details
Motivation: 深度学习分割依赖大量标注数据,而手动标注耗时且费力,尤其对于脑部磁共振成像(MRI)等体积图像。现有的领域随机化方法在标注数据极少时难以提供足够的解剖学多样性,而半监督自训练方法可以缓解标注稀缺问题。 Method: 首先通过自动二值化体素强度生成初始训练数据,使用单个标注样本训练三维颅骨剥离网络;接着训练一个卷积自编码器(AE)来评估未标注数据预测脑掩膜的质量;最后选择高质量伪标签进行网络微调,并与基于测试时增强的一致性排名方法进行比较。 Result: 在仅有单个标注样本的情况下,成功训练了三维颅骨剥离网络,并在分布外数据上实现了接近使用更多标注数据训练模型的性能。基于自编码器的质量评估方法比基于一致性增强的方法与分割精度具有更强的相关性。 Conclusion: 结合领域随机化与基于自编码器的伪标签质量控制策略,能够在仅有少量标注数据的情况下实现有效的半监督分割,减轻了新解剖结构或新兴成像技术研究中标注数据的负担。 Abstract: Deep learning segmentation relies heavily on labeled data, but manual labeling is laborious and time-consuming, especially for volumetric images such as brain magnetic resonance imaging (MRI). While recent domain-randomization techniques alleviate the dependency on labeled data by synthesizing diverse training images from label maps, they offer limited anatomical variability when very few label maps are available. Semi-supervised self-training addresses label scarcity by iteratively incorporating model predictions into the training set, enabling networks to learn from unlabeled data. In this work, we combine domain randomization with self-training to train three-dimensional skull-stripping networks using as little as a single labeled example. First, we automatically bin voxel intensities, yielding labels we use to synthesize images for training an initial skull-stripping model. Second, we train a convolutional autoencoder (AE) on the labeled example and use its reconstruction error to assess the quality of brain masks predicted for unlabeled data. Third, we select the top-ranking pseudo-labels to fine-tune the network, achieving skull-stripping performance on out-of-distribution data that approaches models trained with more labeled images. We compare AE-based ranking to consistency-based ranking under test-time augmentation, finding that the AE approach yields a stronger correlation with segmentation accuracy. Our results highlight the potential of combining domain randomization and AE-based quality control to enable effective semi-supervised segmentation from extremely limited labeled data. This strategy may ease the labeling burden that slows progress in studies involving new anatomical structures or emerging imaging techniques.[113] Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition
Maimunatu Tunau,Vincent Gbouna Zakka,Zhuangzhuang Dai
Main category: cs.CV
TL;DR: 该论文系统评估了基于mmWave雷达的人类动作识别中三种主要数据处理方法的性能,并提出了改进方案。
Details
Motivation: 尽管传统基于视觉的人类动作识别系统有效,但存在隐私问题,而mmWave雷达传感器提供了隐私保护的替代方案,但其点云数据稀疏且噪声大,因此需要评估不同数据处理方法的性能。 Method: 论文使用MiliPoint数据集对DBSCAN、匈牙利算法和卡尔曼滤波三种方法进行了单独、两两组合以及三者组合的性能分析,并提出针对性改进方法以提高准确性。 Result: 论文评估了三种方法及其组合在识别准确性和计算成本方面的表现,并提供了关于每种方法优势和权衡的关键见解。 Conclusion: 该论文得出结论,对基于mmWave雷达的人类动作识别系统中常用的三种数据处理方法进行了全面评估,并提出了改进方法,为未来相关研究提供了指导。 Abstract: Human Action Recognition (HAR) plays a crucial role in healthcare, fitness tracking, and ambient assisted living technologies. While traditional vision based HAR systems are effective, they pose privacy concerns. mmWave radar sensors offer a privacy preserving alternative but present challenges due to the sparse and noisy nature of their point cloud data. In the literature, three primary data processing methods: Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the Hungarian Algorithm, and Kalman Filtering have been widely used to improve the quality and continuity of radar data. However, a comprehensive evaluation of these methods, both individually and in combination, remains lacking. This paper addresses that gap by conducting a detailed performance analysis of the three methods using the MiliPoint dataset. We evaluate each method individually, all possible pairwise combinations, and the combination of all three, assessing both recognition accuracy and computational cost. Furthermore, we propose targeted enhancements to the individual methods aimed at improving accuracy. Our results provide crucial insights into the strengths and trade-offs of each method and their integrations, guiding future work on mmWave based HAR systems[114] STAMP: Multi-pattern Attention-aware Multiple Instance Learning for STAS Diagnosis in Multi-center Histopathology Images
Liangrui Pan,xiaoyu Li,Guang Zhu,Guanting Li,Ruixin Wang,Jiadi Luo,Yaning Yang,Liang qingchun,Shaoliang Peng
Main category: cs.CV
TL;DR: This study introduces STAMP, a deep learning framework that improves the diagnosis of Spread through Air Spaces (STAS) in lung adenocarcinoma, achieving high accuracy across multiple datasets.
Details
Motivation: The motivation is to address the labor-intensive, error-prone manual diagnosis of STAS in LUAD by leveraging deep learning for more accurate and scalable diagnosis. Method: The authors constructed datasets (STAS-SXY, STAS-TXY, STAS-TCGA) using histopathological images annotated by senior pathologists. They proposed the STAMP framework, which uses a dual-branch architecture, transformer-based instance encoding, and multi-pattern attention aggregation for STAS diagnosis. Result: STAMP achieved AUCs of 0.8058, 0.8017, and 0.7928 on the three datasets, outperforming clinical diagnostic methods. Conclusion: The study concludes that the STAMP framework effectively improves the diagnosis of STAS in LUAD, demonstrating high accuracy and surpassing clinical standards. Abstract: Spread through air spaces (STAS) constitutes a novel invasive pattern in lung adenocarcinoma (LUAD), associated with tumor recurrence and diminished survival rates. However, large-scale STAS diagnosis in LUAD remains a labor-intensive endeavor, compounded by the propensity for oversight and misdiagnosis due to its distinctive pathological characteristics and morphological features. Consequently, there is a pressing clinical imperative to leverage deep learning models for STAS diagnosis. This study initially assembled histopathological images from STAS patients at the Second Xiangya Hospital and the Third Xiangya Hospital of Central South University, alongside the TCGA-LUAD cohort. Three senior pathologists conducted cross-verification annotations to construct the STAS-SXY, STAS-TXY, and STAS-TCGA datasets. We then propose a multi-pattern attention-aware multiple instance learning framework, named STAMP, to analyze and diagnose the presence of STAS across multi-center histopathology images. Specifically, the dual-branch architecture guides the model to learn STAS-associated pathological features from distinct semantic spaces. Transformer-based instance encoding and a multi-pattern attention aggregation modules dynamically selects regions closely associated with STAS pathology, suppressing irrelevant noise and enhancing the discriminative power of global representations. Moreover, a similarity regularization constraint prevents feature redundancy across branches, thereby improving overall diagnostic accuracy. Extensive experiments demonstrated that STAMP achieved competitive diagnostic results on STAS-SXY, STAS-TXY and STAS-TCGA, with AUCs of 0.8058, 0.8017, and 0.7928, respectively, surpassing the clinical level.[115] TweezeEdit: Consistent and Efficient Image Editing with Path Regularization
Jianda Mao,Kaibo Wang,Yang Xiang,Kani Chen
Main category: cs.CV
TL;DR: TweezeEdit is a new, efficient, and inversion-free method for text-guided image editing that preserves source semantics and enables rapid editing for real-time use.
Details
Motivation: Existing image editing methods through text guidance tend to over-align with target prompts while inadequately preserving the semantics of the source image, leading to inefficiencies and elongated editing paths. Method: The method regularizes the entire denoising path instead of relying solely on inversion anchors, using gradient-driven regularization to inject target prompt semantics along a direct path with the help of a consistency model. Result: TweezeEdit achieves superior semantic preservation and target alignment, requiring only 12 steps (1.6 seconds per edit), which highlights its potential for real-time applications. Conclusion: TweezeEdit is a tuning- and inversion-free framework that enables consistent and efficient image editing, outperforming existing methods in semantic preservation and target alignment. Abstract: Large-scale pre-trained diffusion models empower users to edit images through text guidance. However, existing methods often over-align with target prompts while inadequately preserving source image semantics. Such approaches generate target images explicitly or implicitly from the inversion noise of the source images, termed the inversion anchors. We identify this strategy as suboptimal for semantic preservation and inefficient due to elongated editing paths. We propose TweezeEdit, a tuning- and inversion-free framework for consistent and efficient image editing. Our method addresses these limitations by regularizing the entire denoising path rather than relying solely on the inversion anchors, ensuring source semantic retention and shortening editing paths. Guided by gradient-driven regularization, we efficiently inject target prompt semantics along a direct path using a consistency model. Extensive experiments demonstrate TweezeEdit's superior performance in semantic preservation and target alignment, outperforming existing methods. Remarkably, it requires only 12 steps (1.6 seconds per edit), underscoring its potential for real-time applications.[116] Multi-Sample Anti-Aliasing and Constrained Optimization for 3D Gaussian Splatting
Zheng Zhou,Jia-Chen Zhang,Yu-Jie Xiong,Chun-Ming Xia
Main category: cs.CV
TL;DR: This paper proposes a novel optimization framework for 3D Gaussian splatting using MSAA and dual geometric constraints to better preserve high-frequency textures and sharp discontinuities while maintaining real-time performance.
Details
Motivation: Insufficient geometric constraints in 3D Gaussian splatting lead to blurred reconstructions of fine details, especially in regions with high-frequency textures and sharp discontinuities. This work aims to address this issue with a more comprehensive optimization approach. Method: The method introduces a framework combining multisample anti-aliasing (MSAA) with dual geometric constraints, including an adaptive weighting strategy and gradient differential constraints, to improve geometric regularization and detail preservation. Result: Experimental evaluations demonstrate that the proposed method achieves state-of-the-art performance in detail preservation, with significant improvements in structural similarity (SSIM) and perceptual quality (LPIPS) over baseline approaches, while maintaining real-time rendering efficiency. Conclusion: The proposed optimization framework successfully enhances the reconstruction quality of 3D Gaussian splatting, particularly in preserving fine-grained details, high-frequency textures, and sharp discontinuities while maintaining real-time rendering efficiency. Abstract: Recent advances in 3D Gaussian splatting have significantly improved real-time novel view synthesis, yet insufficient geometric constraints during scene optimization often result in blurred reconstructions of fine-grained details, particularly in regions with high-frequency textures and sharp discontinuities. To address this, we propose a comprehensive optimization framework integrating multisample anti-aliasing (MSAA) with dual geometric constraints. Our system computes pixel colors through adaptive blending of quadruple subsamples, effectively reducing aliasing artifacts in high-frequency components. The framework introduces two constraints: (a) an adaptive weighting strategy that prioritizes under-reconstructed regions through dynamic gradient analysis, and (b) gradient differential constraints enforcing geometric regularization at object boundaries. This targeted optimization enables the model to allocate computational resources preferentially to critical regions requiring refinement while maintaining global consistency. Extensive experimental evaluations across multiple benchmarks demonstrate that our method achieves state-of-the-art performance in detail preservation, particularly in preserving high-frequency textures and sharp discontinuities, while maintaining real-time rendering efficiency. Quantitative metrics and perceptual studies confirm statistically significant improvements over baseline approaches in both structural similarity (SSIM) and perceptual quality (LPIPS).[117] A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection
Yangjie Xiao,Ke Zhang,Jiacun Wang,Xin Sheng,Yurong Guo,Meijuan Chen,Zehua Ren,Zhaoye Zheng,Zhenbing Zhao
Main category: cs.CV
TL;DR: 本文提出了一种新的螺栓缺陷检测方法SBDE,通过分割驱动的缺陷编辑和数据增强策略,显著提高了检测性能。
Details
Motivation: 螺栓缺陷检测对于确保输电线路的安全至关重要,但缺陷图像稀缺和数据分布不平衡显著限制了检测性能。 Method: 提出了一种分割驱动的螺栓缺陷编辑方法(SBDE),包括螺栓属性分割模型(Bolt-SAM)、掩码优化模块(MOD)与图像修复模型(LaMa)结合的螺栓缺陷属性编辑模型(MOD-LaMa),以及编辑恢复增强(ERA)策略。 Result: 构建了多个螺栓数据集并进行了大量实验,结果表明SBDE生成的螺栓缺陷图像优于现有模型,并有效提高了缺陷检测性能。 Conclusion: 实验结果表明,SBDE在生成螺栓缺陷图像方面显著优于现有图像编辑模型,并能有效提高螺栓缺陷检测性能,验证了该方法的有效性和应用潜力。 Abstract: Bolt defect detection is critical to ensure the safety of transmission lines. However, the scarcity of defect images and imbalanced data distributions significantly limit detection performance. To address this problem, we propose a segmentationdriven bolt defect editing method (SBDE) to augment the dataset. First, a bolt attribute segmentation model (Bolt-SAM) is proposed, which enhances the segmentation of complex bolt attributes through the CLAHE-FFT Adapter (CFA) and Multipart- Aware Mask Decoder (MAMD), generating high-quality masks for subsequent editing tasks. Second, a mask optimization module (MOD) is designed and integrated with the image inpainting model (LaMa) to construct the bolt defect attribute editing model (MOD-LaMa), which converts normal bolts into defective ones through attribute editing. Finally, an editing recovery augmentation (ERA) strategy is proposed to recover and put the edited defect bolts back into the original inspection scenes and expand the defect detection dataset. We constructed multiple bolt datasets and conducted extensive experiments. Experimental results demonstrate that the bolt defect images generated by SBDE significantly outperform state-of-the-art image editing models, and effectively improve the performance of bolt defect detection, which fully verifies the effectiveness and application potential of the proposed method. The code of the project is available at https://github.com/Jay-xyj/SBDE.[118] EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba
Quang Nguyen,Nhat Le,Baoru Huang,Minh Nhat Vu,Chengcheng Tang,Van Nguyen,Ngan Le,Thieu Vo,Anh Nguyen
Main category: cs.CV
TL;DR: The paper introduces a new method called EgoMusic Motion Network with Skeleton Mamba to estimate human dance motion from both egocentric video and music, outperforming existing methods.
Details
Motivation: The task of jointly estimating human dance motion from both egocentric video and music remains largely unexplored, and current methods face challenges due to obscured body parts in egocentric views and the need to align movements with music. Method: The authors developed a new method using diffusion models and Mamba to create an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. Result: The new method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data, as demonstrated by intensive experiments. Conclusion: The proposed method, EgoMusic Motion Network with Skeleton Mamba, outperforms state-of-the-art approaches and generalizes effectively to real-world data for estimating human dance motion from both egocentric video and music. Abstract: Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.[119] Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies
Ayushman Sarkar,Mohd Yamani Idna Idris,Zhenyu Yu
Main category: cs.CV
TL;DR: This survey paper categorizes visual reasoning into five major types, examines their implementation through various architectures, reviews evaluation protocols, identifies challenges, and outlines a future research agenda to develop transparent and trustworthy AI systems.
Details
Motivation: The motivation is to address the lack of a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols in visual reasoning research. Method: The paper categorizes visual reasoning into five types and systematically examines implementation architectures and evaluation protocols, while analyzing their limitations and identifying open challenges. Result: The result is a comprehensive survey that categorizes visual reasoning, reviews evaluation protocols, identifies key challenges, and outlines a forward-looking research agenda for next-generation vision systems. Conclusion: The paper emphasizes the importance of bridging perception and reasoning for the development of transparent, cross-domain adaptive AI systems, highlighting key challenges and a future research agenda. Abstract: Visual reasoning is critical for a wide range of computer vision tasks that go beyond surface-level object detection and classification. Despite notable advances in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys often address these directions in isolation, lacking a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols. This survey aims to address this gap by categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examining their implementation through architectures such as graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of generalizability, reproducibility, and explanatory power. Beyond evaluation, we identify key open challenges in visual reasoning, including scalability to complex scenes, deeper integration of symbolic and neural paradigms, the lack of comprehensive benchmark datasets, and reasoning under weak supervision. Finally, we outline a forward-looking research agenda for next-generation vision systems, emphasizing that bridging perception and reasoning is essential for building transparent, trustworthy, and cross-domain adaptive AI systems, particularly in critical domains such as autonomous driving and medical diagnostics.[120] Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset
Ziye Deng,Ruihan He,Jiaxiang Liu,Yuan Wang,Zijie Meng,Songtao Jiang,Yong Xie,Zuozhu Liu
Main category: cs.CV
TL;DR: 本研究提出Med-GLIP框架和Med-GLIP-5M数据集,实现医学图像与自然语言的高效对齐,显著提升医学图像分析任务的准确性。
Details
Motivation: 现有医学图像基础研究受限于模态覆盖不足、注释粗糙以及缺乏统一和可推广的基础框架。 Method: 构建了一个包含超过530万区域级注释的医学图像数据集Med-GLIP-5M,并提出Med-GLIP框架,利用隐式学习获取多层次语义理解。 Result: Med-GLIP在多个医学图像基础任务中表现优于现有最先进方法,并显著提升了下游任务(如医学VQA和报告生成)的性能。 Conclusion: Med-GLIP提供了一个统一且可推广的医学图像基础框架,通过利用大规模多模态数据集Med-GLIP-5M,显著提升了医学图像分析任务的性能。 Abstract: Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data -- enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.[121] GCRPNet: Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images
Mengyu Ren,Yutong Li,Hua Li,Runmin Cong,Sam Kwong
Main category: cs.CV
TL;DR: This paper proposes GCRPNet, a graph-enhanced contextual and regional perception network for salient object detection in optical remote sensing images, achieving state-of-the-art performance.
Details
Motivation: Salient object detection in optical remote sensing images faces challenges such as significant variations in target scales and low contrast between targets and the background. Existing methods struggle to effectively integrate heterogeneous features from vision transformers and convolutional neural networks. Method: GCRPNet is built upon the Mamba architecture and incorporates a visual state space (VSS) encoder for multi-scale feature extraction. It also uses a difference-similarity guided hierarchical graph attention module (DS-HGAM) and a LEVSS block decoder with adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). Result: Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority. Conclusion: The proposed GCRPNet model achieves state-of-the-art performance in salient object detection for optical remote sensing images, demonstrating its effectiveness and superiority. Abstract: Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model's structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the decoder of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba's local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.[122] PSScreen: Partially Supervised Multiple Retinal Disease Screening
Boyi Zheng,Qing Liu
Main category: cs.CV
TL;DR: PSScreen是一种新型的部分监督学习模型,用于多视网膜疾病筛查,通过两个流分别学习确定性和概率特征,利用文本指导解耦和特征蒸馏提升领域泛化能力,解决了标签缺失问题,并在实验中表现出优秀的检测性能。
Details
Motivation: 使用多个部分标记的数据集训练多视网膜疾病筛查模型,减少了对完全标注数据集的依赖,但跨不同医疗站点的训练数据集存在显著的领域转移和部分类别标签缺失的问题。 Method: 提出了一种名为PSScreen的新型部分监督多视网膜疾病筛查模型,该模型包含两个流,一个学习确定性特征,另一个通过不确定性注入学习概率特征,并利用文本指导解耦和特征蒸馏来提高领域泛化能力。 Result: 实验表明,PSScreen在视网膜疾病检测性能上显著提高,并在领域内和领域外数据集上均取得最先进的结果。 Conclusion: PSScreen模型显著提高了视网膜疾病检测的性能,并在六个视网膜疾病和正常状态上平均取得了最先进的结果。 Abstract: Leveraging multiple partially labeled datasets to train a model for multiple retinal disease screening reduces the reliance on fully annotated datasets, but remains challenging due to significant domain shifts across training datasets from various medical sites, and the label absent issue for partial classes. To solve these challenges, we propose PSScreen, a novel Partially Supervised multiple retinal disease Screening model. Our PSScreen consists of two streams and one learns deterministic features and the other learns probabilistic features via uncertainty injection. Then, we leverage the textual guidance to decouple two types of features into disease-wise features and align them via feature distillation to boost the domain generalization ability. Meanwhile, we employ pseudo label consistency between two streams to address the label absent issue and introduce a self-distillation to transfer task-relevant semantics about known classes from the deterministic to the probabilistic stream to further enhance the detection performances. Experiments show that our PSScreen significantly enhances the detection performances on six retinal diseases and the normal state averagely and achieves state-of-the-art results on both in-domain and out-of-domain datasets. Codes are available at https://github.com/boyiZheng99/PSScreen.[123] AR Surgical Navigation With Surface Tracing: Comparing In-SitVisualization with Tool-Tracking Guidance for Neurosurgical Applications
Marc J. Fischer,Jeffrey Potts,Gabriel Urreola,Dax Jones,Paolo Palmisciano,E. Bradley Strong,Branden Cord,Andrew D. Hernandez,Julia D. Sharma,E. Brandon Strong
Main category: cs.CV
TL;DR: This study introduces an augmented reality surgical navigation system using Microsoft HoloLens 2, showing improved accuracy and user experience in simulated catheter placements compared to static AR visualization.
Details
Motivation: The motivation is to overcome the limitations of traditional surgical navigation systems, particularly the challenges of AR depth perception and occlusion handling, by introducing a more precise and user-friendly AR guidance methodology for surgical procedures. Method: The study utilizes a novel surface tracing method for registering anatomical targets and employs real-time infrared tool tracking using the onboard sensors of Microsoft HoloLens 2. Participants performed simulated catheter insertions under two AR guidance conditions: static in-situ visualization and real-time tool-tracking guidance. Result: Tool-tracking guidance improved performance across all accuracy measures, including insertion accuracy, target deviation, angular error, and depth precision. Subjective evaluations also showed a user preference for real-time tool-tracking guidance. Conclusion: The study concludes that the novel AR guidance methodology enhances surgical precision and user experience compared to traditional methods, with tool-tracking guidance showing superior performance and preference in simulated catheter insertions. Abstract: Augmented Reality (AR) surgical navigation systems are emerging as the next generation of intraoperative surgical guidance, promising to overcome limitations of traditional navigation systems. However, known issues with AR depth perception due to vergence-accommodation conflict and occlusion handling limitations of the currently commercially available display technology present acute challenges in surgical settings where precision is paramount. This study presents a novel methodology for utilizing AR guidance to register anatomical targets and provide real-time instrument navigation using placement of simulated external ventricular drain catheters on a phantom model as the clinical scenario. The system registers target positions to the patient through a novel surface tracing method and uses real-time infrared tool tracking to aid in catheter placement, relying only on the onboard sensors of the Microsoft HoloLens 2. A group of intended users performed the procedure of simulated insertions under two AR guidance conditions: static in-situ visualization, where planned trajectories are overlaid directly onto the patient anatomy, and real-time tool-tracking guidance, where live feedback of the catheter's pose is provided relative to the plan. Following the insertion tests, computed tomography scans of the phantom models were acquired, allowing for evaluation of insertion accuracy, target deviation, angular error, and depth precision. System Usability Scale surveys assessed user experience and cognitive workload. Tool-tracking guidance improved performance metrics across all accuracy measures and was preferred by users in subjective evaluations. A free copy of this paper and all supplemental materials are available at https://bit.ly/45l89Hq.[124] Retrieval-Augmented Prompt for OOD Detection
Ruisong Han,Zongbo Han,Jiahao Zhang,Mingyue Cheng,Changqing Zhang
Main category: cs.CV
TL;DR: RAP improves OOD detection using retrieval-augmented prompts for better semantic supervision, achieving top results on benchmarks with reduced FPR95 and increased AUROC.
Details
Motivation: Existing OOD detection methods struggle due to limited and mismatched outlier samples, resulting in insufficient semantic supervision. RAP addresses this by incorporating external knowledge to improve semantic guidance. Method: RAP augments prompts for a pre-trained vision-language model by retrieving descriptive words for outliers from external textual knowledge during training and dynamically updates OOD prompts during testing for real-time adaptation. Result: RAP achieves state-of-the-art performance, reducing the average FPR95 by 7.05% and improving AUROC by 1.71% in 1-shot OOD detection on ImageNet-1k. Ablation studies confirm the effectiveness of its modules. Conclusion: The proposed RAP method significantly enhances OOD detection performance by leveraging external knowledge for semantic supervision, achieving state-of-the-art results on benchmarks like ImageNet-1k. Abstract: Out-of-Distribution (OOD) detection is crucial for the reliable deployment of machine learning models in-the-wild, enabling accurate identification of test samples that differ from the training data distribution. Existing methods rely on auxiliary outlier samples or in-distribution (ID) data to generate outlier information for training, but due to limited outliers and their mismatch with real test OOD samples, they often fail to provide sufficient semantic supervision, leading to suboptimal performance. To address this, we propose a novel OOD detection method called Retrieval-Augmented Prompt (RAP). RAP augments a pre-trained vision-language model's prompts by retrieving external knowledge, offering enhanced semantic supervision for OOD detection. During training, RAP retrieves descriptive words for outliers based on joint similarity with external textual knowledge and uses them to augment the model's OOD prompts. During testing, RAP dynamically updates OOD prompts in real-time based on the encountered OOD samples, enabling the model to rapidly adapt to the test environment. Our extensive experiments demonstrate that RAP achieves state-of-the-art performance on large-scale OOD detection benchmarks. For example, in 1-shot OOD detection on the ImageNet-1k dataset, RAP reduces the average FPR95 by 7.05% and improves the AUROC by 1.71% compared to previous methods. Additionally, comprehensive ablation studies validate the effectiveness of each module and the underlying motivations of our approach.[125] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks
Xinhao Wang,Zhiwei Lin,Zhongyu Xia,Yongtao Wang
Main category: cs.CV
TL;DR: PTQAT is a hybrid quantization approach that improves efficiency and accuracy by selectively fine-tuning critical layers while applying PTQ to the rest, outperforming traditional QAT methods in 3D perception tasks.
Details
Motivation: The paper addresses the limitations of existing quantization methods: PTQ causes performance degradation while QAT is resource-intensive. The goal is to find a balance between speed and accuracy. Method: PTQAT selects critical layers for QAT fine-tuning based on quantization discrepancies and applies PTQ to the remaining layers, improving quantization accuracy by compensating for error propagation. Result: PTQAT outperforms QAT in multiple 3D perception tasks (e.g., object detection, semantic segmentation) with 0.2%-0.9% NDS and 0.3%-2.0% mIoU gains, while reducing the number of fine-tuned weights significantly. Conclusion: PTQAT is an effective hybrid quantization method that achieves better performance than QAT by selectively fine-tuning critical layers and applying PTQ on the rest, resulting in improved efficiency and accuracy across various 3D perception tasks and model architectures. Abstract: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight fine-tuning.In this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade-off between PTQ and QAT, our method selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers. Contrary to intuition, fine-tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model's quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT-only baselines. Notably, it achieves 0.2%-0.9% NDS and 0.3%-1.0% mAP gains in object detection, 0.3%-2.0% mIoU gains in semantic segmentation and occupancy prediction while fine-tuning fewer weights.[126] HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis
Shiyu Liu,Kui Jiang,Xianming Liu,Hongxun Yao,Xiaocheng Feng
Main category: cs.CV
TL;DR: 本文提出 HM-Talker,通过结合显性动作单元和隐性特征,提升音频驱动说话人视频生成的质量和泛化能力。
Details
Motivation: 当前音频驱动的说话人生成方法依赖于隐性建模,缺乏显性的语音相关面部运动先验知识,导致运动模糊和唇部抖动问题,因此需要引入显性先验知识以提高生成质量。 Method: HM-Talker 框架采用混合运动表示,包括显性动作单元(AUs)和隐性特征,并通过跨模态解耦模块(CMDM)和混合运动建模模块(HMMM)进行建模,实现音频到面部动作的精准映射。 Result: 实验表明,HM-Talker 在视觉质量和唇同步准确性方面优于现有最先进方法,能够实现高保真、时间连贯的说话人生成,并具有良好的跨身份泛化能力。 Conclusion: HM-Talker 提出了一种结合显性和隐性运动线索的混合运动表示方法,有效解决了音频驱动说话人视频生成中的运动模糊和嘴唇抖动问题,并在跨主体泛化方面表现出色。 Abstract: Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.[127] SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous Driving
Philipp Wolters,Johannes Gilg,Torben Teepe,Gerhard Rigoll
Main category: cs.CV
TL;DR: SpaRC-AD是一个查询基础的端到端相机-雷达融合框架,用于解决自动驾驶中的感知、运动预测和规划问题。
Details
Motivation: 基于视觉的方法在恶劣天气条件下、部分遮挡和精确速度估计方面面临根本性的限制,而这些问题在安全敏感场景中对避免碰撞至关重要。 Method: 通过稀疏的3D特征对齐和基于多普勒的测速,实现强大的3D场景表示,以优化智能体锚点、地图多段线和运动建模。 Result: 在多个自动驾驶任务中实现了显著改进,包括3D检测(+4.8% mAP)、多目标跟踪(+8.3% AMOTA)、在线地图绘制(+1.8% mAP)、运动预测(-4.0% mADE)和轨迹规划(-0.1m L2和-9% TPC)。 Conclusion: SpaRC-AD是一个基于查询的端到端相机-雷达融合框架,用于规划导向的自动驾驶。它在多个自动驾驶任务中实现了比现有视觉方法更好的性能,特别是在安全关键场景中展示了雷达融合的有效性。 Abstract: End-to-end autonomous driving systems promise stronger performance through unified optimization of perception, motion forecasting, and planning. However, vision-based approaches face fundamental limitations in adverse weather conditions, partial occlusions, and precise velocity estimation - critical challenges in safety-sensitive scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. To address these limitations, we propose SpaRC-AD, a query-based end-to-end camera-radar fusion framework for planning-oriented autonomous driving. Through sparse 3D feature alignment, and doppler-based velocity estimation, we achieve strong 3D scene representations for refinement of agent anchors, map polylines and motion modelling. Our method achieves strong improvements over the state-of-the-art vision-only baselines across multiple autonomous driving tasks, including 3D detection (+4.8% mAP), multi-object tracking (+8.3% AMOTA), online mapping (+1.8% mAP), motion prediction (-4.0% mADE), and trajectory planning (-0.1m L2 and -9% TPC). We achieve both spatial coherence and temporal consistency on multiple challenging benchmarks, including real-world open-loop nuScenes, long-horizon T-nuScenes, and closed-loop simulator Bench2Drive. We show the effectiveness of radar-based fusion in safety-critical scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. The source code of all experiments is available at https://phi-wol.github.io/sparcad/[128] Adapting SAM via Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection
Humza Naveed,Xina Zeng,Mitch Bryson,Nagita Mehrseresht
Main category: cs.CV
TL;DR: 本研究基于Segment Anything Model(SAM)提出了一种改进的遥感变化检测方法,结合了微调、空间-时间特征增强、多尺度解码器融合和交叉熵掩码损失,显著提升了检测性能。
Details
Motivation: 基础模型(如SAM)在计算机视觉领域取得了巨大成功,但如何将其有效应用于遥感变化检测仍是一个挑战。研究旨在利用SAM的通用表示能力并结合特定技术提升在变化检测任务上的表现。 Method: 通过微调SAM编码器,结合空间-时间特征增强(STFE)和多尺度解码器融合(MSDF),并设计了交叉熵掩码(CEM)损失函数以应对类别不平衡问题。 Result: 该方法在四个变化检测数据集(Levir-CD、WHU-CD、CLCD和S2Looking)上均超越了最先进的方法,在S2Looking数据集上F1分数提高了2.5%。 Conclusion: 研究提出了一种基于SAM编码器的遥感变化检测方法,并通过空间-时间特征增强和多尺度解码器融合技术以及一种新的交叉熵掩码损失函数来提高检测精度,最终在多个数据集上超越了最先进的方法。 Abstract: Foundational models have achieved significant success in diverse domains of computer vision. They learn general representations that are easily transferable to tasks not seen during training. One such foundational model is Segment anything model (SAM), which can accurately segment objects in images. We propose adapting the SAM encoder via fine-tuning for remote sensing change detection (RSCD) along with spatial-temporal feature enhancement (STFE) and multi-scale decoder fusion (MSDF) to detect changes robustly at multiple scales. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle high class imbalance in change detection datasets. Our method outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.5% F1-score improvement on a large complex S2Looking dataset. The code is available at: https://github.com/humza909/SAM-CEM-CD[129] Towards Agentic AI for Multimodal-Guided Video Object Segmentation
Tuyen Tran,Thao Minh Le,Truyen Tran
Main category: cs.CV
TL;DR: The paper proposes a new agentic system called Multi-Modal Agent that dynamically generates workflows using large language models to solve video object segmentation tasks guided by multimodal cues, achieving better performance than previous approaches.
Details
Motivation: Traditional approaches require training specialized models, leading to high computational complexity and manual effort. The emergence of vision-language foundation models provides an opportunity to develop flexible, training-free solutions for dynamic tasks. Method: The method uses large language models to generate dynamic workflows, which interact iteratively with specialized tools across modalities to identify target objects based on multimodal cues. Result: The proposed agentic approach outperforms existing methods on two multimodal-conditioned video object segmentation tasks: RVOS and Ref-AVS. Conclusion: The proposed Multi-Modal Agent demonstrates significant improvements over previous methods in handling dynamic multimodal tasks, particularly in RVOS and Ref-AVS tasks. Abstract: Referring-based Video Object Segmentation is a multimodal problem that requires producing fine-grained segmentation results guided by external cues. Traditional approaches to this task typically involve training specialized models, which come with high computational complexity and manual annotation effort. Recent advances in vision-language foundation models open a promising direction toward training-free approaches. Several studies have explored leveraging these general-purpose models for fine-grained segmentation, achieving performance comparable to that of fully supervised, task-specific models. However, existing methods rely on fixed pipelines that lack the flexibility needed to adapt to the dynamic nature of the task. To address this limitation, we propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner. Specifically, our method leverages the reasoning capabilities of large language models (LLMs) to generate dynamic workflows tailored to each input. This adaptive procedure iteratively interacts with a set of specialized tools designed for low-level tasks across different modalities to identify the target object described by the multimodal cues. Our agentic approach demonstrates clear improvements over prior methods on two multimodal-conditioned VOS tasks: RVOS and Ref-AVS.[130] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs
Zheng Qin,Ruobing Zheng,Yabing Wang,Tianqi Li,Yi Yuan,Jingdong Chen,Le Wang
Main category: cs.CV
TL;DR: 本文提出了HumanSense基准测试,用于评估多模态大语言模型在人类中心交互中的表现,发现结合多模态输入和强化学习方法能显著提升模型推理能力,并通过设计提示提升非推理模型表现。
Details
Motivation: 由于缺乏细粒度的人类中心场景评估框架,多模态大语言模型在实现真正类人交互方面进展受限,因此需要引入一种新的评估体系来推动该领域的发展。 Method: 引入HumanSense基准测试,评估多模态大语言模型在理解复杂人类意图和生成共情、情境感知响应方面的能力,并采用多阶段、模态递进的强化学习方法提升模型推理能力。 Result: 研究发现,结合音频和文本信息的多模态输入能显著提升模型表现,Omni模型在这些任务上具有优势;通过多阶段、模态递进的强化学习方法提升推理能力后,模型在评估中取得了显著进步。 Conclusion: HumanSense基准测试表明,多模态大语言模型在人类中心交互任务中仍有改进空间,特别是通过多阶段、模态递进的强化学习方法提升推理能力后,模型表现显著提高,同时设计相应提示可提升非推理模型的表现。 Abstract: While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: \textcolor{brightpink}https://digital-avatar.github.io/ai/HumanSense/[131] EvTurb: Event Camera Guided Turbulence Removal
Yixing Liu,Minggui Teng,Yifei Xia,Peiqi Duan,Boxin Shi
Main category: cs.CV
TL;DR: 本文提出了一种新的湍流去除框架EvTurb,通过事件引导的方法有效解耦模糊和倾斜效应,实验证明其性能优越且计算高效。
Details
Motivation: 大气湍流通过引入模糊和几何倾斜畸变降低了图像质量,对下游的计算机视觉任务提出了重大挑战。现有的单图像和多帧方法由于湍流引起的畸变的组合复杂性而难以解决这一高度不适定的问题。 Method: 提出了一种名为EvTurb的事件引导的湍流去除框架,该框架通过建模基于事件的湍流形成,采用两步事件引导网络:首先使用事件积分减少模糊,然后利用从原始事件流中推导出的方差图消除倾斜失真。 Result: 实验结果表明,EvTurb在去湍流效果上优于现有最先进方法,并且保持了计算效率。此外,作者还提出了TurbEvent,这是首个包含多样湍流场景的真实捕捉数据集。 Conclusion: EvTurb是一种利用高速事件流来解耦模糊和倾斜效应的湍流去除框架,优于现有的最先进方法,同时保持了计算效率。 Abstract: Atmospheric turbulence degrades image quality by introducing blur and geometric tilt distortions, posing significant challenges to downstream computer vision tasks. Existing single-image and multi-frame methods struggle with the highly ill-posed nature of this problem due to the compositional complexity of turbulence-induced distortions. To address this, we propose EvTurb, an event guided turbulence removal framework that leverages high-speed event streams to decouple blur and tilt effects. EvTurb decouples blur and tilt effects by modeling event-based turbulence formation, specifically through a novel two-step event-guided network: event integrals are first employed to reduce blur in the coarse outputs. This is followed by employing a variance map, derived from raw event streams, to eliminate the tilt distortion for the refined outputs. Additionally, we present TurbEvent, the first real-captured dataset featuring diverse turbulence scenarios. Experimental results demonstrate that EvTurb surpasses state-of-the-art methods while maintaining computational efficiency.[132] Towards Powerful and Practical Patch Attacks for 2D Object Detection in Autonomous Driving
Yuxin Cao,Yedi Zhang,Wentao He,Yifan Liao,Yan Xiao,Chang Li,Zhiyong Huang,Jin Song Dong
Main category: cs.CV
TL;DR: 本文提出了一种针对自动驾驶中2D目标检测模型的新型实用攻击框架 P$^3$A,其在高分辨率数据上表现出色。
Details
Motivation: 现有基于 mAP 的攻击方法在实际攻击场景中被高估,且在高分辨率数据上效果差,需要更实际有效的攻击方法。 Method: 提出了一种新的攻击框架 P$^3$A,包括新指标 PASR、定制损失函数 LCSL 和数据预处理方法 PSPP。 Result: P$^3$A 在未见过的模型和高分辨率数据集上均优于现有方法,效果在新指标和传统指标下都得到验证。 Conclusion: P$^3$A 框架在高分辨率数据集上对2D目标检测模型具有更强的攻击效果和可迁移性,解决了以往方法在实际场景中被高估的问题。 Abstract: Learning-based autonomous driving systems remain critically vulnerable to adversarial patches, posing serious safety and security risks in their real-world deployment. Black-box attacks, notable for their high attack success rate without model knowledge, are especially concerning, with their transferability extensively studied to reduce computational costs compared to query-based attacks. Previous transferability-based black-box attacks typically adopt mean Average Precision (mAP) as the evaluation metric and design training loss accordingly. However, due to the presence of multiple detected bounding boxes and the relatively lenient Intersection over Union (IoU) thresholds, the attack effectiveness of these approaches is often overestimated, resulting in reduced success rates in practical attacking scenarios. Furthermore, patches trained on low-resolution data often fail to maintain effectiveness on high-resolution images, limiting their transferability to autonomous driving datasets. To fill this gap, we propose P$^3$A, a Powerful and Practical Patch Attack framework for 2D object detection in autonomous driving, specifically optimized for high-resolution datasets. First, we introduce a novel metric, Practical Attack Success Rate (PASR), to more accurately quantify attack effectiveness with greater relevance for pedestrian safety. Second, we present a tailored Localization-Confidence Suppression Loss (LCSL) to improve attack transferability under PASR. Finally, to maintain the transferability for high-resolution datasets, we further incorporate the Probabilistic Scale-Preserving Padding (PSPP) into the patch attack pipeline as a data preprocessing step. Extensive experiments show that P$^3$A outperforms state-of-the-art attacks on unseen models and unseen high-resolution datasets, both under the proposed practical IoU-based evaluation metric and the previous mAP-based metrics.[133] Fourier-Guided Attention Upsampling for Image Super-Resolution
Daejune Choi,Youchan No,Jinhyung Lee,Duksu Kim
Main category: cs.CV
TL;DR: Frequency-Guided Attention (FGA) is a lightweight and effective upsampling module for single image super-resolution that significantly enhances detail reconstruction and reduces aliasing compared to traditional methods.
Details
Motivation: Traditional upsampling methods, such as Sub-Pixel Convolution, are efficient but often fail to reconstruct high-frequency details and introduce aliasing artifacts. This limitation motivates the development of a more effective and lightweight upsampling approach. Method: FGA integrates three components: a Fourier feature-based MLP for positional frequency encoding, a cross-resolution Correlation Attention Layer for adaptive spatial alignment, and a frequency-domain L1 loss for spectral fidelity supervision. Result: FGA adds only 0.3M parameters and achieves consistent performance improvements across five diverse super-resolution backbones. It delivers average PSNR gains of 0.12~0.14 dB and improves frequency-domain consistency by up to 29%, particularly on texture-rich datasets. Conclusion: FGA is a practical and scalable alternative to traditional upsampling methods, effectively reducing aliasing and preserving fine details in single image super-resolution. Abstract: We propose Frequency-Guided Attention (FGA), a lightweight upsampling module for single image super-resolution. Conventional upsamplers, such as Sub-Pixel Convolution, are efficient but frequently fail to reconstruct high-frequency details and introduce aliasing artifacts. FGA addresses these issues by integrating (1) a Fourier feature-based Multi-Layer Perceptron (MLP) for positional frequency encoding, (2) a cross-resolution Correlation Attention Layer for adaptive spatial alignment, and (3) a frequency-domain L1 loss for spectral fidelity supervision. Adding merely 0.3M parameters, FGA consistently enhances performance across five diverse super-resolution backbones in both lightweight and full-capacity scenarios. Experimental results demonstrate average PSNR gains of 0.12~0.14 dB and improved frequency-domain consistency by up to 29%, particularly evident on texture-rich datasets. Visual and spectral evaluations confirm FGA's effectiveness in reducing aliasing and preserving fine details, establishing it as a practical, scalable alternative to traditional upsampling methods.[134] FIND-Net -- Fourier-Integrated Network with Dictionary Kernels for Metal Artifact Reduction
Farid Tasharofi,Fuxin Fan,Melika Qahqaie,Mareike Thies,Andreas Maier
Main category: cs.CV
TL;DR: 本文提出FIND-Net,结合频率域和空间域处理,有效减少CT图像中的金属伪影并保留结构细节,提高了金属伪影减少的性能和临床适用性。
Details
Motivation: 金属植入物引起的CT图像伪影严重影响图像质量,现有深度学习方法在抑制伪影和保留结构细节方面仍存在挑战。 Method: 提出了FIND-Net,采用快速傅里叶卷积(FFC)层和可训练高斯滤波,将MAR任务视为在空间域和频率域进行的混合处理。 Result: 在合成数据集上,FIND-Net实现了MAE降低3.07%,SSIM提高0.18%,PSNR提高0.90%;在真实临床CT扫描中,FIND-Net有效抑制了金属伪影且对正常解剖区域影响较小。 Conclusion: FIND-Net通过结合频率域和空间域处理,有效减少了金属伪影,同时保留了结构细节,提升了MAR的性能,并具有较高的临床应用价值。 Abstract: Metal artifacts, caused by high-density metallic implants in computed tomography (CT) imaging, severely degrade image quality, complicating diagnosis and treatment planning. While existing deep learning algorithms have achieved notable success in Metal Artifact Reduction (MAR), they often struggle to suppress artifacts while preserving structural details. To address this challenge, we propose FIND-Net (Fourier-Integrated Network with Dictionary Kernels), a novel MAR framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation. FIND-Net incorporates Fast Fourier Convolution (FFC) layers and trainable Gaussian filtering, treating MAR as a hybrid task operating in both spatial and frequency domains. This approach enhances global contextual understanding and frequency selectivity, effectively reducing artifacts while maintaining anatomical structures. Experiments on synthetic datasets show that FIND-Net achieves statistically significant improvements over state-of-the-art MAR methods, with a 3.07% MAE reduction, 0.18% SSIM increase, and 0.90% PSNR improvement, confirming robustness across varying artifact complexities. Furthermore, evaluations on real-world clinical CT scans confirm FIND-Net's ability to minimize modifications to clean anatomical regions while effectively suppressing metal-induced distortions. These findings highlight FIND-Net's potential for advancing MAR performance, offering superior structural preservation and improved clinical applicability. Code is available at https://github.com/Farid-Tasharofi/FIND-Net[135] Increasing the Utility of Synthetic Images through Chamfer Guidance
Nicola Dall'Asen,Xiaofeng Zhang,Reyhane Askari Hemmat,Melissa Hall,Jakob Verbeek,Adriana Romero-Soriano,Michal Drozdzal
Main category: cs.CV
TL;DR: Chamfer Guidance是一种新的训练-免费引导方法,通过使用少量真实图像来提高合成数据的质量和多样性,从而显著提升图像分类任务的性能。
Details
Motivation: 现有的图像生成模型在生成质量上取得了进展,但生成多样性受限,且合成数据与真实数据之间可能存在分布偏移,限制了其作为训练数据的实用性。 Method: Chamfer Guidance利用真实示例图像来指导合成数据的质量和多样性,并通过ImageNet-1k和标准地理多样性基准测试验证其效果。 Result: Chamfer Guidance在仅使用2张真实示例图像时达到了96.4%的精确度和86.4%的分布覆盖率,在使用32张图像时分别提升至97.5%和92.7%。同时,在下游任务中,该方法在分布内和分布外的准确率分别提升了15%和16%。 Conclusion: Chamfer Guidance是一个无需训练的引导方法,通过使用少量真实示例图像来提高合成数据的质量和多样性,从而提升下游图像分类任务的性能。 Abstract: Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4\% in terms of precision, and 86.4\% in terms of distributional coverage, which increase to 97.5\% and 92.7\%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15\% for in-distribution over the baselines, and up to 16\% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31\% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.[136] ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation
Hosam Elgendy,Ahmed Sharshar,Ahmed Aboeitta,Mohsen Guizani
Main category: cs.CV
TL;DR: ChatENV is an interactive vision language model that uses satellite image pairs and real-world sensor data to provide effective environmental monitoring with interactive scenario-based reasoning.
Details
Motivation: Current vision language models overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. Method: The framework creates a 177k-image dataset with 152k temporal pairs across 62 land-use classes, uses GPT-4o and Gemini 2.0 for data annotation, and fine-tunes Qwen-2.5-VL using LoRA adapters. Result: ChatENV achieves strong performance in temporal and 'what-if' reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models. Conclusion: ChatENV is a powerful tool for grounded, sensor-aware environmental monitoring, supporting interactive scenario-based analysis and outperforming state-of-the-art temporal models. Abstract: Understanding environmental changes from aerial imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT- 4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.[137] Processing and acquisition traces in visual encoders: What does CLIP know about your camera?
Ryan Ramos,Vladan Stojnić,Giorgos Kordopatis-Zilos,Yuta Nakashima,Giorgos Tolias,Noa Garcia
Main category: cs.CV
TL;DR: 本文研究了视觉编码器对图像采集过程中细微参数变化的敏感性,发现这些参数会被系统性编码,并显著影响语义预测结果。
Details
Motivation: 先前的研究主要关注训练过程中未见到的严重图像损坏对视觉编码器的影响,而本文则从一个不同的角度出发,研究更细微的参数变化对语义预测的影响。 Method: 通过分析图像采集过程中的参数以及可能对人眼细微甚至不可察觉的变换,研究其对视觉表示的影响。 Result: 发现图像采集过程中的细微参数会被视觉编码器系统性地编码,并且这些参数对语义预测有显著影响,可能是正面也可能是负面。 Conclusion: 视觉编码器在面对图像采集过程中的细微参数变化时,会系统性地编码这些参数,并可能对语义预测产生显著影响,这取决于语义标签与这些参数之间的相关性。 Abstract: Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces[138] Lameness detection in dairy cows using pose estimation and bidirectional LSTMs
Helena Russello,Rik van der Tol,Eldert J. van Henten,Gert Kootstra
Main category: cs.CV
TL;DR: A new lameness detection approach in cows uses pose estimation and BLSTM networks to achieve higher accuracy with less manual input and smaller data requirements.
Details
Motivation: The motivation is to develop a markerless, automated lameness detection system that reduces the need for manual feature engineering and works efficiently with short video sequences and small datasets. Method: The method involves using T-LEAP pose estimation to extract motion sequences of nine keypoints from videos of walking cows, which are then fed into a BLSTM classifier for binary lameness classification. Result: The BLSTM classifier achieved a classification accuracy of 85%, outperforming the 80% accuracy of the established feature-based method. It was also able to detect lameness using just one second of video data. Conclusion: The study concludes that combining pose estimation and BLSTM neural networks provides an effective method for lameness detection in cows, outperforming traditional feature-based approaches. Abstract: This study presents a lameness detection approach that combines pose estimation and Bidirectional Long-Short-Term Memory (BLSTM) neural networks. Combining pose-estimation and BLSTMs classifier offers the following advantages: markerless pose-estimation, elimination of manual feature engineering by learning temporal motion features from the keypoint trajectories, and working with short sequences and small training datasets. Motion sequences of nine keypoints (located on the cows' hooves, head and back) were extracted from videos of walking cows with the T-LEAP pose estimation model. The trajectories of the keypoints were then used as an input to a BLSTM classifier that was trained to perform binary lameness classification. Our method significantly outperformed an established method that relied on manually-designed locomotion features: our best architecture achieved a classification accuracy of 85%, against 80% accuracy for the feature-based approach. Furthermore, we showed that our BLSTM classifier could detect lameness with as little as one second of video data.[139] SemPT: Semantic Prompt Tuning for Vision-Language Models
Xiao Shi,Yangjun Ou,Zhenzhong Chen
Main category: cs.CV
TL;DR: This paper introduces Semantic Prompt Tuning (SemPT), a framework for visual transfer learning that leverages shared attribute-level knowledge to improve generalization to unseen categories.
Details
Motivation: Existing prompt tuning methods for Vision-Language Models suffer from fragmented knowledge representation and hindered transferability due to reliance on sparse labels or disparate descriptions. Method: The paper proposes a two-step prompting strategy called Semantic Prompt Tuning (SemPT) to extract shared visual attributes and generate attribute-level descriptions. It applies visually guided weighting to reduce noise and jointly aligns image embeddings with label and attribute-enhanced text embeddings. Result: SemPT achieves state-of-the-art performance across multiple settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning on 15 benchmark datasets. Conclusion: SemPT is an effective framework for visual transfer learning, achieving state-of-the-art performance on various benchmark datasets by leveraging attribute-level knowledge. Abstract: Visual transfer learning for unseen categories presents an active research topic yet a challenging task, due to the inherent conflict between preserving category-specific representations and acquiring transferable knowledge. Vision-Language Models (VLMs) pre-trained on large amounts of image-text pairs offer a promising solution. However, existing prompt tuning methods rely on sparse category labels or disparate LLM-generated descriptions, which fragment knowledge representation and hinder transferability. To address this limitation, we introduce Semantic Prompt Tuning (SemPT), a novel framework that tackles the generalization challenge by leveraging shared attribute-level knowledge across categories. Specifically, SemPT adopts a two-step prompting strategy to guide LLM in extracting shared visual attributes and generating attribute-level descriptions, capturing transferable semantic cues beyond labels while ensuring coherent structure. Then, visually guided weighting is applied to the embeddings of attribute-level descriptions to reduce noise from irrelevant attributes and enhance the text embeddings. Additionally, image embeddings are jointly aligned with both label and attribute-enhanced text embeddings, balancing discrimination for seen categories and transferability to unseen ones. Considering the availability of category exposure, our inference dynamically selects between standard label embeddings for seen categories and attribute-enhanced embeddings for unseen ones to ensure effective adaptation. Extensive experiments on 15 benchmark datasets demonstrate that SemPT achieves state-of-the-art performance across various settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning.[140] Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking
Zhangyong Tang,Tianyang Xu,Xuefeng Zhu,Chunyang Cheng,Tao Zhou,Xiaojun Wu,Josef Kittler
Main category: cs.CV
TL;DR: This paper proposes UniBench300, a unified benchmark for multi-modal visual object tracking, and employs continual learning to address performance degradation, thereby offering valuable insights for future research in this area.
Details
Motivation: The motivation stems from the complementary nature of different modalities in building robust tracking systems and the inconsistency issue between training and testing due to the absence of a unified benchmark. Method: The authors introduce a unified benchmark, UniBench300, and reformulate the unification process in a serial format, progressively integrating new tasks while applying continual learning to prevent performance degradation. Result: UniBench300 reduces inference passes from three to one and cuts time consumption by 27%. Extensive experiments show the superiority of continual learning in supporting a stable unification process and insights into performance degradation related to network capacity and modality discrepancies. Conclusion: The paper concludes that the proposed UniBench300 benchmark and continual learning approach effectively address the inconsistency issue in multi-modal visual object tracking, while also offering insights into future multi-modal vision research. Abstract: Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing \textit{inconsistency} between training and testing, thus leading to performance \textit{degradation}. To address these issues, this work advances in two aspects: \ding{182} A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27\%. \ding{183} The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT > RGBD > RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at \textit{https://github.com/Zhangyong-Tang/UniBench300}.[141] AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models
Shixiong Xu,Chenghao Zhang,Lubin Fan,Yuan Zhou,Bin Fan,Shiming Xiang,Gaofeng Meng,Jieping Ye
Main category: cs.CV
TL;DR: 本文提出AddressVLM,通过结合卫星和街景图像的跨视图对齐调整策略,显著提升视觉语言模型在街道级别地理定位的表现。
Details
Motivation: 现有的LVLM在国家或城市级别的地理定位表现出色,但在城市内部街道级别的细粒度定位上存在困难,需要提升模型在微观视觉线索下的表现。 Method: 引入透视不变的卫星图像作为宏观线索,提出跨视图对齐调整策略,包括卫星视图和街景视图图像嫁接机制以及自动标签生成机制,并通过两个阶段的训练协议提升模型的全局街道分布理解能力。 Result: AddressVLM在匹兹堡和旧金山的两个街景VQA数据集上分别实现了平均地址定位精度超过对比LVLM模型9%和12%的提升。 Conclusion: AddressVLM有效地提升了LVLM在街道级地理定位的表现,通过结合卫星图像和街景图像的跨视图对齐调整策略,实现了更准确的地址定位。 Abstract: Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM's global understanding of street distribution is enhanced through cross-view matching. Our proposed model, named AddressVLM, consists of two-stage training protocols: cross-view alignment tuning and address localization tuning. Furthermore, we have constructed two street-view VQA datasets based on image address localization datasets from Pittsburgh and San Francisco. Qualitative and quantitative evaluations demonstrate that AddressVLM outperforms counterpart LVLMs by over 9% and 12% in average address localization accuracy on these two datasets, respectively.[142] Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation
Feiran Li,Qianqian Xu,Shilong Bao,Boyu Han,Zhiyong Yang,Qingming Huang
Main category: cs.CV
TL;DR: 本文提出了一种高效构建多样化人脸数据集的方法,在DataCV ICCV挑战赛中取得第一名,显著提升了人脸识别模型的性能。
Details
Motivation: 构建一个不包含现有公共数据集身份的高质量人脸数据集,以提升人脸识别模型的泛化能力和竞赛表现。 Method: 本文采用Mixture-of-Experts(MoE)策略对HSFace数据集进行清洗,结合人脸嵌入聚类和GPT-4o辅助验证,保留最大的一致身份簇,并利用Stable Diffusion和Vec2Face生成合成身份及其变体。此外,采用课程学习策略提升模型训练效果。 Result: 该方法在10K、20K和100K身份规模上均提升了模型性能,并在DataCV ICCV挑战赛中获得第一名。生成的最终数据集每个身份包含50张图像,并确保没有身份泄漏。 Conclusion: 本文提出的方法在DataCV ICCV挑战赛中获得了第一名,通过构建一个高质量且不包含任何现有公共人脸数据集身份的人脸数据集,有效提升了人脸识别模型的性能。 Abstract: In this paper, we present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model. The constructed dataset must not contain identities overlapping with any existing public face datasets. To handle this challenge, we begin with a thorough cleaning of the baseline HSFace dataset, identifying and removing mislabeled or inconsistent identities through a Mixture-of-Experts (MoE) strategy combining face embedding clustering and GPT-4o-assisted verification. We retain the largest consistent identity cluster and apply data augmentation up to a fixed number of images per identity. To further diversify the dataset, we generate synthetic identities using Stable Diffusion with prompt engineering. As diffusion models are computationally intensive, we generate only one reference image per identity and efficiently expand it using Vec2Face, which rapidly produces 49 identity-consistent variants. This hybrid approach fuses GAN-based and diffusion-based samples, enabling efficient construction of a diverse and high-quality dataset. To address the high visual similarity among synthetic identities, we adopt a curriculum learning strategy by placing them early in the training schedule, allowing the model to progress from easier to harder samples. Our final dataset contains 50 images per identity, and all newly generated identities are checked with mainstream face datasets to ensure no identity leakage. Our method achieves \textbf{1st place} in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales. Code is available at https://github.com/Ferry-Li/datacv_fr.[143] HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection
Zhaoyuan Qi,Weihua Gao,Wenlong Niu,Jie Tang,Yun Li,Xiaodong Peng
Main category: cs.CV
TL;DR: HyperTea 是一种用于 MIRSTD 的新方法,结合了 CNN、RNN 和 HGNN,通过多尺度特征表示和高阶时空相关性建模,显著提高了检测性能。
Details
Motivation: 由于目标尺寸小、强度弱、运动模式复杂,MIRSTD 仍然是高度挑战性的任务。现有方法通常仅建模特征节点之间的低阶相关性,并且在单一时间尺度上进行特征提取和增强,而超图已在高阶相关性学习中得到广泛应用,但在 MIRSTD 中受到的关注有限。 Method: HyperTea 包含全局时间增强模块 (GTEM)、局部时间增强模块 (LTEM) 和时间对齐模块 (TAM),以实现多尺度特征表示和高阶时空相关性建模。 Result: 实验结果表明,HyperTea 在 DAUB 和 IRDST 数据集上达到了最先进的 (SOTA) 性能。 Conclusion: HyperTea 是第一个将 CNN、RNN 和 HGNN 结合起来用于 MIRSTD 的工作,显著提高了检测性能。 Abstract: In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.[144] Physics-Informed Joint Multi-TE Super-Resolution with Implicit Neural Representation for Robust Fetal T2 Mapping
Busra Bulut,Maik Dannecker,Thomas Sanchez,Sara Neves Silva,Vladyslav Zalevskyi,Steven Jia,Jean-Baptiste Ledoux,Guillaume Auzias,François Rousseau,Jana Hutter,Daniel Rueckert,Meritxell Bach Cuadra
Main category: cs.CV
TL;DR: 研究提出了一种新的方法来改善胎儿脑部MRI中的T2映射,通过结合隐式神经表示和物理信息正则化,解决了运动问题,并展示了在减少扫描时间和保持图像质量方面的潜力。
Details
Motivation: 胎儿脑部MRI中的T2映射在中等磁场(0.55T)下具有较慢的T2衰减,可以提高对发育中大脑的表征能力。然而,由于胎儿MRI采集依赖于多个运动受损的厚层切片堆栈,需要切片到体积重建(SVR)来估计高分辨率3D体积,因此面临挑战。目前的T2映射方法涉及在每个TE下重复采集这些堆栈,导致扫描时间长且对运动高度敏感。 Method: 该方法结合了隐式神经表示与物理信息正则化,后者模拟了T2衰减,从而在不同TE之间共享信息,同时保持了解剖结构和定量T2的保真度。 Result: 研究表明,该方法在模拟胎儿大脑和体内成人数据集上展示了卓越的性能,并提供了首个在0.55T场强下的体内胎儿T2映射结果。此外,该方法展示了减少T2映射中每个TE所需堆栈数量的潜力。 Conclusion: 该研究提出了一种新的方法,通过结合隐式神经表示和物理信息正则化,联合重建不同回波时间(TE)下的数据,从而解决胎儿脑部T2映射中的严重运动问题。研究展示了其在模拟胎儿大脑和体内成人数据集上的卓越性能,并展示了首个在0.55T场强下的体内胎儿T2映射结果,表明该方法在减少T2映射中每个TE所需的堆栈数量方面的潜力。 Abstract: T2 mapping in fetal brain MRI has the potential to improve characterization of the developing brain, especially at mid-field (0.55T), where T2 decay is slower. However, this is challenging as fetal MRI acquisition relies on multiple motion-corrupted stacks of thick slices, requiring slice-to-volume reconstruction (SVR) to estimate a high-resolution (HR) 3D volume. Currently, T2 mapping involves repeated acquisitions of these stacks at each echo time (TE), leading to long scan times and high sensitivity to motion. We tackle this challenge with a method that jointly reconstructs data across TEs, addressing severe motion. Our approach combines implicit neural representations with a physics-informed regularization that models T2 decay, enabling information sharing across TEs while preserving anatomical and quantitative T2 fidelity. We demonstrate state-of-the-art performance on simulated fetal brain and in vivo adult datasets with fetal-like motion. We also present the first in vivo fetal T2 mapping results at 0.55T. Our study shows potential for reducing the number of stacks per TE in T2 mapping by leveraging anatomical redundancy.[145] IADGPT: Unified LVLM for Few-Shot Industrial Anomaly Detection, Localization, and Reasoning via In-Context Learning
Mengyang Zhao,Teng Fu,Haiyang Yu,Ke Niu,Bin Li
Main category: cs.CV
TL;DR: 本文提出了一种名为IADGPT的少样本工业异常检测方法,结合上下文学习与三阶段训练策略,提升了检测性能及异常推理能力。
Details
Motivation: 现有大型视觉-语言模型缺乏工业知识和推理能力,难以满足少样本工业异常检测的需求。 Method: 提出了一种基于上下文学习的训练范式,设计了一个三阶段渐进式训练策略,并通过logits输出和注意力图生成图像级和像素级异常分数。 Result: 实验表明IADGPT在异常检测中取得了显著性能提升,并生成了新的包含100K图像和400类工业产品的数据集。 Conclusion: IADGPT实现了在工业产品异常检测中的卓越表现,同时在异常定位和推理方面展示了竞争力。 Abstract: Few-Shot Industrial Anomaly Detection (FS-IAD) has important applications in automating industrial quality inspection. Recently, some FS-IAD methods based on Large Vision-Language Models (LVLMs) have been proposed with some achievements through prompt learning or fine-tuning. However, existing LVLMs focus on general tasks but lack basic industrial knowledge and reasoning capabilities related to FS-IAD, making these methods far from specialized human quality inspectors. To address these challenges, we propose a unified framework, IADGPT, designed to perform FS-IAD in a human-like manner, while also handling associated localization and reasoning tasks, even for diverse and novel industrial products. To this end, we introduce a three-stage progressive training strategy inspired by humans. Specifically, the first two stages gradually guide IADGPT in acquiring fundamental industrial knowledge and discrepancy awareness. In the third stage, we design an in-context learning-based training paradigm, enabling IADGPT to leverage a few-shot image as the exemplars for improved generalization to novel products. In addition, we design a strategy that enables IADGPT to output image-level and pixel-level anomaly scores using the logits output and the attention map, respectively, in conjunction with the language output to accomplish anomaly reasoning. To support our training, we present a new dataset comprising 100K images across 400 diverse industrial product categories with extensive attribute-level textual annotations. Experiments indicate IADGPT achieves considerable performance gains in anomaly detection and demonstrates competitiveness in anomaly localization and reasoning. We will release our dataset in camera-ready.[146] Novel View Synthesis using DDIM Inversion
Sehajdeep SIngh,A V Subramanyam
Main category: cs.CV
TL;DR: 本文提出了一种基于预训练扩散模型的轻量级视图翻译框架,用于从单个输入图像合成新视角图像。通过利用DDIM反转的潜在表示和提出的新融合策略,该方法在保持纹理和细粒度细节方面表现出色,且优于现有方法。
Details
Motivation: 现有方法需要微调大型扩散模型或从头训练,计算成本高昂且存在重建模糊和泛化能力差的问题,因此需要探索一种高效的轻量级方法。 Method: 使用相机姿态条件化的翻译U-Net(TUNet)预测目标视角的潜在表示,并提出一种利用DDIM反转中噪声相关性结构的融合策略,以保持细节和纹理。 Result: 在MVImgNet上的实验表明,该方法在合成新视角图像的质量上优于现有方法,特别是在保持细粒度细节和减少模糊方面。 Conclusion: 本文提出的方法有效地结合了预训练扩散模型的生成能力和轻量级视图翻译框架,为单图像新视角合成提供了更高效和高质量的解决方案。 Abstract: Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.[147] Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios
Zhanwen Liu,Yujing Sun,Yang Wang,Nan Yang,Shengbo Eben Li,Xiangmo Zhao
Main category: cs.CV
TL;DR: 本研究提出了一种结合事件相机与RGB相机的运动线索融合网络(MCFNet),通过事件校正、事件动态上采样和跨模态Mamba融合模块,在复杂光照条件下的交通场景中实现了卓越的目标检测性能。
Details
Motivation: 传统RGB相机的动态范围限制导致在复杂交通环境中(如夜间驾驶、隧道)全局对比度降低,高频细节丢失,影响特征提取和基于帧的目标检测。 Method: 提出了一种运动线索融合网络(MCFNet),包括事件校正模块(ECM)、事件动态上采样模块(EDUM)和跨模态Mamba融合模块(CMM),以实现时空对齐和自适应跨模态特征融合。 Result: 在DSEC-Det和PKU-DAVIS-SOD数据集上的实验表明,MCFNet在各种低光照和快速移动交通场景中显著优于现有方法。在DSEC-Det数据集上,mAP50和mAP指标分别提升了7.4%和1.7%。 Conclusion: MCFNet通过融合事件相机和RGB相机的信息,在复杂光照条件下的交通场景中实现了优越的检测性能,显著优于现有方法。 Abstract: The dynamic range limitation of conventional RGB cameras reduces global contrast and causes loss of high-frequency details such as textures and edges in complex traffic environments (e.g., nighttime driving, tunnels), hindering discriminative feature extraction and degrading frame-based object detection. To address this, we integrate a bio-inspired event camera with an RGB camera to provide high dynamic range information and propose a motion cue fusion network (MCFNet), which achieves optimal spatiotemporal alignment and adaptive cross-modal feature fusion under challenging lighting. Specifically, an event correction module (ECM) temporally aligns asynchronous event streams with image frames via optical-flow-based warping, jointly optimized with the detection network to learn task-aware event representations. The event dynamic upsampling module (EDUM) enhances spatial resolution of event frames to match image structures, ensuring precise spatiotemporal alignment. The cross-modal mamba fusion module (CMM) uses adaptive feature fusion with a novel interlaced scanning mechanism, effectively integrating complementary information for robust detection. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively. The code is available at https://github.com/Charm11492/MCFNet.[148] CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation
Joohyeon Lee,Jin-Seop Lee,Jee-Hyong Lee
Main category: cs.CV
TL;DR: CountCluster通过改进扩散模型的注意力机制,在不增加额外训练或外部工具的情况下,显著提高了生成图像中物体数量的准确性。
Details
Motivation: 现有的扩散模型在生成与输入提示中指定数量的物体准确匹配的图像方面仍存在不足,尤其是在去噪过程的早期阶段未能有效控制物体数量。 Method: 通过在推理过程中对物体交叉注意力图进行聚类,将注意力图划分为k个簇,并优化潜在空间以与目标分布对齐。 Result: 该方法在物体数量准确性上平均提高了18.5%p,并在多种提示下表现出优越的数量控制性能。 Conclusion: CountCluster方法在不依赖任何外部工具或额外训练的情况下,显著提高了扩散模型在生成图像时准确反映输入提示中物体数量的能力。 Abstract: Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic--The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into $k$ clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5\%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster .[149] NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
NextStep Team,Chunrui Han,Guopeng Li,Jingwei Wu,Quan Sun,Yan Cai,Yuang Peng,Zheng Ge,Deyu Zhou,Haomiao Tang,Hongyu Zhou,Kenkun Liu,Ailin Huang,Bin Wang,Changxin Miao,Deshan Sun,En Yu,Fukun Yin,Gang Yu,Hao Nie,Haoran Lv,Hanpeng Hu,Jia Wang,Jian Zhou,Jianjian Sun,Kaijun Tan,Kang An,Kangheng Lin,Liang Zhao,Mei Chen,Peng Xing,Rui Wang,Shiyu Liu,Shutao Xia,Tianhao You,Wei Ji,Xianfang Zeng,Xin Han,Xuelin Zhang,Yana Wei,Yanming Xu,Yimin Jiang,Yingming Wang,Yu Zhou,Yucheng Han,Ziyang Meng,Binxing Jiao,Daxin Jiang,Xiangyu Zhang,Yibo Zhu
Main category: cs.CV
TL;DR: NextStep-1 is a new autoregressive model that advances text-to-image generation by efficiently handling both discrete text tokens and continuous image tokens.
Details
Motivation: The authors aim to advance the autoregressive paradigm for text-to-image generation by addressing limitations in existing models, such as computational intensity or quantization loss. Method: NextStep-1 uses a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens using next-token prediction objectives. Result: NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks and demonstrates strong capabilities in high-fidelity image synthesis. Conclusion: NextStep-1 shows strong performance in image editing, highlighting the power and versatility of the unified approach, and the authors plan to release code and models for open research. Abstract: Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.[150] Lightweight CNNs for Embedded SAR Ship Target Detection and Classification
Fabian Kresse,Georgios Pilikos,Mario Azcueta,Nicolas Floury
Main category: cs.CV
TL;DR: 该论文研究了使用神经网络对合成孔径雷达(SAR)数据进行实时处理,以减少需要下传的数据量,减轻带宽限制,并证明了在二元分类任务中目标分类(船舶和风车)的可行性。
Details
Motivation: 当前,近实时监控受限于需要下传所有原始数据、在地面上进行图像聚焦和分析。在卫星上处理数据可以减少需要下传的数据量,减轻带宽限制并最小化延迟,但传统图像处理算法因卫星有限的内存、处理能力和计算资源而面临挑战。 Method: 提出并评估了适用于在Stripmap和Interferometric Wide(IW)模式下获取的未聚焦SAR数据的神经网络,这些数据来自Sentinel-1卫星。 Result: 结果表明,其中一个模型可以在现场可编程门阵列(FPGA)上进行板载处理和部署,并且通过研究船舶和风车之间的二元分类任务,证明了目标分类是可行的。 Conclusion: 论文证明了使用神经网络对未聚焦SAR数据进行实时处理的可行性,这有助于实现卫星上的高效数据处理,减少数据下传的需求,并支持目标分类任务。 Abstract: Synthetic Aperture Radar (SAR) data enables large-scale surveillance of maritime vessels. However, near-real-time monitoring is currently constrained by the need to downlink all raw data, perform image focusing, and subsequently analyze it on the ground. On-board processing to generate higher-level products could reduce the data volume that needs to be downlinked, alleviating bandwidth constraints and minimizing latency. However, traditional image focusing and processing algorithms face challenges due to the satellite's limited memory, processing power, and computational resources. This work proposes and evaluates neural networks designed for real-time inference on unfocused SAR data acquired in Stripmap and Interferometric Wide (IW) modes captured with Sentinel-1. Our results demonstrate the feasibility of using one of our models for on-board processing and deployment on an FPGA. Additionally, by investigating a binary classification task between ships and windmills, we demonstrate that target classification is possible.[151] Revisiting Cross-View Localization from Image Matching
Panwang Xia,Qiong Wu,Lei Yu,Yi Liu,Mingtao Xiong,Lei Liang,Yongjun Zhang,Yi Wan
Main category: cs.CV
TL;DR: The paper proposes a novel framework for cross-view localization that improves image matching and pose estimation by introducing a Surface Model and SimRefiner module, outperforming existing methods under extreme viewpoint differences.
Details
Motivation: The motivation stems from the limitations of existing cross-view localization methods, which struggle with establishing strict cross-view correspondences, leading to coarse or inconsistent matches. This issue affects localization interpretability, especially in GNSS-denied environments like urban canyons and disaster zones. Method: The authors propose a novel framework for cross-view localization that involves a Surface Model for accurate BEV projection and a SimRefiner module for refining the similarity matrix through local-global residual correction. They also introduce the CVFM benchmark dataset with annotated cross-view image pairs. Result: The proposed framework achieves improved localization accuracy and image matching quality, particularly under extreme viewpoint disparities, as demonstrated by extensive experiments on the newly introduced CVFM benchmark. Conclusion: The paper concludes that their proposed framework, which includes a Surface Model and SimRefiner module, significantly enhances cross-view image matching and localization accuracy, establishing new baselines in the field. Abstract: Cross-view localization aims to estimate the 3 degrees of freedom pose of a ground-view image by registering it to aerial or satellite imagery. It is essential in GNSS-denied environments such as urban canyons and disaster zones. Existing methods either regress poses directly or align features in a shared bird's-eye view (BEV) space, both built upon accurate spatial correspondences between perspectives. However, these methods fail to establish strict cross-view correspondences, yielding only coarse or geometrically inconsistent matches. Consequently, fine-grained image matching between ground and aerial views remains an unsolved problem, which in turn constrains the interpretability of localization results. In this paper, we revisit cross-view localization from the perspective of cross-view image matching and propose a novel framework that improves both matching and localization. Specifically, we introduce a Surface Model to model visible regions for accurate BEV projection, and a SimRefiner module to refine the similarity matrix through local-global residual correction, eliminating the reliance on post-processing like RANSAC. To further support research in this area, we introduce CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach substantially improves both localization accuracy and image matching quality, setting new baselines under extreme viewpoint disparity.[152] Exploiting Discriminative Codebook Prior for Autoregressive Image Generation
Longxiang Tang,Ruihang Chu,Xiang Wang,Yujin Han,Pingyu Wu,Chunming He,Yingya Zhang,Shiwei Zhang,Jiaya Jia
Main category: cs.CV
TL;DR: This paper introduces DCPE, a novel method to extract prior information from codebooks in autoregressive image generation, outperforming k-means clustering and accelerating model training while improving performance.
Details
Motivation: k-means clustering is ineffective in the codebook feature space due to token space disparity and centroid distance inaccuracy, prompting the need for a better method to exploit the prior encoded in the codebook. Method: The Discriminative Codebook Prior Extractor (DCPE) uses an instance-based distance metric and agglomerative merging to better capture token similarity and address token space disparity. Result: DCPE improves training speed by 42% on LlamaGen-B and enhances FID and IS performance, proving its effectiveness as a plug-and-play solution. Conclusion: DCPE effectively extracts discriminative prior information from the codebook, improving the training efficiency and performance of autoregressive models. Abstract: Advanced discrete token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm. While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited. Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook. However, we reveal that k-means clustering performs poorly in the codebook feature space due to inherent issues, including token space disparity and centroid distance inaccuracy. In this work, we propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means clustering for more effectively mining and utilizing the token similarity information embedded in the codebook. DCPE replaces the commonly used centroid-based distance, which is found to be unsuitable and inaccurate for the token feature space, with a more reasonable instance-based distance. Using an agglomerative merging technique, it further addresses the token space disparity issue by avoiding splitting high-density regions and aggregating low-density ones. Extensive experiments demonstrate that DCPE is plug-and-play and integrates seamlessly with existing codebook prior-based paradigms. With the discriminative prior extracted, DCPE accelerates the training of autoregressive models by 42% on LlamaGen-B and improves final FID and IS performance.[153] EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering
Yanjun Li,Yuqian Fu,Tianwen Qian,Qi'ao Xu,Silong Dai,Danda Pani Paudel,Luc Van Gool,Xiaoling Wang
Main category: cs.CV
TL;DR: EgoCross 是一个用于评估多模态大语言模型在自我中心视频问答中跨领域泛化能力的新基准测试,发现现有模型在非日常生活领域表现不佳,强调需要改进领域自适应能力。
Details
Motivation: 现有的 EgocentricQA 研究主要集中在日常活动上,而在现实世界中,模型需要面对不同视觉风格和语义内容的领域变化,因此需要一个新的基准来评估跨领域泛化能力。 Method: 引入了一个名为 EgoCross 的新基准测试,涵盖手术、工业、极限运动和动物视角四个领域,并包括约 1000 个 QA 对,涵盖预测、识别、定位和计数四个任务。 Result: 实验表明,大多数现有的 MLLMs 在超出日常生活的领域中泛化能力较差,突显了当前模型的局限性。 Conclusion: EgoCross 强调了当前 MLLMs 在跨领域泛化方面的不足,并希望通过该基准测试推动领域自适应和鲁棒性自我中心视频理解的发展。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly pushed the frontier of egocentric video question answering (EgocentricQA). However, existing benchmarks and studies are mainly limited to common daily activities such as cooking and cleaning. In contrast, real-world deployment inevitably encounters domain shifts, where target domains differ substantially in both visual style and semantic content. To bridge this gap, we introduce \textbf{EgoCross}, a comprehensive benchmark designed to evaluate the cross-domain generalization of MLLMs in EgocentricQA. EgoCross covers four diverse and challenging domains, including surgery, industry, extreme sports, and animal perspective, representing realistic and high-impact application scenarios. It comprises approximately 1,000 QA pairs across 798 video clips, spanning four key QA tasks: prediction, recognition, localization, and counting. Each QA pair provides both OpenQA and CloseQA formats to support fine-grained evaluation. Extensive experiments show that most existing MLLMs, whether general-purpose or egocentric-specialized, struggle to generalize to domains beyond daily life, highlighting the limitations of current models. Furthermore, we conduct several pilot studies, \eg, fine-tuning and reinforcement learning, to explore potential improvements. We hope EgoCross and our accompanying analysis will serve as a foundation for advancing domain-adaptive, robust egocentric video understanding. Data and codes will be released at: \href{https://github.com/MyUniverse0726/EgoCross}{https://github.com/MyUniverse0726/EgoCross.}[154] Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction
Luyao Tang,Kunze Huang,Chaoqi Chen,Yuxuan Yuan,Chenxin Li,Xiaotong Tu,Xinghao Ding,Yue Huang
Main category: cs.CV
TL;DR: This paper introduces ConGCD, a novel framework for generalized category discovery inspired by human cognitive processes, which decomposes objects into visual primitives and establishes cross-knowledge comparisons to effectively recognize objects across known and novel categories.
Details
Motivation: The motivation is to bridge the gap between human perceptual systems and current machine learning frameworks by introducing a novel approach for generalized category discovery (GCD) inspired by human cognitive processes for understanding novel objects. Method: The method involves decomposing objects into visual primitives, establishing cross-knowledge comparisons, and creating primitive-oriented representations through high-level semantic reconstruction. Dominant and contextual consensus units are implemented to capture class-discriminative patterns and distributional invariants, with a consensus scheduler dynamically optimizing activation pathways. Result: Extensive evaluations across coarse- and fine-grained benchmarks demonstrate the effectiveness of the proposed ConGCD framework as a consensus-aware paradigm for generalized category discovery. Conclusion: The proposed ConGCD framework offers an effective consensus-aware paradigm for generalized category discovery by mimicking human cognitive processes through decomposing objects into visual primitives and establishing cross-knowledge comparisons. Abstract: Human perceptual systems excel at inducing and recognizing objects across both known and novel categories, a capability far beyond current machine learning frameworks. While generalized category discovery (GCD) aims to bridge this gap, existing methods predominantly focus on optimizing objective functions. We present an orthogonal solution, inspired by the human cognitive process for novel object understanding: decomposing objects into visual primitives and establishing cross-knowledge comparisons. We propose ConGCD, which establishes primitive-oriented representations through high-level semantic reconstruction, binding intra-class shared attributes via deconstruction. Mirroring human preference diversity in visual processing, where distinct individuals leverage dominant or contextual cues, we implement dominant and contextual consensus units to capture class-discriminative patterns and inherent distributional invariants, respectively. A consensus scheduler dynamically optimizes activation pathways, with final predictions emerging through multiplex consensus integration. Extensive evaluations across coarse- and fine-grained benchmarks demonstrate ConGCD's effectiveness as a consensus-aware paradigm. Code is available at github.com/lytang63/ConGCD.[155] Privacy-enhancing Sclera Segmentation Benchmarking Competition: SSBC 2025
Matej Vitek,Darian Tomašević,Abhijit Das,Sabari Nathan,Gökhan Özbulak,Gözde Ayşe Tataroğlu Özbulak,Jean-Paul Calbimonte,André Anjos,Hariohm Hemant Bhatt,Dhruv Dhirendra Premani,Jay Chaudhari,Caiyong Wang,Jian Jiang,Chi Zhang,Qi Zhang,Iyyakutti Iyappan Ganapathi,Syed Sadaf Ali,Divya Velayudan,Maregu Assefa,Naoufel Werghi,Zachary A. Daniels,Leeon John,Ritesh Vyas,Jalil Nourmohammadi Khiarak,Taher Akbari Saeed,Mahsa Nasehi,Ali Kianfar,Mobina Pashazadeh Panahi,Geetanjali Sharma,Pushp Raj Panth,Raghavendra Ramachandra,Aditya Nigam,Umapada Pal,Peter Peer,Vitomir Štruc
Main category: cs.CV
TL;DR: The 2025 Sclera Segmentation Benchmarking Competition showed that models trained on synthetic data can perform competitively in sclera segmentation, highlighting the potential of privacy-preserving biometric development using synthetic data.
Details
Motivation: To assess the effectiveness of models trained on synthetic ocular images for sclera segmentation, as a privacy-preserving alternative to models trained on real-world data. Method: Nine research groups developed diverse segmentation models using synthetic data and, in some cases, a mix of synthetic and real-world data. Experiments were conducted across three datasets, evaluating different architectural designs and training strategies. Result: Top-performing models trained solely on synthetic data achieved F1 scores over 0.8, and the mixed track's performance gains were largely driven by methodological choices rather than the inclusion of real data. Conclusion: The competition demonstrates the potential of synthetic data in privacy-preserving sclera segmentation, showing that models trained solely on synthetic data can achieve competitive performance, and methodological choices significantly impact performance gains when combining synthetic and real-world data. Abstract: This paper presents a summary of the 2025 Sclera Segmentation Benchmarking Competition (SSBC), which focused on the development of privacy-preserving sclera-segmentation models trained using synthetically generated ocular images. The goal of the competition was to evaluate how well models trained on synthetic data perform in comparison to those trained on real-world datasets. The competition featured two tracks: $(i)$ one relying solely on synthetic data for model development, and $(ii)$ one combining/mixing synthetic with (a limited amount of) real-world data. A total of nine research groups submitted diverse segmentation models, employing a variety of architectural designs, including transformer-based solutions, lightweight models, and segmentation networks guided by generative frameworks. Experiments were conducted across three evaluation datasets containing both synthetic and real-world images, collected under diverse conditions. Results show that models trained entirely on synthetic data can achieve competitive performance, particularly when dedicated training strategies are employed, as evidenced by the top performing models that achieved $F_1$ scores of over $0.8$ in the synthetic data track. Moreover, performance gains in the mixed track were often driven more by methodological choices rather than by the inclusion of real data, highlighting the promise of synthetic data for privacy-aware biometric development. The code and data for the competition is available at: https://github.com/dariant/SSBC_2025.[156] Axis-level Symmetry Detection with Group-Equivariant Representation
Wongyun Yu,Ahyun Seo,Minsu Cho
Main category: cs.CV
TL;DR: This paper proposes a novel framework for axis-level detection of reflection and rotation symmetries using explicit geometric primitives and a dihedral group-equivariant dual-branch architecture, achieving state-of-the-art results.
Details
Motivation: Detecting symmetry in complex scenes remains a significant challenge in computer vision, and recent heatmap-based approaches often lack precision in identifying individual symmetry axes. Method: A dual-branch architecture that is equivariant to the dihedral group is used to detect reflection and rotation symmetries. For reflection symmetry, orientational anchors and a reflectional matching mechanism are introduced, while rotational matching is proposed for rotational symmetry to identify centers by comparing patterns at fixed angular intervals. Result: Extensive experiments demonstrate that the proposed method outperforms existing approaches in axis-level detection of reflection and rotation symmetries. Conclusion: The proposed method achieves state-of-the-art performance in detecting reflection and rotation symmetry by representing them as explicit geometric primitives and employing a dual-branch architecture that is equivariant to the dihedral group. Abstract: Symmetry is a fundamental concept that has been extensively studied, yet detecting it in complex scenes remains a significant challenge in computer vision. Recent heatmap-based approaches can localize potential regions of symmetry axes but often lack precision in identifying individual axes. In this work, we propose a novel framework for axis-level detection of the two most common symmetry types-reflection and rotation-by representing them as explicit geometric primitives, i.e. lines and points. Our method employs a dual-branch architecture that is equivariant to the dihedral group, with each branch specialized to exploit the structure of dihedral group-equivariant features for its respective symmetry type. For reflection symmetry, we introduce orientational anchors, aligned with group components, to enable orientation-specific detection, and a reflectional matching that measures similarity between patterns and their mirrored counterparts across candidate axes. For rotational symmetry, we propose a rotational matching that compares patterns at fixed angular intervals to identify rotational centers. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches.[157] Forgery Guided Learning Strategy with Dual Perception Network for Deepfake Cross-domain Detection
Lixin Jia,Zhiqing Guo,Gaobo Yang,Liejun Wang,Keqin Li
Main category: cs.CV
TL;DR: To tackle the growing challenges of deepfake technology, the study proposes a novel Forgery Guided Learning strategy and Dual Perception Network, which enhance detection performance across unknown forgery techniques and generalize well across diverse scenarios.
Details
Motivation: Current deepfake detection methods perform well on specific datasets but struggle with unknown forgery techniques. As the gap between emerging and traditional forgery techniques widens, cross-domain detection methods relying on common forgery traces become ineffective. This necessitates the development of more generalized deepfake detection technologies. Method: The authors propose a Forgery Guided Learning (FGL) strategy to capture differential information between known and unknown forgery techniques, enabling real-time learning adaptation. They also design a Dual Perception Network (DPNet) that captures both differences and relationships among forgery traces. The DPNet uses a frequency stream to extract discriminative features, integrates them with spatial features, and employs graph convolution to model feature space relationships. Result: Extensive experiments demonstrate that the proposed approach generalizes well across different scenarios and effectively addresses challenges posed by unknown forgery techniques, offering robust support for deepfake detection. Conclusion: The proposed Forgery Guided Learning (FGL) strategy and Dual Perception Network (DPNet) effectively enhance deepfake detection by enabling models to adapt to unknown forgery techniques and generalize across diverse scenarios. Abstract: The emergence of deepfake technology has introduced a range of societal problems, garnering considerable attention. Current deepfake detection methods perform well on specific datasets, but exhibit poor performance when applied to datasets with unknown forgery techniques. Moreover, as the gap between emerging and traditional forgery techniques continues to widen, cross-domain detection methods that rely on common forgery traces are becoming increasingly ineffective. This situation highlights the urgency of developing deepfake detection technology with strong generalization to cope with fast iterative forgery techniques. To address these challenges, we propose a Forgery Guided Learning (FGL) strategy designed to enable detection networks to continuously adapt to unknown forgery techniques. Specifically, the FGL strategy captures the differential information between known and unknown forgery techniques, allowing the model to dynamically adjust its learning process in real time. To further improve the ability to perceive forgery traces, we design a Dual Perception Network (DPNet) that captures both differences and relationships among forgery traces. In the frequency stream, the network dynamically perceives and extracts discriminative features across various forgery techniques, establishing essential detection cues. These features are then integrated with spatial features and projected into the embedding space. In addition, graph convolution is employed to perceive relationships across the entire feature space, facilitating a more comprehensive understanding of forgery trace correlations. Extensive experiments show that our approach generalizes well across different scenarios and effectively handles unknown forgery challenges, providing robust support for deepfake detection. Our code is available on https://github.com/vpsg-research/FGL.[158] An Efficient Model-Driven Groupwise Approach for Atlas Construction
Ziwei Zou,Bei Zou,Xiaoyan Kui,Wenqi Lu,Haoran Dou,Arezoo Zakeri,Timothy Cootes,Alejandro F Frangi,Jinming Duan
Main category: cs.CV
TL;DR: This paper introduces DARC, a novel model-driven groupwise registration framework for atlas construction, which addresses the limitations of data-driven and model-driven methods, offering a flexible, generalizable, and resource-efficient solution with applications in one-shot segmentation and shape synthesis.
Details
Motivation: Data-driven registration methods have limitations in terms of dataset size requirements, generalizability, and lack of true inference phases, while model-driven methods face scalability and optimization challenges. Method: DARC (Diffeomorphic Atlas Registration via Coordinate descent), a novel model-driven groupwise registration framework for atlas construction. Result: DARC produces unbiased, diffeomorphic atlases with high anatomical fidelity and supports one-shot segmentation and shape synthesis applications. Conclusion: DARC offers a flexible, generalizable, and resource-efficient framework for atlas construction and applications. Abstract: Atlas construction is fundamental to medical image analysis, offering a standardized spatial reference for tasks such as population-level anatomical modeling. While data-driven registration methods have recently shown promise in pairwise settings, their reliance on large training datasets, limited generalizability, and lack of true inference phases in groupwise contexts hinder their practical use. In contrast, model-driven methods offer training-free, theoretically grounded, and data-efficient alternatives, though they often face scalability and optimization challenges when applied to large 3D datasets. In this work, we introduce DARC (Diffeomorphic Atlas Registration via Coordinate descent), a novel model-driven groupwise registration framework for atlas construction. DARC supports a broad range of image dissimilarity metrics and efficiently handles arbitrary numbers of 3D images without incurring GPU memory issues. Through a coordinate descent strategy and a centrality-enforcing activation function, DARC produces unbiased, diffeomorphic atlases with high anatomical fidelity. Beyond atlas construction, we demonstrate two key applications: (1) One-shot segmentation, where labels annotated only on the atlas are propagated to subjects via inverse deformations, outperforming state-of-the-art few-shot methods; and (2) shape synthesis, where new anatomical variants are generated by warping the atlas mesh using synthesized diffeomorphic deformation fields. Overall, DARC offers a flexible, generalizable, and resource-efficient framework for atlas construction and applications.[159] From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models
Tiancheng Han,Yunfei Gao,Yong Li,Wuzhou Yu,Qiaosheng Zhang,Wenqi Shao
Main category: cs.CV
TL;DR: 本文研究了视觉语言模型在空间物理推理方面的局限性,并通过优化方法显著提升了其性能,但模型在新场景中的泛化能力仍需改进。
Details
Motivation: 空间物理推理是理解真实物理世界的关键能力,但当前视觉语言模型在此方面的能力尚未被充分探索。 Method: 对主流视觉语言模型进行了全面的诊断分析,并通过监督微调和基于规则的强化学习对Qwen2.5-VL-7B进行优化。 Result: 经过优化的Qwen2.5-VL-7B在空间物理推理任务上表现出显著提升,甚至超过了领先的专有模型,但在新物理场景中的泛化能力仍然有限。 Conclusion: 尽管在特定任务上取得了显著进步,但当前的视觉语言模型在物理空间推理方面仍表现不足,且泛化能力有限,未来需要新的方法来提升这一能力。 Abstract: Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model's generalization to new physics scenarios remains limited -- underscoring the pressing need for new approaches in spatio-physical reasoning.[160] AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences
Jieyu Li,Xin Zhang,Joey Tianyi Zhou
Main category: cs.CV
TL;DR: AEGIS is a new large-scale benchmark designed to detect hyper-realistic AI-generated videos, offering diverse datasets and annotations to improve authenticity detection methodologies.
Details
Motivation: The motivation stems from the need to address the limitations of existing benchmarks for video authenticity detection, which fail to evaluate modern vision-language models against sophisticated AI-generated forgeries. Method: The paper introduces AEGIS, a large-scale benchmark with over 10,000 real and synthetic videos generated by state-of-the-art models, along with multimodal annotations for enhanced detection capabilities. Result: Experiments show that advanced vision-language models have limited detection capabilities on AEGIS's challenging subsets, highlighting the dataset's complexity and realism. Conclusion: AEGIS provides a comprehensive benchmark for video authenticity detection, advancing research for robust methodologies to combat real-world forgery threats. Abstract: Recent advances in AI-generated content have fueled the rise of highly realistic synthetic videos, posing severe risks to societal trust and digital integrity. Existing benchmarks for video authenticity detection typically suffer from limited realism, insufficient scale, and inadequate complexity, failing to effectively evaluate modern vision-language models against sophisticated forgeries. To address this critical gap, we introduce AEGIS, a novel large-scale benchmark explicitly targeting the detection of hyper-realistic and semantically nuanced AI-generated videos. AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models, including Stable Video Diffusion, CogVideoX-5B, KLing, and Sora, encompassing open-source and proprietary architectures. In particular, AEGIS features specially constructed challenging subsets enhanced with robustness evaluation. Furthermore, we provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features, facilitating authenticity detection and supporting downstream tasks such as multimodal fusion and forgery localization. Extensive experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS, highlighting the dataset's unique complexity and realism beyond the current generalization capabilities of existing models. In essence, AEGIS establishes an indispensable evaluation benchmark, fundamentally advancing research toward developing genuinely robust, reliable, broadly generalizable video authenticity detection methodologies capable of addressing real-world forgery threats. Our dataset is available on https://huggingface.co/datasets/Clarifiedfish/AEGIS.[161] Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
Youping Gu,Xiaolong Li,Yuhao Hu,Bohan Zhuang
Main category: cs.CV
TL;DR: BLADE accelerates video generation in diffusion transformers by combining adaptive sparse attention and sparsity-aware step distillation, achieving faster inference and improved video quality without requiring additional high-quality video data.
Details
Motivation: Diffusion transformers for video generation face significant inference bottlenecks due to slow iterative denoising and high computational costs from quadratic attention mechanisms. While step distillation and sparse attention have shown promise individually, integrating them effectively remains challenging, especially due to data and cost constraints. Method: BLADE introduces two key components: (1) ASA, which dynamically generates content-aware sparsity masks to focus computation on important spatiotemporal features, and (2) a sparsity-aware step distillation paradigm based on Trajectory Distribution Matching (TDM) that integrates sparsity directly into the distillation process for faster convergence. Result: BLADE achieves significant inference speedups—14.10x on Wan2.1-1.3B and 8.89x on CogVideoX-5B—while also improving video generation quality. On the VBench-2.0 benchmark, it boosts scores from 0.534 to 0.569 for CogVideoX-5B and from 0.563 to 0.570 for Wan2.1-1.3B, with further validation from human evaluations. Conclusion: BLADE is an innovative data-free joint training framework that effectively combines Adaptive Block-Sparse Attention (ASA) and sparsity-aware step distillation to accelerate video generation while maintaining or improving quality. Abstract: Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges -- training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Our code and model weights are publicly available at: http://ziplab.co/BLADE-Homepage/.[162] Ultra-High-Definition Reference-Based Landmark Image Super-Resolution with Generative Diffusion Prior
Zhenning Shi,Zizheng Yan,Yuhang Yu,Clara Xue,Jingyu Zhuang,Qi Zhang,Jinwei Chen,Tao Li,Qingnan Fan
Main category: cs.CV
TL;DR: 提出了TriFlowSR,一种用于超高分辨率地标场景的新型RefSR框架。
Details
Motivation: 现有的基于扩散的RefSR方法难以有效对齐LR图像和参考HR图像之间的信息,并且现有的RefSR数据集分辨率和图像质量有限。 Method: 设计了一个参考匹配策略,并引入了Landmark-4K数据集。 Result: 实验结果表明,与之前的方法相比,该方法能更好地利用参考HR图像的语义和纹理信息。 Conclusion: TriFlowSR是一个新的RefSR框架,能够在现实退化条件下实现超高分辨率地标场景的图像超分辨率重建。 Abstract: Reference-based Image Super-Resolution (RefSR) aims to restore a low-resolution (LR) image by utilizing the semantic and texture information from an additional reference high-resolution (reference HR) image. Existing diffusion-based RefSR methods are typically built upon ControlNet, which struggles to effectively align the information between the LR image and the reference HR image. Moreover, current RefSR datasets suffer from limited resolution and poor image quality, resulting in the reference images lacking sufficient fine-grained details to support high-quality restoration. To overcome the limitations above, we propose TriFlowSR, a novel framework that explicitly achieves pattern matching between the LR image and the reference HR image. Meanwhile, we introduce Landmark-4K, the first RefSR dataset for Ultra-High-Definition (UHD) landmark scenarios. Considering the UHD scenarios with real-world degradation, in TriFlowSR, we design a Reference Matching Strategy to effectively match the LR image with the reference HR image. Experimental results show that our approach can better utilize the semantic and texture information of the reference HR image compared to previous methods. To the best of our knowledge, we propose the first diffusion-based RefSR pipeline for ultra-high definition landmark scenarios under real-world degradation. Our code and model will be available at https://github.com/nkicsl/TriFlowSR.[163] Cooperative Face Liveness Detection from Optical Flow
Artem Sokolov,Mikhail Nikitin,Anton Konushin
Main category: cs.CV
TL;DR: 本文提出了一种用户协作式视频人脸活体检测方法,结合用户靠近摄像头的运动模式和光流分析,提高活体检测的可靠性。
Details
Motivation: 为了解决传统被动方法在区分真实人脸和呈现攻击(如照片、屏幕显示、面具和视频回放)时的不足。 Method: 提出了一种基于用户协作的视频人脸活体检测方法,结合了用户靠近摄像头的特定运动模式和光流分析。 Result: 通过神经光流估计,有效提取面部体积信息,并在分类任务中取得了更好的性能。 Conclusion: 该方法通过处理预测的光流和RGB帧,利用空间-时间特征,显著提高了活体检测的可靠性。 Abstract: In this work, we proposed a novel cooperative video-based face liveness detection method based on a new user interaction scenario where participants are instructed to slowly move their frontal-oriented face closer to the camera. This controlled approaching face protocol, combined with optical flow analysis, represents the core innovation of our approach. By designing a system where users follow this specific movement pattern, we enable robust extraction of facial volume information through neural optical flow estimation, significantly improving discrimination between genuine faces and various presentation attacks (including printed photos, screen displays, masks, and video replays). Our method processes both the predicted optical flows and RGB frames through a neural classifier, effectively leveraging spatial-temporal features for more reliable liveness detection compared to passive methods.[164] VasoMIM: Vascular Anatomy-Aware Masked Image Modeling for Vessel Segmentation
De-Xing Huang,Xiao-Hu Zhou,Mei-Jiang Gui,Xiao-Liang Xie,Shi-Qi Liu,Shuang-Yi Wang,Tian-Yu Xiang,Rui-Ze Ma,Nu-Fang Xiao,Zeng-Guang Hou
Main category: cs.CV
TL;DR: VasoMIM improves vessel segmentation in X-ray angiograms by introducing an anatomy-aware MIM framework that addresses class imbalance and enhances vascular representation learning.
Details
Motivation: Accurate vessel segmentation in X-ray angiograms is essential, but the lack of annotated data and class imbalance between vessel and background pixels limit the effectiveness of traditional methods like MIM. Method: VasoMIM includes an anatomy-guided masking strategy and an anatomical consistency loss to improve vascular representation learning. Result: VasoMIM achieves state-of-the-art performance across three datasets, demonstrating its effectiveness in improving vascular representation learning. Conclusion: VasoMIM introduces a novel MIM framework for X-ray angiograms by incorporating anatomical knowledge, achieving state-of-the-art performance in vessel segmentation. Abstract: Accurate vessel segmentation in X-ray angiograms is crucial for numerous clinical applications. However, the scarcity of annotated data presents a significant challenge, which has driven the adoption of self-supervised learning (SSL) methods such as masked image modeling (MIM) to leverage large-scale unlabeled data for learning transferable representations. Unfortunately, conventional MIM often fails to capture vascular anatomy because of the severe class imbalance between vessel and background pixels, leading to weak vascular representations. To address this, we introduce Vascular anatomy-aware Masked Image Modeling (VasoMIM), a novel MIM framework tailored for X-ray angiograms that explicitly integrates anatomical knowledge into the pre-training process. Specifically, it comprises two complementary components: anatomy-guided masking strategy and anatomical consistency loss. The former preferentially masks vessel-containing patches to focus the model on reconstructing vessel-relevant regions. The latter enforces consistency in vascular semantics between the original and reconstructed images, thereby improving the discriminability of vascular representations. Empirically, VasoMIM achieves state-of-the-art performance across three datasets. These findings highlight its potential to facilitate X-ray angiogram analysis.[165] Object Fidelity Diffusion for Remote Sensing Image Generation
Ziqi Ye,Shuran Ma,Jie Yang,Xiaoyi Yang,Ziyang Gong,Xue Yang,Haipeng Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为OF-Diff的方法,通过双分支扩散模型和扩散一致性损失,有效提高了遥感图像生成的精度和保真度。
Details
Motivation: 现有的扩散模型由于无法充分捕捉形态细节,常常生成低保真度的图像,这可能会影响物体检测模型的鲁棒性和可靠性。 Method: 首先基于布局提取对象的先验形状,引入双分支扩散模型和扩散一致性损失函数,并利用DDPO对扩散过程进行微调。 Result: OF-Diff在遥感图像生成的关键质量指标上优于现有技术,尤其对于多态性和小物体类别的性能有显著提升,例如飞机、船只和车辆的mAP分别提高了8.3%、7.7%和4.0%。 Conclusion: OF-Diff是一种有效的遥感图像生成方法,能够生成高保真度和多样化的图像,具有广阔的应用前景。 Abstract: High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.[166] Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops
Anand Kumar,Harminder Pal Monga,Tapasi Brahma,Satyam Kalra,Navas Sherif
Main category: cs.CV
TL;DR: 开发了一种适用于移动设备的植物病害检测解决方案,利用EfficientNet-B1实现了94.7%的分类准确率。
Details
Motivation: 植物病害对全球粮食安全构成重大威胁,因此需要开发能够准确早期检测病害的系统。 Method: 通过整合Plant Doc、PlantVillage 和 PlantWild等多个数据集,构建了一个全面的数据集,并在多个轻量级架构(如MobileNetV2、MobileNetV3和EfficientNet-B0, B1)上进行了评估,以选择适合移动设备部署的最优模型。 Result: EfficientNet-B1表现最佳,分类准确率达到94.7%,在准确性和计算效率之间取得了最佳平衡。 Conclusion: 该研究成功开发出一种高效的移动友好型植物病害检测方法,适用于资源受限设备的现实部署。 Abstract: Plant diseases are a major threat to food security globally. It is important to develop early detection systems which can accurately detect. The advancement in computer vision techniques has the potential to solve this challenge. We have developed a mobile-friendly solution which can accurately classify 101 plant diseases across 33 crops. We built a comprehensive dataset by combining different datasets, Plant Doc, PlantVillage, and PlantWild, all of which are for the same purpose. We evaluated performance across several lightweight architectures - MobileNetV2, MobileNetV3, MobileNetV3-Large, and EfficientNet-B0, B1 - specifically chosen for their efficiency on resource-constrained devices. The results were promising, with EfficientNet-B1 delivering our best performance at 94.7% classification accuracy. This architecture struck an optimal balance between accuracy and computational efficiency, making it well-suited for real-world deployment on mobile devices.[167] UI-Venus Technical Report: Building High-performance UI Agents with RFT
Zhangxuan Gu,Zhengwen Zeng,Zhenyu Xu,Xingran Zhou,Shuheng Shen,Yunfei Liu,Beitong Zhou,Changhua Meng,Tianyu Xia,Weizhi Chen,Yue Wen,Jingya Dou,Fei Tang,Jinzhen Lin,Yulin Liu,Zhenlin Guo,Yichen Gong,Heng Jia,Changlong Gao,Yuan Guo,Yong Deng,Zhenyu Guo,Liang Chen,Weiqiang Wang
Main category: cs.CV
TL;DR: UI-Venus是一个实现了最先进的UI定位和导航性能的本地UI代理,通过强化学习微调和自我演化框架提高了导航性能。
Details
Motivation: UI-Venus旨在解决UI定位和导航任务中的挑战,通过实现更连贯的规划和在复杂UI任务中的更好泛化来推动社区的研究和发展。 Method: UI-Venus通过强化学习微调(RFT)使用Qwen2.5-VL进行训练,引入了精心设计的奖励函数和相应的高效数据清洗策略,并提出了Self-Evolving Trajectory History Alignment & Sparse Action Enhancement方法。 Result: UI-Venus的7B和72B变体在标准定位基准Screenspot-V2/Pro上分别取得了94.1%/50.8%和95.3%/61.9%的成绩,并在AndroidWorld在线UI导航竞技场中分别取得了49.1%和65.9%的成功率。 Conclusion: UI-Venus是一个基于多模态大语言模型的本地UI代理,实现了最先进的性能,并通过精心设计的奖励函数和数据清洗策略,以及一种新的自我演化框架,提高了导航性能。 Abstract: We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment \& Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.[168] Self-Supervised Stereo Matching with Multi-Baseline Contrastive Learning
Peng Xu,Zhiyu Xiang,Jingyun Fu,Tianyu Pu,Kai Wang,Chaojie Ji,Tingming Bai,Eryun Liu
Main category: cs.CV
TL;DR: BaCon-Stereo is a contrastive learning framework for stereo matching that improves performance in occluded regions using a teacher-student paradigm and multi-baseline inputs.
Details
Motivation: Current self-supervised stereo matching fails in occluded regions due to photometric inconsistency, which this paper aims to address. Method: The paper introduces BaCon-Stereo, which uses a teacher-student paradigm with multi-baseline inputs and an occlusion-aware attention map for self-supervised stereo matching. Result: BaCon-Stereo outperforms state-of-the-art self-supervised methods on KITTI 2015 and 2012 benchmarks and shows strong generalization and robustness. Conclusion: BaCon-Stereo improves stereo matching in both occluded and non-occluded regions through a contrastive learning framework and multi-baseline inputs. Abstract: Current self-supervised stereo matching relies on the photometric consistency assumption, which breaks down in occluded regions due to ill-posed correspondences. To address this issue, we propose BaCon-Stereo, a simple yet effective contrastive learning framework for self-supervised stereo network training in both non-occluded and occluded regions. We adopt a teacher-student paradigm with multi-baseline inputs, in which the stereo pairs fed into the teacher and student share the same reference view but differ in target views. Geometrically, regions occluded in the student's target view are often visible in the teacher's, making it easier for the teacher to predict in these regions. The teacher's prediction is rescaled to match the student's baseline and then used to supervise the student. We also introduce an occlusion-aware attention map to better guide the student in learning occlusion completion. To support training, we synthesize a multi-baseline dataset BaCon-20k. Extensive experiments demonstrate that BaCon-Stereo improves prediction in both occluded and non-occluded regions, achieves strong generalization and robustness, and outperforms state-of-the-art self-supervised methods on both KITTI 2015 and 2012 benchmarks. Our code and dataset will be released upon paper acceptance.[169] Generalizable Federated Learning using Client Adaptive Focal Modulation
Tajamul Ashraf,Iqra Altaf Gillani
Main category: cs.CV
TL;DR: AdaptFED improves transformer-based federated learning by enhancing adaptability, scalability, and generalization, particularly in challenging distributed settings.
Details
Motivation: The motivation is to improve the generalizability and scalability of federated learning systems, particularly in non-IID, cross-domain, and resource-constrained environments. Method: The paper proposes AdaptFED, which incorporates a refined adaptation strategy with task-aware client embeddings, enhanced theoretical bounds on adaptation performance, and low-rank hypernetwork conditioning to reduce communication overhead. Result: Experiments on eight diverse datasets demonstrate that AdaptFED outperforms existing methods, especially in source-free and cross-task federated setups. Conclusion: The paper concludes that AdaptFED enhances the generalizability, scalability, and adaptability of transformer-based federated learning systems, outperforming state-of-the-art baselines in diverse settings. Abstract: Federated learning (FL) has proven essential for privacy-preserving, collaborative training across distributed clients. Our prior work, TransFed, introduced a robust transformer-based FL framework that leverages a learn-to-adapt hypernetwork to generate personalized focal modulation layers per client, outperforming traditional methods in non-IID and cross-domain settings. In this extended version, we propose AdaptFED, where we deepen the investigation of focal modulation in generalizable FL by incorporating: (1) a refined adaptation strategy that integrates task-aware client embeddings to personalize modulation dynamics further, (2) enhanced theoretical bounds on adaptation performance, and (3) broader empirical validation across additional modalities, including time-series and multilingual data. We also introduce an efficient variant of TransFed that reduces server-client communication overhead via low-rank hypernetwork conditioning, enabling scalable deployment in resource-constrained environments. Extensive experiments on eight diverse datasets reaffirm the superiority of our method over state-of-the-art baselines, particularly in source-free and cross-task federated setups. Our findings not only extend the capabilities of focal modulation in FL but also pave the way for more adaptive, scalable, and generalizable transformer-based federated systems. The code is available at http://github.com/Tajamul21/TransFed[170] Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation
Harold Haodong Chen,Haojian Huang,Qifeng Chen,Harry Yang,Ser-Nam Lim
Main category: cs.CV
TL;DR: PhysHPO是一种用于生成符合物理定律的高质量视频的新框架,通过细粒度偏好对齐和数据选择,显著提高了视频生成的物理合理性和质量。
Details
Motivation: 生成符合物理定律的视频对于需要真实性和准确性的应用仍然是一个关键挑战。 Method: 提出PhysHPO,一种用于分层跨模态直接偏好优化的新框架,并引入自动化数据选择流程。 Result: 在物理相关和通用能力基准上的大量实验证明,PhysHPO显著提高了物理合理性和整体视频生成质量。 Conclusion: PhysHPO实现了细粒度的偏好对齐和数据选择,为更真实和符合人类偏好的视频生成范式铺平了道路。 Abstract: Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize "good data" from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.[171] Performance of GPT-5 in Brain Tumor MRI Reasoning
Mojtaba Safari,Shansong Wang,Mingzhe Hu,Zach Eidex,Qiang Li,Xiaofeng Yang
Main category: cs.CV
TL;DR: 研究发现GPT-5系列模型在脑肿瘤图像识别任务中的准确率有限,尚无法满足临床需求。
Details
Motivation: 准确区分脑肿瘤类型对于神经肿瘤学中的治疗规划至关重要,而大型语言模型(LLMs)的进步使得整合图像解释与自然语言推理的视觉问答(VQA)方法成为可能。 Method: 研究评估了GPT-4o、GPT-5-nano、GPT-5-mini和GPT-5在从3个脑肿瘤分割(BraTS)数据集中提取的脑肿瘤VQA基准上的表现。 Result: 结果显示GPT-5-mini取得了最高的宏观平均准确率(44.19%),其次是GPT-5(43.71%)、GPT-4o(41.49%)和GPT-5-nano(35.85%)。 Conclusion: GPT-5系列模型在结构化神经肿瘤VQA任务中可以达到中等准确率,但其表现尚未达到临床应用的水平。 Abstract: Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (GLI), meningioma (MEN), and brain metastases (MET). Each case included multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. Models were assessed in a zero-shot chain-of-thought setting for accuracy on both visual and reasoning tasks. Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype, with no single model dominating across all cohorts. These findings suggest that GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.[172] TexVerse: A Universe of 3D Objects with High-Resolution Textures
Yibo Zhang,Li Zhang,Rui Ma,Nan Cao
Main category: cs.CV
TL;DR: TexVerse is a large-scale, high-resolution 3D dataset that addresses the lack of suitable datasets for end-to-end high-resolution texture creation, offering potential applications in various 3D vision and graphics tasks.
Details
Motivation: Creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. Method: The paper introduces TexVerse, a large-scale 3D dataset with high-resolution textures, sourced from Sketchfab, and provides specialized subsets and detailed model annotations. Result: TexVerse fills the gap in high-resolution 3D datasets with over 858K unique 3D models, including more than 158K models with PBR materials, TexVerse-Skeleton with 69K rigged models, and TexVerse-Animation with 54K animated models. Conclusion: TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks. Abstract: We introduce TexVerse, a large-scale 3D dataset featuring high-resolution textures. While recent advances in large-scale 3D datasets have enhanced high-resolution geometry generation, creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. TexVerse fills this gap with a curated collection of over 858K unique high-resolution 3D models sourced from Sketchfab, including more than 158K models with physically based rendering (PBR) materials. Each model encompasses all of its high-resolution variants, bringing the total to 1.6M 3D instances. TexVerse also includes specialized subsets: TexVerse-Skeleton, with 69K rigged models, and TexVerse-Animation, with 54K animated models, both preserving original skeleton and animation data uploaded by the user. We also provide detailed model annotations describing overall characteristics, structural components, and intricate features. TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks.[173] Medico 2025: Visual Question Answering for Gastrointestinal Imaging
Sushant Gautam,Vajira Thambawita,Michael Riegler,Pål Halvorsen,Steven Hicks
Main category: cs.CV
TL;DR: The Medico 2025 challenge promotes trustworthy AI in medical imaging by focusing on explainable VQA models for gastrointestinal endoscopy, using the Kvasir-VQA-x1 dataset.
Details
Motivation: To develop explainable AI models that provide interpretable justifications for clinical decision-making in gastrointestinal endoscopy image analysis. Method: The challenge introduces two subtasks: answering visual questions using the Kvasir-VQA-x1 dataset and generating multimodal explanations. It combines quantitative metrics with expert-reviewed assessments. Result: A benchmark dataset with 6,500 images and 159,549 QA pairs, alongside a challenge framework for evaluating performance and explainability. Conclusion: The Medico 2025 challenge aims to advance trustworthy AI in medical image analysis by focusing on explainable models for gastrointestinal imaging VQA tasks. Abstract: The Medico 2025 challenge addresses Visual Question Answering (VQA) for Gastrointestinal (GI) imaging, organized as part of the MediaEval task series. The challenge focuses on developing Explainable Artificial Intelligence (XAI) models that answer clinically relevant questions based on GI endoscopy images while providing interpretable justifications aligned with medical reasoning. It introduces two subtasks: (1) answering diverse types of visual questions using the Kvasir-VQA-x1 dataset, and (2) generating multimodal explanations to support clinical decision-making. The Kvasir-VQA-x1 dataset, created from 6,500 images and 159,549 complex question-answer (QA) pairs, serves as the benchmark for the challenge. By combining quantitative performance metrics and expert-reviewed explainability assessments, this task aims to advance trustworthy Artificial Intelligence (AI) in medical image analysis. Instructions, data access, and an updated guide for participation are available in the official competition repository: https://github.com/simula/MediaEval-Medico-2025[174] ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
Lingen Li,Guangzhi Wang,Zhaoyang Zhang,Yaowei Li,Xiaoyu Li,Qi Dou,Jinwei Gu,Tianfan Xue,Ying Shan
Main category: cs.CV
TL;DR: ToonComposer is a new AI model that combines inbetweening and colorization for cartoon production, reducing manual effort and improving animation quality and flexibility.
Details
Motivation: Traditional cartoon production involves labor-intensive stages like keyframing, inbetweening, and colorization. Existing AI methods handle these stages separately, leading to errors and artifacts. A more integrated solution is needed to improve efficiency and quality. Method: ToonComposer uses a sparse sketch injection mechanism and a cartoon adaptation method with a spatial low-rank adapter to integrate inbetweening and colorization, leveraging a modern video foundation model tailored to the cartoon domain. Result: ToonComposer achieves superior visual quality, motion consistency, and production efficiency compared to existing methods, as demonstrated on the PKBench benchmark with human-drawn sketches. Conclusion: ToonComposer provides a more efficient and flexible solution for AI-assisted cartoon production by unifying inbetweening and colorization, reducing manual workload, and improving motion control. Abstract: Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.[175] STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
Yushi Lan,Yihang Luo,Fangzhou Hong,Shangchen Zhou,Honghua Chen,Zhaoyang Lyu,Shuai Yang,Bo Dai,Chen Change Loy,Xingang Pan
Main category: cs.CV
TL;DR: STream3R是一种基于仅解码器Transformer的3D重建方法,它通过因果注意力机制处理图像序列,有效应对了传统方法在动态场景中的局限性,并在静态和动态场景基准测试中均优于先前方法。
Details
Motivation: 现有的多视角重建方法要么依赖昂贵的全局优化,要么使用扩展性差的简单内存机制,且在处理动态场景时通常失败。 Method: STream3R将点图预测重新定义为仅解码器Transformer问题,采用受现代语言建模启发的因果注意力机制,构建了一个高效的流式框架以处理图像序列。 Result: STream3R在静态和动态场景基准测试中均表现出优于先前方法的效果,且与LLM风格的训练基础设施天然兼容,支持高效的预训练和微调。 Conclusion: STream3R展示了因果Transformer模型在在线3D感知中的潜力,为流式环境中的实时3D理解铺平了道路。 Abstract: We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.[176] MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Antoine Labatie,Michael Vaccaro,Nina Lardiere,Anatol Garioud,Nicolas Gonthier
Main category: cs.CV
TL;DR: MAESTRO是一种新的遥感数据自我监督学习方法,在多时间动态任务上表现优越,同时在单一时间模态任务上也具有竞争力。
Details
Motivation: 自我监督学习对于遥感很有前景,但需要调整标准方法以适应地球观测数据的独特特性。 Method: 进行融合策略和重建目标归一化方案的全面基准测试,并提出MAESTRO,一种优化的Masked Autoencoder自监督学习方法。 Result: MAESTRO在四个地球观测数据集上设置新的最先进的结果,在多时间动态任务上表现突出,同时在单一时间模态任务上保持竞争力。 Conclusion: MAESTRO展示了一种新颖的自我监督学习方法,适用于遥感数据的多模态、多时间和多光谱特性。 Abstract: Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalization schemes for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder, featuring optimized fusion strategies and a tailored target normalization scheme that introduces a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.[177] ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning
Jongseo Lee,Kyungho Bae,Kyle Min,Gyeong-Moon Park,Jinwoo Choi
Main category: cs.CV
TL;DR: 本文提出了一种新的视频类增量学习方法ESSENTIAL,通过结合情景记忆和语义记忆机制,在降低内存消耗的同时保持了高性能。
Details
Motivation: 解决现有视频类增量学习方法在内存效率与性能之间的权衡问题:现有方法要么存储时间密集样本导致内存效率低下,要么存储时间稀疏样本导致性能下降。 Method: 提出了一种结合情景记忆和语义记忆的视频类增量学习方法ESSENTIAL,其中情景记忆存储时间稀疏特征,语义记忆存储通用知识,并通过跨注意力机制整合两者的新颖记忆检索模块(MR)。 Result: 在多个视频数据集(如UCF-101、HMDB51、Something-Something-V2、ActivityNet和Kinetics-400)上验证了ESSENTIAL的有效性,结果表明其内存显著减少的同时性能优越。 Conclusion: ESSENTIAL实现了在显著减少内存使用的同时,在多个视频数据集上取得了良好的VCIL性能。 Abstract: In this work, we tackle the problem of video classincremental learning (VCIL). Many existing VCIL methods mitigate catastrophic forgetting by rehearsal training with a few temporally dense samples stored in episodic memory, which is memory-inefficient. Alternatively, some methods store temporally sparse samples, sacrificing essential temporal information and thereby resulting in inferior performance. To address this trade-off between memory-efficiency and performance, we propose EpiSodic and SEmaNTIc memory integrAtion for video class-incremental Learning (ESSENTIAL). ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts. We introduce a novel memory retrieval (MR) module that integrates episodic memory and semantic prompts through cross-attention, enabling the retrieval of temporally dense features from temporally sparse features. We rigorously validate ESSENTIAL on diverse datasets: UCF-101, HMDB51, and Something-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, and Kinetics-400 from the vCLIMB benchmark. Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.[178] Human-in-Context: Unified Cross-Domain 3D Human Motion Modeling via In-Context Learning
Mengyuan Liu,Xinshun Wang,Zhongbin Fang,Deheng Ye,Xia Li,Tao Tang,Songtao Wu,Xiangtai Li,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: This paper proposes Human-in-Context (HiC), a unified cross-domain model for 3D human motion that integrates pose and mesh representations, improves generalization across modalities and datasets, and eliminates reliance on domain-specific components and multi-stage training.
Details
Motivation: Existing cross-domain models for 3D human motion often rely on domain-specific components and multi-stage training, limiting their practicality and scalability. This paper aims to address these challenges by proposing a unified model trained through a single process, eliminating the need for domain-specific components and multi-stage training. Method: The paper introduces Pose-in-Context (PiC) as a starting point, which utilizes in-context learning to develop a pose-centric cross-domain model. It then proposes Human-in-Context (HiC), an extension of PiC that integrates pose and mesh representations, expands task coverage, and incorporates larger-scale datasets. HiC also introduces a max-min similarity prompt sampling strategy and a dual-branch context injection network architecture to improve generalization and contextual dependency handling. Result: HiC outperforms PiC in terms of generalization, data scale, and performance across a wide range of domains. The experimental results validate HiC's effectiveness in handling modality diversity, prompting strategy, and contextual dependency, thereby offering improved flexibility and scalability for unified cross-domain 3D human motion modeling. Conclusion: HiC demonstrates potential in constructing a unified cross-domain 3D human motion model with enhanced flexibility and scalability, outperforming PiC in generalization, data scale, and performance across various domains. Abstract: This paper aims to model 3D human motion across domains, where a single model is expected to handle multiple modalities, tasks, and datasets. Existing cross-domain models often rely on domain-specific components and multi-stage training, which limits their practicality and scalability. To overcome these challenges, we propose a new setting to train a unified cross-domain model through a single process, eliminating the need for domain-specific components and multi-stage training. We first introduce Pose-in-Context (PiC), which leverages in-context learning to create a pose-centric cross-domain model. While PiC generalizes across multiple pose-based tasks and datasets, it encounters difficulties with modality diversity, prompting strategy, and contextual dependency handling. We thus propose Human-in-Context (HiC), an extension of PiC that broadens generalization across modalities, tasks, and datasets. HiC combines pose and mesh representations within a unified framework, expands task coverage, and incorporates larger-scale datasets. Additionally, HiC introduces a max-min similarity prompt sampling strategy to enhance generalization across diverse domains and a network architecture with dual-branch context injection for improved handling of contextual dependencies. Extensive experimental results show that HiC performs better than PiC in terms of generalization, data scale, and performance across a wide range of domains. These results demonstrate the potential of HiC for building a unified cross-domain 3D human motion model with improved flexibility and scalability. The source codes and models are available at https://github.com/BradleyWang0416/Human-in-Context.[179] Puppeteer: Rig and Animate Your 3D Models
Chaoyue Song,Xiu Li,Fan Yang,Zhongcong Xu,Jiacheng Wei,Fayao Liu,Jiashi Feng,Guosheng Lin,Jianfeng Zhang
Main category: cs.CV
TL;DR: Puppeteer是一个自动化3D模型绑定和动画生成的框架,通过创新的Transformer模型和优化策略,显著提高了动画质量和计算效率。
Details
Motivation: 当前3D内容创建流程中,静态模型转化为动态资产的瓶颈在于绑定和动画生成需要大量专家干预,而现有的生成式AI主要集中在静态模型创建上。 Method: Puppeteer采用了一种基于Transformer的自回归模型,预测骨骼结构,使用基于关节的标记化策略和随机扰动的层次排序方法;通过注意力机制架构推断蒙皮权重;最后使用基于优化的可微分动画生成方法。 Result: 在多个基准测试中,该方法在骨骼预测准确性和蒙皮质量上显著优于现有技术,同时能高效生成稳定、高保真的动画,且能处理从游戏资产到AI生成形状的多样化3D内容。 Conclusion: Puppeteer为3D模型的自动化绑定和动画生成提供了高效、鲁棒的解决方案,有效缓解了3D内容创作流程中的瓶颈问题。 Abstract: Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.[180] Quantum Visual Fields with Neural Amplitude Encoding
Shuteng Wang,Christian Theobalt,Vladislav Golyanik
Main category: cs.CV
TL;DR: This paper introduces Quantum Visual Field (QVF), a new quantum learning approach for 2D and 3D visual data, outperforming classical and quantum baselines in accuracy and efficiency.