Table of Contents
cs.CL [Back]
[1] Bridging AI Innovation and Healthcare Needs: Lessons Learned from Incorporating Modern NLP at The BC Cancer Registry
Lovedeep Gondara,Gregory Arbour,Raymond Ng,Jonathan Simkin,Shebnum Devji
Main category: cs.CL
TL;DR: This paper shares practical insights from implementing NLP at a cancer registry, emphasizing that successful AI adoption in healthcare hinges on collaboration, clear objectives, data quality, and human oversight.
Details
Motivation: The motivation stems from the potential of NLP to improve efficiency in healthcare settings, but the recognition of practical challenges in deploying such solutions. Method: The authors draw on their experience implementing NLP models at the British Columbia Cancer Registry for information extraction and classification tasks. Result: Key lessons learned include the importance of aligning NLP projects with business goals, iterative development, interdisciplinary collaboration, hybrid model approaches, data quality management, human-in-the-loop validation, and building AI literacy within organizations. Conclusion: The paper concludes that successful implementation of NLP and AI in healthcare requires a focus on business objectives, interdisciplinary collaboration, pragmatic model selection, data quality, error mitigation, and organizational AI literacy. Abstract: Automating data extraction from clinical documents offers significant potential to improve efficiency in healthcare settings, yet deploying Natural Language Processing (NLP) solutions presents practical challenges. Drawing upon our experience implementing various NLP models for information extraction and classification tasks at the British Columbia Cancer Registry (BCCR), this paper shares key lessons learned throughout the project lifecycle. We emphasize the critical importance of defining problems based on clear business objectives rather than solely technical accuracy, adopting an iterative approach to development, and fostering deep interdisciplinary collaboration and co-design involving domain experts, end-users, and ML specialists from inception. Further insights highlight the need for pragmatic model selection (including hybrid approaches and simpler methods where appropriate), rigorous attention to data quality (representativeness, drift, annotation), robust error mitigation strategies involving human-in-the-loop validation and ongoing audits, and building organizational AI literacy. These practical considerations, generalizable beyond cancer registries, provide guidance for healthcare organizations seeking to successfully implement AI/NLP solutions to enhance data management processes and ultimately improve patient care and public health outcomes.[2] A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain
Hugo Massaroli,Leonardo Iara,Emmanuel Iarussi,Viviana Siless
Main category: cs.CL
TL;DR: 该论文介绍了一种利用区块链技术评估开源大型语言模型公平性的透明协议,并揭示了跨语言的差异性。
Details
Motivation: 由于大型语言模型在现实世界中的应用日益广泛,尤其是在刑事司法、教育、医疗保健和金融等高风险领域,关于其公平性的担忧日益加剧。 Method: 通过在Internet Computer Protocol(ICP)区块链上使用智能合约执行HTTP请求,并将数据集、提示和指标存储在链上,确保评估的可验证性、不可变性和可重复性。 Result: 该研究通过PISA数据集和Kaleidoscope基准测试,对Llama、DeepSeek和Mistral模型进行了学术表现预测和多语言评估,揭示了社会偏见和跨语言的差异性。 Conclusion: 该论文提出了一种基于区块链的透明评估协议,用于基准测试开源大型语言模型的公平性,并通过多语言评估揭示了跨语言的差异性。 Abstract: Large language models (LLMs) are increasingly deployed in realworld applications, yet concerns about their fairness persist especially in highstakes domains like criminal justice, education, healthcare, and finance. This paper introduces transparent evaluation protocol for benchmarking the fairness of opensource LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain (Foundation, 2023). Our method ensures verifiable, immutable, and reproducible evaluations by executing onchain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly onchain. We benchmark the Llama, DeepSeek, and Mistral models on the PISA dataset for academic performance prediction (OECD, 2018), a dataset suitable for fairness evaluation using statistical parity and equal opportunity metrics (Hardt et al., 2016). We also evaluate structured Context Association Metrics derived from the StereoSet dataset (Nadeem et al., 2020) to measure social bias in contextual associations. We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark (Salazar et al., 2025), revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.[3] Thematic and Task-Based Categorization of K-12 GenAI Usages with Hierarchical Topic Modeling
Johannes Schneider,Béatrice S. Hasler,Michaela Varrone,Fabian Hoya,Thomas Schroffenegger,Dana-Kristin Mah,Karl Peböck
Main category: cs.CL
TL;DR: 该论文通过创新性主题建模方法分析教育互动数据,发现LLM在教学分析中表现优异,并指出未来研究的方向和潜在问题。
Details
Motivation: 现有研究缺乏针对K-12教育中内容和任务的分类分析,同时传统主题建模方法表现不佳,需要更符合实际需求的文本分析技术。 Method: 通过匿名交互数据分析,结合自然语言处理与任务分类,使用分层分类方法对17000多条信息进行分类,应用了预处理和指令优化的LLM模型。 Result: 研究揭示了新的应用场景,表明LLM在教育文本分析中比传统方法更具优势,并为教育者和研究者提供了增强GenAI使用的方向。 Conclusion: 该论文强调了使用先进的LLM(如ChatGPT)在教育文本分析中的有效性,并提出了对未来研究的开放性问题和关注点。 Abstract: We analyze anonymous interaction data of minors in class-rooms spanning several months, schools, and subjects employing a novel, simple topic modeling approach. Specifically, we categorize more than 17,000 messages generated by students, teachers, and ChatGPT in two dimensions: content (such as nature and people) and tasks (such as writing and explaining). Our hierarchical categorization done separately for each dimension includes exemplary prompts, and provides both a high-level overview as well as tangible insights. Prior works mostly lack a content or thematic categorization. While task categorizations are more prevalent in education, most have not been supported by real-world data for K-12. In turn, it is not surprising that our analysis yielded a number of novel applications. In deriving these insights, we found that many of the well-established classical and emerging computational methods, i.e., topic modeling, for analysis of large amounts of texts underperform, leading us to directly apply state-of-the-art LLMs with adequate pre-processing to achieve hierarchical topic structures with better human alignment through explicit instructions than prior approaches. Our findings support fellow researchers, teachers and students in enriching the usage of GenAI, while our discussion also highlights a number of concerns and open questions for future research.[4] INTIMA: A Benchmark for Human-AI Companionship Behavior
Lucie-Aimée Kaffee,Giada Pistilli,Yacine Jernite
Main category: cs.CL
TL;DR: The study introduces INTIMA, a benchmark for evaluating AI companionship behaviors, revealing that while companionship-reinforcing responses are common, there are notable differences between AI models in handling emotional and boundary-related interactions, raising concerns about user well-being.
Details
Motivation: The motivation for this study is the increasing emotional attachment users form with AI systems, which has both positive and concerning implications. The researchers aim to create a standardized benchmark to evaluate how AI models handle companionship-related behaviors, particularly focusing on emotional support and boundary-setting. Method: The researchers developed the Interactions and Machine Attachment Benchmark (INTIMA), which includes a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts were classified as companionship-reinforcing, boundary-maintaining, or neutral. This benchmark was applied to several AI models including Gemma-3, Phi-4, o3-mini, and Claude-4. Result: The application of INTIMA to various AI models revealed that companionship-reinforcing behaviors are much more common across all models. However, different commercial providers prioritize different categories within the more sensitive parts of the benchmark. Marked differences were observed between models in handling emotionally charged interactions. Conclusion: The study concludes that while companionship-reinforcing behaviors are common across AI models, there are significant differences in how models handle emotionally charged interactions, highlighting the need for more consistent and thoughtful approaches to boundary-setting and emotional support. Abstract: AI companionship, where users develop emotional bonds with AI systems, has emerged as a significant pattern with positive but also concerning implications. We introduce Interactions and Machine Attachment Benchmark (INTIMA), a benchmark for evaluating companionship behaviors in language models. Drawing from psychological theories and user data, we develop a taxonomy of 31 behaviors across four categories and 368 targeted prompts. Responses to these prompts are evaluated as companionship-reinforcing, boundary-maintaining, or neutral. Applying INTIMA to Gemma-3, Phi-4, o3-mini, and Claude-4 reveals that companionship-reinforcing behaviors remain much more common across all models, though we observe marked differences between models. Different commercial providers prioritize different categories within the more sensitive parts of the benchmark, which is concerning since both appropriate boundary-setting and emotional support matter for user well-being. These findings highlight the need for more consistent approaches to handling emotionally charged interactions.[5] XFacta: Contemporary, Real-World Dataset and Evaluation for Multimodal Misinformation Detection with Multimodal LLMs
Yuzhuo Xiao,Zeyu Han,Yuhan Wang,Huaizu Jiang
Main category: cs.CL
TL;DR: 本文介绍了一种用于评估多模态大型语言模型(MLLM)在检测社交媒体多模态错误信息的新数据集XFacta,并提出了一个半自动的检测框架以保持数据集的时效性。
Details
Motivation: 社交媒体上多模态错误信息的快速传播需要更有效的检测方法。当前数据集要么过时,导致评估偏差,要么是人工合成,不能反映现实世界的错误信息模式。此外,对基于MLLM的模型设计策略缺乏全面分析。 Method: 引入了一个当代的、真实世界的数据集XFacta,系统地评估了各种基于MLLM的错误信息检测策略,并构建了一个半自动检测框架,持续更新XFacta内容。 Result: 研究提供了对MLLM-based检测策略的全面评估,提出了XFacta数据集和检测框架,以促进多模态错误信息检测领域的进一步发展。 Conclusion: 本文通过引入XFacta数据集和半自动检测框架,为多模态错误信息检测领域提供了宝贵的见解和实践方法。 Abstract: The rapid spread of multimodal misinformation on social media calls for more effective and robust detection methods. Recent advances leveraging multimodal large language models (MLLMs) have shown the potential in addressing this challenge. However, it remains unclear exactly where the bottleneck of existing approaches lies (evidence retrieval v.s. reasoning), hindering the further advances in this field. On the dataset side, existing benchmarks either contain outdated events, leading to evaluation bias due to discrepancies with contemporary social media scenarios as MLLMs can simply memorize these events, or artificially synthetic, failing to reflect real-world misinformation patterns. Additionally, it lacks comprehensive analyses of MLLM-based model design strategies. To address these issues, we introduce XFacta, a contemporary, real-world dataset that is better suited for evaluating MLLM-based detectors. We systematically evaluate various MLLM-based misinformation detection strategies, assessing models across different architectures and scales, as well as benchmarking against existing detection methods. Building on these analyses, we further enable a semi-automatic detection-in-the-loop framework that continuously updates XFacta with new content to maintain its contemporary relevance. Our analysis provides valuable insights and practices for advancing the field of multimodal misinformation detection. The code and data have been released.[6] AutoGeTS: Knowledge-based Automated Generation of Text Synthetics for Improving Text Classification
Chenhao Xue,Yuanzhe Jin,Adrian Carrasco-Revilla,Joyraj Chakraborty,Min Chen
Main category: cs.CL
TL;DR: 本文研究了如何利用大型语言模型生成合成数据来改善文本分类模型的性能,并提出了一种自动化工作流程和集成方法来选择更有效的搜索策略。
Details
Motivation: 本文的动机是解决现实世界应用中文本分类模型开发的一个主要挑战,即难以收集到足够数量的所有文本类别的数据。 Method: 本文的方法包括利用大型语言模型生成合成数据,并设计了一种自动化工作流程,通过搜索策略寻找能产生更有效合成数据的输入示例。 Result: 本文的结果表明,提出的集成方法能够根据类别的特征选择合适的搜索策略,在改进分类模型方面比每个单独的搜索策略更有效。 Conclusion: 本文的结论是,通过利用大型语言模型生成合成数据,并使用这种数据改进模型性能,提出了一种有效的自动化工作流程,其中集成方法比单个策略更有效。 Abstract: When developing text classification models for real world applications, one major challenge is the difficulty to collect sufficient data for all text classes. In this work, we address this challenge by utilizing large language models (LLMs) to generate synthetic data and using such data to improve the performance of the models without waiting for more real data to be collected and labelled. As an LLM generates different synthetic data in response to different input examples, we formulate an automated workflow, which searches for input examples that lead to more ``effective'' synthetic data for improving the model concerned. We study three search strategies with an extensive set of experiments, and use experiment results to inform an ensemble algorithm that selects a search strategy according to the characteristics of a class. Our further experiments demonstrate that this ensemble approach is more effective than each individual strategy in our automated workflow for improving classification models using LLMs.[7] HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish
Rakesh Thakur,Sneha Sharma,Gauri Chopra
Main category: cs.CL
TL;DR: 本文介绍了一种新的多语言事实核查模型HiFACTMix,专门用于处理低资源代码混合语言如Hinglish的政治声明,实验表明其优于现有方法。
Details
Motivation: 现有事实核查系统主要集中在高资源单语环境中,无法泛化到像印度这样语言多样地区的现实政治话语中。论文旨在解决低资源代码混合语言环境下的事实核查挑战。 Method: 论文引入了一个名为HiFACT的数据集,并提出了一个结合多语言上下文编码、声明-证据语义对齐、证据图构建、图神经推理和自然语言解释生成的图感知、检索增强的事实核查模型。 Result: 实验结果显示,HiFACTMix在准确性上优于现有最先进的多语言基线模型,并为其判断提供了忠实的解释。 Conclusion: 该论文提出了一种针对代码混合、低资源语言(如Hinglish)的新型事实核查模型HiFACTMix,并展示了其在政治语境中的多语言事实核查能力优于现有模型。 Abstract: Fact-checking in code-mixed, low-resource languages such as Hinglish remains an underexplored challenge in natural language processing. Existing fact-verification systems largely focus on high-resource, monolingual settings and fail to generalize to real-world political discourse in linguistically diverse regions like India. Given the widespread use of Hinglish by public figures, particularly political figures, and the growing influence of social media on public opinion, there's a critical need for robust, multilingual and context-aware fact-checking tools. To address this gap a novel benchmark HiFACT dataset is introduced with 1,500 realworld factual claims made by 28 Indian state Chief Ministers in Hinglish, under a highly code-mixed low-resource setting. Each claim is annotated with textual evidence and veracity labels. To evaluate this benchmark, a novel graphaware, retrieval-augmented fact-checking model is proposed that combines multilingual contextual encoding, claim-evidence semantic alignment, evidence graph construction, graph neural reasoning, and natural language explanation generation. Experimental results show that HiFACTMix outperformed accuracy in comparison to state of art multilingual baselines models and provides faithful justifications for its verdicts. This work opens a new direction for multilingual, code-mixed, and politically grounded fact verification research.[8] Semantic Structure in Large Language Model Embeddings
Austin C. Kozlowski,Callin Dai,Andrei Boutyline
Main category: cs.CL
TL;DR: 大型语言模型中的语义结构与人类语言相似,语义信息是低维且高度纠缠的。
Details
Motivation: 人类对词语在不同语义尺度上的评分可以被简化为低维形式,因此研究大型语言模型是否也具有类似的语义结构。 Method: 通过分析大型语言模型嵌入矩阵中的语义关联,研究词在反义词对定义的语义方向上的投影,并观察这些投影对几何对齐特征的影响。 Result: 发现词在特定语义方向上的投影与人类评分高度相关,并且语义特征在模型中沿着一个三维子空间分布,且改变一个语义方向会影响其他相关特征。 Conclusion: 语义特征在大型语言模型中的纠缠方式与人类语言中的互联方式相似,且语义信息尽管表面上复杂,却出人意料地低维。 Abstract: Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the semantic associations encoded in the embedding matrices of large language models (LLMs) exhibit a similar structure. We show that the projections of words on semantic directions defined by antonym pairs (e.g. kind - cruel) correlate highly with human ratings, and further find that these projections effectively reduce to a 3-dimensional subspace within LLM embeddings, closely resembling the patterns derived from human survey responses. Moreover, we find that shifting tokens along one semantic direction causes off-target effects on geometrically aligned features proportional to their cosine similarity. These findings suggest that semantic features are entangled within LLMs similarly to how they are interconnected in human language, and a great deal of semantic information, despite its apparent complexity, is surprisingly low-dimensional. Furthermore, accounting for this semantic structure may prove essential for avoiding unintended consequences when steering features.[9] User Perception of Attention Visualizations: Effects on Interpretability Across Evidence-Based Medical Documents
Andrés Carvallo,Denis Parra,Peter Brusilovsky,Hernan Valdivieso,Gabriel Rada,Ivania Donoso,Vladimir Araujo
Main category: cs.CL
TL;DR: Transformer模型(XLNet)能够准确分类生物医学文献,但注意力权重的可视化方式影响其作为解释工具的效用,用户更偏好直观的文本亮度或背景颜色表示。
Details
Motivation: 注意力机制被视为一种通过注意力权重提供可解释性的机制,但在证据医学中,其是否能有效支持医生理解和使用AI系统仍存在争议。 Method: 通过用户研究评估基于注意力的解释是否有助于生物医学文献分类,并探索不同注意力可视化方式的效果。 Result: XLNet模型能够准确分类文档,但注意力权重并不被普遍认为是有效的解释工具,用户的感知效果因可视化方式而异,更偏好直观的文本亮度或背景颜色。 Conclusion: 注意力权重的可视化方式显著影响其作为解释工具的感知效用,未来设计应考虑用户偏好而非仅依赖精确编码。 Abstract: The attention mechanism is a core component of the Transformer architecture. Beyond improving performance, attention has been proposed as a mechanism for explainability via attention weights, which are associated with input features (e.g., tokens in a document). In this context, larger attention weights may imply more relevant features for the model's prediction. In evidence-based medicine, such explanations could support physicians' understanding and interaction with AI systems used to categorize biomedical literature. However, there is still no consensus on whether attention weights provide helpful explanations. Moreover, little research has explored how visualizing attention affects its usefulness as an explanation aid. To bridge this gap, we conducted a user study to evaluate whether attention-based explanations support users in biomedical document classification and whether there is a preferred way to visualize them. The study involved medical experts from various disciplines who classified articles based on study design (e.g., systematic reviews, broad synthesis, randomized and non-randomized trials). Our findings show that the Transformer model (XLNet) classified documents accurately; however, the attention weights were not perceived as particularly helpful for explaining the predictions. However, this perception varied significantly depending on how attention was visualized. Contrary to Munzner's principle of visual effectiveness, which favors precise encodings like bar length, users preferred more intuitive formats, such as text brightness or background color. While our results do not confirm the overall utility of attention weights for explanation, they suggest that their perceived helpfulness is influenced by how they are visually presented.[10] From Answers to Questions: EQGBench for Evaluating LLMs' Educational Question Generation
Chengliang Zhou,Mei Wang,Ting Zhang,Qiannan Zhu,Jian Li,Hua Huang
Main category: cs.CL
TL;DR: EQGBench是一個用於評估大語言模型在中文教育問題生成表現的綜合基準,揭示了LLMs在生成具教育價值問題上的發展空間。
Details
Motivation: 從提供答案轉變為生成高質量教育問題存在重大挑戰,需要探索以促進教育問題生成(EQG)和便於LLMs生成具有教育價值和效果的問題。 Method: 通過對46個主流大模型的系統評估,介紹了專門設計用於評估LLMs在中文EQG表現的綜合基準EQGBench。 Result: 建立了由900個評估樣本組成的數據集,涵蓋三個基本中學學科,並通過五維評估框架模擬真實教育場景。 Conclusion: EQGBench揭示了在生成反映教育價值和促進學生綜合能力的問題上,LLMs仍有很大的發展空間。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in mathematical problem-solving. However, the transition from providing answers to generating high-quality educational questions presents significant challenges that remain underexplored. To advance Educational Question Generation (EQG) and facilitate LLMs in generating pedagogically valuable and educationally effective questions, we introduce EQGBench, a comprehensive benchmark specifically designed for evaluating LLMs' performance in Chinese EQG. EQGBench establishes a five-dimensional evaluation framework supported by a dataset of 900 evaluation samples spanning three fundamental middle school disciplines: mathematics, physics, and chemistry. The dataset incorporates user queries with varying knowledge points, difficulty gradients, and question type specifications to simulate realistic educational scenarios. Through systematic evaluation of 46 mainstream large models, we reveal significant room for development in generating questions that reflect educational value and foster students' comprehensive abilities.[11] Automated scoring of the Ambiguous Intentions Hostility Questionnaire using fine-tuned large language models
Y. Lyu,D. Combs,D. Neumann,Y. C. Leong
Main category: cs.CL
TL;DR: 本研究评估了大型语言模型在自动评估Ambiguous Intentions Hostility Questionnaire(AIHQ)开放式回答的能力。研究发现,这些模型可以有效地自动化评分过程,为研究和临床环境中的心理评估提供潜在的便利。
Details
Motivation: AIHQ开放式问题需要耗时的人工评分,评估大型语言模型是否可以自动评估AIHQ开放式回答。 Method: 使用先前收集的数据集,其中TBI个体和HC完成了AIHQ,并对其开放式回答进行了人工评分。使用一半的回答对两个模型进行微调,并在剩余的一半AIHQ回答上测试微调后的模型。 Result: 结果表明,模型生成的评分与人工评分一致,微调后的模型一致性更高。这种一致性在模糊、故意和意外情景类型中都是一致的,并且复制了关于TBI组和HC组在敌意归因和攻击反应上的群体差异的先前发现。微调后的模型在一个独立的非临床数据集中也表现良好。 Conclusion: 大型语言模型可以自动评估AIHQ开放式回答,为研究和临床环境中的心理评估提供潜在的便利。 Abstract: Hostile attribution bias is the tendency to interpret social interactions as intentionally hostile. The Ambiguous Intentions Hostility Questionnaire (AIHQ) is commonly used to measure hostile attribution bias, and includes open-ended questions where participants describe the perceived intentions behind a negative social situation and how they would respond. While these questions provide insights into the contents of hostile attributions, they require time-intensive scoring by human raters. In this study, we assessed whether large language models can automate the scoring of AIHQ open-ended responses. We used a previously collected dataset in which individuals with traumatic brain injury (TBI) and healthy controls (HC) completed the AIHQ and had their open-ended responses rated by trained human raters. We used half of these responses to fine-tune the two models on human-generated ratings, and tested the fine-tuned models on the remaining half of AIHQ responses. Results showed that model-generated ratings aligned with human ratings for both attributions of hostility and aggression responses, with fine-tuned models showing higher alignment. This alignment was consistent across ambiguous, intentional, and accidental scenario types, and replicated previous findings on group differences in attributions of hostility and aggression responses between TBI and HC groups. The fine-tuned models also generalized well to an independent nonclinical dataset. To support broader adoption, we provide an accessible scoring interface that includes both local and cloud-based options. Together, our findings suggest that large language models can streamline AIHQ scoring in both research and clinical contexts, revealing their potential to facilitate psychological assessments across different populations.[12] Multidimensional classification of posts for online course discussion forum curation
Antonio Leandro Martins Candido,Jose Everardo Bessa Maia
Main category: cs.CL
TL;DR: This paper proposes a Bayesian fusion approach to combine classification scores from a pre-trained LLM and a local classifier, avoiding costly fine-tuning while maintaining performance for curating online course discussion forums.
Details
Motivation: The motivation is to avoid the resource-intensive process of frequently retraining Large Language Models (LLMs) for the automatic curation of discussion forums in online courses. Method: The paper proposes a Bayesian fusion method that combines multidimensional classification scores from a pre-trained generic LLM with those of a classifier trained on local data. Result: The performance comparison showed that the proposed fusion method improves results compared to using each classifier individually and is competitive with the LLM fine-tuning approach. Conclusion: The proposed Bayesian fusion approach effectively combines classification scores from a pre-trained LLM and a local classifier, offering competitive performance compared to LLM fine-tuning without the need for frequent retraining. Abstract: The automatic curation of discussion forums in online courses requires constant updates, making frequent retraining of Large Language Models (LLMs) a resource-intensive process. To circumvent the need for costly fine-tuning, this paper proposes and evaluates the use of Bayesian fusion. The approach combines the multidimensional classification scores of a pre-trained generic LLM with those of a classifier trained on local data. The performance comparison demonstrated that the proposed fusion improves the results compared to each classifier individually, and is competitive with the LLM fine-tuning approach[13] Beyond Hard Sharing: Efficient Multi-Task Speech-to-Text Modeling with Supervised Mixture of Experts
Hojun Jin,Eunsoo Hong,Ziwon Hyung,Sungjun Lim,Seungjin Lee,Keunseok Cho
Main category: cs.CL
TL;DR: 我们提出了一种新的Supervised Mixture of Experts (S-MoE)方法,通过使用指导标记路由每个任务到其指定的专家,从而克服了硬参数共享的局限性,并在语音识别和语音翻译任务中取得了显著的性能提升。
Details
Motivation: 硬参数共享虽然是一种在不同任务上共同训练单一模型的常用策略,但往往会导致任务干扰,影响整体模型性能。为了解决这个问题,我们提出了S-MoE。 Method: 提出了一种新的Supervised Mixture of Experts (S-MoE)方法,该方法不需训练门控函数,而是利用特殊的指导标记将每个任务路由到其指定的专家。 Result: 实验结果表明,S-MoE在语音识别和语音翻译任务中都取得了6.35%的相对提升。 Conclusion: S-MoE是一个有效的模型,通过使用指导标记路由每个任务到其指定的专家,克服了硬参数共享的局限性。 Abstract: Hard-parameter sharing is a common strategy to train a single model jointly across diverse tasks. However, this often leads to task interference, impeding overall model performance. To address the issue, we propose a simple yet effective Supervised Mixture of Experts (S-MoE). Unlike traditional Mixture of Experts models, S-MoE eliminates the need for training gating functions by utilizing special guiding tokens to route each task to its designated expert. By assigning each task to a separate feedforward network, S-MoE overcomes the limitations of hard-parameter sharing. We further apply S-MoE to a speech-to-text model, enabling the model to process mixed-bandwidth input while jointly performing automatic speech recognition (ASR) and speech translation (ST). Experimental results demonstrate the effectiveness of the proposed S-MoE, achieving a 6.35% relative improvement in Word Error Rate (WER) when applied to both the encoder and decoder.[14] An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs
Ayana Hussain,Patrick Zhao,Nicholas Vincent
Main category: cs.CL
TL;DR: 该研究探讨了 LLMs 生成的越狱攻击如何导致有害的医疗错误信息,并评估了使用 LLMs 检测来自其他 LLMs 和人类的错误信息的能力。
Details
Motivation: 大型语言模型 (LLMs) 是一把双刃剑,它们既能生成有害的错误信息,也有可能被用来检测和防止错误信息的传播。 Method: 研究了 109 个针对三个目标 LLM 的越狱攻击,比较了攻击提示与现实世界中与健康相关的 LLM 查询之间的差异,并分析了使用标准机器学习方法检测这些错误信息的有效性。 Result: 研究发现,LLMs 生成的错误信息与社交媒体上常见的错误信息有所不同,并且可以使用标准机器学习方法进行有效检测。 Conclusion: LLMs 可以有效地用于检测来自其他 LLMs 和人类的错误信息,并且经过精心设计,可以为更健康的信息生态系统做出贡献。 Abstract: Large Language Models (LLMs) are a double-edged sword capable of generating harmful misinformation -- inadvertently, or when prompted by "jailbreak" attacks that attempt to produce malicious outputs. LLMs could, with additional research, be used to detect and prevent the spread of misinformation. In this paper, we investigate the efficacy and characteristics of LLM-produced jailbreak attacks that cause other models to produce harmful medical misinformation. We also study how misinformation generated by jailbroken LLMs compares to typical misinformation found on social media, and how effectively it can be detected using standard machine learning approaches. Specifically, we closely examine 109 distinct attacks against three target LLMs and compare the attack prompts to in-the-wild health-related LLM queries. We also examine the resulting jailbreak responses, comparing the generated misinformation to health-related misinformation on Reddit. Our findings add more evidence that LLMs can be effectively used to detect misinformation from both other LLMs and from people, and support a body of work suggesting that with careful design, LLMs can contribute to a healthier overall information ecosystem.[15] Evaluation of GPT-based large language generative AI models as study aids for the national licensure examination for registered dietitians in Japan
Yuta Nagamori,Mikoto Kosai,Yuji Kawai,Haruka Marumo,Misaki Shibuya,Tatsuya Negishi,Masaki Imanishi,Yasumasa Ikeda,Koichiro Tsuchiya,Asuka Sawai,Licht Miyamoto
Main category: cs.CL
TL;DR: The study evaluated the effectiveness of generative AI models like ChatGPT and Bing models (Precise, Creative, Balanced) as study aids for the Japanese national licensure examination for registered dietitians. It found that while some models, like Bing-Precise and Bing-Creative, marginally exceeded the passing threshold, the overall accuracy and consistency of all models were suboptimal, highlighting the need for further advancements in AI-based study aids for dietitian licensure preparation.
Details
Motivation: The motivation of the study was to evaluate the potential of current LLM-based generative AI models as study aids for nutrition students, particularly in the context of the Japanese national licensure examination for registered dietitians, where the performance of such models remains underexplored. Method: The study used questions from the Japanese national examination for registered dietitians as prompts for ChatGPT and three Bing models (Precise, Creative, Balanced), based on GPT-3.5 and GPT-4. Each question was entered into independent sessions, and model responses were analyzed for accuracy, consistency, and response time. Additional prompt engineering was tested to assess potential performance improvements. Result: Bing-Precise (66.2%) and Bing-Creative (61.4%) surpassed the passing threshold (60%), while Bing-Balanced (43.3%) and ChatGPT (42.8%) did not. Bing-Precise and Bing-Creative generally outperformed others across subject fields except Nutrition Education, where all models underperformed. None of the models consistently provided the same correct responses across repeated attempts, highlighting limitations in answer stability. ChatGPT showed greater consistency in response patterns but lower accuracy. Prompt engineering had minimal effect, except for modest improvement when correct answers and explanations were explicitly provided. Conclusion: The study concludes that while some generative AI models, like Bing-Precise and Bing-Creative, marginally exceeded the passing threshold in the Japanese national licensure examination for registered dietitians, the overall accuracy and consistency of all models were suboptimal. The study highlights the limitations in answer stability and robustness, indicating the need for further advancements to ensure reliable AI-based study aids for dietitian licensure preparation. Abstract: Generative artificial intelligence (AI) based on large language models (LLMs), such as ChatGPT, has demonstrated remarkable progress across various professional fields, including medicine and education. However, their performance in nutritional education, especially in Japanese national licensure examination for registered dietitians, remains underexplored. This study aimed to evaluate the potential of current LLM-based generative AI models as study aids for nutrition students. Questions from the Japanese national examination for registered dietitians were used as prompts for ChatGPT and three Bing models (Precise, Creative, Balanced), based on GPT-3.5 and GPT-4. Each question was entered into independent sessions, and model responses were analyzed for accuracy, consistency, and response time. Additional prompt engineering, including role assignment, was tested to assess potential performance improvements. Bing-Precise (66.2%) and Bing-Creative (61.4%) surpassed the passing threshold (60%), while Bing-Balanced (43.3%) and ChatGPT (42.8%) did not. Bing-Precise and Bing-Creative generally outperformed others across subject fields except Nutrition Education, where all models underperformed. None of the models consistently provided the same correct responses across repeated attempts, highlighting limitations in answer stability. ChatGPT showed greater consistency in response patterns but lower accuracy. Prompt engineering had minimal effect, except for modest improvement when correct answers and explanations were explicitly provided. While some generative AI models marginally exceeded the passing threshold, overall accuracy and answer consistency remained suboptimal. Moreover, all the models demonstrated notable limitations in answer consistency and robustness. Further advancements are needed to ensure reliable and stable AI-based study aids for dietitian licensure preparation.[16] Guided Navigation in Knowledge-Dense Environments: Structured Semantic Exploration with Guidance Graphs
Dehao Tao,Guangjie Liu,Weizheng,Yongfeng Huang,Minghu jiang
Main category: cs.CL
TL;DR: This paper proposes GG Explore, a novel framework for knowledge graph exploration that improves structured knowledge retrieval using a Guidance Graph, achieving better efficiency and performance than existing methods.
Details
Motivation: Current knowledge graph exploration methods face limitations in handling granularity mismatches and leveraging contextual information effectively, which limits their performance in complex knowledge-intensive tasks. Method: The authors propose GG Explore, which uses a Guidance Graph to bridge unstructured queries and structured knowledge retrieval. It incorporates Structural Alignment to filter incompatible candidates and Context Aware Pruning to maintain semantic consistency. Result: Experiments demonstrate that GG Explore outperforms state-of-the-art methods in terms of efficiency and performance, particularly for complex tasks, while maintaining strong results with smaller LLMs. Conclusion: The proposed GG Explore framework effectively addresses the limitations of existing knowledge exploration methods by introducing an intermediate Guidance Graph that enhances structured knowledge retrieval while achieving high efficiency and performance across complex tasks. Abstract: While Large Language Models (LLMs) exhibit strong linguistic capabilities, their reliance on static knowledge and opaque reasoning processes limits their performance in knowledge intensive tasks. Knowledge graphs (KGs) offer a promising solution, but current exploration methods face a fundamental trade off: question guided approaches incur redundant exploration due to granularity mismatches, while clue guided methods fail to effectively leverage contextual information for complex scenarios. To address these limitations, we propose Guidance Graph guided Knowledge Exploration (GG Explore), a novel framework that introduces an intermediate Guidance Graph to bridge unstructured queries and structured knowledge retrieval. The Guidance Graph defines the retrieval space by abstracting the target knowledge' s structure while preserving broader semantic context, enabling precise and efficient exploration. Building upon the Guidance Graph, we develop: (1) Structural Alignment that filters incompatible candidates without LLM overhead, and (2) Context Aware Pruning that enforces semantic consistency with graph constraints. Extensive experiments show our method achieves superior efficiency and outperforms SOTA, especially on complex tasks, while maintaining strong performance with smaller LLMs, demonstrating practical value.[17] Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis
Linqing Chen,Hanmeng Zhong,Wentao Wu,Weilei Wang
Main category: cs.CL
TL;DR: Semantic Bridge是一种可控生成复杂多跳推理问题的框架,通过语义图编织方法在多语言和多领域中实现高质量问答对生成。
Details
Motivation: LLM训练面临缺乏高质量、推理密集型问答对的问题,尤其是来自特定领域的文档,现有方法无法生成可控的多跳推理问题。 Method: 提出了“语义图编织”方法,包含实体桥接、谓词链桥接和因果桥接,通过AMR驱动分析构建复杂路径,实现对复杂性和类型细粒度控制。 Result: Semantic Bridge在多语言(英语、中文、法语、德语)和多领域(维基百科、生物医学)评估中表现优异,相较基线模型提升了18.3%-25.4%。 Conclusion: Semantic Bridge是一个新的LLM训练数据合成范式,能够从稀疏源中可控地生成目标推理问题,并且在多语言和多领域中表现出色。 Abstract: Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.[18] PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?
Lingfeng Zhou,Jialing Zhang,Jin Gao,Mohan Jiang,Dequan Wang
Main category: cs.CL
TL;DR: PersonaEval reveals that current LLM evaluators lack the human-like reasoning needed to effectively judge role-play quality, as they perform significantly worse than humans in identifying personas from dialogues.
Details
Motivation: Current role-play studies use unvalidated LLM-as-a-judge paradigms that may not reflect how humans perceive role fidelity. The study aims to address this by emphasizing the importance of role identification for meaningful evaluation. Method: The study introduces PersonaEval, a benchmark using human-authored dialogues to test LLM evaluators' ability to identify personas. It includes experiments comparing model performance with human participants and examines factors like training-time adaptation and test-time compute. Result: Experiments show that even the best-performing LLMs achieve only around 69% accuracy on PersonaEval, significantly lower than human participants, who achieve 90.8% accuracy. Conclusion: The study concludes that current LLM evaluators are not sufficiently human-like in their reasoning abilities to reliably judge role-play scenarios, and improvements require more than just task-specific tuning. Abstract: Current role-play studies often rely on unvalidated LLM-as-a-judge paradigms, which may fail to reflect how humans perceive role fidelity. A key prerequisite for human-aligned evaluation is role identification, the ability to recognize who is speaking based on dialogue context. We argue that any meaningful judgment of role-playing quality (how well a character is played) fundamentally depends on first correctly attributing words and actions to the correct persona (who is speaking). We present PersonaEval, the first benchmark designed to test whether LLM evaluators can reliably identify human roles. PersonaEval uses human-authored dialogues from novels, scripts, and video transcripts, challenging models to determine the correct persona according to the conversation context. Our experiments, including a human study, show that even the best-performing LLMs reach only around 69% accuracy, well below the level needed for reliable evaluation. In contrast, human participants perform near ceiling with 90.8% accuracy, highlighting that current LLM evaluators are still not human enough to effectively judge role-play scenarios. To better understand this gap, we examine training-time adaptation and test-time compute, suggesting that reliable evaluation requires more than task-specific tuning, but depends on strong, human-like reasoning abilities in LLM evaluators. We release our benchmark at https://github.com/maple-zhou/PersonaEval.[19] RealTalk-CN: A Realistic Chinese Speech-Text Dialogue Benchmark With Cross-Modal Interaction Analysis
Enzhi Wang,Qicheng Li,Shiwan Zhao,Aobo Kong,Jiaming Zhou,Xi Yang,Yequan Wang,Yonghua Lin,Yong Qin
Main category: cs.CL
TL;DR: RealTalk-CN 是首个中文多轮、多领域语音-文本双模态 TOD 数据集,填补了现有数据集在真实语音评估方面的空白。
Details
Motivation: 现有 TOD 数据集主要基于文本且以英文为主,缺乏真实语音信号和关键语音特征,无法有效评估语音基 LLMs 的性能。 Method: 构建了一个包含 5.4k 对话(60K 语句,150 小时)的 RealTalk-CN 数据集,并提出了一个新的跨模态对话任务。 Result: RealTalk-CN 包含多轮、多领域、语音-文本双模态数据,涵盖自发语音不流畅、说话人变化等真实场景特征,并通过实验验证了其有效性。 Conclusion: RealTalk-CN 是一个全面且真实的中文语音-文本双模态 TOD 数据集,为基于语音的 LLMs 研究提供了坚实基础。 Abstract: In recent years, large language models (LLMs) have achieved remarkable advancements in multimodal processing, including end-to-end speech-based language models that enable natural interactions and perform specific tasks in task-oriented dialogue (TOD) systems. However, existing TOD datasets are predominantly text-based, lacking real speech signals that are essential for evaluating the robustness of speech-based LLMs. Moreover, existing speech TOD datasets are primarily English and lack critical aspects such as speech disfluencies and speaker variations. To address these gaps, we introduce RealTalk-CN, the first Chinese multi-turn, multi-domain speech-text dual-modal TOD dataset, comprising 5.4k dialogues (60K utterances, 150 hours) with paired speech-text annotations. RealTalk-CN captures diverse dialogue scenarios with annotated spontaneous speech disfluencies, ensuring comprehensive coverage of real-world complexities in speech dialogue. In addition, we propose a novel cross-modal chat task that authentically simulates real-world user interactions, allowing dynamic switching between speech and text modalities. Our evaluation covers robustness to speech disfluencies, sensitivity to speaker characteristics, and cross-domain performance. Extensive experiments validate the effectiveness of RealTalk-CN, establishing a strong foundation for Chinese speech-based LLMs research.[20] Training-Free Multimodal Large Language Model Orchestration
Tianyu Xie,Yuhang Wu,Yongdong Luo,Jiayi Ji,Xiawu Zheng
Main category: cs.CL
TL;DR: This paper introduces MLLM Orchestration, a method to build interactive multimodal AI systems without additional training, achieving better performance, lower latency, and higher interpretability by using a central controller and explicit workflows.
Details
Motivation: The motivation stems from the difficulty of integrating different MLLMs into a unified system due to issues like modal alignment and Text-to-Speech efficiency, prompting the search for a modular and efficient solution. Method: The approach uses a central controller LLM to manage specialized models through explicit workflows, includes a parallel Text-to-Speech architecture for full-duplex interaction, and employs a cross-modal memory integration system for coherent context handling. Result: The evaluations show that MLLM Orchestration achieves better multimodal capabilities without training, with up to 7.8% performance improvement, 10.3% reduced latency, and improved interpretability. Conclusion: MLLM Orchestration provides a new method for building multimodal AI systems without additional training, offering improved efficiency, interpretability, and performance. Abstract: Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.[21] A Rose by Any Other Name Would Smell as Sweet: Categorical Homotopy Theory for Large Language Models
Sridhar Mahadevan
Main category: cs.CL
TL;DR: This paper proposes a categorical homotopy framework for LLMs to address the issue of different next-token probabilities for semantically equivalent statements, by introducing an LLM Markov category and using categorical homotopy techniques to capture 'weak equivalences'.
Details
Motivation: The motivation is to solve the problem where LLMs do not generate the same next-token probabilities for semantically equivalent statements, which is an issue in natural language processing. Method: The paper introduces an LLM Markov category to represent probability distributions in language generated by an LLM and applies categorical homotopy techniques to capture 'weak equivalences' in this category. Result: The result is a detailed overview of the application of categorical homotopy to LLMs, providing a framework that addresses the fundamental problem of non-isomorphic arrows generated by equivalent rephrases in the LLM Markov category. Conclusion: The paper concludes that categorical homotopy techniques can be effectively applied to LLMs to address the issue of capturing 'weak equivalences' in an LLM Markov category, building on theoretical results from higher algebraic K-theory to model categories. Abstract: Natural language is replete with superficially different statements, such as ``Charles Darwin wrote" and ``Charles Darwin is the author of", which carry the same meaning. Large language models (LLMs) should generate the same next-token probabilities in such cases, but usually do not. Empirical workarounds have been explored, such as using k-NN estimates of sentence similarity to produce smoothed estimates. In this paper, we tackle this problem more abstractly, introducing a categorical homotopy framework for LLMs. We introduce an LLM Markov category to represent probability distributions in language generated by an LLM, where the probability of a sentence, such as ``Charles Darwin wrote" is defined by an arrow in a Markov category. However, this approach runs into difficulties as language is full of equivalent rephrases, and each generates a non-isomorphic arrow in the LLM Markov category. To address this fundamental problem, we use categorical homotopy techniques to capture ``weak equivalences" in an LLM Markov category. We present a detailed overview of application of categorical homotopy to LLMs, from higher algebraic K-theory to model categories, building on powerful theoretical results developed over the past half a century.[22] Decoupling Understanding from Reasoning via Problem Space Mapping for Small-scale Model Reasoning
Li Wang,Changhao Zhang,Zengqi Xiu,Kai Lu,Xin Yu,Kui Zhang,Wenjun Wu
Main category: cs.CL
TL;DR: DURIT enhances Small Language Models' reasoning by separating understanding from reasoning, improving performance and robustness.
Details
Motivation: Small Language Models struggle with reasoning due to linguistic variability and complexity, which hinders optimization. Method: Mapper and reasoner co-training in a canonical problem space using reinforcement learning, self-distillation, and iterative training. Result: DURIT enhances performance on in-domain and out-of-domain mathematical and logical reasoning tasks for Small Language Models. Conclusion: DURIT improves the reasoning capabilities and robustness of Small Language Models by decoupling understanding from reasoning. Abstract: Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., $\leq$ 1.5B) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.[23] FedCoT: Communication-Efficient Federated Reasoning Enhancement for Large Language Models
Chuan Li,Qianyi Zhao,Fengran Mo,Cen Chen
Main category: cs.CL
TL;DR: FedCoT is a novel framework designed to improve reasoning in federated learning environments for large language models, particularly in healthcare contexts, by enhancing rationale quality and reducing communication overhead while preserving data privacy.
Details
Motivation: Enhancing reasoning capabilities of large language models in federated learning environments is challenging, especially in healthcare, where accurate outputs and interpretable rationales are critical. Existing approaches neglect rationale quality and rely on privacy-violating methods, prompting the need for a better solution. Method: FedCoT uses a lightweight chain-of-thought enhancement mechanism where local models generate multiple reasoning paths and a compact discriminator selects the most promising one. It also employs LoRA module stacking for efficient aggregation across diverse clients. Result: FedCoT significantly improves client-side reasoning performance under strict resource budgets while ensuring data privacy, as demonstrated by experiments on medical reasoning tasks. Conclusion: FedCoT provides a framework to enhance reasoning capabilities of LLMs in federated settings while preserving data privacy and improving performance under resource constraints. Abstract: Efficiently enhancing the reasoning capabilities of large language models (LLMs) in federated learning environments remains challenging, particularly when balancing performance gains with strict computational, communication, and privacy constraints. This challenge is especially acute in healthcare, where decisions-spanning clinical, operational, and patient-facing contexts-demand not only accurate outputs but also interpretable, traceable rationales to ensure safety, accountability, and regulatory compliance. Conventional federated tuning approaches on LLM fail to address this need: they optimize primarily for answer correctness while neglecting rationale quality, leaving CoT capabilities dependent on models' innate pre-training abilities. Moreover, existing methods for improving rationales typically rely on privacy-violating knowledge distillation from centralized models. Additionally, the communication overhead in traditional federated fine-tuning on LLMs remains substantial. We addresses this gap by proposing FedCoT, a novel framework specifically designed to enhance reasoning in federated settings. FedCoT leverages a lightweight chain-of-thought enhancement mechanism: local models generate multiple reasoning paths, and a compact discriminator dynamically selects the most promising one. This approach improves reasoning accuracy and robustness while providing valuable interpretability, which is particularly critical for medical applications. To manage client heterogeneity efficiently, we adopt an improved aggregation approach building upon advanced LoRA module stacking, incorporating client classifier-awareness to achieve noise-free aggregation across diverse clients. Comprehensive experiments on medical reasoning tasks demonstrate that FedCoT significantly boosts client-side reasoning performance under stringent resource budgets while fully preserving data privacy.[24] LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients
Egor Fadeev,Dzhambulat Mollaev,Aleksei Shestov,Dima Korolev,Omar Zoloev,Ivan Kireev,Andrey Savchenko,Maksim Makarenko
Main category: cs.CL
TL;DR: 该论文提出了一种名为LATTE的高效对比学习框架,通过结合大语言模型的语义信息和对比学习方法,显著降低了处理长序列事件的计算成本,并在金融数据上取得了优于现有方法的表现。
Details
Motivation: 尽管大语言模型(LLMs)具有广泛的知识,但其在长序列事件上的直接应用计算成本高且不切实际,因此需要一种更高效的方法来处理金融领域的事件序列数据。 Method: LATTE框架通过将原始事件嵌入与冻结的大语言模型(LLM)生成的语义嵌入对齐,使用对比损失进行训练。 Result: 实验表明,该方法在真实世界金融数据集上优于现有的事件序列表示学习技术,并且适用于对延迟敏感的环境。 Conclusion: 该论文提出了一种名为LATTE的对比学习框架,用于学习客户行为序列的嵌入表示,能够在保持性能的同时显著降低计算成本和输入大小。 Abstract: Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.[25] Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control
Yuanchang Ye
Main category: cs.CL
TL;DR: This study proposes a statistically rigorous framework combining significance testing and conformal prediction to enhance the reliability of large language models in multiple-choice question answering, addressing hallucinations and factual inaccuracies.
Details
Motivation: The motivation is to address hallucination and factual inaccuracies in large language models (LLMs) used in disciplinary question answering scenarios, thereby improving their reliability and trustworthiness. Method: The method involves integrating p-value computation with conformity scoring through self-consistency resampling of MCQA responses, calculating option frequencies and constructing prediction sets via null hypothesis testing. Result: The enhanced conformal prediction framework achieves user-specified empirical miscoverage rates, and the average prediction set size decreases with increasing risk levels, validating it as an effective uncertainty metric. Conclusion: The study concludes that integrating significance testing with conformal prediction provides a statistically rigorous framework for enhancing the reliability of LLMs in high-stakes multiple-choice question answering scenarios. Abstract: This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA). While LLMs have been increasingly deployed in disciplinary QA scenarios, hallucination and nonfactual generation substantially compromise response reliability. Although CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor, their synergistic integration remains unexplored. To mitigate hallucination and factual inaccuracies, our framework integrates $p$-value computation with conformity scoring through self-consistency resampling of MCQA responses. This approach calculates option frequencies to address LLMs' black-box nature, subsequently constructing prediction sets via null hypothesis testing ($\mathcal{H}_0$) with empirically derived $p$-values. Evaluations on MMLU and MMLU-Pro benchmarks using off-the-shelf LLMs demonstrate: (1) The enhanced CP achieves user-specified empirical miscoverage rates; (2) Test-set average prediction set size (APSS) decreases monotonically with increasing risk levels ($\alpha$), validating APSS as an effective uncertainty metric. This work establishes a principled statistical framework for trustworthy LLM deployment in high-stakes QA applications.[26] RTTC: Reward-Guided Collaborative Test-Time Compute
J. Pablo Muñoz,Jinjie Yuan
Main category: cs.CL
TL;DR: This paper proposes RTTC, a framework that improves LLM inference by adaptively choosing the best computation strategy (like RAG or fine-tuning) per query using a reward model, resulting in higher accuracy and efficiency.
Details
Motivation: The motivation is to address the issue where existing TTC strategies like TTT or RAG are applied indiscriminately, leading to suboptimal performance and high computational costs. The need for an adaptive, efficient, and scalable solution for LLM inference drives this work. Method: The paper introduces RTTC, which uses a distributed server-client architecture and a pretrained reward model to adaptively select the optimal Test-Time Compute (TTC) strategy, such as RAG or lightweight fine-tuning, based on the query. It also incorporates Query-State Caching to reduce redundant computations. Result: Experiments show that RTTC consistently outperforms baseline methods such as vanilla RAG or TTT in terms of accuracy across multiple LLMs and benchmarks, demonstrating the effectiveness of adaptive strategy selection and caching. Conclusion: RTTC provides a scalable and high-performance solution for language model adaptation by adaptively selecting the most effective TTC strategy for each query using a reward model, achieving superior accuracy compared to traditional methods like RAG or TTT. Abstract: Test-Time Compute (TTC) has emerged as a powerful paradigm for enhancing the performance of Large Language Models (LLMs) at inference, leveraging strategies such as Test-Time Training (TTT) and Retrieval-Augmented Generation (RAG). However, the optimal adaptation strategy varies across queries, and indiscriminate application of TTC strategy incurs substantial computational overhead. In this work, we introduce Reward-Guided Test-Time Compute (RTTC), a novel framework that adaptively selects the most effective TTC strategy for each query via a pretrained reward model, maximizing downstream accuracy across diverse domains and tasks. RTTC operates in a distributed server-client architecture, retrieving relevant samples from a remote knowledge base and applying RAG or lightweight fine-tuning on client devices only when necessary. To further mitigate redundant computation, we propose Query-State Caching, which enables the efficient reuse of historical query states at both retrieval and adaptation levels. Extensive experiments across multiple LLMs and benchmarks demonstrate that RTTC consistently achieves superior accuracy compared to vanilla RAG or TTT, validating the necessity of adaptive, reward-guided TTC selection and the potential of RTTC for scalable, high-performance language model adaptation.[27] Detecting and explaining postpartum depression in real-time with generative artificial intelligence
Silvia García-Méndez,Francisco de Arriba-Pérez
Main category: cs.CL
TL;DR: This research presents an intelligent, real-time, and non-invasive system for detecting postpartum depression using speech analysis and interpretable AI techniques, achieving high accuracy and transparency in predictions.
Details
Motivation: The motivation is to address the critical need for rapid detection of postpartum depression (PPD) and its risk factors to enable timely assessment and intervention, leveraging the latest technological advancements for real-time screening and recommendations. Method: The method involves an intelligent PPD screening system that integrates Natural Language Processing, Machine Learning (tree-based algorithms), and Large Language Models. It uses feature importance and natural language to explain predictions, addressing the black box issue in ML models. Result: The system achieved 90% accuracy in PPD detection across all evaluation metrics, outperforming existing solutions in the literature. Conclusion: The study concludes that the intelligent PPD screening system combining NLP, ML, and LLMs is effective in rapid and accurate detection of PPD, offering real-time and non-invasive analysis through free speech. Abstract: Among the many challenges mothers undergo after childbirth, postpartum depression (PPD) is a severe condition that significantly impacts their mental and physical well-being. Consequently, the rapid detection of ppd and their associated risk factors is critical for in-time assessment and intervention through specialized prevention procedures. Accordingly, this work addresses the need to help practitioners make decisions with the latest technological advancements to enable real-time screening and treatment recommendations. Mainly, our work contributes to an intelligent PPD screening system that combines Natural Language Processing, Machine Learning (ML), and Large Language Models (LLMs) towards an affordable, real-time, and non-invasive free speech analysis. Moreover, it addresses the black box problem since the predictions are described to the end users thanks to the combination of LLMs with interpretable ml models (i.e., tree-based algorithms) using feature importance and natural language. The results obtained are 90 % on ppd detection for all evaluation metrics, outperforming the competing solutions in the literature. Ultimately, our solution contributes to the rapid detection of PPD and their associated risk factors, critical for in-time and proper assessment and intervention.[28] SABER: Switchable and Balanced Training for Efficient LLM Reasoning
Kai Zhao,Yanjun Zhao,Jiaming Song,Shien He,Lusheng Zhang,Qiang Zhang,Tianjiao Li
Main category: cs.CL
TL;DR: SABER is a reinforcement learning framework for large language models that provides efficient reasoning by allowing user-controllable, token-budgeted thinking processes.
Details
Motivation: Large language models suffer from excessive inference costs and latency; SABER addresses this by providing user-controllable reasoning with budget constraints. Method: SABER uses a reinforcement learning framework to provide token-budgeted reasoning, profiling each training example's token usage and assigning it to budget tiers, guiding the model with system prompts and rewards. Result: SABER achieves high accuracy under tight budgets, with a 65.4% reduction in reasoning length and a 3.6% accuracy gain on the MATH benchmark compared to the base model. Conclusion: SABER enables flexible trade-offs between latency and reasoning depth through four discrete inference modes: NoThink, FastThink, CoreThink, and DeepThink. Abstract: Large language models (LLMs) empowered by chain-of-thought reasoning have achieved impressive accuracy on complex tasks but suffer from excessive inference costs and latency when applied uniformly to all problems. We propose SABER (Switchable and Balanced Training for Efficient LLM Reasoning), a reinforcement learning framework that endows LLMs with user-controllable, token-budgeted reasoning. SABER first profiles each training example's base-model thinking token usage and assigns it to one of the predefined budget tiers. During fine-tuning, the model is guided by system prompts and length-aware rewards to respect its assigned budget. In parallel, we incorporate no-think examples to ensure the model remains reliable even when explicit reasoning is turned off. SABER further supports four discrete inference modes - NoThink, FastThink, CoreThink, and DeepThink, enabling flexible trade-offs between latency and reasoning depth. Extensive evaluations on math reasoning (MATH, GSM8K), code generation (MBPP), and logical reasoning (LiveBench-Reasoning) demonstrate that SABER achieves high accuracy under tight budgets, graceful degradation, and effective cross-scale and cross-domain generalization. In particular, SABER-FastThink cuts reasoning length by 65.4% and yields a 3.6% accuracy gain compared with the base model on the MATH benchmark.[29] LLMCARE: Alzheimer's Detection via Transformer Models Enhanced by LLM-Generated Synthetic Data
Ali Zolnour,Hossein Azadmaleki,Yasaman Haghbin,Fatemeh Taherinezhad,Mohamad Javad Momeni Nezhad,Sina Rashidi,Masoud Khani,AmirSajjad Taleban,Samin Mahdizadeh Sani,Maryam Dadkhah,James M. Noble,Suzanne Bakken,Yadollah Yaghoobzadeh,Abdol-Hossein Vahabie,Masoud Rouhizadeh,Maryam Zolnoori
Main category: cs.CL
TL;DR: 通过结合变压器嵌入和语言特征,开发了一种增强阿尔茨海默病及相关痴呆症检测的新方法,利用大型语言模型生成的合成语音进行数据增强提高了模型性能。
Details
Motivation: 阿尔茨海默病及相关痴呆症(ADRD)影响了大约五百万美国老年人,而超过一半的人未被诊断,语言处理技术可能提供一种可扩展的早期检测方法。 Method: 开发并评估了一个筛查流程,结合了变压器嵌入和手工语言特征,利用大型语言模型生成的合成语音进行数据增强,并测试了单模和多模LLM分类器。 Result: 融合模型达到了F1 = 83.3,使用合成语音进行数据增强后F1增加到85.7,单模LLM分类器的性能显著提高,而多模模型的表现较低。 Conclusion: 将变压器嵌入与语言特征相结合可以增强ADRD的检测,临床调整的LLM可以有效支持分类和数据增强,但多模态建模仍需进一步发展。 Abstract: Alzheimer's disease and related dementias (ADRD) affect approximately five million older adults in the U.S., yet over half remain undiagnosed. Speech-based natural language processing (NLP) offers a promising, scalable approach to detect early cognitive decline through linguistic markers. To develop and evaluate a screening pipeline that (i) fuses transformer embeddings with handcrafted linguistic features, (ii) tests data augmentation using synthetic speech generated by large language models (LLMs), and (iii) benchmarks unimodal and multimodal LLM classifiers for ADRD detection. Transcripts from the DementiaBank "cookie-theft" task (n = 237) were used. Ten transformer models were evaluated under three fine-tuning strategies. A fusion model combined embeddings from the top-performing transformer with 110 lexical-derived linguistic features. Five LLMs (LLaMA-8B/70B, MedAlpaca-7B, Ministral-8B, GPT-4o) were fine-tuned to generate label-conditioned synthetic speech, which was used to augment training data. Three multimodal models (GPT-4o, Qwen-Omni, Phi-4) were tested for speech-text classification in zero-shot and fine-tuned settings. The fusion model achieved F1 = 83.3 (AUC = 89.5), outperforming linguistic or transformer-only baselines. Augmenting training data with 2x MedAlpaca-7B synthetic speech increased F1 to 85.7. Fine-tuning significantly improved unimodal LLM classifiers (e.g., MedAlpaca: F1 = 47.3 -> 78.5 F1). Current multimodal models demonstrated lower performance (GPT-4o = 70.2 F1; Qwen = 66.0). Performance gains aligned with the distributional similarity between synthetic and real speech. Integrating transformer embeddings with linguistic features enhances ADRD detection from speech. Clinically tuned LLMs effectively support both classification and data augmentation, while further advancement is needed in multimodal modeling.[30] PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs
Xiao Fu,Hossein A. Rahmani,Bin Wu,Jerome Ramos,Emine Yilmaz,Aldo Lipani
Main category: cs.CL
TL;DR: PREF is a personalized, reference-free evaluation framework for text generation that improves robustness, transparency, and alignment with user preferences, outperforming existing methods in performance and human alignment.
Details
Motivation: Most evaluation methods for text generation overlook the individuality of users, necessitating a personalized, reference-free evaluation framework. Method: PREF operates in a three-step pipeline: (1) a coverage stage generates universal criteria; (2) a preference stage personalizes the evaluation rubric based on user profiles and preferences; and (3) a scoring stage uses an LLM judge to rate outputs against the rubric. Result: Experiments on the PrefEval benchmark show that PREF outperforms strong baselines in accuracy, calibration, and alignment with human judgments. Conclusion: PREF enables scalable, interpretable, and user-aligned evaluation, providing a foundation for more reliable assessment and development of personalized language generation systems. Abstract: Personalised text generation is essential for user-centric information systems, yet most evaluation methods overlook the individuality of users. We introduce \textbf{PREF}, a \textbf{P}ersonalised \textbf{R}eference-free \textbf{E}valuation \textbf{F}ramework that jointly measures general output quality and user-specific alignment without requiring gold personalised references. PREF operates in a three-step pipeline: (1) a coverage stage uses a large language model (LLM) to generate a comprehensive, query-specific guideline covering universal criteria such as factuality, coherence, and completeness; (2) a preference stage re-ranks and selectively augments these factors using the target user's profile, stated or inferred preferences, and context, producing a personalised evaluation rubric; and (3) a scoring stage applies an LLM judge to rate candidate answers against this rubric, ensuring baseline adequacy while capturing subjective priorities. This separation of coverage from preference improves robustness, transparency, and reusability, and allows smaller models to approximate the personalised quality of larger ones. Experiments on the PrefEval benchmark, including implicit preference-following tasks, show that PREF achieves higher accuracy, better calibration, and closer alignment with human judgments than strong baselines. By enabling scalable, interpretable, and user-aligned evaluation, PREF lays the groundwork for more reliable assessment and development of personalised language generation systems.[31] Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
Wenpeng Xing,Mohan Li,Chunqiang Hu,Haitao XuNingyu Zhang,Bo Lin,Meng Han
Main category: cs.CL
TL;DR: This paper introduces Latent Fusion Jailbreak (LFJ), a new attack on large language models, and proposes adversarial training as a defense, achieving high attack success rates and effective mitigation.
Details
Motivation: The motivation is to explore vulnerabilities in large language models caused by jailbreak attacks and propose effective defense mechanisms. Method: The paper introduces Latent Fusion Jailbreak (LFJ), which uses hidden state interpolation from harmful and benign query pairs, followed by gradient-guided adjustments and optimization. An adversarial training defense is also proposed. Result: LFJ achieves an average attack success rate of 94.01% on models like Vicuna and LLaMA-2, outperforming existing methods. Adversarial training reduces ASR by over 80%. Conclusion: The paper concludes that LFJ is a potent jailbreak attack on large language models, and the proposed adversarial training defense effectively mitigates its impact. Abstract: Large language models (LLMs) demonstrate impressive capabilities in various language tasks but are susceptible to jailbreak attacks that circumvent their safety alignments. This paper introduces Latent Fusion Jailbreak (LFJ), a representation-based attack that interpolates hidden states from harmful and benign query pairs to elicit prohibited responses. LFJ begins by selecting query pairs with high thematic and syntactic similarity, then performs gradient-guided interpolation at influential layers and tokens, followed by optimization to balance attack success, output fluency, and computational efficiency. Evaluations on models such as Vicuna and LLaMA-2 across benchmarks like AdvBench and MaliciousInstruct yield an average attack success rate (ASR) of 94.01%, outperforming existing methods. To mitigate LFJ, we propose an adversarial training defense that fine-tunes models on interpolated examples, reducing ASR by over 80% without degrading performance on benign inputs. Ablation studies validate the importance of query pair selection, hidden state interpolation components, and optimization strategies in LFJ's effectiveness.[32] Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models
Saaduddin Mahmud,Mason Nakamura,Kyle H. Wray,Shlomo Zilberstein
Main category: cs.CL
TL;DR: This paper introduces IAPO, a new framework for prompt optimization that incorporates inference-awareness, along with the PSST algorithm, to better align black-box large language models while considering inference budgets and multi-objective trade-offs.
Details
Motivation: The motivation stems from the identified gap in existing prompt optimization methods, which are inference-strategy agnostic. The researchers aimed to address how prompts and inference strategies should be jointly optimized to better align LLMs according to user preferences and budget constraints. Method: The researchers introduced the IAPO framework, which jointly optimizes prompts and inference scale while considering inference budgets and task objectives. They also developed the PSST algorithm, a fixed-budget training method, and analyzed its error probability under finite-budget constraints. Result: The proposed PSST algorithm demonstrated effectiveness across six tasks, showing the importance of integrating inference-awareness into prompt optimization for improved alignment of LLMs. Conclusion: The study concludes that incorporating inference-awareness in prompt optimization is crucial for aligning black-box LLMs effectively, as demonstrated by the proposed IAPO framework and the PSST algorithm across six diverse tasks. Abstract: Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have also proven to enhance alignment and performance by trading off computation. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without regard to the inference strategy employed during deployment. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a unified novel framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, which we call PSST (Prompt Scaling via Sequential Trimming), and analyze finite-budget guarantees on error probability. Finally, we evaluate the effectiveness of PSST on six different tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness when aligning black-box LLMs through prompt optimization.[33] The Cost of Thinking: Increased Jailbreak Risk in Large Language Models
Fan Yang
Main category: cs.CL
TL;DR: This paper reveals that thinking mode in LLMs is vulnerable to Jailbreak attacks and proposes a safe thinking intervention method to reduce such vulnerabilities.
Details
Motivation: The motivation is to address the previously overlooked vulnerability of thinking mode in LLMs to Jailbreak attacks, which can lead to harmful responses even when the model recognizes the harm. Method: The authors evaluated 9 LLMs using AdvBench and HarmBench datasets and analyzed the characteristics of successfully attacked data. They proposed a method involving 'specific thinking tokens' to guide LLMs' internal thinking process for improved security. Result: The study found that thinking mode LLMs are more susceptible to Jailbreak attacks, especially when the input has educational purposes or excessively long thinking lengths. The proposed safe thinking intervention significantly reduced the attack success rate. Conclusion: The paper concludes that thinking mode in LLMs is vulnerable to Jailbreak attacks, and a new method called safe thinking intervention can reduce the success rate of such attacks. Abstract: Thinking mode has always been regarded as one of the most valuable modes in LLMs. However, we uncover a surprising and previously overlooked phenomenon: LLMs with thinking mode are more easily broken by Jailbreak attack. We evaluate 9 LLMs on AdvBench and HarmBench and find that the success rate of attacking thinking mode in LLMs is almost higher than that of non-thinking mode. Through large numbers of sample studies, it is found that for educational purposes and excessively long thinking lengths are the characteristics of successfully attacked data, and LLMs also give harmful answers when they mostly know that the questions are harmful. In order to alleviate the above problems, this paper proposes a method of safe thinking intervention for LLMs, which explicitly guides the internal thinking processes of LLMs by adding "specific thinking tokens" of LLMs to the prompt. The results demonstrate that the safe thinking intervention can significantly reduce the attack success rate of LLMs with thinking mode.[34] Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion
Dong Zhao,Yadong Wang,Xiang Chen,Chenxi Wang,Hongliang Dai,Chuanxing Geng,Shengzhong Zhang,Shaoyuan Li,Sheng-Jun Huang
Main category: cs.CL
TL;DR: 本文提出了一种新的主动提示框架APIE,通过一种称为“内省混淆”的原则来提高信息抽取任务中大语言模型的性能。
Details
Motivation: 现有的上下文示例选择策略无法提供有效的指导,因为它们忽略了模型失败的一个关键来源:不仅来自语义内容的混淆,还来自信息抽取任务所需的良好结构格式的生成。 Method: 引入了主动提示框架APIE,该框架通过一个双重组件的不确定性度量来量化格式不确定性和内容不确定性,从而让大语言模型能够评估自身的混淆程度,并据此选择最具挑战性和信息量的样例作为少样本示例。 Result: 在四个基准测试中进行了广泛的实验,结果表明该方法在提取准确性和鲁棒性方面均显著优于强基线方法。 Conclusion: 本文的工作强调了在构建高效且可靠的结构化生成系统时,对模型不确定性进行细粒度、双层视角分析的重要性。 Abstract: Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.[35] mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning
Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen
Main category: cs.CL
TL;DR: This paper introduces mSCoRe, a benchmark for evaluating multilingual commonsense reasoning in LLMs, revealing their current limitations and suggesting future improvements.
Details
Motivation: Investigate how reasoning-reinforced LLMs utilize human reasoning skills, particularly in multilingual commonsense reasoning involving diverse languages and cultures. Method: Developed mSCoRe benchmark with a taxonomy of reasoning skills, a data synthesis pipeline, and a complexity scaling framework; conducted experiments on eight state-of-the-art LLMs. Result: Experiments showed that current LLMs struggle with the mSCoRe benchmark, especially at higher complexity levels, highlighting their limitations in nuanced multilingual general and cultural commonsense reasoning. Conclusion: mSCoRe benchmark reveals current limitations of reasoning-reinforced LLMs in multilingual commonsense reasoning and suggests future directions for improvement. Abstract: Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a \textbf{M}ultilingual and Scalable Benchmark for \textbf{S}kill-based \textbf{Co}mmonsense \textbf{Re}asoning (\textbf{mSCoRe}). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that \textbf{mSCoRe} remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models' reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.[36] Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
Kartikeya Badola,Jonathan Simon,Arian Hosseini,Sara Marie Mc Carthy,Tsendsuren Munkhdalai,Abhimanyu Goyal,Tomáš Kočiský,Shyam Upadhyay,Bahare Fatemi,Mehran Kazemi
Main category: cs.CL
TL;DR: The paper introduces a benchmark for evaluating LLMs' abilities in multi-turn dialogue and reasoning with incomplete data, revealing key areas for improvement.
Details
Motivation: The motivation is to address the limitations of LLMs in handling nuanced environments and interactive tasks common in real-world scenarios. Method: The paper introduces a novel benchmark with multi-turn tasks designed to test reasoning, interactive dialogue, and information-seeking abilities, using deterministic scoring mechanisms. Result: Evaluations on the benchmark show that most errors in LLMs stem from poor instruction following, reasoning failures, and poor planning. Conclusion: The paper concludes that there is significant room for improvement in current LLMs regarding instruction following, reasoning, and planning in interactive, multi-turn dialogue scenarios. Abstract: Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.[37] LaajMeter: A Framework for LaaJ Evaluation
Gal Amram,Eitan Farchi,Shmulik Froimovich,Raviv Gal,Avi Ziv
Main category: cs.CL
TL;DR: The paper introduces LaaJMeter, a simulation-based framework for controlled meta-evaluation of LLM-as-a-Judge (LaaJ) systems, particularly useful in domain-specific contexts with scarce annotated data. It allows systematic analysis of evaluation metrics under realistic conditions, aiding in the validation and refinement of LaaJs for specific tasks.
Details
Motivation: While effective in general domains, LaaJs pose significant challenges in domain-specific contexts, where annotated data is scarce and expert evaluation is costly. This makes it difficult to determine which metrics effectively identify LaaJ quality and what threshold indicates sufficient evaluator performance. Method: This work introduces LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs, enabling engineers to generate synthetic data representing virtual models and judges. Result: The results highlight the limitations of common metrics and the importance of principled metric selection. Conclusion: LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP. Abstract: Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks, a paradigm known as LLM-as-a-Judge (LaaJ). While effective in general domains, LaaJs pose significant challenges in domain-specific contexts, where annotated data is scarce and expert evaluation is costly. In such cases, meta-evaluation is often performed using metrics that have not been validated for the specific domain in which they are applied. As a result, it becomes difficult to determine which metrics effectively identify LaaJ quality, and further, what threshold indicates sufficient evaluator performance. In this work, we introduce LaaJMeter, a simulation-based framework for controlled meta-evaluation of LaaJs. LaaJMeter enables engineers to generate synthetic data representing virtual models and judges, allowing systematic analysis of evaluation metrics under realistic conditions. This helps practitioners validate and refine LaaJs for specific evaluation tasks: they can test whether their metrics correctly distinguish between better and worse (virtual) LaaJs, and estimate appropriate thresholds for evaluator adequacy. We demonstrate the utility of LaaJMeter in a code translation task involving a legacy programming language, showing how different metrics vary in sensitivity to evaluator quality. Our results highlight the limitations of common metrics and the importance of principled metric selection. LaaJMeter provides a scalable and extensible solution for assessing LaaJs in low-resource settings, contributing to the broader effort to ensure trustworthy and reproducible evaluation in NLP.[38] Estimating Machine Translation Difficulty
Lorenzo Proietti,Stefano Perrella,Vilém Zouhar,Roberto Navigli,Tom Kocmi
Main category: cs.CL
TL;DR: 本文提出了一种评估翻译难度的方法,并引入了一个新的指标来评估难度估计器,展示了其在构建更具挑战性的机器翻译基准方面的实用性。
Details
Motivation: 机器翻译质量在某些设置下已接近完美,这使得很难区分最先进的模型并确定未来改进的方向。因此,需要一种方法来自动识别机器翻译系统难以处理的文本。 Method: 本文形式化了翻译难度估计的任务,并基于文本翻译的预期质量定义了文本难度。文中引入了一个新的评估难度估计器的指标,并利用该指标评估了基线和新方法。 Result: 实验结果表明,专门设计的模型(Sentinel-src)在性能上优于基于启发式的方法(如单词罕见度或句法复杂度)和LLM-as-a-judge方法。 Conclusion: 难度估计器可以用来扫描大量文本并选择最有可能挑战当代机器翻译系统的文本,文中发布了两个改进的难度估计模型Sentinel-src-24和Sentinel-src-25。 Abstract: Machine translation quality has began achieving near-perfect translations in some setups. These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement. Automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research. We formalize the task of translation difficulty estimation, defining a text's difficulty based on the expected quality of its translations. We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches. Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging machine translation benchmarks. Our results show that dedicated models (dubbed Sentinel-src) outperform both heuristic-based methods (e.g. word rarity or syntactic complexity) and LLM-as-a-judge approaches. We release two improved models for difficulty estimation, Sentinel-src-24 and Sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.[39] Efficient Forward-Only Data Valuation for Pretrained LLMs and VLMs
Wenlong Deng,Jiaming Zhang,Qi Zeng,Christos Thrampoulidis,Boying Gong,Xiaoxiao Li
Main category: cs.CL
TL;DR: For-Value is a forward-only, scalable method for estimating the influence of training samples in large language and vision-language models.
Details
Motivation: To improve transparency and accountability of large models by efficiently quantifying the influence of training samples. Method: For-Value uses a closed-form expression based on a single forward pass, leveraging hidden representations and prediction errors. Result: For-Value accurately identifies impactful examples and detects mislabeled data without costly gradient computations. Conclusion: For-Value is a scalable and efficient framework for influence estimation in large models, matching or outperforming gradient-based methods. Abstract: Quantifying the influence of individual training samples is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing data valuation methods often rely on Hessian information or model retraining, making them computationally prohibitive for billion-parameter models. In this work, we introduce For-Value, a forward-only data valuation framework that enables scalable and efficient influence estimation for both LLMs and VLMs. By leveraging the rich representations of modern foundation models, For-Value computes influence scores using a simple closed-form expression based solely on a single forward pass, thereby eliminating the need for costly gradient computations. Our theoretical analysis demonstrates that For-Value accurately estimates per-sample influence by capturing alignment in hidden representations and prediction errors between training and validation samples. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in identifying impactful fine-tuning examples and effectively detecting mislabeled data.[40] PakBBQ: A Culturally Adapted Bias Benchmark for QA
Abdullah Hashmat,Muhammad Arham Mirza,Agha Ali Raza
Main category: cs.CL
TL;DR: 本文介绍了一个文化和地区适应的问答偏见基准数据集PakBBQ,并评估了多语言大型语言模型在不同上下文和问题框架下的表现。
Details
Motivation: 由于大型语言模型(LLMs)在各种应用中的广泛采用,确保它们在所有用户社区中的公平性是经验性的。然而,大多数LLMs都是基于西方中心的数据进行训练和评估的,很少关注低资源语言和区域背景。 Method: 引入了PakBBQ,这是一个文化和地区适应的扩展原始问答偏见基准(BBQ)数据集,包含超过214个模板,17180个QA对,涵盖8个偏见维度。评估了多种多语言LLM在模糊和明确上下文以及负面与非负面问题框架下的表现。 Result: 实验发现:(i)通过消除歧义平均提高了12%的准确性,(ii)在乌尔都语中的反偏见行为比在英语中更强,(iii)显著的框架效应减少了负面问题框架下的刻板反应。 Conclusion: PakBBQ强调了在低资源环境中进行上下文化基准测试和简单提示工程策略的重要性,以减轻偏见。 Abstract: With the widespread adoption of Large Language Models (LLMs) across various applications, it is empirical to ensure their fairness across all user communities. However, most LLMs are trained and evaluated on Western centric data, with little attention paid to low-resource languages and regional contexts. To address this gap, we introduce PakBBQ, a culturally and regionally adapted extension of the original Bias Benchmark for Question Answering (BBQ) dataset. PakBBQ comprises over 214 templates, 17180 QA pairs across 8 categories in both English and Urdu, covering eight bias dimensions including age, disability, appearance, gender, socio-economic status, religious, regional affiliation, and language formality that are relevant in Pakistan. We evaluate multiple multilingual LLMs under both ambiguous and explicitly disambiguated contexts, as well as negative versus non negative question framings. Our experiments reveal (i) an average accuracy gain of 12\% with disambiguation, (ii) consistently stronger counter bias behaviors in Urdu than in English, and (iii) marked framing effects that reduce stereotypical responses when questions are posed negatively. These findings highlight the importance of contextualized benchmarks and simple prompt engineering strategies for bias mitigation in low resource settings.[41] Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models
Igor Halperin
Main category: cs.CL
TL;DR: 本文提出了一种名为语义分歧度量(SDM)的新框架,用于检测大型语言模型中的忠实性幻觉问题,通过联合聚类和信息理论度量,提供可视化的对话分析并有效识别响应类型。
Details
Motivation: 大型语言模型(LLM)在生成文本时可能出现忠实性幻觉问题,即生成与输入上下文严重偏离的非事实或无意义文本。现有的方法如语义熵仅通过固定提示的多样性衡量任意性,而本文提出的方法更加关注提示的变化,以检测更深层次的任意性。 Method: 本文的方法基于联合聚类生成句子嵌入,构建提示和回答的共享主题空间,并通过主题共现热图进行可视化。随后,利用Jensen-Shannon散度、Wasserstein距离和KL散度等信息理论度量计算语义分歧,从而识别忠实性幻觉。 Result: SDM框架能够在多个提示和其语义等效的改写之间测量响应一致性,通过量化语义分歧检测忠实性幻觉。实验结果表明,KL散度可以作为区分不同生成行为的关键信号,且SDM可以结合成一个诊断框架——语义盒子,有效分类LLM响应类型。 Conclusion: 本文提出了一个名为语义分歧度量(SDM)的新框架,用于检测大型语言模型(LLM)的忠实性幻觉问题。SDM通过结合信息理论度量和句子嵌入的联合聚类,提供了一种可视化和量化的用户-机器对话分析方法,并能够有效识别包括自信虚构在内的LLM响应类型。 Abstract: The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations -- events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {confabulations, defined as responses that are arbitrary and semantically misaligned with the user's query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent paraphrases of the original prompt. Methodologically, our approach uses joint clustering on sentence embeddings to create a shared topic space for prompts and answers. A heatmap of topic co-occurances between prompts and responses can be viewed as a quantified two-dimensional visualization of the user-machine dialogue. We then compute a suite of information-theoretic metrics to measure the semantic divergence between prompts and responses. Our practical score, $\mathcal{S}_H$, combines the Jensen-Shannon divergence and Wasserstein distance to quantify this divergence, with a high score indicating a Faithfulness hallucination. Furthermore, we identify the KL divergence KL(Answer $||$ Prompt) as a powerful indicator of \textbf{Semantic Exploration}, a key signal for distinguishing different generative behaviors. These metrics are further combined into the Semantic Box, a diagnostic framework for classifying LLM response types, including the dangerous, confident confabulation.[42] Understanding Textual Emotion Through Emoji Prediction
Ethan Gordon,Nishank Kuppa,Rigved Tummala,Sriram Anasuri
Main category: cs.CL
TL;DR: 本研究通过四种深度学习架构(前馈网络、CNN、Transformer和BERT)探索短文本序列中的表情符号预测,重点解决类别不平衡问题,并强调了架构选择和超参数调整的重要性。
Details
Motivation: 改进人机交互中情感感知表情符号预测的效果 Method: 使用推特评估数据集,通过焦点损失和正则化技术解决类别不平衡问题,探索了四种深度学习架构:前馈网络、CNN、Transformer和BERT。 Result: BERT由于其预训练优势实现了最高整体性能,CNN在罕见表情符号类别上表现出更高的有效性。 Conclusion: BERT在整体表现上最优,CNN在罕见表情符号类别上表现出色,强调了情感感知表情符号预测中架构选择和超参数调整的重要性。 Abstract: This project explores emoji prediction from short text sequences using four deep learning architectures: a feed-forward network, CNN, transformer, and BERT. Using the TweetEval dataset, we address class imbalance through focal loss and regularization techniques. Results show BERT achieves the highest overall performance due to its pre-training advantage, while CNN demonstrates superior efficacy on rare emoji classes. This research shows the importance of architecture selection and hyperparameter tuning for sentiment-aware emoji prediction, contributing to improved human-computer interaction.[43] Using Large Language Models to Measure Symptom Severity in Patients At Risk for Schizophrenia
Andrew X. Chen,Guillermo Horga,Sean Escola
Main category: cs.CL
TL;DR: 大型语言模型(LLMs)可用于预测临床高风险(CHR)患者的BPRS评分,提供与人工评估相近的准确性,并具有跨语言和纵向数据整合能力,从而改善症状监测和治疗决策。
Details
Motivation: BPRS在临床实践中不常使用,因为它需要长时间的结构化访谈,因此需要更有效的工具来改善CHR患者的症状评估和监测。 Method: 使用LLMs从AMP-SCZ队列中409名CHR患者的临床访谈记录中预测BPRS评分,并评估预测与真实评估之间的一致性。 Result: LLMs的零样本预测表现与真实评估相近(中位一致性:0.84,ICC:0.73),并且在跨语言评估和整合纵向信息方面表现出潜力(中位一致性:0.88,ICC:0.70) Conclusion: LLMs可以有效预测BPRS评分,为临床实践提供标准化和准确的评估工具,并具有跨语言和纵向数据整合的潜力。 Abstract: Patients who are at clinical high risk (CHR) for schizophrenia need close monitoring of their symptoms to inform appropriate treatments. The Brief Psychiatric Rating Scale (BPRS) is a validated, commonly used research tool for measuring symptoms in patients with schizophrenia and other psychotic disorders; however, it is not commonly used in clinical practice as it requires a lengthy structured interview. Here, we utilize large language models (LLMs) to predict BPRS scores from clinical interview transcripts in 409 CHR patients from the Accelerating Medicines Partnership Schizophrenia (AMP-SCZ) cohort. Despite the interviews not being specifically structured to measure the BPRS, the zero-shot performance of the LLM predictions compared to the true assessment (median concordance: 0.84, ICC: 0.73) approaches human inter- and intra-rater reliability. We further demonstrate that LLMs have substantial potential to improve and standardize the assessment of CHR patients via their accuracy in assessing the BPRS in foreign languages (median concordance: 0.88, ICC: 0.70), and integrating longitudinal information in a one-shot or few-shot learning approach.[44] A Computational Approach to Analyzing Language Change and Variation in the Constructed Language Toki Pona
Daniel Huang,Hyoun-A Joo
Main category: cs.CL
TL;DR: This study uses computational methods to show that Toki Pona, though constructed, evolves like natural languages due to sociolinguistic influences and community usage patterns.
Details
Motivation: To understand how constructed languages like Toki Pona change and vary over time and across different usages, similar to natural languages. Method: A computational and corpus-based approach was used to analyze language change and variation in Toki Pona. Result: The study found that sociolinguistic factors influence Toki Pona and that its linguistic structures evolve as community usage changes. Conclusion: Toki Pona, a constructed language, evolves naturally through usage, much like natural languages, influenced by sociolinguistic factors. Abstract: This study explores language change and variation in Toki Pona, a constructed language with approximately 120 core words. Taking a computational and corpus-based approach, the study examines features including fluid word classes and transitivity in order to examine (1) changes in preferences of content words for different syntactic positions over time and (2) variation in usage across different corpora. The results suggest that sociolinguistic factors influence Toki Pona in the same way as natural languages, and that even constructed linguistic systems naturally evolve as communities use them.[45] Inductive Bias Extraction and Matching for LLM Prompts
Christian M. Angel,Francis Ferraro
Main category: cs.CL
TL;DR: The paper introduces a strategy to better engineer prompts for LLMs by matching the model's inductive bias, leading to significant improvements in model performance.
Details
Motivation: Prompt engineering is an active research area because LLMs are sensitive to small changes in prompt wording. This sensitivity can be attributed to the inductive bias in LLMs. Method: Used LLM's output as part of its prompt to create wording that matches the inductive bias in the model. Result: Empirical results show an improvement of up to 19% in LLM performance for classification tasks and up to 27% for ranking tasks. Conclusion: The strategy of Inductive Bias Extraction and Matching significantly improves the performance of LLMs in classification and ranking tasks. Abstract: The active research topic of prompt engineering makes it evident that LLMs are sensitive to small changes in prompt wording. A portion of this can be ascribed to the inductive bias that is present in the LLM. By using an LLM's output as a portion of its prompt, we can more easily create satisfactory wording for prompts. This has the effect of creating a prompt that matches the inductive bias in model. Empirically, we show that using this Inductive Bias Extraction and Matching strategy improves LLM Likert ratings used for classification by up to 19% and LLM Likert ratings used for ranking by up to 27%.[46] Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race
Gustavo Bonil,Simone Hashiguti,Jhessica Silva,João Gondim,Helena Maia,Nádia Silva,Helio Pedrini,Sandra Avila
Main category: cs.CL
TL;DR: 本研究提出了一种定性分析框架,用于检测大型语言模型中的性别和种族偏见。研究发现,LLM生成的内容倾向于强化刻板印象,且在修正偏见时效果有限,突出了对AI进行批判性、跨学科研究的必要性。
Details
Motivation: 随着人工智能的发展,大型语言模型被广泛应用,但它们是否会复制歧视和种族化的偏见,并维持现有的霸权话语,这是一个亟需关注的问题。当前的偏见检测方法多依赖定量和自动化技术,而忽视了自然语言中偏见的复杂表现形式。因此,研究提出了一种补充性的定性方法来应对这一问题。 Method: 通过手动分析由LLM生成的关于黑人和白人女性的短篇故事,研究者采用了一种质性的、话语分析的框架来检测性别和种族偏见。 Result: 研究结果显示,黑人女性在生成内容中常与祖先和抵抗联系在一起,而白人女性则更多出现在自我发现的过程中。这种模式反映了语言模型如何复制固化的刻板印象,强化了本质化和社会流动性缺失的观点。此外,当模型被要求修正偏见时,它们提供的修改往往是表面的,未能真正解决潜在的问题。 Conclusion: 该研究得出结论,定性分析方法对于识别和缓解大型语言模型(LLM)输出中的偏见至关重要,同时揭示了这些模型在性别和种族问题上的意识形态运作方式。研究强调了对人工智能设计和部署采取批判性和跨学科方法的必要性。 Abstract: With the advance of Artificial Intelligence (AI), Large Language Models (LLMs) have gained prominence and been applied in diverse contexts. As they evolve into more sophisticated versions, it is essential to assess whether they reproduce biases, such as discrimination and racialization, while maintaining hegemonic discourses. Current bias detection approaches rely mostly on quantitative, automated methods, which often overlook the nuanced ways in which biases emerge in natural language. This study proposes a qualitative, discursive framework to complement such methods. Through manual analysis of LLM-generated short stories featuring Black and white women, we investigate gender and racial biases. We contend that qualitative methods such as the one proposed here are fundamental to help both developers and users identify the precise ways in which biases manifest in LLM outputs, thus enabling better conditions to mitigate them. Results show that Black women are portrayed as tied to ancestry and resistance, while white women appear in self-discovery processes. These patterns reflect how language models replicate crystalized discursive representations, reinforcing essentialization and a sense of social immobility. When prompted to correct biases, models offered superficial revisions that maintained problematic meanings, revealing limitations in fostering inclusive narratives. Our results demonstrate the ideological functioning of algorithms and have significant implications for the ethical use and development of AI. The study reinforces the need for critical, interdisciplinary approaches to AI design and deployment, addressing how LLM-generated discourses reflect and perpetuate inequalities.[47] ReviewRL: Towards Automated Scientific Review with RL
Sihang Zeng,Kai Tian,Kaiyan Zhang,Yuru wang,Junqi Gao,Runze Liu,Sa Yang,Jingxuan Li,Xinwei Long,Jiaheng Ma,Biqing Qi,Bowen Zhou
Main category: cs.CL
TL;DR: ReviewRL 是一个基于强化学习的科学论文自动评审框架,通过检索增强的上下文生成、监督微调和强化学习过程,显著提升了评审质量和准确性。
Details
Motivation: 由于提交量的增加和审稿人疲劳,现有的自动化评审方法在事实准确性、评分一致性和分析深度方面存在不足。 Method: 结合检索增强的上下文生成管道、监督微调和强化学习过程,使用复合奖励函数提升评论质量和评分准确性。 Result: 在ICLR 2025论文上的实验表明,ReviewRL在规则基础指标和模型基础质量评估方面显著优于现有方法。 Conclusion: ReviewRL 提供了一个基于强化学习的自动科学论文评审框架,为未来在科学发现领域的发展展示了良好的潜力。 Abstract: Peer review is essential for scientific progress but faces growing challenges due to increasing submission volumes and reviewer fatigue. Existing automated review approaches struggle with factual accuracy, rating consistency, and analytical depth, often generating superficial or generic feedback lacking the insights characteristic of high-quality human reviews. We introduce ReviewRL, a reinforcement learning framework for generating comprehensive and factually grounded scientific paper reviews. Our approach combines: (1) an ArXiv-MCP retrieval-augmented context generation pipeline that incorporates relevant scientific literature, (2) supervised fine-tuning that establishes foundational reviewing capabilities, and (3) a reinforcement learning procedure with a composite reward function that jointly enhances review quality and rating accuracy. Experiments on ICLR 2025 papers demonstrate that ReviewRL significantly outperforms existing methods across both rule-based metrics and model-based quality assessments. ReviewRL establishes a foundational framework for RL-driven automatic critique generation in scientific discovery, demonstrating promising potential for future development in this domain. The implementation of ReviewRL will be released at GitHub.[48] From Surface to Semantics: Semantic Structure Parsing for Table-Centric Document Analysis
Xuan Li,Jialiang Dong,Raymond Wong
Main category: cs.CL
TL;DR: DOTABLER is a semantic document parsing framework that effectively uncovers deep semantic links between tables and their context, achieving high performance in advanced document analysis tasks.
Details
Motivation: Existing studies focus on surface-level tasks like layout analysis and data extraction, lacking deep semantic parsing of tables and their contextual associations. This limits advanced tasks such as cross-paragraph data interpretation and context-consistent analysis. Method: DOTABLER uses a custom dataset and domain-specific fine-tuning of pre-trained models, integrating a complete parsing pipeline to identify context segments semantically tied to tables. Result: Evaluated on nearly 4,000 pages with over 1,000 tables from real-world PDFs, DOTABLER achieves over 90% Precision and F1 scores. Conclusion: DOTABLER is an effective table-centric semantic document parsing framework that outperforms advanced models like GPT-4o in table-context semantic analysis and deep document parsing. Abstract: Documents are core carriers of information and knowl-edge, with broad applications in finance, healthcare, and scientific research. Tables, as the main medium for structured data, encapsulate key information and are among the most critical document components. Existing studies largely focus on surface-level tasks such as layout analysis, table detection, and data extraction, lacking deep semantic parsing of tables and their contextual associations. This limits advanced tasks like cross-paragraph data interpretation and context-consistent analysis. To address this, we propose DOTABLER, a table-centric semantic document parsing framework designed to uncover deep semantic links between tables and their context. DOTABLER leverages a custom dataset and domain-specific fine-tuning of pre-trained models, integrating a complete parsing pipeline to identify context segments semantically tied to tables. Built on this semantic understanding, DOTABLER implements two core functionalities: table-centric document structure parsing and domain-specific table retrieval, delivering comprehensive table-anchored semantic analysis and precise extraction of semantically relevant tables. Evaluated on nearly 4,000 pages with over 1,000 tables from real-world PDFs, DOTABLER achieves over 90% Precision and F1 scores, demonstrating superior performance in table-context semantic analysis and deep document parsing compared to advanced models such as GPT-4o.[49] Beyond Semantic Understanding: Preserving Collaborative Frequency Components in LLM-based Recommendation
Minhao Wang,Yunhang He,Cong Xu,Zhangchi Zhu,Wei Zhang
Main category: cs.CL
TL;DR: 本文提出FreLLM4Rec,通过频谱方法平衡语义和协作信息,以增强LLM推荐系统中的协作信号。
Details
Motivation: LLM在推荐系统中倾向于过度强调用户历史交互中的语义相关性,削弱了协作信号,而传统模型能保留或增强该信号。 Method: 引入频谱视角,通过全局图低通滤波器(G-LPF)净化项目嵌入,并利用时间频率调制(TFM)逐层保留协作信号。 Result: 在四个基准数据集上的实验表明,FreLLM4Rec在NDCG@10指标上比最佳基线提高了高达8.00%。 Conclusion: FreLLM4Rec有效缓解了LLM中协作信号的衰减问题,提供了改进LLM推荐系统的原理性方法。 Abstract: Recommender systems in concert with Large Language Models (LLMs) present promising avenues for generating semantically-informed recommendations. However, LLM-based recommenders exhibit a tendency to overemphasize semantic correlations within users' interaction history. When taking pretrained collaborative ID embeddings as input, LLM-based recommenders progressively weaken the inherent collaborative signals as the embeddings propagate through LLM backbones layer by layer, as opposed to traditional Transformer-based sequential models in which collaborative signals are typically preserved or even enhanced for state-of-the-art performance. To address this limitation, we introduce FreLLM4Rec, an approach designed to balance semantic and collaborative information from a spectral perspective. Item embeddings that incorporate both semantic and collaborative information are first purified using a Global Graph Low-Pass Filter (G-LPF) to preliminarily remove irrelevant high-frequency noise. Temporal Frequency Modulation (TFM) then actively preserves collaborative signal layer by layer. Note that the collaborative preservation capability of TFM is theoretically guaranteed by establishing a connection between the optimal but hard-to-implement local graph fourier filters and the suboptimal yet computationally efficient frequency-domain filters. Extensive experiments on four benchmark datasets demonstrate that FreLLM4Rec successfully mitigates collaborative signal attenuation and achieves competitive performance, with improvements of up to 8.00\% in NDCG@10 over the best baseline. Our findings provide insights into how LLMs process collaborative information and offer a principled approach for improving LLM-based recommendation systems.[50] Cross-Prompt Encoder for Low-Performing Languages
Beso Mikaberidze,Teimuraz Saghinadze,Simon Ostermann,Philipp Muller
Main category: cs.CL
TL;DR: The paper proposes the Cross-Prompt Encoder (XPE) and Dual Soft Prompt mechanism to enhance performance on low-performing languages and improve adaptability in multilingual settings.
Details
Motivation: The motivation is to explore the broader potential of prompt encoders in improving performance on low-performing languages, which achieve poor accuracy even under full-model fine-tuning. Method: The paper introduces the Cross-Prompt Encoder (XPE), which uses a lightweight encoding architecture with multi-source training on typologically diverse languages, and a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. Result: Experiments on the SIB-200 benchmark reveal a consistent trade-off: XPE is most effective for low-performing languages, while hybrid variants offer broader adaptability across multilingual settings. Conclusion: The paper concludes that the Cross-Prompt Encoder (XPE) and the Dual Soft Prompt mechanism improve performance on low-performing languages and offer broader adaptability in multilingual settings. Abstract: Soft prompts have emerged as a powerful alternative to adapters in parameter-efficient fine-tuning (PEFT), enabling large language models (LLMs) to adapt to downstream tasks without architectural changes or parameter updates. While prior work has focused on stabilizing training via parameter interaction in small neural prompt encoders, their broader potential for transfer across languages remains unexplored. In this paper, we demonstrate that a prompt encoder can play a central role in improving performance on low-performing languages-those that achieve poor accuracy even under full-model fine-tuning. We introduce the Cross-Prompt Encoder (XPE), which combines a lightweight encoding architecture with multi-source training on typologically diverse languages - a design that enables the model to capture abstract and transferable patterns across languages. To complement XPE, we propose a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. This hybrid design proves especially effective for target languages that benefit from both broadly shared structure and language-specific alignment. Experiments on the SIB-200 benchmark reveal a consistent trade-off: XPE is most effective for low-performing languages, while hybrid variants offer broader adaptability across multilingual settings.[51] Making Qwen3 Think in Korean with Reinforcement Learning
Jungyup Lee,Jemin Kim,Sang Park,SeungJae Lee
Main category: cs.CL
TL;DR: 本文提出了一种两阶段微调方法,使大型语言模型Qwen3 14B能够以韩语进行思考。第一阶段使用高质量的韩语推理数据集进行监督微调,第二阶段则通过定制的Group Relative Policy Optimization(GRPO)算法进行强化学习,以提升韩语推理和问题解决能力。
Details
Motivation: 为了使大型语言模型Qwen3 14B能够以韩语进行思考,提高其在韩语任务和通用推理能力上的表现。 Method: 首先在高质量的韩语推理数据集上进行监督微调,建立坚实的韩语逻辑推理基础;然后使用定制的Group Relative Policy Optimization算法进行强化学习,进一步提升韩语推理和问题解决能力。 Result: 该方法实现了稳定的学习过程,并在高级推理基准测试中表现出显著改进,特别是在数学和编码任务上,同时保持了知识和语言能力。 Conclusion: 通过两阶段微调方法,成功地使Qwen3 14B模型能够以韩语进行内部思维链,并在多项任务中表现出色。 Abstract: We present a two-stage fine-tuning approach to make the large language model Qwen3 14B "think" natively in Korean. In the first stage, supervised fine-tuning (SFT) on a high-quality Korean reasoning dataset establishes a strong foundation in Korean logical reasoning, yielding notable improvements in Korean-language tasks and even some gains in general reasoning ability. In the second stage, we employ reinforcement learning with a customized Group Relative Policy Optimization (GRPO) algorithm to further enhance both Korean reasoning alignment and overall problem-solving performance. We address critical stability challenges in GRPO training - such as reward hacking and policy collapse - by introducing an oracle judge model that calibrates the reward signal. Our approach achieves stable learning (avoiding the collapse observed in naive GRPO) and leads to steady, incremental performance gains. The final RL-tuned model demonstrates substantially improved results on advanced reasoning benchmarks (particularly math and coding tasks) while maintaining knowledge and language proficiency, successfully conducting its internal chain-of-thought entirely in Korean.[52] Advancing Cross-lingual Aspect-Based Sentiment Analysis with LLMs and Constrained Decoding for Sequence-to-Sequence Models
Jakub Šmíd,Pavel Přibáň,Pavel Král
Main category: cs.CL
TL;DR: 本文提出了一种新的跨语言ABSA任务方法,通过约束解码减少对翻译工具的依赖并提高性能。
Details
Motivation: 现有的跨语言ABSA研究主要集中在简单任务上,并且严重依赖外部翻译工具,这在低资源语言中存在挑战。 Method: 本文使用了一种基于序列到序列的方法,并通过约束解码来进行跨语言的复合ABSA任务。 Result: 该方法在跨语言ABSA任务上性能提高了10%,并且与大型语言模型的比较显示,虽然多语言模型可以达到可比结果,但英语中心模型在这些任务上表现不佳。 Conclusion: 本文提出了一种新的基于序列到序列的复合ABSA任务方法,消除了对翻译工具的依赖,并通过约束解码提高了跨语言ABSA的性能。 Abstract: Aspect-based sentiment analysis (ABSA) has made significant strides, yet challenges remain for low-resource languages due to the predominant focus on English. Current cross-lingual ABSA studies often centre on simpler tasks and rely heavily on external translation tools. In this paper, we present a novel sequence-to-sequence method for compound ABSA tasks that eliminates the need for such tools. Our approach, which uses constrained decoding, improves cross-lingual ABSA performance by up to 10\%. This method broadens the scope of cross-lingual ABSA, enabling it to handle more complex tasks and providing a practical, efficient alternative to translation-dependent techniques. Furthermore, we compare our approach with large language models (LLMs) and show that while fine-tuned multilingual LLMs can achieve comparable results, English-centric LLMs struggle with these tasks.[53] Large Language Models for Summarizing Czech Historical Documents and Beyond
Václav Tran,Jakub Šmíd,Jiří Martínek,Ladislav Lenc,Pavel Král
Main category: cs.CL
TL;DR: This study explores the use of advanced language models for Czech text summarization, achieves top results on an existing dataset, and introduces a new dataset for historical Czech documents.
Details
Motivation: Czech text summarization, especially for historical documents, is underexplored due to linguistic complexities and limited annotated datasets, creating a need for more research in this area. Method: The researchers employed large language models (Mistral and mT5) for Czech text summarization and evaluated their performance on existing and newly introduced datasets. Result: The study achieved new state-of-the-art results on the SumeCzech dataset and introduced a new dataset, Posel od Čerchova, for historical Czech document summarization with baseline results. Conclusion: The study concludes that using large language models like Mistral and mT5 can significantly advance Czech text summarization, particularly for historical documents. Abstract: Text summarization is the task of shortening a larger body of text into a concise version while retaining its essential meaning and key information. While summarization has been significantly explored in English and other high-resource languages, Czech text summarization, particularly for historical documents, remains underexplored due to linguistic complexities and a scarcity of annotated datasets. Large language models such as Mistral and mT5 have demonstrated excellent results on many natural language processing tasks and languages. Therefore, we employ these models for Czech summarization, resulting in two key contributions: (1) achieving new state-of-the-art results on the modern Czech summarization dataset SumeCzech using these advanced models, and (2) introducing a novel dataset called Posel od \v{C}erchova for summarization of historical Czech documents with baseline results. Together, these contributions provide a great potential for advancing Czech text summarization and open new avenues for research in Czech historical text processing.[54] Improving Generative Cross-lingual Aspect-Based Sentiment Analysis with Constrained Decoding
Jakub Šmíd,Pavel Přibáň,Pavel Král
Main category: cs.CL
TL;DR: 本研究提出了一种新的跨语言基于方面情感分析(ABSA)方法,使用受约束解码的序列到序列模型,无需依赖翻译工具,显著提升了低资源语言的性能,并支持多任务学习。
Details
Motivation: 尽管基于方面的观点分析(ABSA)取得了显著进展,但低资源语言的研究仍然不足。现有跨语言ABSA方法通常局限于简单任务,并依赖外部翻译工具,因此需要提出一种更加有效且独立的方法。 Method: 采用受约束解码的序列到序列模型进行跨语言ABSA,并支持多任务学习。此外,研究还评估了大语言模型在零样本、少样本和微调场景下的性能表现。 Result: 新方法在最复杂的任务上平均提升了5%的跨语言性能,同时在多任务学习中通过受约束解码提升了超过10%的结果。研究在七种语言和六项ABSA任务上超越了现有技术,并为未探索的任务设定了新基准。 Conclusion: 本研究通过使用受约束解码的序列到序列模型,提出了一种无需依赖翻译工具的跨语言ABSA新方法,并在低资源语言上取得了显著性能提升。此外,研究还提供了对现实应用的实用建议,推动了跨语言ABSA领域的研究进展。 Abstract: While aspect-based sentiment analysis (ABSA) has made substantial progress, challenges remain for low-resource languages, which are often overlooked in favour of English. Current cross-lingual ABSA approaches focus on limited, less complex tasks and often rely on external translation tools. This paper introduces a novel approach using constrained decoding with sequence-to-sequence models, eliminating the need for unreliable translation tools and improving cross-lingual performance by 5\% on average for the most complex task. The proposed method also supports multi-tasking, which enables solving multiple ABSA tasks with a single model, with constrained decoding boosting results by more than 10\%. We evaluate our approach across seven languages and six ABSA tasks, surpassing state-of-the-art methods and setting new benchmarks for previously unexplored tasks. Additionally, we assess large language models (LLMs) in zero-shot, few-shot, and fine-tuning scenarios. While LLMs perform poorly in zero-shot and few-shot settings, fine-tuning achieves competitive results compared to smaller multilingual models, albeit at the cost of longer training and inference times. We provide practical recommendations for real-world applications, enhancing the understanding of cross-lingual ABSA methodologies. This study offers valuable insights into the strengths and limitations of cross-lingual ABSA approaches, advancing the state-of-the-art in this challenging research domain.[55] Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts
Chiyu Zhang,Lu Zhou,Xiaogang Xu,Jiafei Wu,Liming Fang,Zhe Liu
Main category: cs.CL
TL;DR: 本文介绍了一种新的混合评估框架MDH,用于检测恶意内容,并提出了两种提高越狱攻击成功率的新策略。
Details
Motivation: 现有的红队数据集包含许多不合适的提示,需要进行评估和清洗。然而,现有的恶意内容检测方法要么依赖人工标注,要么依赖大语言模型(LLM),但其在有害类型中的准确性不一致。 Method: 提出了一种名为MDH的混合评估框架,结合了基于LLM的注释和最少的人工监督,并提出了D-Attack和DH-CoT两种新策略。 Result: MDH框架在平衡准确性和效率方面表现出色,且精心设计的开发者消息显著提高了越狱成功率。 Conclusion: 本文提出了一种结合LLM和人类监督的混合评估框架MDH,用于恶意内容检测,并展示了其在数据集清洗和检测越狱响应中的有效性。 Abstract: Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.[56] Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Huizhen Shu,Xuying Li,Qirui Wang,Yuji Kosuga,Mengqiu Tian,Zhuo Li
Main category: cs.CL
TL;DR: This paper proposes SFPF, a new black-box attack method for adversarial text generation that leverages the interpretability of large models.
Details
Motivation: Generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. Method: Sparse Feature Perturbation Framework (SFPF) utilizing sparse autoencoders to identify and manipulate critical features in text. Result: Adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms. Conclusion: SFPF enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Abstract: With the rapid proliferation of Natural Language Processing (NLP), especially Large Language Models (LLMs), generating adversarial examples to jailbreak LLMs remains a key challenge for understanding model vulnerabilities and improving robustness. In this context, we propose a new black-box attack method that leverages the interpretability of large models. We introduce the Sparse Feature Perturbation Framework (SFPF), a novel approach for adversarial text generation that utilizes sparse autoencoders to identify and manipulate critical features in text. After using the SAE model to reconstruct hidden layer representations, we perform feature clustering on the successfully attacked texts to identify features with higher activations. These highly activated features are then perturbed to generate new adversarial texts. This selective perturbation preserves the malicious intent while amplifying safety signals, thereby increasing their potential to evade existing defenses. Our method enables a new red-teaming strategy that balances adversarial effectiveness with safety alignment. Experimental results demonstrate that adversarial texts generated by SFPF can bypass state-of-the-art defense mechanisms, revealing persistent vulnerabilities in current NLP systems.However, the method's effectiveness varies across prompts and layers, and its generalizability to other architectures and larger models remains to be validated.[57] ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning
Juyuan Wang,Rongchen Zhao,Wei Wei,Yufeng Wang,Mo Yu,Jie Zhou,Jin Xu,Liyan Xu
Main category: cs.CL
TL;DR: 本文提出ComoRAG,一种基于动态记忆整合和迭代推理的长文本叙事理解方法,在长上下文基准测试中显著优于传统RAG方法。
Details
Motivation: 传统RAG方法在处理长故事和小说的复杂情节和动态人物关系时存在局限,因此需要一种更符合人类认知机制的动态推理方法。 Method: 提出ComoRAG方法,采用迭代推理机制和动态记忆工作区,通过生成探测性查询整合新证据至全局记忆池,从而构建连贯的上下文以解决问题。 Result: 在四个长上下文叙事基准测试中,ComoRAG相比强RAG基线模型取得了最高11%的相对提升,并在复杂查询任务上表现出显著优势。 Conclusion: ComoRAG提供了一种基于检索的长文本理解的新范式,通过迭代推理和动态记忆整合提升了叙事理解的性能,尤其是在需要全局理解的复杂查询上。 Abstract: Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG[58] Evaluating LLMs on Chinese Idiom Translation
Cai Yang,Yao Dou,David Heineman,Xiaofeng Wu,Wei Xu
Main category: cs.CL
TL;DR: This paper introduces IdiomEval, a new framework for evaluating Chinese idiom translation, revealing that current machine translation systems, including advanced models like GPT-4, struggle with accurately translating idioms, and existing evaluation metrics are inadequate for this task.
Details
Motivation: Despite the prevalence of idioms in everyday Chinese language and recent advancements in machine translation, there has been little research on the translation of Chinese idioms. This study aims to address this gap by systematically evaluating the performance of current translation systems on idiom translation. Method: The researchers introduced IdiomEval, a framework for evaluating Chinese idiom translation that includes a detailed error taxonomy. They annotated 900 translation pairs from nine modern translation systems across four domains and analyzed the types of errors made. They also evaluated existing translation metrics against human ratings. Result: Modern machine translation systems, including GPT-4o and Google Translate, perform poorly in translating Chinese idioms, often producing incorrect, literal, partial, or missing translations. GPT-4, the best-performing system, still makes errors in 28% of cases. Existing evaluation metrics also show poor correlation with human assessments of idiom translation quality. Conclusion: The study concludes that current machine translation systems struggle with translating Chinese idioms effectively, with even the best-performing system, GPT-4, making errors in a significant percentage of cases. Additionally, traditional evaluation metrics are not well-suited for assessing idiom translation quality. Abstract: Idioms, whose figurative meanings usually differ from their literal interpretations, are common in everyday language, especially in Chinese, where they often contain historical references and follow specific structural patterns. Despite recent progress in machine translation with large language models, little is known about Chinese idiom translation. In this work, we introduce IdiomEval, a framework with a comprehensive error taxonomy for Chinese idiom translation. We annotate 900 translation pairs from nine modern systems, including GPT-4o and Google Translate, across four domains: web, news, Wikipedia, and social media. We find these systems fail at idiom translation, producing incorrect, literal, partial, or even missing translations. The best-performing system, GPT-4, makes errors in 28% of cases. We also find that existing evaluation metrics measure idiom quality poorly with Pearson correlation below 0.48 with human ratings. We thus develop improved models that achieve F$_1$ scores of 0.68 for detecting idiom translation errors.[59] Computational Economics in Large Language Models: Exploring Model Behavior and Incentive Design under Resource Constraints
Sandeep Reddy,Kabir Khan,Rohit Patil,Ananya Chakraborty,Faizan A. Khan,Swati Kulkarni,Arjun Verma,Neha Singh
Main category: cs.CL
TL;DR: 本文提出了一种“计算经济学”框架,通过将大型语言模型(LLM)视为资源受限的代理(注意力头和神经块)组成的内部经济系统,以最大化任务效用的方式分配稀缺计算资源,从而在保证准确性的前提下显著减少计算成本和延迟。
Details
Motivation: 大型语言模型由于计算成本高昂,限制了其应用。本文旨在通过引入经济学原理,优化计算资源的分配,从而提高模型效率。 Method: 提出了一种基于激励的训练框架,在任务损失中加入可微分的计算成本项,鼓励稀疏和高效的激活;同时分析了标准LLM在计算资源稀缺时如何重新分配注意力以保持准确性。 Result: 在GLUE和WikiText-103任务中,该方法在保持相似准确性的前提下减少了约40%的FLOPS和更低的延迟,同时注意力模式更具可解释性。 Conclusion: 本文表明,基于经济原则的框架为在严格资源约束下设计高效、自适应和透明的LLM提供了一种可行的方法。 Abstract: Large language models (LLMs) are limited by substantial computational cost. We introduce a "computational economics" framework that treats an LLM as an internal economy of resource-constrained agents (attention heads and neuron blocks) that must allocate scarce computation to maximize task utility. First, we show empirically that when computation is scarce, standard LLMs reallocate attention toward high-value tokens while preserving accuracy. Building on this observation, we propose an incentive-driven training paradigm that augments the task loss with a differentiable computation cost term, encouraging sparse and efficient activations. On GLUE (MNLI, STS-B, CoLA) and WikiText-103, the method yields a family of models that trace a Pareto frontier and consistently dominate post-hoc pruning; for a similar accuracy we obtain roughly a forty percent reduction in FLOPS and lower latency, together with more interpretable attention patterns. These results indicate that economic principles offer a principled route to designing efficient, adaptive, and more transparent LLMs under strict resource constraints.[60] DiFaR: Enhancing Multimodal Misinformation Detection with Diverse, Factual, and Relevant Rationales
Herun Wan,Jiaying Wu,Minnan Luo,Xiangzheng Kong,Zihan Ma,Zhi Zeng
Main category: cs.CL
TL;DR: DiFaR是一个用于生成多样化、事实性强且相关性高的理由的框架,旨在增强虚假信息检测。
Details
Motivation: 生成文本理由以支持可训练的多模态虚假信息检测器的效果受限于三个核心挑战:生成理由的多样性不足、由于幻觉导致的事实错误以及引入噪声的无关或冲突内容。 Method: DiFaR采用五种思维链提示来引发不同的推理轨迹,并结合轻量级事后过滤模块,根据句子级别的事实性和相关性评分选择理由句子。 Result: 实验结果表明,DiFaR在四个流行基准上优于四种基线类别,最多提升5.9%,并使现有检测器提升高达8.7%。 Conclusion: DiFaR是一个有效的框架,能够提升多模态虚假信息检测的性能,并且在四个流行的基准测试中表现出优于基线模型的效果。 Abstract: Generating textual rationales from large vision-language models (LVLMs) to support trainable multimodal misinformation detectors has emerged as a promising paradigm. However, its effectiveness is fundamentally limited by three core challenges: (i) insufficient diversity in generated rationales, (ii) factual inaccuracies due to hallucinations, and (iii) irrelevant or conflicting content that introduces noise. We introduce DiFaR, a detector-agnostic framework that produces diverse, factual, and relevant rationales to enhance misinformation detection. DiFaR employs five chain-of-thought prompts to elicit varied reasoning traces from LVLMs and incorporates a lightweight post-hoc filtering module to select rationale sentences based on sentence-level factuality and relevance scores. Extensive experiments on four popular benchmarks demonstrate that DiFaR outperforms four baseline categories by up to 5.9% and boosts existing detectors by as much as 8.7%. Both automatic metrics and human evaluations confirm that DiFaR significantly improves rationale quality across all three dimensions.[61] When Explainability Meets Privacy: An Investigation at the Intersection of Post-hoc Explainability and Differential Privacy in the Context of Natural Language Processing
Mahdi Dhaini,Stephen Meisenbacher,Ege Erdogan,Florian Matthes,Gjergji Kasneci
Main category: cs.CL
TL;DR: This paper explores the interplay between privacy and explainability in NLP, showing that they can co-exist and offering practical recommendations for future research.
Details
Motivation: There is a lack of research at the intersection of explainability and privacy in NLP, leaving uncertainty about whether both can be achieved simultaneously. Method: Empirical investigation guided by Differential Privacy (DP) and Post-hoc Explainability methods. Result: Findings reveal an intricate relationship between privacy and explainability influenced by factors like the downstream task and methods used for text privatization and explainability. Conclusion: Privacy and explainability in NLP can co-exist, and the study provides practical recommendations for future work at their intersection. Abstract: In the study of trustworthy Natural Language Processing (NLP), a number of important research fields have emerged, including that of \textit{explainability} and \textit{privacy}. While research interest in both explainable and privacy-preserving NLP has increased considerably in recent years, there remains a lack of investigation at the intersection of the two. This leaves a considerable gap in understanding of whether achieving \textit{both} explainability and privacy is possible, or whether the two are at odds with each other. In this work, we conduct an empirical investigation into the privacy-explainability trade-off in the context of NLP, guided by the popular overarching methods of \textit{Differential Privacy} (DP) and Post-hoc Explainability. Our findings include a view into the intricate relationship between privacy and explainability, which is formed by a number of factors, including the nature of the downstream task and choice of the text privatization and explainability method. In this, we highlight the potential for privacy and explainability to co-exist, and we summarize our findings in a collection of practical recommendations for future work at this important intersection.[62] When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models
Huyu Wu,Meng Tang,Xinhan Zheng,Haiyun Jiang
Main category: cs.CL
TL;DR: 本文研究了多模态大型语言模型中的文本主导问题,提出了评估指标和改进方法,有效解决了这一问题。
Details
Motivation: 多模态大型语言模型在多种任务中表现出色,但存在文本主导问题,即过度依赖文本而其他模态利用不足。本文旨在系统性地研究这一问题并提出解决方案。 Method: 提出了两种评估指标(MDI和AEI)来衡量文本主导问题,并提出了一种简单的令牌压缩方法来重新平衡模型注意力。 Result: 分析揭示了文本主导问题的普遍存在,并找到了三个根本原因。使用LLaVA-7B进行测试,该方法显著降低了MDI值,从10.23降至0.86。 Conclusion: 本文提出了用于衡量和解决多模态大型语言模型中文本主导问题的指标和方法,并为更公平的多模态模型开发提供了基础。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a diverse range of multimodal tasks. However, these models suffer from a core problem known as text dominance: they depend heavily on text for their inference, while underutilizing other modalities. While prior work has acknowledged this phenomenon in vision-language tasks, often attributing it to data biases or model architectures. In this paper, we conduct the first systematic investigation of text dominance across diverse data modalities, including images, videos, audio, time-series, and graphs. To measure this imbalance, we propose two evaluation metrics: the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI). Our comprehensive analysis reveals that text dominance is both significant and pervasive across all tested modalities. Our in-depth analysis identifies three underlying causes: attention dilution from severe token redundancy in non-textual modalities, the influence of fusion architecture design, and task formulations that implicitly favor textual inputs. Furthermore, we propose a simple token compression method that effectively rebalances model attention. Applying this method to LLaVA-7B, for instance, drastically reduces its MDI from 10.23 to a well-balanced value of 0.86. Our analysis and methodological framework offer a foundation for the development of more equitable and comprehensive multimodal language models.[63] eDIF: A European Deep Inference Fabric for Remote Interpretability of LLM
Irma Heithoff. Marc Guggenberger,Sandra Kalogiannis,Susanne Mayer,Fabian Maag,Sigurd Schacht,Carsten Lanquillon
Main category: cs.CL
TL;DR: 本论文研究了欧洲深度推理基础设施(eDIF)的可行性,该基础设施旨在支持对大型语言模型的机械可解释性研究,并通过试点研究评估了其性能和用户反馈。
Details
Motivation: 推动欧洲研究社区对大型语言模型进行机械可解释性研究的基础设施的广泛可及性,从而实现先进的模型分析能力的普及化。 Method: 该论文介绍了一个基于GPU的集群,位于安斯巴赫应用科学大学,并通过NNsight API实现远程模型检查,同时通过一个结构化的试点研究评估了平台的技术性能、可用性和科学实用性。 Result: 试点研究显示了用户参与度的逐步增加,平台性能稳定,远程实验能力受到积极评价,同时识别出了一些局限性,如激活数据下载时间较长和间歇性执行中断。 Conclusion: 该论文的结论是,eDIF项目的启动是欧洲广泛获取LLM可解释性基础设施的重要一步,并为未来更广泛的部署、工具扩展和持续的社区合作奠定了基础。 Abstract: This paper presents a feasibility study on the deployment of a European Deep Inference Fabric (eDIF), an NDIF-compatible infrastructure designed to support mechanistic interpretability research on large language models. The need for widespread accessibility of LLM interpretability infrastructure in Europe drives this initiative to democratize advanced model analysis capabilities for the research community. The project introduces a GPU-based cluster hosted at Ansbach University of Applied Sciences and interconnected with partner institutions, enabling remote model inspection via the NNsight API. A structured pilot study involving 16 researchers from across Europe evaluated the platform's technical performance, usability, and scientific utility. Users conducted interventions such as activation patching, causal tracing, and representation analysis on models including GPT-2 and DeepSeek-R1-70B. The study revealed a gradual increase in user engagement, stable platform performance throughout, and a positive reception of the remote experimentation capabilities. It also marked the starting point for building a user community around the platform. Identified limitations such as prolonged download durations for activation data as well as intermittent execution interruptions are addressed in the roadmap for future development. This initiative marks a significant step towards widespread accessibility of LLM interpretability infrastructure in Europe and lays the groundwork for broader deployment, expanded tooling, and sustained community collaboration in mechanistic interpretability research.[64] Neural Machine Translation for Coptic-French: Strategies for Low-Resource Ancient Languages
Nasma Chaoui,Richard Khoury
Main category: cs.CL
TL;DR: The paper provides a systematic study on translating Coptic into French, showing that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality.
Details
Motivation: The motivation of this study is to explore strategies for translating Coptic into French, which provides insights for developing translation tools for historical languages. Method: The study systematically evaluates pivot versus direct translation, the impact of pre-training, the benefits of multi-version fine-tuning, and model robustness to noise using aligned biblical corpora. Result: The result of the study is the demonstration that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality. Conclusion: The study concludes that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality, providing crucial practical insights for developing translation tools for historical languages. Abstract: This paper presents the first systematic study of strategies for translating Coptic into French. Our comprehensive pipeline systematically evaluates: pivot versus direct translation, the impact of pre-training, the benefits of multi-version fine-tuning, and model robustness to noise. Utilizing aligned biblical corpora, we demonstrate that fine-tuning with a stylistically-varied and noise-aware training corpus significantly enhances translation quality. Our findings provide crucial practical insights for developing translation tools for historical languages in general.[65] Continuous Bangla Sign Language Translation: Mitigating the Expense of Gloss Annotation with the Assistance of Graph
Safaeid Hossain Arib,Rabeya Akter,Sejuti Rahman
Main category: cs.CL
TL;DR: This paper introduces an innovative method for sign language translation by fusing graph-based techniques with transformer architecture, achieving significant improvements in translation accuracy and setting new benchmarks for communication accessibility for the deaf and hard of hearing.
Details
Motivation: The motivation behind the research is to address the communication barriers faced by the deaf and hard of hearing due to the underestimation of sign language in spoken language-dominated societies. Method: The method involves integrating graph-based approaches with transformer architecture, specifically combining transformer and STGCN-LSTM architectures to enhance gloss-free translation effectiveness. Result: The proposed method achieves notable improvements in BLEU-4 scores across multiple datasets, introduces benchmarking for the BornilDB v1.0 dataset, and demonstrates superior translation performance compared to existing approaches. Conclusion: The paper concludes that the Continuous Bangla Sign Language Translation project, through its innovative fusion of graph-based methods with transformer architecture, achieves state-of-the-art results in gloss-free translation, significantly improving communication accessibility for the deaf and hard of hearing. Abstract: Millions of individuals worldwide are affected by deafness and hearing impairment. Sign language serves as a sophisticated means of communication for the deaf and hard of hearing. However, in societies that prioritize spoken languages, sign language often faces underestimation, leading to communication barriers and social exclusion. The Continuous Bangla Sign Language Translation project aims to address this gap by enhancing translation methods. While recent approaches leverage transformer architecture for state-of-the-art results, our method integrates graph-based methods with the transformer architecture. This fusion, combining transformer and STGCN-LSTM architectures, proves more effective in gloss-free translation. Our contributions include architectural fusion, exploring various fusion strategies, and achieving a new state-of-the-art performance on diverse sign language datasets, namely RWTH-PHOENIX-2014T, CSL-Daily, How2Sign, and BornilDB v1.0. Our approach demonstrates superior performance compared to current translation outcomes across all datasets, showcasing notable improvements of BLEU-4 scores of 4.01, 2.07, and 0.5, surpassing those of GASLT, GASLT and slt_how2sign in RWTH-PHOENIX-2014T, CSL-Daily, and How2Sign, respectively. Also, we introduce benchmarking on the BornilDB v1.0 dataset for the first time. Our method sets a benchmark for future research, emphasizing the importance of gloss-free translation to improve communication accessibility for the deaf and hard of hearing.[66] Learning from Natural Language Feedback for Personalized Question Answering
Alireza Salemi,Hamed Zamani
Main category: cs.CL
TL;DR: VAC框架通过使用自然语言反馈代替标量奖励,显著提升了个性化问答系统的性能和响应质量。
Details
Motivation: 当前个性化大型语言模型的方法依赖于检索增强生成和标量奖励信号,但这些信号有时提供弱且非指导性的反馈,限制了学习效率和个性化质量。 Method: 引入VAC框架,该框架用基于用户资料和问题叙述生成的自然语言反馈取代标量奖励。训练过程在反馈模型优化和策略模型微调之间交替进行。 Result: 在LaMP-QA基准测试中,VAC框架在三个不同领域均表现出比现有技术更显著的改进,人类评估也确认了生成响应的优越质量。 Conclusion: 自然语言反馈为优化个性化问答提供了更有效的信号,VAC框架显著提高了个性化响应生成的性能。 Abstract: Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.[67] Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs
Xiangqi Jin,Yuxuan Wang,Yifeng Gao,Zichen Wen,Biqing Qi,Dongrui Liu,Linfeng Zhang
Main category: cs.CL
TL;DR: ICE improves prompting for diffusion large language models by enabling bidirectional processing and early exit, resulting in faster and more accurate results.
Details
Motivation: Traditional prefix-only prompting in LLMs limits bidirectional information flow and flexibility, which dLLMs can overcome with their iterative and bidirectional architecture. Method: The ICE framework uses in-place prompting within masked token positions and a confidence-aware early exit mechanism for efficiency. Result: ICE achieved up to 17.29% accuracy improvement with 4.12× speedup on GSM8K and up to 276.67× acceleration on MMLU while maintaining competitive performance. Conclusion: ICE offers a more efficient and flexible prompting framework for dLLMs, outperforming traditional methods in both accuracy and speed across multiple tasks. Abstract: Despite large language models (LLMs) have achieved remarkable success, their prefix-only prompting paradigm and sequential generation process offer limited flexibility for bidirectional information. Diffusion large language models (dLLMs) present new opportunities through their bidirectional attention mechanisms and iterative refinement processes, enabling more flexible in-place prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework that transforms prefix-only prompting into in-place prompting specifically designed for dLLMs. ICE integrates in-place prompts directly within masked token positions during iterative refinement and employs a confidence-aware early exit mechanism to significantly reduce computational overhead. Extensive experiments demonstrate ICE's effectiveness, achieving up to 17.29% accuracy improvement with 4.12$\times$ speedup on GSM8K, and up to 276.67$\times$ acceleration on MMLU while maintaining competitive performance.[68] Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback
Osama Mohammed Afzal,Preslav Nakov,Tom Hope,Iryna Gurevych
Main category: cs.CL
TL;DR: A structured, literature-aware method for automated novelty evaluation in peer review, showing strong alignment with human judgments and improving consistency.
Details
Motivation: Novelty assessment in peer review is understudied yet critical, especially in high-volume fields like NLP where reviewer capacity is strained. Method: An automated novelty evaluation approach modeling expert reviewer behavior through content extraction, related work synthesis, and structured comparison, informed by human review analysis. Result: Evaluated on 182 ICLR 2025 submissions, achieving 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions, outperforming existing LLM baselines. Conclusion: Structured LLM-assisted approaches can enhance peer review rigor and transparency without replacing human expertise. Abstract: Novelty assessment is a central yet understudied aspect of peer review, particularly in high volume fields like NLP where reviewer capacity is increasingly strained. We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages: content extraction from submissions, retrieval and synthesis of related work, and structured comparison for evidence based assessment. Our method is informed by a large scale analysis of human written novelty reviews and captures key patterns such as independent claim verification and contextual reasoning. Evaluated on 182 ICLR 2025 submissions with human annotated reviewer novelty assessments, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions - substantially outperforming existing LLM based baselines. The method produces detailed, literature aware analyses and improves consistency over ad hoc reviewer judgments. These results highlight the potential for structured LLM assisted approaches to support more rigorous and transparent peer review without displacing human expertise. Data and code are made available.[69] Reinforced Language Models for Sequential Decision Making
Jim Dilkes,Vahid Yazdanpanah,Sebastian Stein
Main category: cs.CL
TL;DR: 本文提出了一种用于提升小型语言模型在顺序决策任务中表现的新后训练方法 MS-GRPO,通过该方法训练的小型模型可以超越大型模型的性能。
Details
Motivation: 大型语言模型在顺序决策任务中受限于其计算成本,需要一种适用于多步骤任务的后训练方法来提升小型模型的表现。 Method: 引入了 Multi-Step Group-Relative Policy Optimization (MS-GRPO) 算法,并结合了新的绝对优势加权片段采样策略来优化训练性能。 Result: 在 Snake 和 Frozen Lake 任务上,经过后训练的 3B 参数模型在 Frozen Lake 任务上的表现超过了 72B 参数基线模型 50%。 Conclusion: MS-GRPO 是一种有效的后训练方法,可以提高小型模型在顺序决策任务中的性能,从而减少对大规模模型的依赖。 Abstract: Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.[70] Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning
Chongyuan Dai,Jinpeng Hu,Hongchang Shi,Zhuo Li,Xun Yang,Meng Wang
Main category: cs.CL
TL;DR: This paper introduces Psyche-R1, a novel Chinese psychological large language model that integrates empathy, expertise, and reasoning to improve mental health support through a hybrid training approach.
Details
Motivation: To address the shortage of mental health professionals and enhance psychological applications using reasoning-augmented large language models (LLMs). Method: A hybrid training strategy was used, combining a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) with supervised fine-tuning (SFT) on empathetic dialogues and psychological questions with rationales. Result: The 7B Psyche-R1 achieved comparable results to the 671B DeepSeek-R1 across several psychological benchmarks. Conclusion: Psyche-R1, a Chinese psychological LLM, successfully integrates empathy, psychological expertise, and reasoning, showing promising results in alleviating the burden of mental health disorders. Abstract: Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.[71] From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms
Zhaokun Jiang,Ziyin Zhang
Main category: cs.CL
TL;DR: 本研究通过整合特征工程、数据增强和可解释机器学习,提出了一种强调可解释性的多维建模框架,以解决现有自动化口译质量评估研究的不足。
Details
Motivation: 现有研究在语言使用质量检验上不足,由于数据稀缺和不平衡导致建模效果不理想,且缺乏对模型预测的解释。 Method: 该研究整合了特征工程、数据增强和可解释机器学习,利用仅与构建相关的透明特征,并通过Shapley值(SHAP)分析实现可解释性。 Result: 在英汉交替传译的新数据集上取得了良好的预测性能,发现BLEURT和CometKiwi评分对准确性最强,与停顿相关的特征对流利度最重要,汉语特定短语多样性指标对语言使用最重要。 Conclusion: 本研究提出了一种可解释性强的多维建模框架,作为传统人工评估的可扩展、可靠和透明的替代方案,能够为学习者提供详细的诊断反馈,并支持自我调节学习。 Abstract: Recent advancements in machine learning have spurred growing interests in automated interpreting quality assessment. Nevertheless, existing research suffers from insufficient examination of language use quality, unsatisfactory modeling effectiveness due to data scarcity and imbalance, and a lack of efforts to explain model predictions. To address these gaps, we propose a multi-dimensional modeling framework that integrates feature engineering, data augmentation, and explainable machine learning. This approach prioritizes explainability over ``black box'' predictions by utilizing only construct-relevant, transparent features and conducting Shapley Value (SHAP) analysis. Our results demonstrate strong predictive performance on a novel English-Chinese consecutive interpreting dataset, identifying BLEURT and CometKiwi scores to be the strongest predictive features for fidelity, pause-related features for fluency, and Chinese-specific phraseological diversity metrics for language use. Overall, by placing particular emphasis on explainability, we present a scalable, reliable, and transparent alternative to traditional human evaluation, facilitating the provision of detailed diagnostic feedback for learners and supporting self-regulated learning advantages not afforded by automated scores in isolation.[72] SSRL: Self-Search Reinforcement Learning
Yuchen Fan,Kaiyan Zhang,Heng Zhou,Yuxin Zuo,Yanxu Chen,Yu Fu,Xinwei Long,Xuekai Zhu,Che Jiang,Yuchen Zhang,Li Kang,Gang Chen,Cheng Huang,Zhizhou He,Bingning Wang,Lei Bai,Ning Ding,Bowen Zhou
Main category: cs.CL
TL;DR: This paper investigates the use of large language models (LLMs) as efficient simulators for reinforcement learning (RL) tasks, reducing the need for costly external search engine interactions. The proposed Self-Search RL (SSRL) method enhances LLMs' search capabilities through format-based and rule-based rewards, enabling internal knowledge utilization without external tools. SSRL-trained models show promise in reducing hallucination and seamlessly integrating with external search engines.
Details
Motivation: We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. Method: We first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, termed Self-Search. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. Result: Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. Conclusion: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Abstract: We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.[73] A Survey on Diffusion Language Models
Tianyi Li,Mingda Chen,Bowei Guo,Zhiqiang Shen
Main category: cs.CL
TL;DR: The paper presents a survey on Diffusion Language Models (DLMs), highlighting their advantages over autoregressive models, providing a comprehensive overview of their techniques, applications, and challenges, and outlining future research directions.
Details
Motivation: The motivation of the paper is to provide a comprehensive overview of the current landscape of Diffusion Language Models (DLMs) as an alternative to autoregressive (AR) models. It aims to explore the advantages of DLMs, such as reduced inference latency, bidirectional context capture, and fine-grained generation control, while analyzing their techniques, applications, limitations, and future research directions. Method: The paper provides a survey on the current state of Diffusion Language Models (DLMs). It reviews their evolution, foundational principles, and state-of-the-art models. It also presents a taxonomy and in-depth analysis of DLM techniques, including pre-training strategies, post-training methods, inference strategies, and optimizations. Additionally, the paper explores multimodal extensions of DLMs and discusses their applications in various scenarios. Result: The paper successfully provides a comprehensive and up-to-date overview of the DLM landscape. It offers a taxonomy, in-depth analysis of current techniques, a review of inference strategies and optimizations, and insights into multimodal extensions and applications. It also identifies the limitations and challenges of DLMs and outlines future research directions. Conclusion: DLMs are becoming a strong alternative to AR models in NLP with their advantages in latency reduction, bidirectional context capturing, and generation control. They have comparable performance to AR models and offer promising applications in various NLP tasks. However, challenges such as efficiency, long-sequence handling, and infrastructure requirements need to be addressed for further development. Abstract: Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.cs.CV [Back]
[74] Stochastic-based Patch Filtering for Few-Shot Learning
Javier Rodenas,Eduardo Aguilar,Petia Radeva
Main category: cs.CV
TL;DR: 本文提出了一種名為SPFF的方法,用於解決小樣本學習中食物圖像分類的挑戰,通過過濾與類別表示相關性較低的圖像塊嵌入,提高分類準確性。
Details
Motivation: 食物圖像因外觀複雜且多變,例如同一道菜可能出現在不同盤子上、不同光線和視角下,導致模型在比較查詢圖像和支持圖像時容易忽略關鍵元素,造成誤分類。 Method: 提出基於隨機過濾的圖像塊篩選方法(SPFF),根據圖像塊嵌入與類別感知嵌入的相似性,以隨機方式過濾掉較不相關的圖像塊,並使用相似性矩陣量化查詢圖像與支持圖像之間的關係。 Result: 通過定性分析和在Food-101、VireoFood-172和UECFood-256等小樣本分類基準上的實驗驗證,SPFF能有效聚焦於具有類別特徵的圖像塊,並過濾掉不相關區域,表現優於現有最先進方法。 Conclusion: SPFF在處理食物圖像的小樣本學習任務中展現出良好的性能,通過篩選相關圖像塊提升分類準確率,為解決視覺複雜性與變異性問題提供了有效策略。 Abstract: Food images present unique challenges for few-shot learning models due to their visual complexity and variability. For instance, a pasta dish might appear with various garnishes on different plates and in diverse lighting conditions and camera perspectives. This problem leads to losing focus on the most important elements when comparing the query with support images, resulting in misclassification. To address this issue, we propose Stochastic-based Patch Filtering for Few-Shot Learning (SPFF) to attend to the patch embeddings that show greater correlation with the class representation. The key concept of SPFF involves the stochastic filtering of patch embeddings, where patches less similar to the class-aware embedding are more likely to be discarded. With patch embedding filtered according to the probability of appearance, we use a similarity matrix that quantifies the relationship between the query image and its respective support images. Through a qualitative analysis, we demonstrate that SPFF effectively focuses on patches where class-specific food features are most prominent while successfully filtering out non-relevant patches. We validate our approach through extensive experiments on few-shot classification benchmarks: Food-101, VireoFood-172 and UECFood-256, outperforming the existing SoA methods.[75] DINOv3
Oriane Siméoni,Huy V. Vo,Maximilian Seitzer,Federico Baldassarre,Maxime Oquab,Cijo Jose,Vasil Khalidov,Marc Szafraniec,Seungeun Yi,Michaël Ramamonjisoa,Francisco Massa,Daniel Haziza,Luca Wehrstedt,Jianyuan Wang,Timothée Darcet,Théo Moutakanni,Leonel Sentana,Claire Roberts,Andrea Vedaldi,Jamie Tolan,John Brandt,Camille Couprie,Julien Mairal,Hervé Jégou,Patrick Labatut,Piotr Bojanowski
Main category: cs.CV
TL;DR: DINOv3 is a self-supervised vision model that improves training stability and flexibility, outperforming previous models on diverse tasks without fine-tuning.
Details
Motivation: Self-supervised learning aims to eliminate manual data annotation, allowing models to scale to large datasets and architectures while learning visual representations from diverse sources. Method: DINOv3 employs Gram anchoring to prevent degradation of dense feature maps during long training and uses post-hoc strategies to enhance flexibility in resolution, model size, and text alignment. Result: DINOv3 achieves superior performance on a wide range of vision tasks compared to previous self- and weakly-supervised models, producing high-quality dense features without fine-tuning. Conclusion: DINOv3 is a versatile vision foundation model that outperforms specialized state-of-the-art models across various tasks without fine-tuning, providing scalable solutions for diverse resource constraints. Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.[76] Empowering Morphing Attack Detection using Interpretable Image-Text Foundation Model
Sushrut Patwardhan,Raghavendra Ramachandra,Sushma Venkatesh
Main category: cs.CV
TL;DR: This paper introduces a multimodal learning framework for detecting morphing attacks in face recognition systems and generating descriptive text, achieving strong performance in zero-shot evaluations.
Details
Motivation: The motivation stems from the need to enhance the reliability of face recognition systems by detecting morphing attacks, while also providing human-understandable textual descriptions. Method: The authors employed Contrastive Language-Image Pretraining (CLIP) for zero-shot evaluation and analyzed ten different textual prompts. They conducted experiments on a face morphing dataset derived from a publicly available biometric dataset. Result: The proposed framework demonstrated generalizable morphing attack detection capabilities and successfully predicted relevant text snippets. Extensive experiments validated its effectiveness across five morphing generation techniques and three mediums. Conclusion: The paper concludes that the proposed multimodal learning approach effectively detects morphing attacks and predicts relevant textual descriptions, showing promising results in zero-shot evaluation scenarios. Abstract: Morphing attack detection has become an essential component of face recognition systems for ensuring a reliable verification scenario. In this paper, we present a multimodal learning approach that can provide a textual description of morphing attack detection. We first show that zero-shot evaluation of the proposed framework using Contrastive Language-Image Pretraining (CLIP) can yield not only generalizable morphing attack detection, but also predict the most relevant text snippet. We present an extensive analysis of ten different textual prompts that include both short and long textual prompts. These prompts are engineered by considering the human understandable textual snippet. Extensive experiments were performed on a face morphing dataset that was developed using a publicly available face biometric dataset. We present an evaluation of SOTA pre-trained neural networks together with the proposed framework in the zero-shot evaluation of five different morphing generation techniques that are captured in three different mediums.[77] Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs
Kaixin Peng,Mengyang Zhao,Haiyang Yu,Teng Fu,Bin Li
Main category: cs.CV
TL;DR: 本文提出了一种结合部首分析与象形语义理解的可解释甲骨文破译方法,通过渐进式训练和双匹配机制提升了零样本破译性能,并发布了一个大规模数据集。
Details
Motivation: 由于甲骨文的稀有性、抽象性和象形多样性,其破译在考古学中长期面临巨大挑战,现有深度学习方法在零样本设置和未破译甲骨文上表现有限,缺乏可解释性。 Method: 提出了一种渐进式训练策略,结合部首分析和象形语义理解,同时设计了一个部首-象形双匹配机制,从而实现从字形到意义的推理。 Result: 在公共基准测试中实现了最先进的Top-10准确率,并展现出优越的零样本破译能力,同时模型能够提供逻辑分析过程,有助于未破译甲骨文的研究。 Conclusion: 该论文提出了一种基于大型视觉-语言模型的可解释甲骨文破译方法,并构建了一个包含甲骨文图像和象形分析文本的数据集,以提升零样本破译性能,并为未破译甲骨文提供考古学上有价值的参考结果。 Abstract: As the oldest mature writing system, Oracle Bone Script (OBS) has long posed significant challenges for archaeological decipherment due to its rarity, abstractness, and pictographic diversity. Current deep learning-based methods have made exciting progress on the OBS decipherment task, but existing approaches often ignore the intricate connections between glyphs and the semantics of OBS. This results in limited generalization and interpretability, especially when addressing zero-shot settings and undeciphered OBS. To this end, we propose an interpretable OBS decipherment method based on Large Vision-Language Models, which synergistically combines radical analysis and pictograph-semantic understanding to bridge the gap between glyphs and meanings of OBS. Specifically, we propose a progressive training strategy that guides the model from radical recognition and analysis to pictographic analysis and mutual analysis, thus enabling reasoning from glyph to meaning. We also design a Radical-Pictographic Dual Matching mechanism informed by the analysis results, significantly enhancing the model's zero-shot decipherment performance. To facilitate model training, we propose the Pictographic Decipherment OBS Dataset, which comprises 47,157 Chinese characters annotated with OBS images and pictographic analysis texts. Experimental results on public benchmarks demonstrate that our approach achieves state-of-the-art Top-10 accuracy and superior zero-shot decipherment capabilities. More importantly, our model delivers logical analysis processes, possibly providing archaeologically valuable reference results for undeciphered OBS, and thus has potential applications in digital humanities and historical research. The dataset and code will be released in https://github.com/PKXX1943/PD-OBS.[78] Deep Learning Enables Large-Scale Shape and Appearance Modeling in Total-Body DXA Imaging
Arianna Bunnell,Devon Cataldi,Yannik Glaser,Thomas K. Wolfgruber,Steven Heymsfield,Alan B. Zonderman,Thomas L. Kelly,Peter Sadowski,John A. Shepherd
Main category: cs.CV
TL;DR: 该研究开发了一种高效的深度学习模型,用于TBDXA图像上的自动关键点检测,支持身体成分与健康标志物之间的大规模关联分析。
Details
Motivation: TBDXA成像是一种低成本的全身成像方式,用于身体成分评估,但缺乏自动化的关键点标注方法,限制了其在大规模研究中的应用。 Method: 使用1,683个手动标注的TBDXA扫描数据集训练并验证深度学习模型,随后在35,928个扫描数据上进行关键点放置,用于形状与外观建模(SAM),并通过两样本Kolmogorov-Smirnov检验测试与健康标志物的关联。 Result: 在外部测试数据集中,模型达到了99.5%的关键点正确率,并成功用于识别与多种健康标志物相关的新假设。 Conclusion: 该论文提出了一种基于深度学习的方法,能够自动在TBDXA扫描图像上放置关键点,并通过大规模实验验证了其准确性和在健康标志物分析中的应用价值。 Abstract: Total-body dual X-ray absorptiometry (TBDXA) imaging is a relatively low-cost whole-body imaging modality, widely used for body composition assessment. We develop and validate a deep learning method for automatic fiducial point placement on TBDXA scans using 1,683 manually-annotated TBDXA scans. The method achieves 99.5% percentage correct keypoints in an external testing dataset. To demonstrate the value for shape and appearance modeling (SAM), our method is used to place keypoints on 35,928 scans for five different TBDXA imaging modes, then associations with health markers are tested in two cohorts not used for SAM model generation using two-sample Kolmogorov-Smirnov tests. SAM feature distributions associated with health biomarkers are shown to corroborate existing evidence and generate new hypotheses on body composition and shape's relationship to various frailty, metabolic, inflammation, and cardiometabolic health markers. Evaluation scripts, model weights, automatic point file generation code, and triangulation files are available at https://github.com/hawaii-ai/dxa-pointplacement.[79] MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning
Thanh-Dat Truong,Christophe Bobda,Nitin Agarwal,Khoa Luu
Main category: cs.CV
TL;DR: This paper proposes MANGO, a novel multimodal fusion method using attention-based normalizing flow, which achieves superior performance in various multimodal tasks by explicitly capturing complex correlations between modalities.
Details
Motivation: Current multimodal fusion methods rely on implicit attention mechanisms that struggle to capture essential features and complex correlations in multimodal data. This work aims to develop a more explicit, interpretable, and tractable approach to multimodal fusion learning. Method: The paper proposes a Multimodal Attention-based Normalizing Flow (MANGO) approach, which includes an Invertible Cross-Attention (ICA) layer with three cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Result: The experimental results on semantic segmentation, image-to-image translation, and movie genre classification tasks demonstrate that the proposed MANGO approach achieves state-of-the-art performance. Conclusion: The paper concludes that the proposed MANGO approach achieves state-of-the-art performance in multimodal learning tasks, demonstrating its effectiveness in explicitly capturing complex correlations among multimodal data. Abstract: Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.[80] Improving watermelon (Citrullus lanatus) disease classification with generative artificial intelligence (GenAI)-based synthetic and real-field images via a custom EfficientNetV2-L model
Nitin Rai,Nathan S. Boyd,Gary E. Vallad,Arnold W. Schumann
Main category: cs.CV
TL;DR: 本研究发现,结合少量真实图像与大量合成图像能够显著提升作物病害分类模型的性能,而仅使用合成图像不足以达到理想效果。
Details
Motivation: 当前关于整合真实与合成图像以提高疾病分类性能的研究有限,因此本研究旨在探讨结合有限数量的真实图像与合成图像是否能够提升EfficientNetV2-L模型对西瓜病害分类的预测准确性。 Method: 将训练数据集分为五种处理方式(H0-H4),使用自定义的EfficientNetV2-L架构结合增强的微调和迁移学习技术进行训练,并比较不同处理方式下的模型性能指标。 Result: 使用H2、H3和H4处理方式的模型表现出较高的精确度、召回率和F1分数。加权F1分数从H0的0.65提升至H3-H4的1.00,表明在大量合成图像的基础上加入少量真实图像可以提高模型性能和泛化能力。 Conclusion: 该研究验证了合成图像无法完全替代真实图像,必须将两者以混合方式使用以最大化作物病害分类模型的性能。 Abstract: The current advancements in generative artificial intelligence (GenAI) models have paved the way for new possibilities for generating high-resolution synthetic images, thereby offering a promising alternative to traditional image acquisition for training computer vision models in agriculture. In the context of crop disease diagnosis, GenAI models are being used to create synthetic images of various diseases, potentially facilitating model creation and reducing the dependency on resource-intensive in-field data collection. However, limited research has been conducted on evaluating the effectiveness of integrating real with synthetic images to improve disease classification performance. Therefore, this study aims to investigate whether combining a limited number of real images with synthetic images can enhance the prediction accuracy of an EfficientNetV2-L model for classifying watermelon \textit{(Citrullus lanatus)} diseases. The training dataset was divided into five treatments: H0 (only real images), H1 (only synthetic images), H2 (1:1 real-to-synthetic), H3 (1:10 real-to-synthetic), and H4 (H3 + random images to improve variability and model generalization). All treatments were trained using a custom EfficientNetV2-L architecture with enhanced fine-tuning and transfer learning techniques. Models trained on H2, H3, and H4 treatments demonstrated high precision, recall, and F1-score metrics. Additionally, the weighted F1-score increased from 0.65 (on H0) to 1.00 (on H3-H4) signifying that the addition of a small number of real images with a considerable volume of synthetic images improved model performance and generalizability. Overall, this validates the findings that synthetic images alone cannot adequately substitute for real images; instead, both must be used in a hybrid manner to maximize model performance for crop disease classification.[81] SynSpill: Improved Industrial Spill Detection With Synthetic Data
Aaditya Baranwal,Abdul Mueez,Jason Voelker,Guneet Bhatia,Shruti Vyas
Main category: cs.CV
TL;DR: 本研究通过使用高质量合成数据集SynSpill解决了大规模视觉-语言模型在工业溢漏检测等特定领域表现不佳的问题,并引入了一个可扩展框架。
Details
Motivation: 大规模视觉-语言模型在工业溢漏检测等特定领域表现不佳,因为这些事件罕见、敏感且难以注释。 Method: 引入了一个以高质量合成数据生成流水线为中心的可扩展框架。 Result: 合成数据集SynSpill使VLM和检测器性能显著提升,且两者表现相当。 Conclusion: 合成数据与轻量级适应相结合为在工业环境中部署视觉系统提供了一种经济高效且可扩展的途径。 Abstract: Large-scale Vision-Language Models (VLMs) have transformed general-purpose visual recognition through strong zero-shot capabilities. However, their performance degrades significantly in niche, safety-critical domains such as industrial spill detection, where hazardous events are rare, sensitive, and difficult to annotate. This scarcity -- driven by privacy concerns, data sensitivity, and the infrequency of real incidents -- renders conventional fine-tuning of detectors infeasible for most industrial settings. We address this challenge by introducing a scalable framework centered on a high-quality synthetic data generation pipeline. We demonstrate that this synthetic corpus enables effective Parameter-Efficient Fine-Tuning (PEFT) of VLMs and substantially boosts the performance of state-of-the-art object detectors such as YOLO and DETR. Notably, in the absence of synthetic data (SynSpill dataset), VLMs still generalize better to unseen spill scenarios than these detectors. When SynSpill is used, both VLMs and detectors achieve marked improvements, with their performance becoming comparable. Our results underscore that high-fidelity synthetic data is a powerful means to bridge the domain gap in safety-critical applications. The combination of synthetic generation and lightweight adaptation offers a cost-effective, scalable pathway for deploying vision systems in industrial environments where real data is scarce/impractical to obtain. Project Page: https://synspill.vercel.app[82] EntropyGS: An Efficient Entropy Coding on 3D Gaussian Splatting
Yuning Huang,Jiahao Pang,Fengqing Zhu,Dong Tian
Main category: cs.CV
TL;DR: 本文提出了EntropyGS,一种用于3DGS高斯属性压缩的方法,通过统计属性相关性并采用因子化参数化熵编码,显著降低数据率并保持渲染质量。
Details
Motivation: 3DGS的高斯创建和视图渲染任务通常在时间或设备上分离,因此需要存储/传输及压缩3DGS高斯数据。 Method: 提出了一种名为EntropyGS的因子化参数化熵编码方法,利用高斯属性分布估计参数进行熵编码,并根据属性类型自适应量化。 Result: EntropyGS在基准数据集上实现了约30倍的降率,同时保持了与输入3DGS数据相似的渲染质量,并具有快速编码和解码时间。 Conclusion: EntropyGS实现了3DGS高斯属性的有效压缩,通过分析属性统计相关性并采用因子化参数化熵编码方法,实现了约30倍的降率,同时保持了渲染质量。 Abstract: As an emerging novel view synthesis approach, 3D Gaussian Splatting (3DGS) demonstrates fast training/rendering with superior visual quality. The two tasks of 3DGS, Gaussian creation and view rendering, are typically separated over time or devices, and thus storage/transmission and finally compression of 3DGS Gaussians become necessary. We begin with a correlation and statistical analysis of 3DGS Gaussian attributes. An inspiring finding in this work reveals that spherical harmonic AC attributes precisely follow Laplace distributions, while mixtures of Gaussian distributions can approximate rotation, scaling, and opacity. Additionally, harmonic AC attributes manifest weak correlations with other attributes except for inherited correlations from a color space. A factorized and parameterized entropy coding method, EntropyGS, is hereinafter proposed. During encoding, distribution parameters of each Gaussian attribute are estimated to assist their entropy coding. The quantization for entropy coding is adaptively performed according to Gaussian attribute types. EntropyGS demonstrates about 30x rate reduction on benchmark datasets while maintaining similar rendering quality compared to input 3DGS data, with a fast encoding and decoding time.[83] CellSymphony: Deciphering the molecular and phenotypic orchestration of cells with single-cell pathomics
Paul H. Acosta,Pingjun Chen,Simon P. Castillo,Maria Esther Salvatierra,Yinyin Yuan,Xiaoxi Pan
Main category: cs.CV
TL;DR: 本文提出了CellSymphony,一种利用基础模型从Xenium转录组和组织学图像中提取特征并进行多模态融合的框架,实现了高精度的细胞类型注释和微环境分析。
Details
Motivation: 尽管组织学图像中包含丰富的形态学信息,提取稳健的细胞级特征并将其与空间转录组数据整合仍然是一个关键挑战。 Method: 引入了CellSymphony,一个灵活的多模态框架,利用基础模型从Xenium转录组谱和组织学图像中提取单细胞分辨率的特征。 Result: CellSymphony在三种癌症类型中实现了准确的细胞类型注释,并揭示了不同的微环境生态位。 Conclusion: CellSymphony通过整合Xenium转录组谱和组织学图像的联合表示,实现了准确的细胞类型注释,并揭示了三种癌症中不同的微环境生态位。 Abstract: Xenium, a new spatial transcriptomics platform, enables subcellular-resolution profiling of complex tumor tissues. Despite the rich morphological information in histology images, extracting robust cell-level features and integrating them with spatial transcriptomics data remains a critical challenge. We introduce CellSymphony, a flexible multimodal framework that leverages foundation model-derived embeddings from both Xenium transcriptomic profiles and histology images at true single-cell resolution. By learning joint representations that fuse spatial gene expression with morphological context, CellSymphony achieves accurate cell type annotation and uncovers distinct microenvironmental niches across three cancer types. This work highlights the potential of foundation models and multimodal fusion for deciphering the physiological and phenotypic orchestration of cells within complex tissue ecosystems.[84] Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets
Xinan Zhang,Haolin Wang,Yung-An Hsieh,Zhongyu Yang,Anthony Yezzi,Yi-Chang Tsai
Main category: cs.CV
TL;DR: 本文综述了深度学习在裂缝检测中的发展趋势,介绍了一个新的3D激光扫描数据集3DCrack,并为常用深度学习方法建立了基准测试结果。
Details
Motivation: 裂缝检测在土木基础设施中至关重要,而深度学习的发展正在重塑这一领域。研究这些新兴趋势有助于推动裂缝检测技术的进步。 Method: 论文采用系统综述的方法,分析了裂缝检测领域的发展趋势,并引入了一个新的数据集3DCrack进行基准测试。 Result: 论文揭示了裂缝检测领域学习范式、泛化能力及数据集获取方式的转变,并通过基准测试建立了常用深度学习方法的性能基线。 Conclusion: 该论文系统分析了深度学习在裂缝检测领域的发展趋势,并引入了一个新的数据集3DCrack,为未来研究提供了基准和基础。 Abstract: Crack detection plays a crucial role in civil infrastructures, including inspection of pavements, buildings, etc., and deep learning has significantly advanced this field in recent years. While numerous technical and review papers exist in this domain, emerging trends are reshaping the landscape. These shifts include transitions in learning paradigms (from fully supervised learning to semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation and fine-tuning foundation models), improvements in generalizability (from single-dataset performance to cross-dataset evaluation), and diversification in dataset reacquisition (from RGB images to specialized sensor-based data). In this review, we systematically analyze these trends and highlight representative works. Additionally, we introduce a new dataset collected with 3D laser scans, 3DCrack, to support future research and conduct extensive benchmarking experiments to establish baselines for commonly used deep learning methodologies, including recent foundation models. Our findings provide insights into the evolving methodologies and future directions in deep learning-based crack detection. Project page: https://github.com/nantonzhang/Awesome-Crack-Detection[85] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs
Haonan Ge,Yiwei Wang,Ming-Hsuan Yang,Yujun Cai
Main category: cs.CV
TL;DR: This paper proposes MRFD, a training-free decoding method that enhances the factual grounding of Large Vision-Language Models by leveraging inter-region consistency in image analysis.
Details
Motivation: The motivation is to enhance the factual grounding of LVLMs due to their tendency to produce hallucinations stemming from limited region-based information verification. Method: The method involves identifying salient image regions, generating initial responses for each, computing reliability weights using Jensen-Shannon Divergence, and applying a consistency-aware fusion with region-aware prompts. Result: The experiments demonstrate that MRFD significantly improves response factuality and reduces hallucinations across multiple LVLMs and benchmarks. Conclusion: MRFD effectively reduces hallucinations in LVLMs by leveraging region-level consistency without the need for training. Abstract: Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.[86] Pose-Robust Calibration Strategy for Point-of-Gaze Estimation on Mobile Phones
Yujie Zhao,Jiabei Zeng,Shiguang Shan
Main category: cs.CV
TL;DR: 该论文提出了一种动态校准策略,可以在用户友好和高效的校准过程中,通过引入头部姿态变化来提升PoG估计器的鲁棒性。
Details
Motivation: 由于个体差异,外观特征的PoG估计器难以在不同个体间泛化,需要特定于人的校准。然而,校准后的PoG估计器往往对头部姿态变化敏感。 Method: 构建了一个基准测试MobilePoG,并利用它系统地分析了校准点和头部姿态的多样性如何影响估计精度。 Result: 实验表明,在校准过程中引入更广泛的头部姿态变化,能够提高估计器处理姿态变化的能力。 Conclusion: 动态校准策略在处理头部姿态变化方面优于传统校准策略,使PoG估计器对头部姿态变化的敏感性降低。 Abstract: Although appearance-based point-of-gaze (PoG) estimation has improved, the estimators still struggle to generalize across individuals due to personal differences. Therefore, person-specific calibration is required for accurate PoG estimation. However, calibrated PoG estimators are often sensitive to head pose variations. To address this, we investigate the key factors influencing calibrated estimators and explore pose-robust calibration strategies. Specifically, we first construct a benchmark, MobilePoG, which includes facial images from 32 individuals focusing on designated points under either fixed or continuously changing head poses. Using this benchmark, we systematically analyze how the diversity of calibration points and head poses influences estimation accuracy. Our experiments show that introducing a wider range of head poses during calibration improves the estimator's ability to handle pose variation. Building on this insight, we propose a dynamic calibration strategy in which users fixate on calibration points while moving their phones. This strategy naturally introduces head pose variation during a user-friendly and efficient calibration process, ultimately producing a better calibrated PoG estimator that is less sensitive to head pose variations than those using conventional calibration strategies. Codes and datasets are available at our project page.[87] High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance
Danyi Gao
Main category: cs.CV
TL;DR: 这篇论文提出了一种新的文本驱动图像生成方法,通过整合文本-图像对比约束和结构引导机制,解决了现有方法在语义对齐准确性和结构一致性方面的性能瓶颈。
Details
Motivation: 该论文旨在解决现有文本驱动图像生成方法在语义对齐准确性和结构一致性方面的性能瓶颈。 Method: 该论文提出了一种高保真图像生成方法,通过整合文本-图像对比约束和结构引导机制。引入了一个对比学习模块,以建立强大的跨模态对齐约束,同时使用结构先验(如语义布局图或边缘草图)来指导生成器的空间级结构建模。 Result: 定量指标证实了所提出方法在CLIP Score、FID和SSIM方面的优越性能。实验结果表明,该方法能够有效生成语义清晰且结构完整的图像。 Conclusion: 该论文提出的方法在不增加计算复杂度的情况下,有效弥合了语义对齐与结构保真度之间的差距,展示了在联合文本-图像建模和图像生成方面的可行技术路径。 Abstract: This paper addresses the performance bottlenecks of existing text-driven image generation methods in terms of semantic alignment accuracy and structural consistency. A high-fidelity image generation method is proposed by integrating text-image contrastive constraints with structural guidance mechanisms. The approach introduces a contrastive learning module that builds strong cross-modal alignment constraints to improve semantic matching between text and image. At the same time, structural priors such as semantic layout maps or edge sketches are used to guide the generator in spatial-level structural modeling. This enhances the layout completeness and detail fidelity of the generated images. Within the overall framework, the model jointly optimizes contrastive loss, structural consistency loss, and semantic preservation loss. A multi-objective supervision mechanism is adopted to improve the semantic consistency and controllability of the generated content. Systematic experiments are conducted on the COCO-2014 dataset. Sensitivity analyses are performed on embedding dimensions, text length, and structural guidance strength. Quantitative metrics confirm the superior performance of the proposed method in terms of CLIP Score, FID, and SSIM. The results show that the method effectively bridges the gap between semantic alignment and structural fidelity without increasing computational complexity. It demonstrates a strong ability to generate semantically clear and structurally complete images, offering a viable technical path for joint text-image modeling and image generation.[88] VIFSS: View-Invariant and Figure Skating-Specific Pose Representation Learning for Temporal Action Segmentation
Ryota Tanaka,Tomohiro Suzuki,Keisuke Fujii
Main category: cs.CV
TL;DR: This paper proposes a novel TAS framework for figure skating jumps that integrates 3D pose representation learning and procedural structure annotations, achieving high accuracy even with limited data.
Details
Motivation: Accurate recognition of figure skating jumps is crucial for objective performance evaluation, but existing TAS methods face challenges due to insufficient annotated data and lack of consideration for 3D aspects and procedural structures of jumps. Method: A new Temporal Action Segmentation (TAS) framework incorporating 3D pose representation learning (VIFSS) and a fine-grained annotation scheme for jump phases. Result: The method achieves over 92% F1@50 on element-level TAS and demonstrates effectiveness in limited data scenarios due to view-invariant contrastive pre-training. Conclusion: The proposed TAS framework with VIFSS and fine-grained annotations effectively addresses the challenges in recognizing figure skating jumps, particularly showing high performance with limited data. Abstract: Understanding human actions from videos plays a critical role across various domains, including sports analytics. In figure skating, accurately recognizing the type and timing of jumps a skater performs is essential for objective performance evaluation. However, this task typically requires expert-level knowledge due to the fine-grained and complex nature of jump procedures. While recent approaches have attempted to automate this task using Temporal Action Segmentation (TAS), there are two major limitations to TAS for figure skating: the annotated data is insufficient, and existing methods do not account for the inherent three-dimensional aspects and procedural structure of jump actions. In this work, we propose a new TAS framework for figure skating jumps that explicitly incorporates both the three-dimensional nature and the semantic procedure of jump movements. First, we propose a novel View-Invariant, Figure Skating-Specific pose representation learning approach (VIFSS) that combines contrastive learning as pre-training and action classification as fine-tuning. For view-invariant contrastive pre-training, we construct FS-Jump3D, the first publicly available 3D pose dataset specialized for figure skating jumps. Second, we introduce a fine-grained annotation scheme that marks the ``entry (preparation)'' and ``landing'' phases, enabling TAS models to learn the procedural structure of jumps. Extensive experiments demonstrate the effectiveness of our framework. Our method achieves over 92% F1@50 on element-level TAS, which requires recognizing both jump types and rotation levels. Furthermore, we show that view-invariant contrastive pre-training is particularly effective when fine-tuning data is limited, highlighting the practicality of our approach in real-world scenarios.[89] JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics
Simindokht Jahangard,Mehrzad Mohammadi,Yi Shen,Zhixi Cai,Hamid Rezatofighi
Main category: cs.CV
TL;DR: This paper introduces JRDB-Reasoning, a new benchmark for visual reasoning in crowded environments, offering customizable questions and detailed annotations for better evaluation of AI models.
Details
Motivation: The motivation is to address limitations in existing visual reasoning benchmarks, such as the lack of defined reasoning complexity, customization, and detailed annotations. Method: The paper introduces an adaptive query engine and extends the JRDB dataset with additional annotations to create the JRDB-Reasoning benchmark. Result: The result is the creation of JRDB-Reasoning, a benchmark that enables customizable question generation and structured evaluation of visual reasoning. Conclusion: The paper concludes that the proposed benchmark and engine allow for fine-grained evaluation and dynamic assessment of visual reasoning frameworks and VLMs in human-crowded environments. Abstract: Recent advances in Vision-Language Models (VLMs) and large language models (LLMs) have greatly enhanced visual reasoning, a key capability for embodied AI agents like robots. However, existing visual reasoning benchmarks often suffer from several limitations: they lack a clear definition of reasoning complexity, offer have no control to generate questions over varying difficulty and task customization, and fail to provide structured, step-by-step reasoning annotations (workflows). To bridge these gaps, we formalize reasoning complexity, introduce an adaptive query engine that generates customizable questions of varying complexity with detailed intermediate annotations, and extend the JRDB dataset with human-object interaction and geometric relationship annotations to create JRDB-Reasoning, a benchmark tailored for visual reasoning in human-crowded environments. Our engine and benchmark enable fine-grained evaluation of visual reasoning frameworks and dynamic assessment of visual-language models across reasoning levels.[90] A Sub-Pixel Multimodal Optical Remote Sensing Images Matching Method
Tao Huang,Hongbo Pan,Nanxi Zhou,Shun Zhou
Main category: cs.CV
TL;DR: This paper proposes a novel PCWLAD method for multimodal image matching, combining coarse and fine matching techniques to enhance accuracy, achieving impressive results on three image datasets.
Details
Motivation: The research aims to address degraded image matching accuracy caused by nonlinear radiation and geometric deformation differences in multimodal optical images. Method: The method involves two steps: coarse matching using the structural similarity index measure (SSIM) and fine matching using phase consistency weighted least absolute deviation (WLAD), with mutual structure filtering to mitigate noise impact. Result: The PCWLAD method outperformed eight state-of-the-art methods in terms of correct matching rate (CMR) and root mean square error (RMSE). Conclusion: The proposed PCWLAD method demonstrates superior performance in matching accuracy for multimodal optical images, achieving an average accuracy of approximately 0.4 pixels across three datasets. Abstract: High-accuracy matching of multimodal optical images is the basis of geometric processing. However, the image matching accuracy is usually degraded by the nonlinear radiation and geometric deformation differences caused by different spectral responses. To address these problems, we proposed a phase consistency weighted least absolute deviation (PCWLAD) sub-pixel template matching method to improve the matching accuracy of multimodal optical images. This method consists of two main steps: coarse matching with the structural similarity index measure (SSIM) and fine matching with WLAD. In the coarse matching step, PCs are calculated without a noise filter to preserve the original structural details, and template matching is performed using the SSIM. In the fine matching step, we applied the radiometric and geometric transformation models between two multimodal PC templates based on the coarse matching. Furthermore, mutual structure filtering is adopted in the model to mitigate the impact of noise within the corresponding templates on the structural consistency, and the WLAD criterion is used to estimate the sub-pixel offset. To evaluate the performance of PCWLAD, we created three types of image datasets: visible to infrared Landsat images, visible to near-infrared close-range images, and visible to infrared uncrewed aerial vehicle (UAV) images. PCWLAD outperformed existing state-of-the-art eight methods in terms of correct matching rate (CMR) and root mean square error (RMSE) and reached an average matching accuracy of approximately 0.4 pixels across all three datasets. Our software and datasets are publicly available at https://github.com/huangtaocsu/PCWLAD.[91] InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild
Yiyi Ma,Yuanzhi Liang,Xiu Li,Chi Zhang,Xuelong Li
Main category: cs.CV
TL;DR: InterSyn通过交错学习策略,结合单独和交互运动的学习,生成更真实、多样化的运动序列。
Details
Motivation: 之前的交互运动生成方法将单独运动和交互运动分开处理,而InterSyn通过一个统一的框架来模拟真实场景中的自然交互和协调。 Method: 提出了一种交错学习策略,包括Interleaved Interaction Synthesis模块和Relative Coordination Refinement模块。 Result: 实验结果表明InterSyn生成的运动序列在文本到运动对齐和多样性方面优于现有方法。 Conclusion: InterSyn是一个全新的运动合成框架,通过从整合运动中学习,生成逼真的交互运动。 Abstract: We present Interleaved Learning for Motion Synthesis (InterSyn), a novel framework that targets the generation of realistic interaction motions by learning from integrated motions that consider both solo and multi-person dynamics. Unlike previous methods that treat these components separately, InterSyn employs an interleaved learning strategy to capture the natural, dynamic interactions and nuanced coordination inherent in real-world scenarios. Our framework comprises two key modules: the Interleaved Interaction Synthesis (INS) module, which jointly models solo and interactive behaviors in a unified paradigm from a first-person perspective to support multiple character interactions, and the Relative Coordination Refinement (REC) module, which refines mutual dynamics and ensures synchronized motions among characters. Experimental results show that the motion sequences generated by InterSyn exhibit higher text-to-motion alignment and improved diversity compared with recent methods, setting a new benchmark for robust and natural motion synthesis. Additionally, our code will be open-sourced in the future to promote further research and development in this area.[92] From Pixel to Mask: A Survey of Out-of-Distribution Segmentation
Wenjie Zhao,Jia Li,Yunhui Guo
Main category: cs.CV
TL;DR: This survey paper explores out-of-distribution (OoD) detection and segmentation methods, focusing on their importance in AI security for safety-critical applications like autonomous driving. It categorizes current approaches, reviews recent advances, highlights challenges, and suggests future research directions.
Details
Motivation: The motivation behind this paper is the increasing concern about AI security, particularly in safety-critical applications like autonomous driving, where there is a need for precise segmentation of out-of-distribution (OoD) objects to enable targeted control actions and enhance system robustness. Method: The paper groups current OoD segmentation approaches into four categories: (i) test-time OoD segmentation, (ii) outlier exposure for supervised training, (iii) reconstruction-based methods, and (iv) approaches that leverage powerful models. Result: The result of the paper is a comprehensive survey that categorizes current OoD segmentation approaches, reviews recent advances, identifies challenges, and discusses future research directions in the field of OoD segmentation for autonomous-driving scenarios. Conclusion: The paper concludes by systematically reviewing recent advances in OoD segmentation for autonomous-driving scenarios, identifying emerging challenges, and discussing promising future research directions. Abstract: Out-of-distribution (OoD) detection and segmentation have attracted growing attention as concerns about AI security rise. Conventional OoD detection methods identify the existence of OoD objects but lack spatial localization, limiting their usefulness in downstream tasks. OoD segmentation addresses this limitation by localizing anomalous objects at pixel-level granularity. This capability is crucial for safety-critical applications such as autonomous driving, where perception modules must not only detect but also precisely segment OoD objects, enabling targeted control actions and enhancing overall system robustness. In this survey, we group current OoD segmentation approaches into four categories: (i) test-time OoD segmentation, (ii) outlier exposure for supervised training, (iii) reconstruction-based methods, (iv) and approaches that leverage powerful models. We systematically review recent advances in OoD segmentation for autonomous-driving scenarios, identify emerging challenges, and discuss promising future research directions.[93] Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances
Yuanzhi Liang,Yijie Fang,Rui Li,Ziqi Ni,Ruijie Su,Chi Zhang,Xuelong Li
Main category: cs.CV
TL;DR: This survey explores the application of reinforcement learning in visual content generation, aiming to improve the alignment of generative models with perceptual quality and high-level goals, and discusses future research directions.
Details
Motivation: The motivation is to address the misalignment of traditional training objectives for generative models with perceptual quality, semantic accuracy, or physical realism, and to explore how RL can enhance controllability, consistency, and human alignment. Method: The paper provides a systematic overview of RL-based methods for visual content generation, examining the evolution of RL from classical control to a general-purpose optimization tool. Result: The result is an in-depth survey that highlights how RL can serve as both a fine-tuning mechanism and a structural component for aligning generation with complex, high-level goals in image, video, and 3D/4D generation. Conclusion: The paper concludes with a discussion on the open challenges and future research directions at the intersection of reinforcement learning and generative modeling. Abstract: Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.[94] Concepts or Skills? Rethinking Instruction Selection for Multi-modal Models
Andrew Bai,Justin Cui,Ruochen Wang,Cho-Jui Hsieh
Main category: cs.CV
TL;DR: 本文提出了一种优化视觉语言指令微调性能的数据选择方法,强调概念与技能的平衡。
Details
Motivation: 发现视觉语言基准测试主要受益于相似技能或视觉概念的训练指令。 Method: 设计了一种针对训练数据选择的方法,首先提取概念/技能,判断其主要受益于概念还是技能,然后选择最匹配的指令。 Result: 实验验证了该方法的有效性,在所有基准测试中比现有最佳基线高出0.9%,在技能集中子集高出1.5%。 Conclusion: 研究强调了在指令选择中认识概念知识与视觉技能之间固有平衡的重要性。 Abstract: Vision-language instruction tuning achieves two main purposes: learning visual concepts and learning visual skills. In this paper, we found that vision-language benchmarks fall into the dichotomy of mainly benefiting from training on instructions with similar skills or visual concepts. Inspired by the discovery, we designed a simple targeted training data selection method to optimize the performance of a given benchmark. We first extract the concepts/skills from the benchmark, determine whether the benchmark predominantly benefits from similar concepts or skills, and finally select instructions with the most matching concepts/skills. Experiments on 10+ benchmarks validate the effectiveness of our targeted data selection method, showing +0.9\% over the best existing baseline averaged over all benchmarks and +1.5\% on the skill-focused subset. Our findings underscore the importance of recognizing the inherent trade-off within instruction selection, which requires balancing the acquisition of conceptual knowledge against visual skill.[95] Glo-DMU: A Deep Morphometry Framework of Ultrastructural Characterization in Glomerular Electron Microscopic Images
Zhentai Zhang,Danyi Weng,Guibin Zhang,Xiang Chen,Kaixing Long,Jian Geng,Yanmeng Lu,Lei Zhang,Zhitao Zhou,Lei Cao
Main category: cs.CV
TL;DR: Glo-DMU ist ein vollautomatisches Analysetool für die Glomerulau ultrastruktur, das durch hohe Präzision und Durchsatzfähigkeit überzeugt und die Unterstützung von Nierenpathologen bei der Diagnose erleichtert.
Details
Motivation: Die computergestützte Pathologie mit Deep-Learning-Methoden hat großes Potenzial bei der automatischen Analyse der Glomerulau ultrastruktur gezeigt. Bisherige Forschungen konzentrierten sich jedoch vorwiegend auf die Erkennung einzelner Ultrastrukturen, was den praktischen diagnostischen Anforderungen nicht gerecht wird. Method: Im Rahmen dieser Studie wurde ein Glomerulamorphometrieframework namens Glo-DMU entwickelt, das auf drei tiefen Modellen basiert: einem Modell zur Ultrastruktursegmentierung, einem Modell zur Klassifizierung der Glomerula filtrationsbarriere und einem Modell zur Erkennung elektronendichter Ablagerungen. Result: Bei der Bewertung von 115 Patienten mit 9 verschiedenen Nierenpathologietypen in realistischen diagnostischen Szenarien zeigte Glo-DMU eine gute Übereinstimmung zwischen den automatisch ermittelten Quantifizierungsergebnissen und den morphologischen Beschreibungen in den pathologischen Berichten. Conclusion: Glo-DMU ist ein vollautomatisches, präzises und hochdurchsatzfähiges Instrument zur Analyse der Glomerulau ultrastruktur, das die Anforderungen der praktischen Diagnostik erfüllt und Pathologen bei der Diagnose unterstützt. Abstract: Complex and diverse ultrastructural features can indicate the type, progression, and prognosis of kidney diseases. Recently, computational pathology combined with deep learning methods has shown tremendous potential in advancing automatic morphological analysis of glomerular ultrastructure. However, current research predominantly focuses on the recognition of individual ultrastructure, which makes it challenging to meet practical diagnostic needs. In this study, we propose the glomerular morphometry framework of ultrastructural characterization (Glo-DMU), which is grounded on three deep models: the ultrastructure segmentation model, the glomerular filtration barrier region classification model, and the electron-dense deposits detection model. Following the conventional protocol of renal biopsy diagnosis, this framework simultaneously quantifies the three most widely used ultrastructural features: the thickness of glomerular basement membrane, the degree of foot process effacement, and the location of electron-dense deposits. We evaluated the 115 patients with 9 renal pathological types in real-world diagnostic scenarios, demonstrating good consistency between automatic quantification results and morphological descriptions in the pathological reports. Glo-DMU possesses the characteristics of full automation, high precision, and high throughput, quantifying multiple ultrastructural features simultaneously, and providing an efficient tool for assisting renal pathologists.[96] Improving OCR for Historical Texts of Multiple Languages
Hylke Westerdijk,Ben Blankenborg,Khondoker Ittehadul Islam
Main category: cs.CV
TL;DR: 本文研究了深度学习在三个文本识别任务中的应用,包括历史希伯来文碎片、16至18世纪会议决议和现代英文手写识别,提出了改进的模型方法和未来研究方向。
Details
Motivation: 提升历史和现代文档的字符识别与布局分析的准确性,探索深度学习技术在不同文本识别任务中的应用。 Method: 使用Kraken和TrOCR模型改进字符识别,利用集成DeepLabV3+和双向LSTM的CRNN进行语义分割与置信度伪标记,以及使用基于ResNet34编码器的CRNN和CTC损失函数进行序列依赖建模。 Result: 通过数据增强和模型优化,提高了字符识别和文档布局分析的性能,特别是在历史和现代文本处理任务中。 Conclusion: 本文通过使用先进的深度学习技术,在三个OCR和文档布局分析任务中展示了方法和发现,并为未来研究提供了有价值的方向。 Abstract: This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.[97] AtomDiffuser: Time-Aware Degradation Modeling for Drift and Beam Damage in STEM Imaging
Hao Wang,Hongkui Zheng,Kai He,Abolfazl Razi
Main category: cs.CV
TL;DR: AtomDiffuser is a new framework for analyzing STEM data that effectively separates spatial drift and signal loss, enabling clearer interpretation of atomic-level material dynamics.
Details
Motivation: Interpreting time-resolved STEM data is challenging due to entangled degradation effects like spatial drift and beam-induced signal loss. Traditional methods struggle to separate these effects, necessitating a more effective solution. Method: AtomDiffuser uses a time-aware approach to disentangle spatial drift and radiometric attenuation by predicting affine transformations and spatially varying decay maps between STEM frames. It leverages synthetic data for training and applies to real-world cryo-STEM data. Result: AtomDiffuser successfully models and separates degradation effects, enabling high-resolution degradation inference, drift alignment, and quantification of radiation-induced atomic instabilities. Conclusion: AtomDiffuser offers a novel framework for modeling degradation in STEM data, effectively separating drift and signal loss, and enabling accurate visualization and quantification of atomic instabilities caused by radiation. Abstract: Scanning transmission electron microscopy (STEM) plays a critical role in modern materials science, enabling direct imaging of atomic structures and their evolution under external interferences. However, interpreting time-resolved STEM data remains challenging due to two entangled degradation effects: spatial drift caused by mechanical and thermal instabilities, and beam-induced signal loss resulting from radiation damage. These factors distort both geometry and intensity in complex, temporally correlated ways, making it difficult for existing methods to explicitly separate their effects or model material dynamics at atomic resolution. In this work, we present AtomDiffuser, a time-aware degradation modeling framework that disentangles sample drift and radiometric attenuation by predicting an affine transformation and a spatially varying decay map between any two STEM frames. Unlike traditional denoising or registration pipelines, our method leverages degradation as a physically heuristic, temporally conditioned process, enabling interpretable structural evolutions across time. Trained on synthetic degradation processes, AtomDiffuser also generalizes well to real-world cryo-STEM data. It further supports high-resolution degradation inference and drift alignment, offering tools for visualizing and quantifying degradation patterns that correlate with radiation-induced atomic instabilities.[98] Contrast Sensitivity Function of Multimodal Vision-Language Models
Pablo Hernández-Cámara,Alexandra Gomez-Villa,Jose Manuel Jaén-Lorites,Jorge Vila-Tomás,Jesus Malo,Valero Laparra
Main category: cs.CV
TL;DR: 本研究通过一种新颖的方法评估视觉语言模型的对比敏感度函数,发现现有模型在模拟人类视觉感知方面仍存在显著差距。
Details
Motivation: 评估多模态视觉语言模型(VLMs)与人类感知的一致性对于理解它们如何感知低级视觉特征至关重要,而人类视觉的一个关键特征是对比敏感度函数(CSF)。 Method: 通过使用带通滤波噪声图像和多样化的提示,直接提示模型判断不同对比度下的模式可见性,以估计模型的CSF。 Result: 研究发现,虽然某些模型近似于人类CSF的形状或幅度,但没有一个模型能完全复制两者。此外,提示措辞对模型响应有显著影响,引发了对提示稳定性的担忧。 Conclusion: 该研究提出了一种新颖的行为心理物理学方法来估计基于聊天的视觉语言模型(VLMs)的对比敏感度函数(CSF),揭示了模型视觉表征与人类感知之间的重要差距。 Abstract: Assessing the alignment of multimodal vision-language models~(VLMs) with human perception is essential to understand how they perceive low-level visual features. A key characteristic of human vision is the contrast sensitivity function (CSF), which describes sensitivity to spatial frequency at low-contrasts. Here, we introduce a novel behavioral psychophysics-inspired method to estimate the CSF of chat-based VLMs by directly prompting them to judge pattern visibility at different contrasts for each frequency. This methodology is closer to the real experiments in psychophysics than the previously reported. Using band-pass filtered noise images and a diverse set of prompts, we assess model responses across multiple architectures. We find that while some models approximate human-like CSF shape or magnitude, none fully replicate both. Notably, prompt phrasing has a large effect on the responses, raising concerns about prompt stability. Our results provide a new framework for probing visual sensitivity in multimodal models and reveal key gaps between their visual representations and human perception.[99] Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models
Hyundo Lee,Suhyung Choi,Byoung-Tak Zhang,Inwoo Hwang
Main category: cs.CV
TL;DR: This paper proposes a method for improving spatial consistency in image generation models by co-generating images and intrinsic scene properties (like depth and segmentation maps), resulting in more realistic and structurally accurate images while maintaining the quality and text alignment of the base model.
Details
Motivation: Image generation models often produce spatially inconsistent and distorted images due to limited understanding of underlying structures. Prior approaches rely solely on image-text pairs or use intrinsic properties as conditional inputs, which limits their effectiveness. This work aims to address these limitations by incorporating intrinsic scene properties into the generation process. Method: The approach involves extracting intrinsic scene properties using pre-trained estimators, aggregating these properties into a latent variable via an autoencoder, and extending pre-trained Latent Diffusion Models to simultaneously denoise both image and intrinsic domains by sharing mutual information. Result: The experimental results show that the method effectively corrects spatial inconsistencies, produces more natural scene layouts, and maintains the fidelity and textual alignment of the base model (e.g., Stable Diffusion). Conclusion: The proposed method successfully improves spatial consistency and realism in generated images by co-generating images and intrinsic scene properties, maintaining fidelity and textual alignment with the base model. Abstract: Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).[100] Unlocking Robust Semantic Segmentation Performance via Label-only Elastic Deformations against Implicit Label Noise
Yechan Kim,Dongho Yoon,Younkwan Lee,Unse Fatima,Hong Kook Kim,Songjae Lee,Sanga Park,Jeong Ho Park,Seonjong Kang,Moongu Jeon
Main category: cs.CV
TL;DR: NSegment+ 通过解耦图像和标签变换来处理语义分割中的隐性标签噪声,显著提高模型性能。
Details
Motivation: 现实世界的数据集存在隐性标签噪声,这种噪声源于固有的挑战,如模糊的对象边界和标注者差异,需要一种新的增强方法来解决这些问题。 Method: 引入受控的弹性变形仅应用于分割标签,同时保持原始图像不变,以学习对轻微标签不一致具有鲁棒性的表示。 Result: NSegment+ 在 Vaihingen, LoveDA, Cityscapes, 和 PASCAL VOC 数据集上分别平均提升了 2.29, 2.38, 1.75, 和 3.39 的 mIoU,表明其在处理隐性标签噪声方面的有效性。 Conclusion: NSegment+ 是一种新的增强框架,通过解耦图像和标签变换,有效处理语义分割中的隐性标签噪声,提高模型性能。 Abstract: While previous studies on image segmentation focus on handling severe (or explicit) label noise, real-world datasets also exhibit subtle (or implicit) label imperfections. These arise from inherent challenges, such as ambiguous object boundaries and annotator variability. Although not explicitly present, such mild and latent noise can still impair model performance. Typical data augmentation methods, which apply identical transformations to the image and its label, risk amplifying these subtle imperfections and limiting the model's generalization capacity. In this paper, we introduce NSegment+, a novel augmentation framework that decouples image and label transformations to address such realistic noise for semantic segmentation. By introducing controlled elastic deformations only to segmentation labels while preserving the original images, our method encourages models to focus on learning robust representations of object structures despite minor label inconsistencies. Extensive experiments demonstrate that NSegment+ consistently improves performance, achieving mIoU gains of up to +2.29, +2.38, +1.75, and +3.39 in average on Vaihingen, LoveDA, Cityscapes, and PASCAL VOC, respectively-even without bells and whistles, highlighting the importance of addressing implicit label noise. These gains can be further amplified when combined with other training tricks, including CutMix and Label Smoothing.[101] PQ-DAF: Pose-driven Quality-controlled Data Augmentation for Data-scarce Driver Distraction Detection
Haibin Sun,Xinghui Song
Main category: cs.CV
TL;DR: This paper proposes PQ-DAF, a data augmentation framework that uses vision-language models to enhance few-shot learning for driver distraction detection, resulting in better generalization in real-world scenarios.
Details
Motivation: Existing models face challenges in generalization due to few-shot learning and domain shifts between training and real-world deployment. Method: A Pose-driven Quality-controlled Data Augmentation Framework (PQ-DAF) using a Progressive Conditional Diffusion Model (PCDMs) and a sample quality assessment module based on the CogVLM vision-language model. Result: PQ-DAF effectively expands training data and improves cross-domain robustness for driver distraction detection. Conclusion: PQ-DAF improves few-shot driver distraction detection and enhances model generalization under data-scarce conditions. Abstract: Driver distraction detection is essential for improving traffic safety and reducing road accidents. However, existing models often suffer from degraded generalization when deployed in real-world scenarios. This limitation primarily arises from the few-shot learning challenge caused by the high cost of data annotation in practical environments, as well as the substantial domain shift between training datasets and target deployment conditions. To address these issues, we propose a Pose-driven Quality-controlled Data Augmentation Framework (PQ-DAF) that leverages a vision-language model for sample filtering to cost-effectively expand training data and enhance cross-domain robustness. Specifically, we employ a Progressive Conditional Diffusion Model (PCDMs) to accurately capture key driver pose features and synthesize diverse training examples. A sample quality assessment module, built upon the CogVLM vision-language model, is then introduced to filter out low-quality synthetic samples based on a confidence threshold, ensuring the reliability of the augmented dataset. Extensive experiments demonstrate that PQ-DAF substantially improves performance in few-shot driver distraction detection, achieving significant gains in model generalization under data-scarce conditions.[102] Translation of Text Embedding via Delta Vector to Suppress Strongly Entangled Content in Text-to-Image Diffusion Models
Eunseo Koh,Seunghoo Hong,Tae-Young Kim,Simon S. Woo,Jae-Pil Heo
Main category: cs.CV
TL;DR: The paper proposes a novel method to suppress entangled content in text-to-image diffusion models, achieving superior performance compared to existing techniques.
Details
Motivation: The motivation is to overcome the challenge of suppressing content strongly entangled with specific words in text-to-image diffusion models. Method: The method involves modifying the text embedding using a delta vector to weaken the influence of undesired content and integrating this delta vector into the cross-attention mechanism for selective suppression. Result: The approach significantly outperforms existing methods in both quantitative and qualitative metrics. Conclusion: The proposed SSDV method effectively suppresses undesired content in text-to-image diffusion models, outperforming existing methods. Abstract: Text-to-Image (T2I) diffusion models have made significant progress in generating diverse high-quality images from textual prompts. However, these models still face challenges in suppressing content that is strongly entangled with specific words. For example, when generating an image of ``Charlie Chaplin", a ``mustache" consistently appears even if explicitly instructed not to include it, as the concept of ``mustache" is strongly entangled with ``Charlie Chaplin". To address this issue, we propose a novel approach to directly suppress such entangled content within the text embedding space of diffusion models. Our method introduces a delta vector that modifies the text embedding to weaken the influence of undesired content in the generated image, and we further demonstrate that this delta vector can be easily obtained through a zero-shot approach. Furthermore, we propose a Selective Suppression with Delta Vector (SSDV) method to adapt delta vector into the cross-attention mechanism, enabling more effective suppression of unwanted content in regions where it would otherwise be generated. Additionally, we enabled more precise suppression in personalized T2I models by optimizing delta vector, which previous baselines were unable to achieve. Extensive experimental results demonstrate that our approach significantly outperforms existing methods, both in terms of quantitative and qualitative metrics.[103] SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection
Chaesong Park,Eunbin Seo,Jihyeon Hwang,Jongwoo Lim
Main category: cs.CV
TL;DR: SC-Lane是一种新的3D车道检测框架,利用自适应坡度特征融合和时间一致性提高高度估计的准确性与鲁棒性。
Details
Motivation: 传统方法依赖固定坡度锚点,无法应对多样化的道路几何形状,因此需要一种更灵活和鲁棒的高度估计方法。 Method: 提出Slope-Aware Adaptive Feature模块和Height Consistency Module,动态预测权重并确保时间连贯性,以改进高度估计。 Result: 在OpenLane基准测试中,SC-Lane以64.3%的F-score达到最先进的性能,并显著改进高度估计和3D车道检测。 Conclusion: SC-Lane通过自适应融合坡度特征和增强时间一致性,在3D车道检测中实现了更鲁棒和准确的高度估计,显著优于现有方法。 Abstract: In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the appropriate weights from image cues for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios. To evaluate the effectiveness of SC-Lane, we employ three standardized metrics-Mean Absolute Error(MAE), Root Mean Squared Error (RMSE), and threshold-based accuracy-which, although common in surface and depth estimation, have been underutilized for road height assessment. Using the LiDAR-derived heightmap dataset introduced in prior work [20], we benchmark our method under these metrics, thereby establishing a rigorous standard for future comparisons. Extensive experiments on the OpenLane benchmark demonstrate that SC-Lane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3%, outperforming existing methods by a notable margin. For detailed results and a demonstration video, please refer to our project page:https://parkchaesong.github.io/sclane/[104] NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer
Shanyuan Liu,Jian Zhu,Junda Lu,Yue Gong,Liuzhuozheng Li,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Dawei Leng,Yuhui Yin
Main category: cs.CV
TL;DR: NanoControl通过引入低秩适应控制模块和KV-Context Augmentation机制,在可控文本到图像生成中实现了高效性能。
Details
Motivation: 现有的基于ControlNet的方法在使用DiTs进行可控文本到图像生成时引入了显著的参数开销和计算成本。 Method: 设计了一个LoRA风格的控制模块和KV-Context Augmentation机制,使用Flux作为主干网络。 Result: NanoControl在参数数量上仅增加了0.024%,GFLOPs增加了0.029%。 Conclusion: NanoControl实现了高效的可控文本到图像生成,同时保持了生成质量和可控性。 Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm originally designed for UNet-based diffusion models. This paradigm introduces significant parameter overhead and increased computational costs. To address these challenges, we propose the Nano Control Diffusion Transformer (NanoControl), which employs Flux as the backbone network. Our model achieves state-of-the-art controllable text-to-image generation performance while incurring only a 0.024\% increase in parameter count and a 0.029\% increase in GFLOPs, thus enabling highly efficient controllable generation. Specifically, rather than duplicating the DiT backbone for control, we design a LoRA-style (low-rank adaptation) control module that directly learns control signals from raw conditioning inputs. Furthermore, we introduce a KV-Context Augmentation mechanism that integrates condition-specific key-value information into the backbone in a simple yet highly effective manner, facilitating deep fusion of conditional features. Extensive benchmark experiments demonstrate that NanoControl significantly reduces computational overhead compared to conventional control approaches, while maintaining superior generation quality and achieving improved controllability.[105] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes
Keishi Ishihara,Kento Sasaki,Tsubasa Takahashi,Daiki Shiono,Yu Yamaguchi
Main category: cs.CV
TL;DR: STRIDE-QA is a large-scale dataset for visual question answering that enables advanced spatiotemporal reasoning in autonomous driving scenarios, significantly improving the performance of Vision-Language Models (VLMs) in spatial localization and motion prediction.
Details
Motivation: Existing Vision-Language Models (VLMs) are limited in spatiotemporal reasoning for autonomous driving due to reliance on static image-text pairs. STRIDE-QA addresses this gap. Method: Creation of STRIDE-QA, a large-scale VQA dataset based on 100 hours of multi-sensor driving data in Tokyo, featuring dense annotations like 3D bounding boxes and multi-object tracks supporting object-centric and ego-centric reasoning. Result: VLMs fine-tuned on STRIDE-QA achieved 55% success in spatial localization and 28% consistency in future motion prediction, significantly outperforming general-purpose VLMs which scored near-zero. Conclusion: STRIDE-QA provides a foundation for developing more reliable VLMs for autonomous systems by enabling advanced spatiotemporal reasoning. Abstract: Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.[106] CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation
Baichen Liu,Qi Lyu,Xudong Wang,Jiahua Dong,Lianqing Liu,Zhi Han
Main category: cs.CV
TL;DR: This paper proposes CRISP, a novel method for continual video instance segmentation that addresses instance-wise, category-wise, and task-wise confusion, significantly improving performance while avoiding catastrophic forgetting.
Details
Motivation: Continual video instance segmentation requires the ability to learn new object categories while retaining previously learned ones and maintaining temporal consistency across frames. Method: The paper proposes Contrastive Residual Injection and Semantic Prompting (CRISP), which includes instance correlation loss, adaptive residual semantic prompt (ARSP) learning framework, semantic consistency loss, and an initialization strategy for incremental prompts. Result: Experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets show that CRISP outperforms existing methods in long-term continual video instance segmentation, effectively avoiding catastrophic forgetting and improving performance. Conclusion: CRISP effectively addresses the challenges of continual video instance segmentation by enhancing plasticity and stability while preserving temporal consistency, significantly outperforming existing methods. Abstract: Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation. For instance-wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category-wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query-prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task-wise learning, to ensure the correlation at the inter-task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube-VIS-2019 and YouTube-VIS-2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/01upup10/CRISP.[107] DOD-SA: Infrared-Visible Decoupled Object Detection with Single-Modality Annotations
Hang Jin,Chenqiang Gao,Junjie Guo,Fangcen Liu,Kanghui Tian,Qinyao Chang
Main category: cs.CV
TL;DR: This paper introduces DOD-SA, an efficient infrared-visible object detection framework using single-modality annotations, achieving better performance than current methods.
Details
Motivation: The motivation is to reduce the high annotation costs of existing dual-modality methods in infrared-visible object detection while maintaining robust performance. Method: The method involves a Teacher-Student Network (CoSD-TSNet) with single-modality and dual-modality branches, along with a Progressive and Self-Tuning Training Strategy (PaST) and a Pseudo Label Assigner (PLA) to improve model performance and label alignment. Result: Experiments on the DroneVehicle dataset show that the proposed method outperforms state-of-the-art approaches. Conclusion: The paper proposes a new framework called DOD-SA for infrared-visible object detection, which reduces annotation costs by using single-modality annotations while achieving superior performance compared to existing methods. Abstract: Infrared-visible object detection has shown great potential in real-world applications, enabling robust all-day perception by leveraging the complementary information of infrared and visible images. However, existing methods typically require dual-modality annotations to output detection results for both modalities during prediction, which incurs high annotation costs. To address this challenge, we propose a novel infrared-visible Decoupled Object Detection framework with Single-modality Annotations, called DOD-SA. The architecture of DOD-SA is built upon a Single- and Dual-Modality Collaborative Teacher-Student Network (CoSD-TSNet), which consists of a single-modality branch (SM-Branch) and a dual-modality decoupled branch (DMD-Branch). The teacher model generates pseudo-labels for the unlabeled modality, simultaneously supporting the training of the student model. The collaborative design enables cross-modality knowledge transfer from the labeled modality to the unlabeled modality, and facilitates effective SM-to-DMD branch supervision. To further improve the decoupling ability of the model and the pseudo-label quality, we introduce a Progressive and Self-Tuning Training Strategy (PaST) that trains the model in three stages: (1) pretraining SM-Branch, (2) guiding the learning of DMD-Branch by SM-Branch, and (3) refining DMD-Branch. In addition, we design a Pseudo Label Assigner (PLA) to align and pair labels across modalities, explicitly addressing modality misalignment during training. Extensive experiments on the DroneVehicle dataset demonstrate that our method outperforms state-of-the-art (SOTA).[108] SkeySpot: Automating Service Key Detection for Digital Electrical Layout Plans in the Construction Industry
Dhruv Dosi,Rohit Meena,Param Rajpura,Yogesh Kumar Meena
Main category: cs.CV
TL;DR: This paper presents SkeySpot, an open-source toolkit for automated electrical symbol detection in scanned floor plans, making digitization accessible for SMEs and supporting industry goals of standardization and sustainability.
Details
Motivation: Legacy floor plans are essential in the construction industry but lack machine-readable formats, making interpretation difficult. Automated symbol spotting is proposed as a scalable solution to support various workflows like cost estimation and regulatory compliance. Method: The paper introduces the DELP dataset with annotated electrical layout plans and proposes a systematic evaluation framework using pre-trained object detection models. YOLOv8 was used to develop SkeySpot, a lightweight and open-source toolkit for symbol detection. Result: YOLOv8 achieved the highest performance with an mAP of 82.5%. SkeySpot produces structured outputs that can be scaled for interoperable building workflows. Conclusion: The paper concludes that SkeySpot, a toolkit developed using the YOLOv8 model, effectively enables real-time detection, classification, and quantification of electrical symbols in floor plans. This approach makes digitization more accessible for SMEs and supports goals like standardization, interoperability, and sustainability. Abstract: Legacy floor plans, often preserved only as scanned documents, remain essential resources for architecture, urban planning, and facility management in the construction industry. However, the lack of machine-readable floor plans render large-scale interpretation both time-consuming and error-prone. Automated symbol spotting offers a scalable solution by enabling the identification of service key symbols directly from floor plans, supporting workflows such as cost estimation, infrastructure maintenance, and regulatory compliance. This work introduces a labelled Digitised Electrical Layout Plans (DELP) dataset comprising 45 scanned electrical layout plans annotated with 2,450 instances across 34 distinct service key classes. A systematic evaluation framework is proposed using pretrained object detection models for DELP dataset. Among the models benchmarked, YOLOv8 achieves the highest performance with a mean Average Precision (mAP) of 82.5\%. Using YOLOv8, we develop SkeySpot, a lightweight, open-source toolkit for real-time detection, classification, and quantification of electrical symbols. SkeySpot produces structured, standardised outputs that can be scaled up for interoperable building information workflows, ultimately enabling compatibility across downstream applications and regulatory platforms. By lowering dependency on proprietary CAD systems and reducing manual annotation effort, this approach makes the digitisation of electrical layouts more accessible to small and medium-sized enterprises (SMEs) in the construction industry, while supporting broader goals of standardisation, interoperability, and sustainability in the built environment.[109] From Images to Perception: Emergence of Perceptual Properties by Reconstructing Images
Pablo Hernández-Cámara,Jesus Malo,Valero Laparra
Main category: cs.CV
TL;DR: 本文提出了PerceptNet生物启发架构,通过多种图像重建任务优化,发现其编码层与人类感知判断高度相关,表明视觉系统可能适应特定失真和稀疏性水平,且无需人类监督即可学习感知度量。
Details
Motivation: 研究人类视觉感知是否可能源于图像统计特性,并探索生物启发模型在早期视觉中如何形成高效的神经表征。 Method: 通过端到端优化PerceptNet,执行图像重建任务(如自动编码、去噪、去模糊和稀疏正则化),并评估其与人类感知判断的相关性。 Result: 尽管未使用感知信息进行初始化或训练,编码器阶段(V1层)与人类感知判断的相关性最高,并在适度噪声、模糊和稀疏性条件下达到最佳对齐。 Conclusion: 生物启发模型可以在没有人类监督的情况下学习感知度量,并且视觉系统可能适应去除特定水平的失真和稀疏性。 Abstract: A number of scientists suggested that human visual perception may emerge from image statistics, shaping efficient neural representations in early vision. In this work, a bio-inspired architecture that can accommodate several known facts in the retina-V1 cortex, the PerceptNet, has been end-to-end optimized for different tasks related to image reconstruction: autoencoding, denoising, deblurring, and sparsity regularization. Our results show that the encoder stage (V1-like layer) consistently exhibits the highest correlation with human perceptual judgments on image distortion despite not using perceptual information in the initialization or training. This alignment exhibits an optimum for moderate noise, blur and sparsity. These findings suggest that the visual system may be tuned to remove those particular levels of distortion with that level of sparsity and that biologically inspired models can learn perceptual metrics without human supervision.[110] Trajectory-aware Shifted State Space Models for Online Video Super-Resolution
Qiang Zhu,Xiandong Meng,Yuxian Jiang,Fan Zhang,David Bull,Shuyuan Zhu,Bing Zeng
Main category: cs.CV
TL;DR: This paper introduces TS-Mamba, a novel online video super-resolution method leveraging trajectory-aware modeling and shifted state space models to achieve efficient spatio-temporal aggregation and improved performance with reduced complexity.
Details
Motivation: The motivation is to improve upon existing online VSR methods by addressing limitations in temporal alignment and long-range temporal modeling using state space models. Method: The paper proposes TS-Mamba, which uses Trajectory-aware Shifted SSMs for efficient spatio-temporal information aggregation in online VSR. Result: TS-Mamba outperforms six online VSR benchmark models in most cases and achieves over 22.7% complexity reduction (in MACs). Conclusion: The paper concludes that TS-Mamba achieves state-of-the-art performance in online video super-resolution with reduced complexity compared to existing methods. Abstract: Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames. Then, a Trajectory-aware Shifted Mamba Aggregation (TSMA) module consisting of proposed shifted SSMs blocks is employed to aggregate the selected tokens. The shifted SSMs blocks are designed based on Hilbert scannings and corresponding shift operations to compensate for scanning losses and strengthen the spatial continuity of Mamba. Additionally, we propose a trajectory-aware loss function to supervise the trajectory generation, ensuring the accuracy of token selection when training our model. Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7\% complexity reduction (in MACs). The source code for TS-Mamba will be available at https://github.com.[111] Multi-Label Plant Species Prediction with Metadata-Enhanced Multi-Head Vision Transformers
Hanna Herasimchyk,Robin Labryga,Tomislav Prusina
Main category: cs.CV
TL;DR: 本论文提出了一种基于多头视觉Transformer的植物种类预测方法,用于解决PlantCLEF 2025挑战中的多标签植物种类预测问题。
Details
Motivation: 本论文旨在解决单物种植物图像训练与多物种样方图像测试之间的领域差异问题,从而提高植物种类预测的准确性。 Method: 该方法采用预训练的DINOv2视觉Transformer Base (ViT-B/14)作为主干,并结合多个分类头进行物种、属和科的预测,利用分类层次结构。此外,还引入了多尺度切片、动态阈值优化和集成策略等关键贡献。 Result: 实验结果显示,该方法在包含约140万张训练图像的数据集上表现优异,并在私人排行榜上排名第三。 Conclusion: 该研究展示了多头视觉Transformer在多标签植物种类预测中的有效性,并通过多尺度切片和动态阈值优化等技术提高了模型性能。 Abstract: We present a multi-head vision transformer approach for multi-label plant species prediction in vegetation plot images, addressing the PlantCLEF 2025 challenge. The task involves training models on single-species plant images while testing on multi-species quadrat images, creating a drastic domain shift. Our methodology leverages a pre-trained DINOv2 Vision Transformer Base (ViT-B/14) backbone with multiple classification heads for species, genus, and family prediction, utilizing taxonomic hierarchies. Key contributions include multi-scale tiling to capture plants at different scales, dynamic threshold optimization based on mean prediction length, and ensemble strategies through bagging and Hydra model architectures. The approach incorporates various inference techniques including image cropping to remove non-plant artifacts, top-n filtering for prediction constraints, and logit thresholding strategies. Experiments were conducted on approximately 1.4 million training images covering 7,806 plant species. Results demonstrate strong performance, making our submission 3rd best on the private leaderboard. Our code is available at https://github.com/geranium12/plant-clef-2025/tree/v1.0.0.[112] SingleStrip: learning skull-stripping from a single labeled example
Bella Specktor-Fadida,Malte Hoffmann
Main category: cs.CV
TL;DR: This paper presents a semi-supervised method for 3D skull-stripping in MRI images using minimal labeled data by combining domain randomization and autoencoder-based quality control.
Details
Motivation: The motivation is to reduce the laborious and time-consuming process of manual labeling in deep learning segmentation, particularly for volumetric images like brain MRI, by developing a semi-supervised approach that requires minimal labeled data. Method: The paper proposes a method that combines domain randomization with self-training to train 3D skull-stripping networks using very few labeled examples. It includes steps like voxel intensity binning, convolutional autoencoder training for quality assessment, and pseudo-label selection for fine-tuning. Result: The method achieved skull-stripping performance on out-of-distribution data that approaches models trained with more labeled images. The AE-based ranking showed a stronger correlation with segmentation accuracy compared to consistency-based ranking under test-time augmentation. Conclusion: The paper concludes that combining domain randomization with AE-based quality control enables effective semi-supervised segmentation from extremely limited labeled data, reducing the reliance on manual labeling. Abstract: Deep learning segmentation relies heavily on labeled data, but manual labeling is laborious and time-consuming, especially for volumetric images such as brain magnetic resonance imaging (MRI). While recent domain-randomization techniques alleviate the dependency on labeled data by synthesizing diverse training images from label maps, they offer limited anatomical variability when very few label maps are available. Semi-supervised self-training addresses label scarcity by iteratively incorporating model predictions into the training set, enabling networks to learn from unlabeled data. In this work, we combine domain randomization with self-training to train three-dimensional skull-stripping networks using as little as a single labeled example. First, we automatically bin voxel intensities, yielding labels we use to synthesize images for training an initial skull-stripping model. Second, we train a convolutional autoencoder (AE) on the labeled example and use its reconstruction error to assess the quality of brain masks predicted for unlabeled data. Third, we select the top-ranking pseudo-labels to fine-tune the network, achieving skull-stripping performance on out-of-distribution data that approaches models trained with more labeled images. We compare AE-based ranking to consistency-based ranking under test-time augmentation, finding that the AE approach yields a stronger correlation with segmentation accuracy. Our results highlight the potential of combining domain randomization and AE-based quality control to enable effective semi-supervised segmentation from extremely limited labeled data. This strategy may ease the labeling burden that slows progress in studies involving new anatomical structures or emerging imaging techniques.[113] Enhanced Sparse Point Cloud Data Processing for Privacy-aware Human Action Recognition
Maimunatu Tunau,Vincent Gbouna Zakka,Zhuangzhuang Dai
Main category: cs.CV
TL;DR: 该论文全面评估了毫米波雷达人体动作识别中常用的三种数据处理方法及其组合,提出了改进建议,并分析了它们的优势和权衡。
Details
Motivation: 虽然传统的基于视觉的人体动作识别系统有效,但存在隐私问题。毫米波雷达传感器提供了一种保护隐私的替代方案,但其点云数据稀疏且噪声大,需要特定的数据处理方法来提升性能。 Method: 论文使用MiliPoint数据集对Density-Based Spatial Clustering of Applications with Noise (DBSCAN)、匈牙利算法和卡尔曼滤波三种方法进行了详细的性能分析,包括每种方法单独使用时、两两组合使用时以及三种方法联合使用时的表现。 Result: 论文评估了三种方法在识别准确率和计算成本方面的表现,提出了针对各方法的改进建议,并分析了它们之间的整合优势和权衡。 Conclusion: 该论文提供了对三种主要毫米波雷达数据处理方法的全面评估,并提出了改进方案,为未来的毫米波人体动作识别系统研究提供了关键见解。 Abstract: Human Action Recognition (HAR) plays a crucial role in healthcare, fitness tracking, and ambient assisted living technologies. While traditional vision based HAR systems are effective, they pose privacy concerns. mmWave radar sensors offer a privacy preserving alternative but present challenges due to the sparse and noisy nature of their point cloud data. In the literature, three primary data processing methods: Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the Hungarian Algorithm, and Kalman Filtering have been widely used to improve the quality and continuity of radar data. However, a comprehensive evaluation of these methods, both individually and in combination, remains lacking. This paper addresses that gap by conducting a detailed performance analysis of the three methods using the MiliPoint dataset. We evaluate each method individually, all possible pairwise combinations, and the combination of all three, assessing both recognition accuracy and computational cost. Furthermore, we propose targeted enhancements to the individual methods aimed at improving accuracy. Our results provide crucial insights into the strengths and trade-offs of each method and their integrations, guiding future work on mmWave based HAR systems[114] STAMP: Multi-pattern Attention-aware Multiple Instance Learning for STAS Diagnosis in Multi-center Histopathology Images
Liangrui Pan,xiaoyu Li,Guang Zhu,Guanting Li,Ruixin Wang,Jiadi Luo,Yaning Yang,Liang qingchun,Shaoliang Peng
Main category: cs.CV
TL;DR: This study introduces STAMP, a deep learning framework that effectively identifies STAS in LUAD histopathology images, offering improved diagnostic accuracy over current clinical approaches.
Details
Motivation: STAS is a novel invasive pattern in LUAD associated with poor clinical outcomes, but its diagnosis is labor-intensive and error-prone, necessitating the development of deep learning models for improved detection. Method: A dual-branch architecture with transformer-based instance encoding and multi-pattern attention aggregation modules was developed to identify STAS-associated features while suppressing irrelevant noise. Additionally, a similarity regularization constraint was applied to improve accuracy. Result: STAMP achieved AUCs of 0.8058, 0.8017, and 0.7928 on the STAS-SXY, STAS-TXY, and STAS-TCGA datasets, respectively, demonstrating superior performance over existing clinical methods. Conclusion: STAMP, a multi-pattern attention-aware multiple instance learning framework, demonstrates competitive diagnostic results for STAS in LUAD, surpassing clinical standards. Abstract: Spread through air spaces (STAS) constitutes a novel invasive pattern in lung adenocarcinoma (LUAD), associated with tumor recurrence and diminished survival rates. However, large-scale STAS diagnosis in LUAD remains a labor-intensive endeavor, compounded by the propensity for oversight and misdiagnosis due to its distinctive pathological characteristics and morphological features. Consequently, there is a pressing clinical imperative to leverage deep learning models for STAS diagnosis. This study initially assembled histopathological images from STAS patients at the Second Xiangya Hospital and the Third Xiangya Hospital of Central South University, alongside the TCGA-LUAD cohort. Three senior pathologists conducted cross-verification annotations to construct the STAS-SXY, STAS-TXY, and STAS-TCGA datasets. We then propose a multi-pattern attention-aware multiple instance learning framework, named STAMP, to analyze and diagnose the presence of STAS across multi-center histopathology images. Specifically, the dual-branch architecture guides the model to learn STAS-associated pathological features from distinct semantic spaces. Transformer-based instance encoding and a multi-pattern attention aggregation modules dynamically selects regions closely associated with STAS pathology, suppressing irrelevant noise and enhancing the discriminative power of global representations. Moreover, a similarity regularization constraint prevents feature redundancy across branches, thereby improving overall diagnostic accuracy. Extensive experiments demonstrated that STAMP achieved competitive diagnostic results on STAS-SXY, STAS-TXY and STAS-TCGA, with AUCs of 0.8058, 0.8017, and 0.7928, respectively, surpassing the clinical level.[115] TweezeEdit: Consistent and Efficient Image Editing with Path Regularization
Jianda Mao,Kaibo Wang,Yang Xiang,Kani Chen
Main category: cs.CV
TL;DR: TweezeEdit is a fast and effective image editing method that preserves original image meaning by avoiding reliance on inversion anchors and using gradient-driven regularization with a consistency model.
Details
Motivation: Existing image editing methods using large-scale pre-trained diffusion models often over-align with target prompts but poorly preserve source image semantics, and they are inefficient due to long editing paths. Method: TweezeEdit uses gradient-driven regularization to inject target prompt semantics along a direct path with the help of a consistency model, instead of relying on inversion anchors. Result: TweezeEdit outperforms existing methods in both semantic preservation and target alignment, achieving efficient editing in only 12 steps (1.6 seconds per edit), which highlights its potential for real-time applications. Conclusion: TweezeEdit is a tuning- and inversion-free framework that achieves consistent and efficient image editing by regularizing the entire denoising path, thereby preserving source semantics and shortening editing paths. Abstract: Large-scale pre-trained diffusion models empower users to edit images through text guidance. However, existing methods often over-align with target prompts while inadequately preserving source image semantics. Such approaches generate target images explicitly or implicitly from the inversion noise of the source images, termed the inversion anchors. We identify this strategy as suboptimal for semantic preservation and inefficient due to elongated editing paths. We propose TweezeEdit, a tuning- and inversion-free framework for consistent and efficient image editing. Our method addresses these limitations by regularizing the entire denoising path rather than relying solely on the inversion anchors, ensuring source semantic retention and shortening editing paths. Guided by gradient-driven regularization, we efficiently inject target prompt semantics along a direct path using a consistency model. Extensive experiments demonstrate TweezeEdit's superior performance in semantic preservation and target alignment, outperforming existing methods. Remarkably, it requires only 12 steps (1.6 seconds per edit), underscoring its potential for real-time applications.[116] Multi-Sample Anti-Aliasing and Constrained Optimization for 3D Gaussian Splatting
Zheng Zhou,Jia-Chen Zhang,Yu-Jie Xiong,Chun-Ming Xia
Main category: cs.CV
TL;DR: This paper proposes an optimization framework combining MSAA and dual geometric constraints to improve detail preservation in 3D Gaussian splatting, particularly for high-frequency textures and sharp edges, while maintaining real-time performance.
Details
Motivation: Insufficient geometric constraints during scene optimization in 3D Gaussian splatting lead to blurred reconstructions of fine-grained details, especially in regions with high-frequency textures and sharp discontinuities. Method: The framework integrates multisample anti-aliasing (MSAA) with dual geometric constraints, using adaptive blending of quadruple subsamples and dynamic gradient analysis to prioritize under-reconstructed regions. Result: Extensive experiments show that the method outperforms baseline approaches in structural similarity (SSIM) and perceptual quality (LPIPS), achieving superior detail preservation and real-time efficiency. Conclusion: The proposed optimization framework with dual geometric constraints and MSAA achieves state-of-the-art performance in preserving high-frequency textures and sharp discontinuities while maintaining real-time rendering efficiency. Abstract: Recent advances in 3D Gaussian splatting have significantly improved real-time novel view synthesis, yet insufficient geometric constraints during scene optimization often result in blurred reconstructions of fine-grained details, particularly in regions with high-frequency textures and sharp discontinuities. To address this, we propose a comprehensive optimization framework integrating multisample anti-aliasing (MSAA) with dual geometric constraints. Our system computes pixel colors through adaptive blending of quadruple subsamples, effectively reducing aliasing artifacts in high-frequency components. The framework introduces two constraints: (a) an adaptive weighting strategy that prioritizes under-reconstructed regions through dynamic gradient analysis, and (b) gradient differential constraints enforcing geometric regularization at object boundaries. This targeted optimization enables the model to allocate computational resources preferentially to critical regions requiring refinement while maintaining global consistency. Extensive experimental evaluations across multiple benchmarks demonstrate that our method achieves state-of-the-art performance in detail preservation, particularly in preserving high-frequency textures and sharp discontinuities, while maintaining real-time rendering efficiency. Quantitative metrics and perceptual studies confirm statistically significant improvements over baseline approaches in both structural similarity (SSIM) and perceptual quality (LPIPS).[117] A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection
Yangjie Xiao,Ke Zhang,Jiacun Wang,Xin Sheng,Yurong Guo,Meijuan Chen,Zehua Ren,Zhaoye Zheng,Zhenbing Zhao
Main category: cs.CV
TL;DR: 本文提出了一种新的螺栓缺陷检测数据增强方法SBDE,包括螺栓属性分割模型Bolt-SAM,螺栓缺陷属性编辑模型MOD-LaMa,以及编辑恢复增强策略ERA。实验结果表明,该方法在生成螺栓缺陷图像方面优于现有模型,并能有效提高螺栓缺陷检测的性能。
Details
Motivation: 螺栓缺陷检测对于确保输电线路的安全至关重要,但缺陷图像的稀缺性和数据分布的不平衡性显著限制了检测性能。本文旨在通过提出一种新的数据增强方法来解决这一问题。 Method: 提出了一种分割驱动的螺栓缺陷编辑方法(SBDE),包括一个螺栓属性分割模型(Bolt-SAM),一个掩码优化模块(MOD)和图像修复模型(LaMa)组成的螺栓缺陷属性编辑模型(MOD-LaMa),以及一种编辑恢复增强(ERA)策略。 Result: 实验结果表明,由SBDE生成的螺栓缺陷图像显著优于最先进的图像编辑模型,并有效提高了螺栓缺陷检测的性能。 Conclusion: 本文提出了一种名为SBDE的新方法,用于通过分割驱动的螺栓缺陷编辑来增强数据集,从而提高螺栓缺陷检测的性能。实验结果表明,该方法在生成螺栓缺陷图像方面显著优于最先进的图像编辑模型,并有效提升了螺栓缺陷检测的效果。 Abstract: Bolt defect detection is critical to ensure the safety of transmission lines. However, the scarcity of defect images and imbalanced data distributions significantly limit detection performance. To address this problem, we propose a segmentationdriven bolt defect editing method (SBDE) to augment the dataset. First, a bolt attribute segmentation model (Bolt-SAM) is proposed, which enhances the segmentation of complex bolt attributes through the CLAHE-FFT Adapter (CFA) and Multipart- Aware Mask Decoder (MAMD), generating high-quality masks for subsequent editing tasks. Second, a mask optimization module (MOD) is designed and integrated with the image inpainting model (LaMa) to construct the bolt defect attribute editing model (MOD-LaMa), which converts normal bolts into defective ones through attribute editing. Finally, an editing recovery augmentation (ERA) strategy is proposed to recover and put the edited defect bolts back into the original inspection scenes and expand the defect detection dataset. We constructed multiple bolt datasets and conducted extensive experiments. Experimental results demonstrate that the bolt defect images generated by SBDE significantly outperform state-of-the-art image editing models, and effectively improve the performance of bolt defect detection, which fully verifies the effectiveness and application potential of the proposed method. The code of the project is available at https://github.com/Jay-xyj/SBDE.[118] EgoMusic-driven Human Dance Motion Estimation with Skeleton Mamba
Quang Nguyen,Nhat Le,Baoru Huang,Minh Nhat Vu,Chengcheng Tang,Van Nguyen,Ngan Le,Thieu Vo,Anh Nguyen
Main category: cs.CV
TL;DR: This paper introduces EgoAIST++ and an EgoMusic Motion Network using Skeleton Mamba to accurately estimate dance motion from egocentric video and music, outperforming current methods.
Details
Motivation: The motivation is to address the underexplored challenge of jointly estimating human dance motion from both egocentric video and music, aiming to enhance accuracy and alignment with multimodal inputs. Method: The method involves creating a new large-scale dataset (EgoAIST++) and designing an EgoMusic Motion Network with a Skeleton Mamba to model human body structure and motion sequences effectively. Result: The experiments demonstrate that the proposed method outperforms existing approaches and generalizes well to real-world data, proving its effectiveness and theoretical support. Conclusion: The paper concludes that the proposed EgoMusic Motion Network, leveraging the Skeleton Mamba and the EgoAIST++ dataset, outperforms state-of-the-art methods in dance motion estimation from egocentric video and music. Abstract: Estimating human dance motion is a challenging task with various industrial applications. Recently, many efforts have focused on predicting human dance motion using either egocentric video or music as input. However, the task of jointly estimating human motion from both egocentric video and music remains largely unexplored. In this paper, we aim to develop a new method that predicts human dance motion from both egocentric video and music. In practice, the egocentric view often obscures much of the body, making accurate full-pose estimation challenging. Additionally, incorporating music requires the generated head and body movements to align well with both visual and musical inputs. We first introduce EgoAIST++, a new large-scale dataset that combines both egocentric views and music with more than 36 hours of dancing motion. Drawing on the success of diffusion models and Mamba on modeling sequences, we develop an EgoMusic Motion Network with a core Skeleton Mamba that explicitly captures the skeleton structure of the human body. We illustrate that our approach is theoretically supportive. Intensive experiments show that our method clearly outperforms state-of-the-art approaches and generalizes effectively to real-world data.[119] Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies
Ayushman Sarkar,Mohd Yamani Idna Idris,Zhenyu Yu
Main category: cs.CV
TL;DR: 这篇论文对视觉推理进行了全面分类,并审查了不同架构和评估协议,同时指出了未来研究的方向和关键挑战。
Details
Motivation: 尽管在关系型、符号型、时间型、因果型和常识型推理方面取得了显著进展,但现有的综述往往孤立地处理这些方向,缺乏对推理类型、方法和评估协议的统一分析和比较。因此,这篇论文旨在填补这一空白。 Method: 论文通过将视觉推理分为五种主要类型,并系统地审查了基于图的模型、记忆网络、注意力机制和神经符号系统等架构中的实现方式。此外,论文还回顾了用于评估功能正确性、结构一致性和因果有效性的协议。 Result: 论文对视觉推理进行了全面的分类,并系统审查了不同的架构和评估协议。同时,论文还指出了现有评估方法在可推广性、可重复性和解释能力方面的局限性,并识别了视觉推理中的关键开放性挑战。 Conclusion: 该论文强调了视觉推理在计算机视觉任务中的重要性,并提出了一个涵盖五种主要类型(关系型、符号型、时间型、因果型和常识型)的分类法。论文还指出了未来研究的方向,包括构建透明、可信和跨领域适应的人工智能系统。 Abstract: Visual reasoning is critical for a wide range of computer vision tasks that go beyond surface-level object detection and classification. Despite notable advances in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys often address these directions in isolation, lacking a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols. This survey aims to address this gap by categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examining their implementation through architectures such as graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of generalizability, reproducibility, and explanatory power. Beyond evaluation, we identify key open challenges in visual reasoning, including scalability to complex scenes, deeper integration of symbolic and neural paradigms, the lack of comprehensive benchmark datasets, and reasoning under weak supervision. Finally, we outline a forward-looking research agenda for next-generation vision systems, emphasizing that bridging perception and reasoning is essential for building transparent, trustworthy, and cross-domain adaptive AI systems, particularly in critical domains such as autonomous driving and medical diagnostics.[120] Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset
Ziye Deng,Ruihan He,Jiaxiang Liu,Yuan Wang,Zijie Meng,Songtao Jiang,Yong Xie,Zuozhu Liu
Main category: cs.CV
TL;DR: 为了解决医学图像基础任务中存在的模态覆盖有限、标注粗粒度和缺乏统一框架的问题,研究团队构建了大规模多模态数据集Med-GLIP-5M,并开发了Med-GLIP框架,该框架通过隐式学习实现多层次语义理解,在多个任务中表现出色。
Details
Motivation: 现有的医学图像基础研究受限于模态覆盖范围有限、粗粒度标注以及缺乏统一、可推广的基础框架,因此需要一个更全面和高效的解决方案。 Method: 构建了一个包含超过530万区域级注释的大型医疗基础数据集Med-GLIP-5M,并基于此开发了Med-GLIP框架,该框架能够隐式地从多样化训练数据中获取分层语义理解,而无需显式设计专家模块。 Result: Med-GLIP在多个基础任务基准上一致优于现有最先进的基线方法,并通过将其空间输出集成到下游任务(如医学VQA和报告生成)中,实现了显著的性能提升。 Conclusion: Med-GLIP提供了一个统一且可推广的医疗图像基础框架,通过利用大规模多模态数据集Med-GLIP-5M,显著提高了医学图像分析任务的性能,并计划发布该数据集以促进进一步研究。 Abstract: Medical image grounding aims to align natural language phrases with specific regions in medical images, serving as a foundational task for intelligent diagnosis, visual question answering (VQA), and automated report generation (MRG). However, existing research is constrained by limited modality coverage, coarse-grained annotations, and the absence of a unified, generalizable grounding framework. To address these challenges, we construct a large-scale medical grounding dataset Med-GLIP-5M comprising over 5.3 million region-level annotations across seven imaging modalities, covering diverse anatomical structures and pathological findings. The dataset supports both segmentation and grounding tasks with hierarchical region labels, ranging from organ-level boundaries to fine-grained lesions. Based on this foundation, we propose Med-GLIP, a modality-aware grounding framework trained on Med-GLIP-5M. Rather than relying on explicitly designed expert modules, Med-GLIP implicitly acquires hierarchical semantic understanding from diverse training data -- enabling it to recognize multi-granularity structures, such as distinguishing lungs from pneumonia lesions. Extensive experiments demonstrate that Med-GLIP consistently outperforms state-of-the-art baselines across multiple grounding benchmarks. Furthermore, integrating its spatial outputs into downstream tasks, including medical VQA and report generation, leads to substantial performance gains. Our dataset will be released soon.[121] GCRPNet: Graph-Enhanced Contextual and Regional Perception Network For Salient Object Detection in Optical Remote Sensing Images
Mengyu Ren,Yutong Li,Hua Li,Runmin Cong,Sam Kwong
Main category: cs.CV
TL;DR: This paper proposes GCRPNet, a novel network based on the Mamba architecture, for improved salient object detection in optical remote sensing images through enhanced feature integration.
Details
Motivation: Current methods using ViTs and CNNs struggle to effectively integrate heterogeneous global and local features for salient object detection in ORSIs, leading to suboptimal performance. Method: GCRPNet uses a visual state space encoder, a difference-similarity guided hierarchical graph attention module, and a LEVSS block with adaptive scanning and multi-granularity attention enhancement in the decoder. Result: Extensive experiments show that the proposed GCRPNet achieves state-of-the-art results, demonstrating its effectiveness and superiority in salient object detection. Conclusion: The proposed GCRPNet model based on the Mamba architecture improves salient object detection in optical remote sensing images by effectively integrating global and local features through novel modules. Abstract: Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model's structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the decoder of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba's local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.[122] PSScreen: Partially Supervised Multiple Retinal Disease Screening
Boyi Zheng,Qing Liu
Main category: cs.CV
TL;DR: PSScreen是一种新颖的部分监督视网膜疾病筛查模型,通过利用两个流来学习确定性和概率性特征,并采用文本指导和特征蒸馏来提高领域泛化能力,同时解决了标签缺失问题。
Details
Motivation: 使用多个部分标记的数据集来训练模型以减少对完全标注数据集的依赖,但由于不同医疗站点的数据集之间存在显著的领域转移以及部分类别的标签缺失问题,这仍然具有挑战性。 Method: PSScreen包含两个流:一个学习确定性特征,另一个通过不确定性注入学习概率性特征。利用文本指导将两种特征解耦为疾病特征并通过对特征进行蒸馏对齐以提高领域泛化能力。同时采用伪标签一致性解决标签缺失问题,并引入自蒸馏将任务相关语义从确定性流转移到概率性流以进一步提升检测性能。 Result: 实验表明,PSScreen在六种视网膜疾病和正常状态的平均检测性能显著提升,并在领域内和跨领域数据集上取得了最先进的结果。 Conclusion: PSScreen通过解决领域转移和标签缺失问题,提高了视网膜疾病的检测性能,减少了对完全标注数据的依赖。 Abstract: Leveraging multiple partially labeled datasets to train a model for multiple retinal disease screening reduces the reliance on fully annotated datasets, but remains challenging due to significant domain shifts across training datasets from various medical sites, and the label absent issue for partial classes. To solve these challenges, we propose PSScreen, a novel Partially Supervised multiple retinal disease Screening model. Our PSScreen consists of two streams and one learns deterministic features and the other learns probabilistic features via uncertainty injection. Then, we leverage the textual guidance to decouple two types of features into disease-wise features and align them via feature distillation to boost the domain generalization ability. Meanwhile, we employ pseudo label consistency between two streams to address the label absent issue and introduce a self-distillation to transfer task-relevant semantics about known classes from the deterministic to the probabilistic stream to further enhance the detection performances. Experiments show that our PSScreen significantly enhances the detection performances on six retinal diseases and the normal state averagely and achieves state-of-the-art results on both in-domain and out-of-domain datasets. Codes are available at https://github.com/boyiZheng99/PSScreen.[123] AR Surgical Navigation With Surface Tracing: Comparing In-SitVisualization with Tool-Tracking Guidance for Neurosurgical Applications
Marc J. Fischer,Jeffrey Potts,Gabriel Urreola,Dax Jones,Paolo Palmisciano,E. Bradley Strong,Branden Cord,Andrew D. Hernandez,Julia D. Sharma,E. Brandon Strong
Main category: cs.CV
TL;DR: This study presents a new AR-guided methodology for precise surgical navigation, showing improved accuracy and user preference for real-time tool-tracking guidance over static visualization.
Details
Motivation: Augmented Reality (AR) surgical navigation systems are emerging as the next generation of intraoperative surgical guidance, promising to overcome limitations of traditional navigation systems. However, known issues with AR depth perception due to vergence-accommodation conflict and occlusion handling limitations of the currently commercially available display technology present acute challenges in surgical settings where precision is paramount. Method: A novel methodology for utilizing AR guidance to register anatomical targets and provide real-time instrument navigation using placement of simulated external ventricular drain catheters on a phantom model as the clinical scenario. Result: Following the insertion tests, computed tomography scans of the phantom models were acquired, allowing for evaluation of insertion accuracy, target deviation, angular error, and depth precision. Conclusion: Tool-tracking guidance improved performance metrics across all accuracy measures and was preferred by users in subjective evaluations. Abstract: Augmented Reality (AR) surgical navigation systems are emerging as the next generation of intraoperative surgical guidance, promising to overcome limitations of traditional navigation systems. However, known issues with AR depth perception due to vergence-accommodation conflict and occlusion handling limitations of the currently commercially available display technology present acute challenges in surgical settings where precision is paramount. This study presents a novel methodology for utilizing AR guidance to register anatomical targets and provide real-time instrument navigation using placement of simulated external ventricular drain catheters on a phantom model as the clinical scenario. The system registers target positions to the patient through a novel surface tracing method and uses real-time infrared tool tracking to aid in catheter placement, relying only on the onboard sensors of the Microsoft HoloLens 2. A group of intended users performed the procedure of simulated insertions under two AR guidance conditions: static in-situ visualization, where planned trajectories are overlaid directly onto the patient anatomy, and real-time tool-tracking guidance, where live feedback of the catheter's pose is provided relative to the plan. Following the insertion tests, computed tomography scans of the phantom models were acquired, allowing for evaluation of insertion accuracy, target deviation, angular error, and depth precision. System Usability Scale surveys assessed user experience and cognitive workload. Tool-tracking guidance improved performance metrics across all accuracy measures and was preferred by users in subjective evaluations. A free copy of this paper and all supplemental materials are available at https://bit.ly/45l89Hq.[124] Retrieval-Augmented Prompt for OOD Detection
Ruisong Han,Zongbo Han,Jiahao Zhang,Mingyue Cheng,Changqing Zhang
Main category: cs.CV
TL;DR: RAP is a novel OOD detection approach that leverages external textual knowledge to enhance semantic supervision, achieving state-of-the-art results on large-scale benchmarks.
Details
Motivation: Existing OOD detection methods suffer from limited and mismatched outlier samples, leading to suboptimal performance due to insufficient semantic supervision. Method: RAP augments pre-trained vision-language models' prompts by retrieving external textual knowledge to enhance semantic supervision for OOD detection, dynamically updating OOD prompts in real-time during testing. Result: RAP reduces the average FPR95 by 7.05% and improves AUROC by 1.71% compared to previous methods in 1-shot OOD detection on the ImageNet-1k dataset. Conclusion: RAP provides state-of-the-art performance on large-scale OOD detection benchmarks, with significant improvements in FPR95 and AUROC on the ImageNet-1k dataset. Abstract: Out-of-Distribution (OOD) detection is crucial for the reliable deployment of machine learning models in-the-wild, enabling accurate identification of test samples that differ from the training data distribution. Existing methods rely on auxiliary outlier samples or in-distribution (ID) data to generate outlier information for training, but due to limited outliers and their mismatch with real test OOD samples, they often fail to provide sufficient semantic supervision, leading to suboptimal performance. To address this, we propose a novel OOD detection method called Retrieval-Augmented Prompt (RAP). RAP augments a pre-trained vision-language model's prompts by retrieving external knowledge, offering enhanced semantic supervision for OOD detection. During training, RAP retrieves descriptive words for outliers based on joint similarity with external textual knowledge and uses them to augment the model's OOD prompts. During testing, RAP dynamically updates OOD prompts in real-time based on the encountered OOD samples, enabling the model to rapidly adapt to the test environment. Our extensive experiments demonstrate that RAP achieves state-of-the-art performance on large-scale OOD detection benchmarks. For example, in 1-shot OOD detection on the ImageNet-1k dataset, RAP reduces the average FPR95 by 7.05% and improves the AUROC by 1.71% compared to previous methods. Additionally, comprehensive ablation studies validate the effectiveness of each module and the underlying motivations of our approach.[125] PTQAT: A Hybrid Parameter-Efficient Quantization Algorithm for 3D Perception Tasks
Xinhao Wang,Zhiwei Lin,Zhongyu Xia,Yongtao Wang
Main category: cs.CV
TL;DR: This paper introduces PTQAT, a hybrid quantization method that improves model quantization accuracy and efficiency by selecting critical layers for QAT fine-tuning while performing PTQ on other layers.
Details
Motivation: To address the speed accuracy trade-off between PTQ and QAT in model quantization. Method: PTQAT selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers, focusing on layers with smaller output discrepancies. Result: PTQAT achieves similar performance to QAT with more efficiency and consistently outperforms QAT-only baselines across various 3D perception tasks. Conclusion: PTQAT is a universal quantization method that supports various quantization bit widths as well as different model architectures. Abstract: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight fine-tuning.In this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade-off between PTQ and QAT, our method selects critical layers for QAT fine-tuning and performs PTQ on the remaining layers. Contrary to intuition, fine-tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model's quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT-only baselines. Notably, it achieves 0.2%-0.9% NDS and 0.3%-1.0% mAP gains in object detection, 0.3%-2.0% mIoU gains in semantic segmentation and occupancy prediction while fine-tuning fewer weights.[126] HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis
Shiyu Liu,Kui Jiang,Xianming Liu,Hongxun Yao,Xiaocheng Feng
Main category: cs.CV
TL;DR: 本文提出 HM-Talker,一种结合显性(动作单元)和隐性运动线索的音频驱动说话头视频生成框架,通过跨模态解耦模块和混合运动建模模块提升嘴唇同步和跨主体泛化能力,实验结果表明其在视觉质量和同步准确性上优于现有方法。
Details
Motivation: 当前的音频驱动说话头视频生成方法经常产生带有运动模糊和嘴唇抖动的视频,主要因为它们依赖于隐性建模音频-面部运动相关性,缺乏显性的发音先验(即语音相关面部运动的解剖学指导)。为了解决这一限制,需要引入显性先验来指导面部运动,以提高嘴唇同步的准确性和视频质量。 Method: HM-Talker 利用一种混合运动表示方法,结合了隐性和显性运动线索。显性线索基于动作单元(AUs),即解剖学定义的面部肌肉运动,而隐性线索通过跨模态解耦模块(CMDM)提取隐性特征并预测AUs。此外,混合运动建模模块(HMMM)通过动态合并随机配对的隐性/显性特征,增强跨主体泛化能力。 Result: HM-Talker 在视觉质量和嘴唇同步准确性方面优于现有最先进方法。实验表明,该方法能够生成高保真且时间一致的说话头视频,并在多样化身份中实现强大的嘴唇同步,从而推进个性化的说话头合成技术。 Conclusion: HM-Talker 提出了一种新颖的混合运动表示方法,结合了隐性和显性运动线索,以生成高保真、时间一致的说话头视频。这种方法通过利用动作单元(AUs)和隐性特征来减少音素-口型不对齐问题,同时引入跨模态解耦模块和混合运动建模模块,实现了对身份无关特征的学习,从而在多样化身份中实现强大的嘴唇同步。 Abstract: Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.[127] SpaRC-AD: A Baseline for Radar-Camera Fusion in End-to-End Autonomous Driving
Philipp Wolters,Johannes Gilg,Torben Teepe,Gerhard Rigoll
Main category: cs.CV
TL;DR: SpaRC-AD is a camera-radar fusion framework for autonomous driving that improves 3D scene understanding and trajectory prediction, especially in challenging conditions.
Details
Motivation: Vision-based autonomous driving systems struggle with adverse weather, occlusions, and precise velocity estimation, which are critical for safety in complex scenarios. Method: SpaRC-AD uses a query-based end-to-end camera-radar fusion framework, leveraging sparse 3D feature alignment and Doppler-based velocity estimation for improved 3D scene representation. Result: SpaRC-AD outperforms state-of-the-art vision-only baselines in multiple tasks, including 3D detection (+4.8% mAP), multi-object tracking (+8.3% AMOTA), online mapping (+1.8% mAP), motion prediction (-4.0% mADE), and trajectory planning (-0.1m L2 and -9% TPC). Conclusion: The proposed SpaRC-AD framework effectively addresses the limitations of vision-only autonomous driving systems, particularly in challenging scenarios requiring accurate motion understanding and long-horizon trajectory prediction. Abstract: End-to-end autonomous driving systems promise stronger performance through unified optimization of perception, motion forecasting, and planning. However, vision-based approaches face fundamental limitations in adverse weather conditions, partial occlusions, and precise velocity estimation - critical challenges in safety-sensitive scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. To address these limitations, we propose SpaRC-AD, a query-based end-to-end camera-radar fusion framework for planning-oriented autonomous driving. Through sparse 3D feature alignment, and doppler-based velocity estimation, we achieve strong 3D scene representations for refinement of agent anchors, map polylines and motion modelling. Our method achieves strong improvements over the state-of-the-art vision-only baselines across multiple autonomous driving tasks, including 3D detection (+4.8% mAP), multi-object tracking (+8.3% AMOTA), online mapping (+1.8% mAP), motion prediction (-4.0% mADE), and trajectory planning (-0.1m L2 and -9% TPC). We achieve both spatial coherence and temporal consistency on multiple challenging benchmarks, including real-world open-loop nuScenes, long-horizon T-nuScenes, and closed-loop simulator Bench2Drive. We show the effectiveness of radar-based fusion in safety-critical scenarios where accurate motion understanding and long-horizon trajectory prediction are essential for collision avoidance. The source code of all experiments is available at https://phi-wol.github.io/sparcad/[128] Adapting SAM via Cross-Entropy Masking for Class Imbalance in Remote Sensing Change Detection
Humza Naveed,Xina Zeng,Mitch Bryson,Nagita Mehrseresht
Main category: cs.CV
TL;DR: 本研究提出了一种基于SAM的遥感图像变化检测方法,结合了时空特征增强、多尺度解码器融合和新的损失函数,有效提升了检测性能,尤其是在复杂数据集上的表现。
Details
Motivation: 基础模型(如SAM)在计算机视觉领域表现优异,但在遥感变化检测(RSCD)任务中仍需改进。现有方法在多尺度变化检测和类别不平衡问题上存在不足,因此需要一种更鲁棒的解决方案。 Method: 本文提出了一种基于SAM(Segment Anything Model)的遥感变化检测方法。具体包括对SAM编码器进行微调,并引入了时空特征增强(STFE)和多尺度解码器融合(MSDF)模块。此外,设计了一种新的交叉熵掩码(CEM)损失函数以解决类别不平衡问题。 Result: 该方法在Levir-CD、WHU-CD、CLCD和S2Looking四个变化检测数据集上均优于现有最先进的方法,其中在复杂的S2Looking数据集上F1得分提高了2.5%。 Conclusion: 实验结果表明,所提出的基于SAM的方法结合STFE、MSDF和CEM损失在四个变化检测数据集上均优于SOTA方法,尤其是在S2Looking数据集上取得了2.5%的F1分数提升。 Abstract: Foundational models have achieved significant success in diverse domains of computer vision. They learn general representations that are easily transferable to tasks not seen during training. One such foundational model is Segment anything model (SAM), which can accurately segment objects in images. We propose adapting the SAM encoder via fine-tuning for remote sensing change detection (RSCD) along with spatial-temporal feature enhancement (STFE) and multi-scale decoder fusion (MSDF) to detect changes robustly at multiple scales. Additionally, we propose a novel cross-entropy masking (CEM) loss to handle high class imbalance in change detection datasets. Our method outperforms state-of-the-art (SOTA) methods on four change detection datasets, Levir-CD, WHU-CD, CLCD, and S2Looking. We achieved 2.5% F1-score improvement on a large complex S2Looking dataset. The code is available at: https://github.com/humza909/SAM-CEM-CD[129] Towards Agentic AI for Multimodal-Guided Video Object Segmentation
Tuyen Tran,Thao Minh Le,Truyen Tran
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态代理的方法,用于解决基于参照的视频对象分割(RVOS)问题,利用大语言模型(LLMs)的推理能力动态生成针对每个输入的工作流程,以提高灵活性和适应性。
Details
Motivation: 传统的基于参照的视频对象分割方法依赖于训练专用模型,计算复杂度高且需要大量人工标注。现有利用视觉-语言基础模型的方法缺乏灵活性,无法适应任务的动态特性。 Method: 提出了一种多模态代理系统,利用大语言模型动态生成工作流程,并与专门设计的低级任务工具进行交互,以识别由多模态线索描述的目标对象。 Result: 在两个多模态条件下的视频对象分割任务(RVOS和Ref-AVS)上,所提方法相比现有方法表现出明显改进。 Conclusion: 本文提出的多模态代理方法在基于参照的视频对象分割任务中展现出更高的灵活性和性能,为未来训练自由的方法提供了新的方向。 Abstract: Referring-based Video Object Segmentation is a multimodal problem that requires producing fine-grained segmentation results guided by external cues. Traditional approaches to this task typically involve training specialized models, which come with high computational complexity and manual annotation effort. Recent advances in vision-language foundation models open a promising direction toward training-free approaches. Several studies have explored leveraging these general-purpose models for fine-grained segmentation, achieving performance comparable to that of fully supervised, task-specific models. However, existing methods rely on fixed pipelines that lack the flexibility needed to adapt to the dynamic nature of the task. To address this limitation, we propose Multi-Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner. Specifically, our method leverages the reasoning capabilities of large language models (LLMs) to generate dynamic workflows tailored to each input. This adaptive procedure iteratively interacts with a set of specialized tools designed for low-level tasks across different modalities to identify the target object described by the multimodal cues. Our agentic approach demonstrates clear improvements over prior methods on two multimodal-conditioned VOS tasks: RVOS and Ref-AVS.[130] HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs
Zheng Qin,Ruobing Zheng,Yabing Wang,Tianqi Li,Yi Yuan,Jingdong Chen,Le Wang
Main category: cs.CV
TL;DR: 本文提出HumanSense基准测试,用于评估多模态大语言模型在人性化感知和交互能力方面的表现,并通过强化学习提升模型推理能力,在无需训练的情况下也改善了非推理模型的表现。
Details
Motivation: 现有的多模态大语言模型(MLLMs)在实现类人交互方面潜力巨大,但缺乏针对细粒度人性化场景的评估框架,特别是对复杂人类意图的理解和富有同理心的上下文感知响应的提供。 Method: 引入HumanSense基准测试,用于评估MLLMs在人性化感知和交互能力方面的表现;采用多阶段、模态递进的强化学习方法来增强模型的推理能力,并通过设计提示改善非推理模型的表现。 Result: 评估结果显示,当前领先的MLLMs在面向交互的高级任务上仍有较大改进空间;补充音频和文本信息可以显著提升性能,Omni模型在这些任务上表现出优势;通过强化学习增强推理能力后,模型在评估中取得了显著提升;设计相应提示后,非推理模型的表现也得到了改善。 Conclusion: 通过多阶段、模态递进的强化学习提升模型推理能力,可以显著改善Omni模型在HumanSense基准测试中的表现,同时设计相应的提示可以在无需训练的情况下提升非推理模型的表现。 Abstract: While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: \textcolor{brightpink}https://digital-avatar.github.io/ai/HumanSense/[131] EvTurb: Event Camera Guided Turbulence Removal
Yixing Liu,Minggui Teng,Yifei Xia,Peiqi Duan,Boxin Shi
Main category: cs.CV
TL;DR: EvTurb is a novel framework that uses event streams to effectively remove atmospheric turbulence distortions in images, outperforming existing methods.
Details
Motivation: Atmospheric turbulence introduces blur and tilt distortions in images, which are challenging for existing single-image and multi-frame methods to address due to the complexity of turbulence-induced distortions. Method: EvTurb uses a two-step event-guided network: event integrals to reduce blur, and a variance map to eliminate tilt distortion. A new dataset, TurbEvent, is also introduced. Result: EvTurb achieves superior performance over existing methods in removing turbulence effects while maintaining computational efficiency. Conclusion: EvTurb is an effective turbulence removal framework that decouples blur and tilt effects using high-speed event streams and outperforms state-of-the-art methods. Abstract: Atmospheric turbulence degrades image quality by introducing blur and geometric tilt distortions, posing significant challenges to downstream computer vision tasks. Existing single-image and multi-frame methods struggle with the highly ill-posed nature of this problem due to the compositional complexity of turbulence-induced distortions. To address this, we propose EvTurb, an event guided turbulence removal framework that leverages high-speed event streams to decouple blur and tilt effects. EvTurb decouples blur and tilt effects by modeling event-based turbulence formation, specifically through a novel two-step event-guided network: event integrals are first employed to reduce blur in the coarse outputs. This is followed by employing a variance map, derived from raw event streams, to eliminate the tilt distortion for the refined outputs. Additionally, we present TurbEvent, the first real-captured dataset featuring diverse turbulence scenarios. Experimental results demonstrate that EvTurb surpasses state-of-the-art methods while maintaining computational efficiency.[132] Towards Powerful and Practical Patch Attacks for 2D Object Detection in Autonomous Driving
Yuxin Cao,Yedi Zhang,Wentao He,Yifan Liao,Yan Xiao,Chang Li,Zhiyong Huang,Jin Song Dong
Main category: cs.CV
TL;DR: 本文提出了一种新的对抗性补丁攻击框架P$^3$A,用于自动驾驶中的2D物体检测,通过引入新的度量标准和损失函数,以及数据预处理步骤,提高了攻击在高分辨率数据集上的可转移性。
Details
Motivation: 基于学习的自动驾驶系统容易受到对抗性补丁的攻击,而现有的基于可转移性的黑盒攻击方法在高分辨率图像上效果不佳。 Method: 提出了一种新的度量标准PASR和一种定制的LCSL损失函数,并结合了概率尺度保持填充(PSPP)作为数据预处理步骤。 Result: P$^3$A在未见过的模型和高分辨率数据集上均优于现有的攻击方法,无论是在新的PASR度量还是传统的mAP度量下。 Conclusion: P$^3$A是一个强大的实用补丁攻击框架,专为自动驾驶中的2D物体检测设计,特别优化了高分辨率数据集的攻击可转移性。 Abstract: Learning-based autonomous driving systems remain critically vulnerable to adversarial patches, posing serious safety and security risks in their real-world deployment. Black-box attacks, notable for their high attack success rate without model knowledge, are especially concerning, with their transferability extensively studied to reduce computational costs compared to query-based attacks. Previous transferability-based black-box attacks typically adopt mean Average Precision (mAP) as the evaluation metric and design training loss accordingly. However, due to the presence of multiple detected bounding boxes and the relatively lenient Intersection over Union (IoU) thresholds, the attack effectiveness of these approaches is often overestimated, resulting in reduced success rates in practical attacking scenarios. Furthermore, patches trained on low-resolution data often fail to maintain effectiveness on high-resolution images, limiting their transferability to autonomous driving datasets. To fill this gap, we propose P$^3$A, a Powerful and Practical Patch Attack framework for 2D object detection in autonomous driving, specifically optimized for high-resolution datasets. First, we introduce a novel metric, Practical Attack Success Rate (PASR), to more accurately quantify attack effectiveness with greater relevance for pedestrian safety. Second, we present a tailored Localization-Confidence Suppression Loss (LCSL) to improve attack transferability under PASR. Finally, to maintain the transferability for high-resolution datasets, we further incorporate the Probabilistic Scale-Preserving Padding (PSPP) into the patch attack pipeline as a data preprocessing step. Extensive experiments show that P$^3$A outperforms state-of-the-art attacks on unseen models and unseen high-resolution datasets, both under the proposed practical IoU-based evaluation metric and the previous mAP-based metrics.[133] Fourier-Guided Attention Upsampling for Image Super-Resolution
Daejune Choi,Youchan No,Jinhyung Lee,Duksu Kim
Main category: cs.CV
TL;DR: This paper introduces FGA, a lightweight and effective upsampling module for image super-resolution that addresses the limitations of conventional methods in reconstructing high-frequency details and reducing aliasing artifacts.
Details
Motivation: Conventional upsamplers like Sub-Pixel Convolution are efficient but struggle to reconstruct high-frequency details and often introduce aliasing artifacts in image super-resolution. Method: FGA integrates three components: a Fourier feature-based MLP for positional frequency encoding, a cross-resolution Correlation Attention Layer for adaptive spatial alignment, and a frequency-domain L1 loss for spectral fidelity supervision. Result: FGA achieves average PSNR gains of 0.12~0.14 dB, improves frequency-domain consistency by up to 29%, and performs particularly well on texture-rich datasets. Conclusion: FGA is a practical and scalable alternative to traditional upsampling methods, effectively reducing aliasing and preserving fine details in image super-resolution. Abstract: We propose Frequency-Guided Attention (FGA), a lightweight upsampling module for single image super-resolution. Conventional upsamplers, such as Sub-Pixel Convolution, are efficient but frequently fail to reconstruct high-frequency details and introduce aliasing artifacts. FGA addresses these issues by integrating (1) a Fourier feature-based Multi-Layer Perceptron (MLP) for positional frequency encoding, (2) a cross-resolution Correlation Attention Layer for adaptive spatial alignment, and (3) a frequency-domain L1 loss for spectral fidelity supervision. Adding merely 0.3M parameters, FGA consistently enhances performance across five diverse super-resolution backbones in both lightweight and full-capacity scenarios. Experimental results demonstrate average PSNR gains of 0.12~0.14 dB and improved frequency-domain consistency by up to 29%, particularly evident on texture-rich datasets. Visual and spectral evaluations confirm FGA's effectiveness in reducing aliasing and preserving fine details, establishing it as a practical, scalable alternative to traditional upsampling methods.[134] FIND-Net -- Fourier-Integrated Network with Dictionary Kernels for Metal Artifact Reduction
Farid Tasharofi,Fuxin Fan,Melika Qahqaie,Mareike Thies,Andreas Maier
Main category: cs.CV
TL;DR: FIND-Net是一种新的MAR框架,通过集成频率域和空间域处理,有效减少伪影并保持解剖结构。
Details
Motivation: 金属伪影会严重降低CT图像质量,现有的深度学习算法难以在抑制伪影的同时保持结构细节。 Method: FIND-Net结合了频率域和空间域处理,使用了快速傅里叶卷积(FFC)层和可训练高斯滤波。 Result: FIND-Net在合成数据集上实现了MAE降低3.07%,SSIM增加0.18%,PSNR提高0.90%;在真实世界临床CT扫描中也表现出色。 Conclusion: FIND-Net展现出在MAR性能上的显著进步,提供了卓越的结构保持能力和临床适用性改进。 Abstract: Metal artifacts, caused by high-density metallic implants in computed tomography (CT) imaging, severely degrade image quality, complicating diagnosis and treatment planning. While existing deep learning algorithms have achieved notable success in Metal Artifact Reduction (MAR), they often struggle to suppress artifacts while preserving structural details. To address this challenge, we propose FIND-Net (Fourier-Integrated Network with Dictionary Kernels), a novel MAR framework that integrates frequency and spatial domain processing to achieve superior artifact suppression and structural preservation. FIND-Net incorporates Fast Fourier Convolution (FFC) layers and trainable Gaussian filtering, treating MAR as a hybrid task operating in both spatial and frequency domains. This approach enhances global contextual understanding and frequency selectivity, effectively reducing artifacts while maintaining anatomical structures. Experiments on synthetic datasets show that FIND-Net achieves statistically significant improvements over state-of-the-art MAR methods, with a 3.07% MAE reduction, 0.18% SSIM increase, and 0.90% PSNR improvement, confirming robustness across varying artifact complexities. Furthermore, evaluations on real-world clinical CT scans confirm FIND-Net's ability to minimize modifications to clean anatomical regions while effectively suppressing metal-induced distortions. These findings highlight FIND-Net's potential for advancing MAR performance, offering superior structural preservation and improved clinical applicability. Code is available at https://github.com/Farid-Tasharofi/FIND-Net[135] Increasing the Utility of Synthetic Images through Chamfer Guidance
Nicola Dall'Asen,Xiaofeng Zhang,Reyhane Askari Hemmat,Melissa Hall,Jakob Verbeek,Adriana Romero-Soriano,Michal Drozdzal
Main category: cs.CV
TL;DR: Chamfer Guidance is a training-free method that uses real exemplar images to enhance both the quality and diversity of synthetic image generation, improving downstream model performance and reducing computational cost.
Details
Motivation: Recent advances in conditional image generation have prioritized generation quality over diversity, limiting their usefulness as sources of synthetic training data. Existing guidance methods often fail to account for the distribution shift between synthetic and real data, necessitating a new approach like Chamfer Guidance. Method: The authors introduced Chamfer Guidance, a training-free approach that uses a small number of real images to guide the generation process, focusing on both diversity and quality. They evaluated the method using metrics like precision and distributional coverage on ImageNet-1k and geo-diversity benchmarks, and tested its utility by training downstream classifiers on synthetic data. Result: Chamfer Guidance significantly improved generation diversity while maintaining or enhancing quality. Using only 2 real exemplar images, it achieved 96.4% precision and 86.4% distributional coverage; these metrics increased to 97.5% and 92.7% with 32 images. Training downstream classifiers on the generated data led to accuracy improvements of up to 15% for in-distribution and 16% for out-of-distribution tasks. The method also reduced FLOPs by 31% compared to classifier-free-guidance approaches. Conclusion: Chamfer Guidance is an effective training-free guidance approach that enhances the diversity and quality of synthetic data generated by image models, surpassing baseline methods in performance while reducing computational costs. Abstract: Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4\% in terms of precision, and 86.4\% in terms of distributional coverage, which increase to 97.5\% and 92.7\%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15\% for in-distribution over the baselines, and up to 16\% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31\% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.[136] ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation
Hosam Elgendy,Ahmed Sharshar,Ahmed Aboeitta,Mohsen Guizani
Main category: cs.CV
TL;DR: ChatENV是一种结合卫星图像和传感器数据的交互式环境监测模型,能够进行时间变化分析和假设推理,适用于气候变化、城市规划等领域。
Details
Motivation: 当前视觉语言模型忽略了环境传感器的因果信号,依赖单一来源的描述,缺乏场景交互推理能力。 Method: 构建了一个包含177,000张图像的数据集,并使用GPT-4o和Gemini 2.0进行标注,利用Qwen-2.5-VL模型进行微调。 Result: ChatENV在时序推理和假设问题上表现优异,BERT-F1得分0.903,并支持交互式场景分析。 Conclusion: ChatENV是一个基于传感器数据和卫星图像的交互式环境监测工具,具有强大的时序和假设分析能力。 Abstract: Understanding environmental changes from aerial imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT- 4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.[137] Processing and acquisition traces in visual encoders: What does CLIP know about your camera?
Ryan Ramos,Vladan Stojnić,Giorgos Kordopatis-Zilos,Yuta Nakashima,Giorgos Tolias,Noa Garcia
Main category: cs.CV
TL;DR: 分析图像采集过程参数对视觉编码器的影响,发现这些参数即使细微也会显著影响语义预测。
Details
Motivation: 现有研究主要关注严重图像失真对视觉编码器的影响,而本文探讨细微甚至人眼难以察觉的图像采集参数对语义预测的作用。 Method: 通过分析视觉编码器在不同图像采集参数下的表现,检测其对语义预测的影响,并探究这些参数是否被编码到视觉表示中。 Result: 发现图像采集参数被系统编码到视觉表示中,并能轻易被恢复;这些参数的存在显著影响语义预测,具体效果取决于其与语义标签的相关性。 Conclusion: 视觉编码器不仅学习语义信息,还编码图像采集参数,这对模型的鲁棒性和泛化能力提出了新的挑战。 Abstract: Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces[138] Lameness detection in dairy cows using pose estimation and bidirectional LSTMs
Helena Russello,Rik van der Tol,Eldert J. van Henten,Gert Kootstra
Main category: cs.CV
TL;DR: 本文提出了一种结合姿态估计和双向长短期记忆(BLSTM)神经网络的跛行检测方法,该方法优于传统的手动特征设计方法,并能在仅有一秒视频数据的情况下检测跛行。
Details
Motivation: 为了克服传统跛行检测方法依赖手动特征工程和对长序列数据需求的局限性,本文提出了一种更自动化、高效的检测方法。 Method: 使用T-LEAP姿态估计模型从奶牛行走视频中提取关键点轨迹,将这些轨迹输入到BLSTM分类器中进行二分类跛行检测,避免了手动特征提取并利用了时间运动信息。 Result: 所提出的BLSTM方法在分类准确率上显著优于传统方法(85% vs 80%),并且能够在仅有一秒视频数据的情况下进行有效的跛行检测。 Conclusion: 结合姿态估计和BLSTM分类器的方法在跛行检测中表现出色,具有无需标记、自动提取特征和适用于短序列数据的优势,为动物健康监测提供了一种新思路。 Abstract: This study presents a lameness detection approach that combines pose estimation and Bidirectional Long-Short-Term Memory (BLSTM) neural networks. Combining pose-estimation and BLSTMs classifier offers the following advantages: markerless pose-estimation, elimination of manual feature engineering by learning temporal motion features from the keypoint trajectories, and working with short sequences and small training datasets. Motion sequences of nine keypoints (located on the cows' hooves, head and back) were extracted from videos of walking cows with the T-LEAP pose estimation model. The trajectories of the keypoints were then used as an input to a BLSTM classifier that was trained to perform binary lameness classification. Our method significantly outperformed an established method that relied on manually-designed locomotion features: our best architecture achieved a classification accuracy of 85%, against 80% accuracy for the feature-based approach. Furthermore, we showed that our BLSTM classifier could detect lameness with as little as one second of video data.[139] SemPT: Semantic Prompt Tuning for Vision-Language Models
Xiao Shi,Yangjun Ou,Zhenzhong Chen
Main category: cs.CV
TL;DR: This paper proposes a new framework called Semantic Prompt Tuning (SemPT) for visual transfer learning, which leverages shared attribute-level knowledge across categories to improve transferability and generalization.
Details
Motivation: Existing prompt tuning methods rely on sparse category labels or disparate LLM-generated descriptions, which fragment knowledge representation and hinder transferability. There is a need for a method that can preserve category-specific representations while acquiring transferable knowledge. Method: SemPT adopts a two-step prompting strategy to guide LLM in extracting shared visual attributes and generating attribute-level descriptions. Visually guided weighting is applied to the embeddings of attribute-level descriptions. Image embeddings are jointly aligned with both label and attribute-enhanced text embeddings. Result: Extensive experiments on 15 benchmark datasets demonstrate the effectiveness of SemPT. Conclusion: Semantic Prompt Tuning (SemPT) achieves state-of-the-art performance across various settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning. Abstract: Visual transfer learning for unseen categories presents an active research topic yet a challenging task, due to the inherent conflict between preserving category-specific representations and acquiring transferable knowledge. Vision-Language Models (VLMs) pre-trained on large amounts of image-text pairs offer a promising solution. However, existing prompt tuning methods rely on sparse category labels or disparate LLM-generated descriptions, which fragment knowledge representation and hinder transferability. To address this limitation, we introduce Semantic Prompt Tuning (SemPT), a novel framework that tackles the generalization challenge by leveraging shared attribute-level knowledge across categories. Specifically, SemPT adopts a two-step prompting strategy to guide LLM in extracting shared visual attributes and generating attribute-level descriptions, capturing transferable semantic cues beyond labels while ensuring coherent structure. Then, visually guided weighting is applied to the embeddings of attribute-level descriptions to reduce noise from irrelevant attributes and enhance the text embeddings. Additionally, image embeddings are jointly aligned with both label and attribute-enhanced text embeddings, balancing discrimination for seen categories and transferability to unseen ones. Considering the availability of category exposure, our inference dynamically selects between standard label embeddings for seen categories and attribute-enhanced embeddings for unseen ones to ensure effective adaptation. Extensive experiments on 15 benchmark datasets demonstrate that SemPT achieves state-of-the-art performance across various settings, including base-to-novel generalization, cross-dataset transfer, cross-domain transfer, and few-shot learning.[140] Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking
Zhangyong Tang,Tianyang Xu,Xuefeng Zhu,Chunyang Cheng,Tao Zhou,Xiaojun Wu,Josef Kittler
Main category: cs.CV
TL;DR: This paper introduces UniBench300, a unified benchmark for multi-modal visual object tracking, and proposes a serial integration approach inspired by continual learning to address inconsistency and performance degradation in existing methods.
Details
Motivation: The motivation stems from the complementary nature of different modalities in building robust tracking systems and the inconsistency between training and testing caused by the absence of a unified benchmark. Method: The authors introduce a unified benchmark, UniBench300, and reformulate the unification process in a serial format inspired by continual learning to progressively integrate new tasks while mitigating knowledge forgetting. Result: Experiments show that UniBench300 reduces inference passes from three to one, cuts time consumption by 27%, and demonstrates that continual learning supports a stable unification process. Additionally, performance degradation is found to be negatively correlated with network capacity. Conclusion: The paper concludes that the proposed UniBench300 benchmark and the serial integration approach, inspired by continual learning, effectively address the inconsistency and performance degradation in multi-modal visual object tracking tasks. Abstract: Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking systems. Existing practices mix all data sensor types in a single training procedure, structuring a parallel paradigm from the data-centric perspective and aiming for a global optimum on the joint distribution of the involved tasks. However, the absence of a unified benchmark where all types of data coexist forces evaluations on separated benchmarks, causing \textit{inconsistency} between training and testing, thus leading to performance \textit{degradation}. To address these issues, this work advances in two aspects: \ding{182} A unified benchmark, coined as UniBench300, is introduced to bridge the inconsistency by incorporating multiple task data, reducing inference passes from three to one and cutting time consumption by 27\%. \ding{183} The unification process is reformulated in a serial format, progressively integrating new tasks. In this way, the performance degradation can be specified as knowledge forgetting of previous tasks, which naturally aligns with the philosophy of continual learning (CL), motivating further exploration of injecting CL into the unification process. Extensive experiments conducted on two baselines and four benchmarks demonstrate the significance of UniBench300 and the superiority of CL in supporting a stable unification process. Moreover, while conducting dedicated analyses, the performance degradation is found to be negatively correlated with network capacity. Additionally, modality discrepancies contribute to varying degradation levels across tasks (RGBT > RGBD > RGBE in MMVOT), offering valuable insights for future multi-modal vision research. Source codes and the proposed benchmark is available at \textit{https://github.com/Zhangyong-Tang/UniBench300}.[141] AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models
Shixiong Xu,Chenghao Zhang,Lubin Fan,Yuan Zhou,Bin Fan,Shiming Xiang,Gaofeng Meng,Jieping Ye
Main category: cs.CV
TL;DR: 本文提出AddressVLM,通过整合卫星图像与街景图像,提升大型视觉语言模型在街道级别地址定位任务中的表现。
Details
Motivation: 现有的LVLMs在国家或城市级别的地理定位表现良好,但在城市内部街道级别的精细定位上存在困难。 Method: 提出AddressVLM模型,采用两阶段训练协议:跨视角对齐调优和地址定位调优,结合卫星图像作为宏观线索,增强模型对街道分布的理解。 Result: AddressVLM在匹兹堡和旧金山的两个数据集上的平均地址定位准确率分别超过了对比LVLMs 9%和12%。 Conclusion: AddressVLM在街道级定位任务中显著优于其他LVLMs,通过整合卫星图像和街景图像的多视角信息,提高了地址定位的准确性。 Abstract: Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM's global understanding of street distribution is enhanced through cross-view matching. Our proposed model, named AddressVLM, consists of two-stage training protocols: cross-view alignment tuning and address localization tuning. Furthermore, we have constructed two street-view VQA datasets based on image address localization datasets from Pittsburgh and San Francisco. Qualitative and quantitative evaluations demonstrate that AddressVLM outperforms counterpart LVLMs by over 9% and 12% in average address localization accuracy on these two datasets, respectively.[142] Hybrid Generative Fusion for Efficient and Privacy-Preserving Face Recognition Dataset Generation
Feiran Li,Qianqian Xu,Shilong Bao,Boyu Han,Zhiyong Yang,Qingming Huang
Main category: cs.CV
TL;DR: This paper introduces a hybrid approach combining GAN-based and diffusion-based techniques to create a high-quality, non-overlapping face dataset, achieving top performance in a face recognition challenge.
Details
Motivation: The motivation is to build a high-quality, non-overlapping face dataset for training a face recognition model, addressing challenges in dataset consistency and identity leakage. Method: The method involves dataset cleaning using a Mixture-of-Experts strategy, data augmentation, synthetic identity generation with Stable Diffusion and Vec2Face, and a curriculum learning approach to manage synthetic identity similarity. Result: The final dataset contains 50 images per identity, with no overlap with existing datasets, and the method significantly improves model performance across multiple identity scales. Conclusion: The paper concludes that their proposed method effectively constructs a diverse and high-quality face dataset without overlapping identities, achieving 1st place in the DataCV ICCV Challenge. Abstract: In this paper, we present our approach to the DataCV ICCV Challenge, which centers on building a high-quality face dataset to train a face recognition model. The constructed dataset must not contain identities overlapping with any existing public face datasets. To handle this challenge, we begin with a thorough cleaning of the baseline HSFace dataset, identifying and removing mislabeled or inconsistent identities through a Mixture-of-Experts (MoE) strategy combining face embedding clustering and GPT-4o-assisted verification. We retain the largest consistent identity cluster and apply data augmentation up to a fixed number of images per identity. To further diversify the dataset, we generate synthetic identities using Stable Diffusion with prompt engineering. As diffusion models are computationally intensive, we generate only one reference image per identity and efficiently expand it using Vec2Face, which rapidly produces 49 identity-consistent variants. This hybrid approach fuses GAN-based and diffusion-based samples, enabling efficient construction of a diverse and high-quality dataset. To address the high visual similarity among synthetic identities, we adopt a curriculum learning strategy by placing them early in the training schedule, allowing the model to progress from easier to harder samples. Our final dataset contains 50 images per identity, and all newly generated identities are checked with mainstream face datasets to ensure no identity leakage. Our method achieves \textbf{1st place} in the competition, and experimental results show that our dataset improves model performance across 10K, 20K, and 100K identity scales. Code is available at https://github.com/Ferry-Li/datacv_fr.[143] HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection
Zhaoyuan Qi,Weihua Gao,Wenlong Niu,Jie Tang,Yun Li,Xiaodong Peng
Main category: cs.CV
TL;DR: HyperTea introduces a novel approach for MIRSTD by integrating CNNs, RNNs, and HGNNs, effectively modeling high-order spatiotemporal correlations and achieving SOTA performance on benchmark datasets.
Details
Motivation: MIRSTD is challenging due to the small size, weak intensity, and complex motion patterns of targets. Existing methods are limited in modeling high-order spatiotemporal correlations, prompting the exploration of hypergraphs for better performance. Method: HyperTea incorporates three modules: GTEM for global temporal context enhancement, LTEM for capturing local motion patterns, and TAM for addressing cross-scale feature misalignment. Result: HyperTea achieves state-of-the-art (SOTA) performance on the DAUB and IRDST datasets, demonstrating its effectiveness in MIRSTD tasks. Conclusion: HyperTea is the first work to integrate CNNs, RNNs, and HGNNs for MIRSTD, significantly improving detection performance, as demonstrated by experiments on DAUB and IRDST datasets. Abstract: In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.[144] Physics-Informed Joint Multi-TE Super-Resolution with Implicit Neural Representation for Robust Fetal T2 Mapping
Busra Bulut,Maik Dannecker,Thomas Sanchez,Sara Neves Silva,Vladyslav Zalevskyi,Steven Jia,Jean-Baptiste Ledoux,Guillaume Auzias,François Rousseau,Jana Hutter,Daniel Rueckert,Meritxell Bach Cuadra
Main category: cs.CV
TL;DR: 这项研究开发了一种新方法,通过隐式神经表示和物理信息正则化来解决胎儿脑部MRI中T2映射的运动问题,从而减少所需的扫描时间。
Details
Motivation: 胎儿脑部MRI中的T2映射在中等磁场强度下具有较慢的T2衰减,有助于改善发育大脑的表征,但胎儿MRI获取依赖于多个运动受损的厚层堆叠,需要切片到体积重建来估计高分辨率三维体积,因此需要一种新的方法来解决这一挑战。 Method: 该方法结合了隐式神经表示和物理信息正则化,以解决不同回波时间的数据联合重建问题,并处理严重的运动问题。 Result: 该研究展示了在模拟胎儿大脑和体内成人数据集上的最先进性能,并呈现了首个体内胎儿T2映射结果。 Conclusion: 该研究展示了在胎儿脑部T2映射中减少每个回波时间堆栈数量的潜力。 Abstract: T2 mapping in fetal brain MRI has the potential to improve characterization of the developing brain, especially at mid-field (0.55T), where T2 decay is slower. However, this is challenging as fetal MRI acquisition relies on multiple motion-corrupted stacks of thick slices, requiring slice-to-volume reconstruction (SVR) to estimate a high-resolution (HR) 3D volume. Currently, T2 mapping involves repeated acquisitions of these stacks at each echo time (TE), leading to long scan times and high sensitivity to motion. We tackle this challenge with a method that jointly reconstructs data across TEs, addressing severe motion. Our approach combines implicit neural representations with a physics-informed regularization that models T2 decay, enabling information sharing across TEs while preserving anatomical and quantitative T2 fidelity. We demonstrate state-of-the-art performance on simulated fetal brain and in vivo adult datasets with fetal-like motion. We also present the first in vivo fetal T2 mapping results at 0.55T. Our study shows potential for reducing the number of stacks per TE in T2 mapping by leveraging anatomical redundancy.[145] IADGPT: Unified LVLM for Few-Shot Industrial Anomaly Detection, Localization, and Reasoning via In-Context Learning
Mengyang Zhao,Teng Fu,Haiyang Yu,Ke Niu,Bin Li
Main category: cs.CV
TL;DR: 本文提出了一种名为IADGPT的统一框架,用于执行少样本工业异常检测,并设计了三阶段训练策略和基于上下文学习的训练范式,显著提升了检测性能和推理能力。
Details
Motivation: 现有的大型视觉-语言模型缺乏与少样本工业异常检测相关的基本工业知识和推理能力,难以达到专业质检员的水平。 Method: 提出了一个三阶段渐进式训练策略,并设计了一种基于上下文学习的训练范式,使IADGPT能够以少样本图像作为示例,提高对新产品的泛化能力。 Result: 实验表明,IADGPT在异常检测任务中表现出显著的性能提升,并在异常定位和推理方面具有竞争力。 Conclusion: IADGPT实现了在工业产品异常检测方面的显著性能提升,并展示了在异常定位和推理方面的竞争力。 Abstract: Few-Shot Industrial Anomaly Detection (FS-IAD) has important applications in automating industrial quality inspection. Recently, some FS-IAD methods based on Large Vision-Language Models (LVLMs) have been proposed with some achievements through prompt learning or fine-tuning. However, existing LVLMs focus on general tasks but lack basic industrial knowledge and reasoning capabilities related to FS-IAD, making these methods far from specialized human quality inspectors. To address these challenges, we propose a unified framework, IADGPT, designed to perform FS-IAD in a human-like manner, while also handling associated localization and reasoning tasks, even for diverse and novel industrial products. To this end, we introduce a three-stage progressive training strategy inspired by humans. Specifically, the first two stages gradually guide IADGPT in acquiring fundamental industrial knowledge and discrepancy awareness. In the third stage, we design an in-context learning-based training paradigm, enabling IADGPT to leverage a few-shot image as the exemplars for improved generalization to novel products. In addition, we design a strategy that enables IADGPT to output image-level and pixel-level anomaly scores using the logits output and the attention map, respectively, in conjunction with the language output to accomplish anomaly reasoning. To support our training, we present a new dataset comprising 100K images across 400 diverse industrial product categories with extensive attribute-level textual annotations. Experiments indicate IADGPT achieves considerable performance gains in anomaly detection and demonstrates competitiveness in anomaly localization and reasoning. We will release our dataset in camera-ready.[146] Novel View Synthesis using DDIM Inversion
Sehajdeep SIngh,A V Subramanyam
Main category: cs.CV
TL;DR: This paper proposes an efficient framework for synthesizing novel views from a single image using a translation U-Net and fusion strategy, leveraging a pretrained diffusion model for high-quality results.
Details
Motivation: The study is motivated by the limitations of existing methods, such as high computational cost, blurry reconstruction, and poor generalization, when synthesizing novel views from a single input image. Method: The approach employs TUNet to predict the latent corresponding to the target view based on the DDIM-inverted latent of the input image. A fusion strategy is then applied to preserve texture and fine-grained details before using DDIM sampling for novel view synthesis. Result: The experiments conducted on MVImgNet demonstrate that the proposed method achieves better performance compared to existing techniques. Conclusion: The proposed method, which uses a camera pose-conditioned translation U-Net (TUNet) and a novel fusion strategy, outperforms existing methods in synthesizing novel views from a single input image. Abstract: Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.[147] Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios
Zhanwen Liu,Yujing Sun,Yang Wang,Nan Yang,Shengbo Eben Li,Xiangmo Zhao
Main category: cs.CV
TL;DR: This paper proposes MCFNet, a motion cue fusion framework combining event and RGB cameras, which enhances object detection in poor lighting and fast-moving traffic scenarios.
Details
Motivation: The dynamic range limitation of conventional RGB cameras reduces global contrast and detail clarity in complex traffic environments, hindering effective feature extraction and object detection. Method: A motion cue fusion network (MCFNet) is introduced, incorporating an event correction module (ECM), event dynamic upsampling module (EDUM), and cross-modal mamba fusion module (CMM) to align and fuse asynchronous event and RGB data effectively. Result: MCFNet outperforms existing methods significantly, achieving a 7.4% improvement in mAP50 and 1.7% in mAP on the DSEC-Det dataset under challenging lighting conditions. Conclusion: The proposed MCFNet, which integrates a bio-inspired event camera with an RGB camera, achieves superior performance in object detection under challenging lighting conditions compared to existing methods. Abstract: The dynamic range limitation of conventional RGB cameras reduces global contrast and causes loss of high-frequency details such as textures and edges in complex traffic environments (e.g., nighttime driving, tunnels), hindering discriminative feature extraction and degrading frame-based object detection. To address this, we integrate a bio-inspired event camera with an RGB camera to provide high dynamic range information and propose a motion cue fusion network (MCFNet), which achieves optimal spatiotemporal alignment and adaptive cross-modal feature fusion under challenging lighting. Specifically, an event correction module (ECM) temporally aligns asynchronous event streams with image frames via optical-flow-based warping, jointly optimized with the detection network to learn task-aware event representations. The event dynamic upsampling module (EDUM) enhances spatial resolution of event frames to match image structures, ensuring precise spatiotemporal alignment. The cross-modal mamba fusion module (CMM) uses adaptive feature fusion with a novel interlaced scanning mechanism, effectively integrating complementary information for robust detection. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively. The code is available at https://github.com/Charm11492/MCFNet.[148] CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation
Joohyeon Lee,Jin-Seop Lee,Jee-Hyong Lee
Main category: cs.CV
TL;DR: CountCluster通过在早期去噪步骤中引导对象交叉注意力图的聚类,提高了文本到图像生成模型在对象数量准确性方面的表现。
Details
Motivation: 扩散模型在文本到图像生成方面表现出色,但在生成准确反映输入提示中对象数量的图像方面仍存在不足。 Method: 该方法在推理时根据注意力分数将对象交叉注意力图划分为k个聚类,定义一个理想分布,并优化潜在空间以与目标分布对齐。 Result: 该方法在物体数量准确性方面比现有方法平均提高了18.5%p,并在各种提示下展示了更好的数量控制性能。 Conclusion: CountCluster方法在不依赖外部工具或额外训练的情况下,显著提高了生成图像中对象数量的准确性,并展示了优越的数量控制性能。 Abstract: Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic--The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into $k$ clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5\%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster .[149] NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
NextStep Team,Chunrui Han,Guopeng Li,Jingwei Wu,Quan Sun,Yan Cai,Yuang Peng,Zheng Ge,Deyu Zhou,Haomiao Tang,Hongyu Zhou,Kenkun Liu,Ailin Huang,Bin Wang,Changxin Miao,Deshan Sun,En Yu,Fukun Yin,Gang Yu,Hao Nie,Haoran Lv,Hanpeng Hu,Jia Wang,Jian Zhou,Jianjian Sun,Kaijun Tan,Kang An,Kangheng Lin,Liang Zhao,Mei Chen,Peng Xing,Rui Wang,Shiyu Liu,Shutao Xia,Tianhao You,Wei Ji,Xianfang Zeng,Xin Han,Xuelin Zhang,Yana Wei,Yanming Xu,Yimin Jiang,Yingming Wang,Yu Zhou,Yucheng Han,Ziyang Meng,Binxing Jiao,Daxin Jiang,Xiangyu Zhang,Yibo Zhu
Main category: cs.CV
TL;DR: NextStep-1 is a 14B autoregressive model that advances text-to-image generation by achieving state-of-the-art performance and excelling in high-fidelity image synthesis and image editing.
Details
Motivation: The authors aim to advance the autoregressive paradigm in text-to-image generation by addressing the limitations of existing methods, such as the computational intensity of diffusion models and the quantization loss in vector quantization approaches. Method: NextStep-1 pairs a 14B autoregressive model with a 157M flow matching head, training on discrete text tokens and continuous image tokens using next-token prediction objectives. Result: NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, with strong results in high-fidelity image synthesis and image editing. Conclusion: NextStep-1 is a 14B autoregressive model that achieves state-of-the-art performance in text-to-image generation tasks, showing strong capabilities in high-fidelity image synthesis and image editing. The model will be released to facilitate open research. Abstract: Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.[150] Lightweight CNNs for Embedded SAR Ship Target Detection and Classification
Fabian Kresse,Georgios Pilikos,Mario Azcueta,Nicolas Floury
Main category: cs.CV
TL;DR: This study proposes neural networks for on-board processing of SAR data, demonstrating real-time inference feasibility and target classification capability while addressing bandwidth and latency challenges.
Details
Motivation: The motivation is to overcome the limitations of near-real-time SAR data monitoring, such as bandwidth constraints and latency, by implementing on-board processing to reduce data volume. Method: The study evaluates neural networks designed for real-time inference on unfocused SAR data acquired in Stripmap and Interferometric Wide (IW) modes using Sentinel-1 data. Result: The results demonstrate the feasibility of deploying the models on an FPGA for on-board processing and show that binary classification between ships and windmills is achievable. Conclusion: This work concludes that the proposed neural networks can be used for on-board processing of SAR data, enabling real-time inference and classification of targets such as ships and windmills. Abstract: Synthetic Aperture Radar (SAR) data enables large-scale surveillance of maritime vessels. However, near-real-time monitoring is currently constrained by the need to downlink all raw data, perform image focusing, and subsequently analyze it on the ground. On-board processing to generate higher-level products could reduce the data volume that needs to be downlinked, alleviating bandwidth constraints and minimizing latency. However, traditional image focusing and processing algorithms face challenges due to the satellite's limited memory, processing power, and computational resources. This work proposes and evaluates neural networks designed for real-time inference on unfocused SAR data acquired in Stripmap and Interferometric Wide (IW) modes captured with Sentinel-1. Our results demonstrate the feasibility of using one of our models for on-board processing and deployment on an FPGA. Additionally, by investigating a binary classification task between ships and windmills, we demonstrate that target classification is possible.[151] Revisiting Cross-View Localization from Image Matching
Panwang Xia,Qiong Wu,Lei Yu,Yi Liu,Mingtao Xiong,Lei Liang,Yongjun Zhang,Yi Wan
Main category: cs.CV
TL;DR: This paper proposes a novel framework for cross-view localization that improves image matching and pose estimation by introducing a Surface Model and SimRefiner module, outperforming existing methods under extreme viewpoint differences.
Details
Motivation: The motivation stems from the limitations of existing cross-view localization methods, which struggle with establishing strict cross-view correspondences, leading to coarse or geometrically inconsistent matches between ground and aerial views. Method: The paper proposes a novel framework that introduces a Surface Model for accurate BEV projection and a SimRefiner module for refining the similarity matrix through local-global residual correction, eliminating the need for post-processing techniques like RANSAC. Result: The experiments show that the proposed approach enhances both localization accuracy and image matching quality. Additionally, the introduction of the CVFM benchmark with 32,509 annotated cross-view image pairs supports further research in this area. Conclusion: The paper concludes that the proposed framework, which includes a Surface Model and SimRefiner module, significantly improves cross-view image matching and localization accuracy, especially under extreme viewpoint disparity. Abstract: Cross-view localization aims to estimate the 3 degrees of freedom pose of a ground-view image by registering it to aerial or satellite imagery. It is essential in GNSS-denied environments such as urban canyons and disaster zones. Existing methods either regress poses directly or align features in a shared bird's-eye view (BEV) space, both built upon accurate spatial correspondences between perspectives. However, these methods fail to establish strict cross-view correspondences, yielding only coarse or geometrically inconsistent matches. Consequently, fine-grained image matching between ground and aerial views remains an unsolved problem, which in turn constrains the interpretability of localization results. In this paper, we revisit cross-view localization from the perspective of cross-view image matching and propose a novel framework that improves both matching and localization. Specifically, we introduce a Surface Model to model visible regions for accurate BEV projection, and a SimRefiner module to refine the similarity matrix through local-global residual correction, eliminating the reliance on post-processing like RANSAC. To further support research in this area, we introduce CVFM, the first benchmark with 32,509 cross-view image pairs annotated with pixel-level correspondences. Extensive experiments demonstrate that our approach substantially improves both localization accuracy and image matching quality, setting new baselines under extreme viewpoint disparity.[152] Exploiting Discriminative Codebook Prior for Autoregressive Image Generation
Longxiang Tang,Ruihang Chu,Xiang Wang,Yujin Han,Pingyu Wu,Chunming He,Yingya Zhang,Shiwei Zhang,Jiaya Jia
Main category: cs.CV
TL;DR: This paper proposes DCPE as a better alternative to k-means clustering for extracting token similarity information in autoregressive image generation, leading to faster training and improved model performance.
Details
Motivation: k-means clustering is suboptimal for token feature spaces due to issues like token space disparity and centroid distance inaccuracy, limiting the effectiveness of autoregressive image generation models. Method: The Discriminative Codebook Prior Extractor (DCPE) uses an instance-based distance instead of centroid-based distance and applies agglomerative merging to address token space disparity. This approach better captures token similarity and improves generative model performance. Result: DCPE improves training speed by 42% on LlamaGen-B and enhances FID and IS performance. It also successfully addresses token space disparity by aggregating low-density regions and preserving high-density ones. Conclusion: DCPE effectively addresses the limitations of k-means clustering in the codebook feature space for autoregressive image generation, offering a more accurate and efficient method for utilizing token similarity information. Abstract: Advanced discrete token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm. While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited. Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook. However, we reveal that k-means clustering performs poorly in the codebook feature space due to inherent issues, including token space disparity and centroid distance inaccuracy. In this work, we propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means clustering for more effectively mining and utilizing the token similarity information embedded in the codebook. DCPE replaces the commonly used centroid-based distance, which is found to be unsuitable and inaccurate for the token feature space, with a more reasonable instance-based distance. Using an agglomerative merging technique, it further addresses the token space disparity issue by avoiding splitting high-density regions and aggregating low-density ones. Extensive experiments demonstrate that DCPE is plug-and-play and integrates seamlessly with existing codebook prior-based paradigms. With the discriminative prior extracted, DCPE accelerates the training of autoregressive models by 42% on LlamaGen-B and improves final FID and IS performance.[153] EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering
Yanjun Li,Yuqian Fu,Tianwen Qian,Qi'ao Xu,Silong Dai,Danda Pani Paudel,Luc Van Gool,Xiaoling Wang
Main category: cs.CV
TL;DR: 本文提出了 EgoCross,一个用于评估跨领域 EgocentricQA 中 MLLMs 泛化能力的新基准,揭示了当前模型在处理非日常领域时的局限性,并探索了潜在的改进方法。
Details
Motivation: 现有 EgocentricQA 研究主要集中在日常活动上,而在现实世界中存在领域偏移问题,需要评估 MLLMs 在跨领域场景中的泛化能力。 Method: 开发了一个名为 EgoCross 的新基准,涵盖四个不同领域(手术、工业、极限运动和动物视角),包含约 1000 个 QA 对和 798 个视频片段,并进行了广泛的实验和试点研究。 Result: 实验表明,大多数现有的 MLLMs 在面对非日常生活领域时表现不佳,突出了当前模型的局限性。 Conclusion: EgoCross 强调了当前 MLLMs 在跨领域泛化能力上的不足,并提出了一个有挑战性的基准,以促进领域适应性和鲁棒性视频理解的发展。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly pushed the frontier of egocentric video question answering (EgocentricQA). However, existing benchmarks and studies are mainly limited to common daily activities such as cooking and cleaning. In contrast, real-world deployment inevitably encounters domain shifts, where target domains differ substantially in both visual style and semantic content. To bridge this gap, we introduce \textbf{EgoCross}, a comprehensive benchmark designed to evaluate the cross-domain generalization of MLLMs in EgocentricQA. EgoCross covers four diverse and challenging domains, including surgery, industry, extreme sports, and animal perspective, representing realistic and high-impact application scenarios. It comprises approximately 1,000 QA pairs across 798 video clips, spanning four key QA tasks: prediction, recognition, localization, and counting. Each QA pair provides both OpenQA and CloseQA formats to support fine-grained evaluation. Extensive experiments show that most existing MLLMs, whether general-purpose or egocentric-specialized, struggle to generalize to domains beyond daily life, highlighting the limitations of current models. Furthermore, we conduct several pilot studies, \eg, fine-tuning and reinforcement learning, to explore potential improvements. We hope EgoCross and our accompanying analysis will serve as a foundation for advancing domain-adaptive, robust egocentric video understanding. Data and codes will be released at: \href{https://github.com/MyUniverse0726/EgoCross}{https://github.com/MyUniverse0726/EgoCross.}[154] Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction
Luyao Tang,Kunze Huang,Chaoqi Chen,Yuxuan Yuan,Chenxin Li,Xiaotong Tu,Xinghao Ding,Yue Huang
Main category: cs.CV
TL;DR: ConGCD借鉴人类认知过程,通过分解对象为视觉基元并建立跨知识比较,提出了一种新的广义类别发现方法。
Details
Motivation: 人类感知系统在识别已知和新类别对象方面优于当前机器学习框架,而现有广义类别发现方法主要集中在优化目标函数,缺乏对类人认知过程的借鉴。 Method: 提出ConGCD,通过高层语义重建建立以基元为导向的表示,并利用主导和上下文共识单元分别捕捉类间区分模式和分布不变性,通过共识调度器动态优化激活路径并进行多路共识集成。 Result: 在多个粗粒度和细粒度基准上进行了广泛评估,验证了ConGCD作为一种共识感知范式的有效性。 Conclusion: ConGCD通过结合主导和上下文共识单元以及共识调度器,提供了一种新颖的、类人认知的广义类别发现方法,在粗粒度和细粒度基准上均表现出有效性。 Abstract: Human perceptual systems excel at inducing and recognizing objects across both known and novel categories, a capability far beyond current machine learning frameworks. While generalized category discovery (GCD) aims to bridge this gap, existing methods predominantly focus on optimizing objective functions. We present an orthogonal solution, inspired by the human cognitive process for novel object understanding: decomposing objects into visual primitives and establishing cross-knowledge comparisons. We propose ConGCD, which establishes primitive-oriented representations through high-level semantic reconstruction, binding intra-class shared attributes via deconstruction. Mirroring human preference diversity in visual processing, where distinct individuals leverage dominant or contextual cues, we implement dominant and contextual consensus units to capture class-discriminative patterns and inherent distributional invariants, respectively. A consensus scheduler dynamically optimizes activation pathways, with final predictions emerging through multiplex consensus integration. Extensive evaluations across coarse- and fine-grained benchmarks demonstrate ConGCD's effectiveness as a consensus-aware paradigm. Code is available at github.com/lytang63/ConGCD.[155] Privacy-enhancing Sclera Segmentation Benchmarking Competition: SSBC 2025
Matej Vitek,Darian Tomašević,Abhijit Das,Sabari Nathan,Gökhan Özbulak,Gözde Ayşe Tataroğlu Özbulak,Jean-Paul Calbimonte,André Anjos,Hariohm Hemant Bhatt,Dhruv Dhirendra Premani,Jay Chaudhari,Caiyong Wang,Jian Jiang,Chi Zhang,Qi Zhang,Iyyakutti Iyappan Ganapathi,Syed Sadaf Ali,Divya Velayudan,Maregu Assefa,Naoufel Werghi,Zachary A. Daniels,Leeon John,Ritesh Vyas,Jalil Nourmohammadi Khiarak,Taher Akbari Saeed,Mahsa Nasehi,Ali Kianfar,Mobina Pashazadeh Panahi,Geetanjali Sharma,Pushp Raj Panth,Raghavendra Ramachandra,Aditya Nigam,Umapada Pal,Peter Peer,Vitomir Štruc
Main category: cs.CV
TL;DR: The 2025 Sclera Segmentation Benchmarking Competition showed that models trained on synthetic ocular images can achieve high performance in sclera segmentation, highlighting the potential of synthetic data for privacy-preserving biometric development.
Details
Motivation: To assess the effectiveness of models trained on synthetic ocular images for sclera segmentation in comparison to those trained on real-world data, promoting privacy-preserving biometric development. Method: Nine research groups developed diverse sclera-segmentation models using synthetic and/or limited real-world ocular image data. Models included transformer-based, lightweight, and generative framework-guided architectures. Performance was evaluated across three datasets with both synthetic and real-world images. Result: Top-performing models trained solely on synthetic data achieved F1 scores over 0.8. In the mixed data track, performance gains were more influenced by methodological choices than the inclusion of real data. Conclusion: The SSBC 2025 competition demonstrated that models trained solely on synthetic data can achieve competitive performance in sclera segmentation, especially with dedicated training strategies. Methodological choices significantly impact performance, underscoring the potential of synthetic data in privacy-aware biometric development. Abstract: This paper presents a summary of the 2025 Sclera Segmentation Benchmarking Competition (SSBC), which focused on the development of privacy-preserving sclera-segmentation models trained using synthetically generated ocular images. The goal of the competition was to evaluate how well models trained on synthetic data perform in comparison to those trained on real-world datasets. The competition featured two tracks: $(i)$ one relying solely on synthetic data for model development, and $(ii)$ one combining/mixing synthetic with (a limited amount of) real-world data. A total of nine research groups submitted diverse segmentation models, employing a variety of architectural designs, including transformer-based solutions, lightweight models, and segmentation networks guided by generative frameworks. Experiments were conducted across three evaluation datasets containing both synthetic and real-world images, collected under diverse conditions. Results show that models trained entirely on synthetic data can achieve competitive performance, particularly when dedicated training strategies are employed, as evidenced by the top performing models that achieved $F_1$ scores of over $0.8$ in the synthetic data track. Moreover, performance gains in the mixed track were often driven more by methodological choices rather than by the inclusion of real data, highlighting the promise of synthetic data for privacy-aware biometric development. The code and data for the competition is available at: https://github.com/dariant/SSBC_2025.[156] Axis-level Symmetry Detection with Group-Equivariant Representation
Wongyun Yu,Ahyun Seo,Minsu Cho
Main category: cs.CV
TL;DR: This paper introduces a novel framework for precise axis-level detection of reflection and rotational symmetry using geometric primitives and group-equivariant features, achieving state-of-the-art results.
Details
Motivation: Detecting symmetry in complex scenes is a significant challenge in computer vision, and existing heatmap-based approaches lack precision in identifying individual symmetry axes. Method: The method uses a dual-branch architecture equivariant to the dihedral group, with specialized branches for reflection and rotational symmetry. It introduces orientational anchors and reflectional/rotational matching techniques for precise symmetry detection. Result: Extensive experiments demonstrate that the proposed method excels in axis-level detection of reflection and rotational symmetry, achieving superior performance compared to existing methods. Conclusion: The proposed method achieves state-of-the-art performance in detecting reflection and rotation symmetry axes, outperforming existing approaches. Abstract: Symmetry is a fundamental concept that has been extensively studied, yet detecting it in complex scenes remains a significant challenge in computer vision. Recent heatmap-based approaches can localize potential regions of symmetry axes but often lack precision in identifying individual axes. In this work, we propose a novel framework for axis-level detection of the two most common symmetry types-reflection and rotation-by representing them as explicit geometric primitives, i.e. lines and points. Our method employs a dual-branch architecture that is equivariant to the dihedral group, with each branch specialized to exploit the structure of dihedral group-equivariant features for its respective symmetry type. For reflection symmetry, we introduce orientational anchors, aligned with group components, to enable orientation-specific detection, and a reflectional matching that measures similarity between patterns and their mirrored counterparts across candidate axes. For rotational symmetry, we propose a rotational matching that compares patterns at fixed angular intervals to identify rotational centers. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches.[157] Forgery Guided Learning Strategy with Dual Perception Network for Deepfake Cross-domain Detection
Lixin Jia,Zhiqing Guo,Gaobo Yang,Liejun Wang,Keqin Li
Main category: cs.CV
TL;DR: This paper proposes a novel Forgery Guided Learning strategy and Dual Perception Network to improve deepfake detection, enabling dynamic adaptation to unknown forgery techniques and robust performance across diverse scenarios.
Details
Motivation: Deepfake technology has introduced societal problems, and current detection methods perform poorly on datasets with unknown forgery techniques. Traditional cross-domain detection methods relying on common forgery traces are increasingly ineffective as the gap between emerging and traditional techniques widens, highlighting the urgency of developing detection technology with strong generalization. Method: The FGL strategy captures differential information between known and unknown forgery techniques, enabling dynamic adaptation. The DPNet captures differences and relationships among forgery traces through a frequency stream and graph convolution, integrating spatial features into an embedding space for comprehensive analysis. Result: Extensive experiments demonstrate that the proposed approach generalizes well across scenarios and effectively addresses challenges posed by unknown forgery techniques. Conclusion: The proposed Forgery Guided Learning (FGL) strategy and Dual Perception Network (DPNet) provide robust support for deepfake detection, effectively handling unknown forgery challenges and generalizing across different scenarios. Abstract: The emergence of deepfake technology has introduced a range of societal problems, garnering considerable attention. Current deepfake detection methods perform well on specific datasets, but exhibit poor performance when applied to datasets with unknown forgery techniques. Moreover, as the gap between emerging and traditional forgery techniques continues to widen, cross-domain detection methods that rely on common forgery traces are becoming increasingly ineffective. This situation highlights the urgency of developing deepfake detection technology with strong generalization to cope with fast iterative forgery techniques. To address these challenges, we propose a Forgery Guided Learning (FGL) strategy designed to enable detection networks to continuously adapt to unknown forgery techniques. Specifically, the FGL strategy captures the differential information between known and unknown forgery techniques, allowing the model to dynamically adjust its learning process in real time. To further improve the ability to perceive forgery traces, we design a Dual Perception Network (DPNet) that captures both differences and relationships among forgery traces. In the frequency stream, the network dynamically perceives and extracts discriminative features across various forgery techniques, establishing essential detection cues. These features are then integrated with spatial features and projected into the embedding space. In addition, graph convolution is employed to perceive relationships across the entire feature space, facilitating a more comprehensive understanding of forgery trace correlations. Extensive experiments show that our approach generalizes well across different scenarios and effectively handles unknown forgery challenges, providing robust support for deepfake detection. Our code is available on https://github.com/vpsg-research/FGL.[158] An Efficient Model-Driven Groupwise Approach for Atlas Construction
Ziwei Zou,Bei Zou,Xiaoyan Kui,Wenqi Lu,Haoran Dou,Arezoo Zakeri,Timothy Cootes,Alejandro F Frangi,Jinming Duan
Main category: cs.CV
TL;DR: This paper introduces DARC, a model-driven groupwise registration framework for atlas construction that supports a broad range of image dissimilarity metrics, handles large 3D datasets efficiently, and produces unbiased, high-fidelity atlases. It also demonstrates the framework's applications in one-shot segmentation and shape synthesis.
Details
Motivation: Data-driven registration methods have recently shown promise in pairwise settings but their reliance on large training datasets, limited generalizability, and lack of true inference phases in groupwise contexts hinder their practical use. Model-driven methods often face scalability and optimization challenges when applied to large 3D datasets. Method: We introduce DARC (Diffeomorphic Atlas Registration via Coordinate descent), a novel model-driven groupwise registration framework for atlas construction. DARC supports a broad range of image dissimilarity metrics and efficiently handles arbitrary numbers of 3D images without incurring GPU memory issues. Through a coordinate descent strategy and a centrality-enforcing activation function, DARC produces unbiased, diffeomorphic atlases with high anatomical fidelity. Result: We demonstrate two key applications of DARC: (1) One-shot segmentation, where labels annotated only on the atlas are propagated to subjects via inverse deformations, outperforming state-of-the-art few-shot methods; and (2) Shape synthesis, where new anatomical variants are generated by warping the atlas mesh using synthesized diffeomorphic deformation fields. Conclusion: DARC offers a flexible, generalizable, and resource-efficient framework for atlas construction and applications. Abstract: Atlas construction is fundamental to medical image analysis, offering a standardized spatial reference for tasks such as population-level anatomical modeling. While data-driven registration methods have recently shown promise in pairwise settings, their reliance on large training datasets, limited generalizability, and lack of true inference phases in groupwise contexts hinder their practical use. In contrast, model-driven methods offer training-free, theoretically grounded, and data-efficient alternatives, though they often face scalability and optimization challenges when applied to large 3D datasets. In this work, we introduce DARC (Diffeomorphic Atlas Registration via Coordinate descent), a novel model-driven groupwise registration framework for atlas construction. DARC supports a broad range of image dissimilarity metrics and efficiently handles arbitrary numbers of 3D images without incurring GPU memory issues. Through a coordinate descent strategy and a centrality-enforcing activation function, DARC produces unbiased, diffeomorphic atlases with high anatomical fidelity. Beyond atlas construction, we demonstrate two key applications: (1) One-shot segmentation, where labels annotated only on the atlas are propagated to subjects via inverse deformations, outperforming state-of-the-art few-shot methods; and (2) shape synthesis, where new anatomical variants are generated by warping the atlas mesh using synthesized diffeomorphic deformation fields. Overall, DARC offers a flexible, generalizable, and resource-efficient framework for atlas construction and applications.[159] From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models
Tiancheng Han,Yunfei Gao,Yong Li,Wuzhou Yu,Qiaosheng Zhang,Wenqi Shao
Main category: cs.CV
TL;DR: 本文分析了主流视觉语言模型在空间物理推理方面的表现,发现其存在不足,并通过改进方法显著提升了Qwen2.5-VL-7B的性能,但仍需新方法来提高模型在新物理场景中的泛化能力。
Details
Motivation: 空间物理推理是理解真实物理世界的基础能力,是构建稳健世界模型的关键一步。尽管最近的视觉语言模型在多模态数学和纯空间理解等领域取得了显著进展,但它们在空间物理推理方面的能力仍 largely 未被探索。 Method: 本文对主流视觉语言模型进行了全面的诊断分析,并通过监督微调和基于规则的强化学习方法对Qwen2.5-VL-7B进行优化。 Result: 研究发现,当前模型在此关键任务上的表现不佳,主要归因于人类先验导致的偏差和缺乏深度推理。通过监督微调结合基于规则的强化学习方法,Qwen2.5-VL-7B的空间物理推理能力得到显著提升,并超越了领先的专有模型。 Conclusion: 尽管在提升模型的空间物理推理能力方面取得了显著进展,但模型在新物理场景中的泛化能力仍然有限,强调了在空间物理推理方面需要新方法的迫切需求。 Abstract: Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model's generalization to new physics scenarios remains limited -- underscoring the pressing need for new approaches in spatio-physical reasoning.[160] AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences
Jieyu Li,Xin Zhang,Joey Tianyi Zhou
Main category: cs.CV
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Recent advances in AI-generated content have fueled the rise of highly realistic synthetic videos, posing severe risks to societal trust and digital integrity. Existing benchmarks for video authenticity detection typically suffer from limited realism, insufficient scale, and inadequate complexity, failing to effectively evaluate modern vision-language models against sophisticated forgeries. To address this critical gap, we introduce AEGIS, a novel large-scale benchmark explicitly targeting the detection of hyper-realistic and semantically nuanced AI-generated videos. AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models, including Stable Video Diffusion, CogVideoX-5B, KLing, and Sora, encompassing open-source and proprietary architectures. In particular, AEGIS features specially constructed challenging subsets enhanced with robustness evaluation. Furthermore, we provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features, facilitating authenticity detection and supporting downstream tasks such as multimodal fusion and forgery localization. Extensive experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS, highlighting the dataset's unique complexity and realism beyond the current generalization capabilities of existing models. In essence, AEGIS establishes an indispensable evaluation benchmark, fundamentally advancing research toward developing genuinely robust, reliable, broadly generalizable video authenticity detection methodologies capable of addressing real-world forgery threats. Our dataset is available on https://huggingface.co/datasets/Clarifiedfish/AEGIS.[161] Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
Youping Gu,Xiaolong Li,Yuhao Hu,Bohan Zhuang
Main category: cs.CV
TL;DR: BLADE is a data-free joint training framework that combines adaptive sparse attention and sparsity-aware step distillation to significantly accelerate diffusion transformers in video generation while improving output quality.
Details
Motivation: Diffusion transformers face inference bottlenecks due to slow iterative denoising and high computational costs from quadratic attention for long sequences. Existing strategies like step distillation and sparse attention have shown promise but face integration challenges. Method: BLADE introduces an Adaptive Block-Sparse Attention (ASA) mechanism and a sparsity-aware step distillation paradigm based on Trajectory Distribution Matching (TDM) to jointly optimize sparsity and distillation without requiring additional high-quality video data. Result: BLADE achieves a 14.10x end-to-end inference speedup on Wan2.1-1.3B and an 8.89x speedup on CogVideoX-5B, with improved video generation quality as measured by VBench-2.0 and human evaluations. Conclusion: The proposed BLADE framework effectively integrates adaptive sparse attention and sparsity-aware step distillation to overcome the inference bottlenecks in diffusion transformers for video generation, achieving significant acceleration and quality improvement. Abstract: Diffusion transformers currently lead the field in high-quality video generation, but their slow iterative denoising process and prohibitive quadratic attention costs for long sequences create significant inference bottlenecks. While both step distillation and sparse attention mechanisms have shown promise as independent acceleration strategies, effectively combining these approaches presents critical challenges -- training-free integration yields suboptimal results, while separately training sparse attention after step distillation requires prohibitively expensive high-quality video data. To overcome these limitations, we propose BLADE, an innovative data-free joint training framework that introduces: (1) an Adaptive Block-Sparse Attention (ASA) mechanism for dynamically generating content-aware sparsity masks to focus computation on salient spatiotemporal features, and (2) a sparsity-aware step distillation paradigm built upon Trajectory Distribution Matching (TDM) that directly incorporates sparsity into the distillation process rather than treating it as a separate compression step, with fast convergence. We validate BLADE on text-to-video models like CogVideoX-5B and Wan2.1-1.3B. Our framework demonstrates remarkable efficiency gains across different scales. On Wan2.1-1.3B, BLADE achieves a 14.10x end-to-end inference acceleration over a 50-step baseline. Moreover, on models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup. Crucially, the acceleration is accompanied by a consistent quality improvement. On the VBench-2.0 benchmark, BLADE boosts the score of CogVideoX-5B to 0.569 (from 0.534) and Wan2.1-1.3B to 0.570 (from 0.563), results that are further corroborated by superior ratings in human evaluations. Our code and model weights are publicly available at: http://ziplab.co/BLADE-Homepage/.[162] Ultra-High-Definition Reference-Based Landmark Image Super-Resolution with Generative Diffusion Prior
Zhenning Shi,Zizheng Yan,Yuhang Yu,Clara Xue,Jingyu Zhuang,Qi Zhang,Jinwei Chen,Tao Li,Qingnan Fan
Main category: cs.CV
TL;DR: This paper introduces TriFlowSR, a diffusion-based framework for reference image super-resolution that improves alignment between low-resolution and reference high-resolution images, along with a new UHD landmark dataset called Landmark-4K.
Details
Motivation: Existing diffusion-based RefSR methods struggle with aligning LR and reference HR images, and current datasets lack resolution and quality to support high-quality restoration. Method: The paper designs a Reference Matching Strategy to explicitly achieve pattern matching between LR and reference HR images, based on a diffusion model without relying on ControlNet. Result: TriFlowSR outperforms existing methods in utilizing semantic and texture information from the reference HR image for high-quality UHD restoration. Conclusion: The paper proposes TriFlowSR, a novel diffusion-based RefSR framework for Ultra-High-Definition (UHD) landmark scenarios, and introduces Landmark-4K, the first UHD RefSR dataset. Abstract: Reference-based Image Super-Resolution (RefSR) aims to restore a low-resolution (LR) image by utilizing the semantic and texture information from an additional reference high-resolution (reference HR) image. Existing diffusion-based RefSR methods are typically built upon ControlNet, which struggles to effectively align the information between the LR image and the reference HR image. Moreover, current RefSR datasets suffer from limited resolution and poor image quality, resulting in the reference images lacking sufficient fine-grained details to support high-quality restoration. To overcome the limitations above, we propose TriFlowSR, a novel framework that explicitly achieves pattern matching between the LR image and the reference HR image. Meanwhile, we introduce Landmark-4K, the first RefSR dataset for Ultra-High-Definition (UHD) landmark scenarios. Considering the UHD scenarios with real-world degradation, in TriFlowSR, we design a Reference Matching Strategy to effectively match the LR image with the reference HR image. Experimental results show that our approach can better utilize the semantic and texture information of the reference HR image compared to previous methods. To the best of our knowledge, we propose the first diffusion-based RefSR pipeline for ultra-high definition landmark scenarios under real-world degradation. Our code and model will be available at https://github.com/nkicsl/TriFlowSR.[163] Cooperative Face Liveness Detection from Optical Flow
Artem Sokolov,Mikhail Nikitin,Anton Konushin
Main category: cs.CV
TL;DR: 论文提出了一种用户协作式人脸活体检测方法,通过用户特定运动模式与光流分析相结合,提高了检测的准确性和鲁棒性。
Details
Motivation: 为了提升人脸活体检测的鲁棒性,以应对打印照片、屏幕显示、面具和视频回放等不同类型的呈现攻击。 Method: 通过设计用户缓慢将正脸朝相机移动的受控接近人脸协议,并结合光流分析,利用神经光流估计和RGB帧处理,通过神经分类器进行活体检测。 Result: 该方法通过提取面部体积信息,显著提高了对真实人脸和呈现攻击的辨别能力,相较于被动方法更加可靠。 Conclusion: 该论文提出了一种基于新型用户交互场景的协作式视频人脸活体检测方法,通过结合用户特定运动模式和光流分析,提高了对真实人脸和呈现攻击的区分能力。 Abstract: In this work, we proposed a novel cooperative video-based face liveness detection method based on a new user interaction scenario where participants are instructed to slowly move their frontal-oriented face closer to the camera. This controlled approaching face protocol, combined with optical flow analysis, represents the core innovation of our approach. By designing a system where users follow this specific movement pattern, we enable robust extraction of facial volume information through neural optical flow estimation, significantly improving discrimination between genuine faces and various presentation attacks (including printed photos, screen displays, masks, and video replays). Our method processes both the predicted optical flows and RGB frames through a neural classifier, effectively leveraging spatial-temporal features for more reliable liveness detection compared to passive methods.[164] VasoMIM: Vascular Anatomy-Aware Masked Image Modeling for Vessel Segmentation
De-Xing Huang,Xiao-Hu Zhou,Mei-Jiang Gui,Xiao-Liang Xie,Shi-Qi Liu,Shuang-Yi Wang,Tian-Yu Xiang,Rui-Ze Ma,Nu-Fang Xiao,Zeng-Guang Hou
Main category: cs.CV
TL;DR: VasoMIM improves vessel segmentation in X-ray angiograms by integrating anatomical knowledge into masked image modeling, outperforming existing methods.
Details
Motivation: The motivation is to overcome the limitations of conventional masked image modeling (MIM) in capturing vascular anatomy due to class imbalance between vessel and background pixels, by leveraging self-supervised learning with anatomical knowledge integration. Method: The study introduces VasoMIM, a novel masked image modeling framework that includes an anatomy-guided masking strategy and an anatomical consistency loss to improve vascular representation learning from X-ray angiograms. Result: VasoMIM achieved state-of-the-art performance across three datasets for vessel segmentation in X-ray angiograms, demonstrating its effectiveness in improving vascular representation learning. Conclusion: VasoMIM has demonstrated superior performance in vessel segmentation in X-ray angiograms, highlighting its potential in facilitating medical image analysis by incorporating anatomical knowledge into the pre-training process. Abstract: Accurate vessel segmentation in X-ray angiograms is crucial for numerous clinical applications. However, the scarcity of annotated data presents a significant challenge, which has driven the adoption of self-supervised learning (SSL) methods such as masked image modeling (MIM) to leverage large-scale unlabeled data for learning transferable representations. Unfortunately, conventional MIM often fails to capture vascular anatomy because of the severe class imbalance between vessel and background pixels, leading to weak vascular representations. To address this, we introduce Vascular anatomy-aware Masked Image Modeling (VasoMIM), a novel MIM framework tailored for X-ray angiograms that explicitly integrates anatomical knowledge into the pre-training process. Specifically, it comprises two complementary components: anatomy-guided masking strategy and anatomical consistency loss. The former preferentially masks vessel-containing patches to focus the model on reconstructing vessel-relevant regions. The latter enforces consistency in vascular semantics between the original and reconstructed images, thereby improving the discriminability of vascular representations. Empirically, VasoMIM achieves state-of-the-art performance across three datasets. These findings highlight its potential to facilitate X-ray angiogram analysis.[165] Object Fidelity Diffusion for Remote Sensing Image Generation
Ziqi Ye,Shuran Ma,Jie Yang,Xiaoyi Yang,Ziyang Gong,Xue Yang,Haipeng Wang
Main category: cs.CV
TL;DR: 本论文提出OF-Diff方法,实现了高保真遥感图像的生成。
Details
Motivation: 现有扩散模型在生成遥感图像时无法充分捕捉形态细节,导致低保真图像。 Method: 引入了双分支扩散模型和扩散一致性损失,并结合DDPO微调。 Result: 在关键质量指标上,OF-Diff优于现有方法,mAP分别提升了8.3%、7.7%和4.0%。 Conclusion: OF-Diff实现了高保真遥感图像的生成,并且在多个指标上优于现有方法。 Abstract: High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.[166] Mobile-Friendly Deep Learning for Plant Disease Detection: A Lightweight CNN Benchmark Across 101 Classes of 33 Crops
Anand Kumar,Harminder Pal Monga,Tapasi Brahma,Satyam Kalra,Navas Sherif
Main category: cs.CV
TL;DR: A mobile-friendly system using EfficientNet-B1 achieves 94.7% accuracy in classifying 101 plant diseases, offering a promising solution for early plant disease detection.
Details
Motivation: Plant diseases threaten global food security, so early detection systems are crucial. Computer vision offers a promising solution. Method: The researchers built a comprehensive dataset by combining Plant Doc, PlantVillage, and PlantWild, and evaluated several lightweight architectures for classification performance. Result: EfficientNet-B1 achieved the highest classification accuracy of 94.7% across the tested architectures. Conclusion: The study concludes that EfficientNet-B1 is the most suitable architecture for classifying plant diseases due to its balance of accuracy and computational efficiency. Abstract: Plant diseases are a major threat to food security globally. It is important to develop early detection systems which can accurately detect. The advancement in computer vision techniques has the potential to solve this challenge. We have developed a mobile-friendly solution which can accurately classify 101 plant diseases across 33 crops. We built a comprehensive dataset by combining different datasets, Plant Doc, PlantVillage, and PlantWild, all of which are for the same purpose. We evaluated performance across several lightweight architectures - MobileNetV2, MobileNetV3, MobileNetV3-Large, and EfficientNet-B0, B1 - specifically chosen for their efficiency on resource-constrained devices. The results were promising, with EfficientNet-B1 delivering our best performance at 94.7% classification accuracy. This architecture struck an optimal balance between accuracy and computational efficiency, making it well-suited for real-world deployment on mobile devices.[167] UI-Venus Technical Report: Building High-performance UI Agents with RFT
Zhangxuan Gu,Zhengwen Zeng,Zhenyu Xu,Xingran Zhou,Shuheng Shen,Yunfei Liu,Beitong Zhou,Changhua Meng,Tianyu Xia,Weizhi Chen,Yue Wen,Jingya Dou,Fei Tang,Jinzhen Lin,Yulin Liu,Zhenlin Guo,Yichen Gong,Heng Jia,Changlong Gao,Yuan Guo,Yong Deng,Zhenyu Guo,Liang Chen,Weiqiang Wang
Main category: cs.CV
TL;DR: UI-Venus是一个基于多模态大语言模型的原生UI代理,通过强化微调和精心设计的奖励函数,在UI定位和导航任务中实现了最先进的性能。
Details
Motivation: UI-Venus旨在解决UI定位和导航任务中的问题,通过仅使用屏幕截图作为输入,实现最先进的性能,并提出新的方法来提高导航性能。 Method: UI-Venus基于Qwen2.5-VL,通过强化微调(RFT)使用仅数十万高质量训练样本来实现SOTA性能。它引入了针对UI定位和导航任务的奖励函数和数据清洗策略,并提出了Self-Evolving Trajectory History Alignment & Sparse Action Enhancement来优化历史推理轨迹并平衡稀疏但关键动作的分布。 Result: UI-Venus的7B和72B变体在标准基准测试中分别达到了94.1%/50.8%和95.3%/61.9%的性能,超过了包括开源的GTA1和闭源的UI-TARS-1.5在内的先前最先进的基线。在AndroidWorld上的成功率达到49.1%和65.9%。 Conclusion: UI-Venus是目前最先进的开源UI代理,它通过引入精心设计的奖励函数、高效的数据清洗策略以及一种新的自我进化框架,在UI定位和导航任务中表现出色。它鼓励了社区内进一步的研究和开发。代码在https://github.com/antgroup/UI-Venus上公开。 Abstract: We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment \& Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.[168] Self-Supervised Stereo Matching with Multi-Baseline Contrastive Learning
Peng Xu,Zhiyu Xiang,Jingyun Fu,Tianyu Pu,Kai Wang,Chaojie Ji,Tingming Bai,Eryun Liu
Main category: cs.CV
TL;DR: BaCon-Stereo是一种新的自监督立体匹配框架,通过教师-学生模型和遮挡感知注意力图,有效解决了遮挡区域的匹配问题。
Details
Motivation: 现有的自监督立体匹配依赖于光度一致性假设,但在遮挡区域会失效。为解决这一问题,作者提出了BaCon-Stereo。 Method: 提出了一种基于教师-学生框架的对比学习方法BaCon-Stereo,并引入了遮挡感知注意力图来指导学生学习遮挡补全。 Result: BaCon-Stereo在KITTI 2015和2012基准测试上均优于最先进的自监督方法。 Conclusion: BaCon-Stereo改进了在遮挡和非遮挡区域的预测,实现了比现有最先进的自监督方法更强的泛化性和鲁棒性。 Abstract: Current self-supervised stereo matching relies on the photometric consistency assumption, which breaks down in occluded regions due to ill-posed correspondences. To address this issue, we propose BaCon-Stereo, a simple yet effective contrastive learning framework for self-supervised stereo network training in both non-occluded and occluded regions. We adopt a teacher-student paradigm with multi-baseline inputs, in which the stereo pairs fed into the teacher and student share the same reference view but differ in target views. Geometrically, regions occluded in the student's target view are often visible in the teacher's, making it easier for the teacher to predict in these regions. The teacher's prediction is rescaled to match the student's baseline and then used to supervise the student. We also introduce an occlusion-aware attention map to better guide the student in learning occlusion completion. To support training, we synthesize a multi-baseline dataset BaCon-20k. Extensive experiments demonstrate that BaCon-Stereo improves prediction in both occluded and non-occluded regions, achieves strong generalization and robustness, and outperforms state-of-the-art self-supervised methods on both KITTI 2015 and 2012 benchmarks. Our code and dataset will be released upon paper acceptance.[169] Generalizable Federated Learning using Client Adaptive Focal Modulation
Tajamul Ashraf,Iqra Altaf Gillani
Main category: cs.CV
TL;DR: AdaptFED改进了联邦学习中的焦点调制方法,通过个性化调节策略、优化理论边界和跨模态验证,提高了模型的适应性、可扩展性和泛化能力。
Details
Motivation: 为了在非独立同分布和跨域设置中提高联邦学习的性能,AdaptFED深入研究了焦点调制的应用。 Method: AdaptFED引入了任务感知的客户嵌入以个性化调节动态,优化了适应性能的理论边界,并通过时间序列和多语言数据进行跨模态验证。此外,通过低秩超网络条件化减少服务器与客户端的通信开销。 Result: 在八个多样化的数据集上进行了广泛的实验,结果证明AdaptFED在源无关和跨任务联邦设置中优于最先进的基线方法。 Conclusion: AdaptFED不仅扩展了联邦学习中焦点调制的能力,还为更适应性强、可扩展和泛化能力更强的基于Transformer的联邦系统铺平了道路。 Abstract: Federated learning (FL) has proven essential for privacy-preserving, collaborative training across distributed clients. Our prior work, TransFed, introduced a robust transformer-based FL framework that leverages a learn-to-adapt hypernetwork to generate personalized focal modulation layers per client, outperforming traditional methods in non-IID and cross-domain settings. In this extended version, we propose AdaptFED, where we deepen the investigation of focal modulation in generalizable FL by incorporating: (1) a refined adaptation strategy that integrates task-aware client embeddings to personalize modulation dynamics further, (2) enhanced theoretical bounds on adaptation performance, and (3) broader empirical validation across additional modalities, including time-series and multilingual data. We also introduce an efficient variant of TransFed that reduces server-client communication overhead via low-rank hypernetwork conditioning, enabling scalable deployment in resource-constrained environments. Extensive experiments on eight diverse datasets reaffirm the superiority of our method over state-of-the-art baselines, particularly in source-free and cross-task federated setups. Our findings not only extend the capabilities of focal modulation in FL but also pave the way for more adaptive, scalable, and generalizable transformer-based federated systems. The code is available at http://github.com/Tajamul21/TransFed[170] Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation
Harold Haodong Chen,Haojian Huang,Qifeng Chen,Harry Yang,Ser-Nam Lim
Main category: cs.CV
TL;DR: This paper introduces PhysHPO, a framework that improves video generation by aligning preferences at multiple levels and selecting high-quality data automatically, resulting in more realistic and physically accurate videos.
Details
Motivation: The challenge of generating videos that adhere to the laws of physics was the primary motivation, especially for applications requiring realism and accuracy. Method: PhysHPO uses Hierarchical Cross-Modal Direct Preference Optimization across four granularities—Instance, State, Motion, and Semantic Levels—alongside an automated data selection pipeline to identify useful data from existing datasets. Result: Experiments show that PhysHPO improves physical plausibility and overall video generation quality, validated on both physics-focused and general capability benchmarks. Conclusion: PhysHPO significantly enhances the physical plausibility and quality of video generation, marking it as the first work to explore fine-grained preference alignment and data selection for realistic video generation. Abstract: Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos are the best reflections of physical phenomena, we further introduce an automated data selection pipeline to efficiently identify and utilize "good data" from existing large-scale text-video datasets, thereby eliminating the need for costly and time-intensive dataset construction. Extensive experiments on both physics-focused and general capability benchmarks demonstrate that PhysHPO significantly improves physical plausibility and overall video generation quality of advanced models. To the best of our knowledge, this is the first work to explore fine-grained preference alignment and data selection for video generation, paving the way for more realistic and human-preferred video generation paradigms.[171] Performance of GPT-5 in Brain Tumor MRI Reasoning
Mojtaba Safari,Shansong Wang,Mingzhe Hu,Zach Eidex,Qiang Li,Xiaofeng Yang
Main category: cs.CV
TL;DR: 本研究评估了GPT-5系列模型在脑肿瘤MRI图像的视觉问答任务中的表现,发现它们可以达到中等准确率,但尚未达到临床使用标准。
Details
Motivation: 准确区分脑肿瘤类型对于指导神经肿瘤学的治疗计划至关重要,而大型语言模型的进步使得整合图像解释与自然语言推理的视觉问答方法成为可能。 Method: 评估了GPT-4o、GPT-5-nano、GPT-5-mini和GPT-5在从3个脑肿瘤分割数据集中衍生出的脑肿瘤VQA基准上的表现,并以零样本思维链设置对模型在视觉和推理任务上的准确性进行了评估。 Result: 结果显示,GPT-5-mini取得了最高的宏观平均准确率(44.19%),其次是GPT-5(43.71%)、GPT-4o(41.49%)和GPT-5-nano(35.85%)。 Conclusion: GPT-5系列模型在结构化神经肿瘤学VQA任务中可以达到中等准确率,但尚未达到临床应用的水平。 Abstract: Accurate differentiation of brain tumor types on magnetic resonance imaging (MRI) is critical for guiding treatment planning in neuro-oncology. Recent advances in large language models (LLMs) have enabled visual question answering (VQA) approaches that integrate image interpretation with natural language reasoning. In this study, we evaluated GPT-4o, GPT-5-nano, GPT-5-mini, and GPT-5 on a curated brain tumor VQA benchmark derived from 3 Brain Tumor Segmentation (BraTS) datasets - glioblastoma (GLI), meningioma (MEN), and brain metastases (MET). Each case included multi-sequence MRI triplanar mosaics and structured clinical features transformed into standardized VQA items. Models were assessed in a zero-shot chain-of-thought setting for accuracy on both visual and reasoning tasks. Results showed that GPT-5-mini achieved the highest macro-average accuracy (44.19%), followed by GPT-5 (43.71%), GPT-4o (41.49%), and GPT-5-nano (35.85%). Performance varied by tumor subtype, with no single model dominating across all cohorts. These findings suggest that GPT-5 family models can achieve moderate accuracy in structured neuro-oncological VQA tasks, but not at a level acceptable for clinical use.[172] TexVerse: A Universe of 3D Objects with High-Resolution Textures
Yibo Zhang,Li Zhang,Rui Ma,Nan Cao
Main category: cs.CV
TL;DR: TexVerse是一个大规模3D数据集,提供高分辨率纹理,适用于纹理合成、PBR材质开发、动画和各种3D视觉和图形任务。
Details
Motivation: 由于缺乏合适的高分辨率纹理数据集,端到端的高分辨率纹理生成研究仍然不足。TexVerse填补了这一空白。 Method: TexVerse收集了超过858K个独特的高分辨率3D模型,其中包含超过158K个具有物理基础渲染(PBR)材料的模型,并提供了模型的高分辨率变体以及详细的注释信息。 Result: TexVerse包括1.6M个3D实例,以及专门的子集TexVerse-Skeleton和TexVerse-Animation,并提供详细的模型注释。 Conclusion: TexVerse为高分辨率纹理生成和多种3D视觉及图形任务提供了高质量的数据资源。 Abstract: We introduce TexVerse, a large-scale 3D dataset featuring high-resolution textures. While recent advances in large-scale 3D datasets have enhanced high-resolution geometry generation, creating high-resolution textures end-to-end remains underexplored due to the lack of suitable datasets. TexVerse fills this gap with a curated collection of over 858K unique high-resolution 3D models sourced from Sketchfab, including more than 158K models with physically based rendering (PBR) materials. Each model encompasses all of its high-resolution variants, bringing the total to 1.6M 3D instances. TexVerse also includes specialized subsets: TexVerse-Skeleton, with 69K rigged models, and TexVerse-Animation, with 54K animated models, both preserving original skeleton and animation data uploaded by the user. We also provide detailed model annotations describing overall characteristics, structural components, and intricate features. TexVerse offers a high-quality data resource with wide-ranging potential applications in texture synthesis, PBR material development, animation, and various 3D vision and graphics tasks.[173] Medico 2025: Visual Question Answering for Gastrointestinal Imaging
Sushant Gautam,Vajira Thambawita,Michael Riegler,Pål Halvorsen,Steven Hicks
Main category: cs.CV
TL;DR: The Medico 2025 challenge focuses on creating Explainable Artificial Intelligence models to answer clinical questions from Gastrointestinal endoscopy images, using the Kvasir-VQA-x1 dataset, with the goal of making AI in medical image analysis more trustworthy through interpretable results.
Details
Motivation: The motivation behind the Medico 2025 challenge is to develop Explainable Artificial Intelligence models that can answer clinically relevant questions from GI endoscopy images, thereby enhancing trust and transparency in AI-assisted medical image analysis. Method: The challenge introduces two subtasks using the Kvasir-VQA-x1 dataset: answering diverse types of visual questions and generating multimodal explanations. It evaluates models based on quantitative performance metrics and expert-reviewed explainability assessments. Result: The result of the challenge is expected to be the advancement of AI models that can accurately answer visual questions related to GI imaging and provide clear, interpretable justifications for these answers, aligning with medical reasoning. Conclusion: The Medico 2025 challenge aims to advance trustworthy AI in medical image analysis by focusing on the development of Explainable Artificial Intelligence models for Gastrointestinal imaging, with an emphasis on providing interpretable justifications for clinical decision-making. Abstract: The Medico 2025 challenge addresses Visual Question Answering (VQA) for Gastrointestinal (GI) imaging, organized as part of the MediaEval task series. The challenge focuses on developing Explainable Artificial Intelligence (XAI) models that answer clinically relevant questions based on GI endoscopy images while providing interpretable justifications aligned with medical reasoning. It introduces two subtasks: (1) answering diverse types of visual questions using the Kvasir-VQA-x1 dataset, and (2) generating multimodal explanations to support clinical decision-making. The Kvasir-VQA-x1 dataset, created from 6,500 images and 159,549 complex question-answer (QA) pairs, serves as the benchmark for the challenge. By combining quantitative performance metrics and expert-reviewed explainability assessments, this task aims to advance trustworthy Artificial Intelligence (AI) in medical image analysis. Instructions, data access, and an updated guide for participation are available in the official competition repository: https://github.com/simula/MediaEval-Medico-2025[174] ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
Lingen Li,Guangzhi Wang,Zhaoyang Zhang,Yaowei Li,Xiaoyu Li,Qi Dou,Jinwei Gu,Tianfan Xue,Ying Shan
Main category: cs.CV
TL;DR: ToonComposer is a generative AI model that unifies inbetweening and colorization in cartoon production, reducing manual work and offering better flexibility and performance.
Details
Motivation: Traditional cartoon production requires intensive manual effort, and existing AI methods often handle stages separately, leading to errors and artifacts. Method: ToonComposer uses a sparse sketch injection mechanism and a cartoon adaptation method with a spatial low-rank adapter to unify inbetweening and colorization stages. Result: ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency while supporting sparse inputs and multiple sketches for precise control. Conclusion: ToonComposer provides a unified solution for cartoon and anime production, combining inbetweening and colorization to reduce manual effort and improve flexibility. Abstract: Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.[175] STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
Yushi Lan,Yihang Luo,Fangzhou Hong,Shangchen Zhou,Honghua Chen,Zhaoyang Lyu,Shuai Yang,Bo Dai,Chen Change Loy,Xingang Pan
Main category: cs.CV
TL;DR: STream3R是一种基于因果注意力的3D重建方法,能够高效处理图像序列并在各种3D任务中实现先进的性能。
Details
Motivation: 现有的多视角重建方法要么依赖于昂贵的全局优化,要么依赖于对序列长度扩展性差的简单内存机制。需要一种更高效的流框架。 Method: STream3R将点图预测重新定义为仅解码器的Transformer问题,并使用因果注意力处理图像序列。 Result: 实验表明,STream3R在静态和动态场景基准测试中均一致优于以往方法,并且与LLM风格的训练基础设施天然兼容。 Conclusion: STream3R展示了一种用于在线3D感知的因果Transformer模型的潜力,并为流环境中实时3D理解铺平了道路。 Abstract: We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.[176] MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Antoine Labatie,Michael Vaccaro,Nina Lardiere,Anatol Garioud,Nicolas Gonthier
Main category: cs.CV
TL;DR: 本文提出了一種新的遙測數據自監督學習方法MAESTRO,結合多模態、多時段和多光譜數據,通過優化融合策略和特定的目標正則化方案,實現了在多時段動態任務上的先進性能。
Details
Motivation: 標準的自監督學習方法需要根據遙感數據的特殊性質進行調整,本文旨在推動這一方向的研究。 Method: 本文進行了多模態、多時段和多光譜遙測數據的融合策略和重建目標正則化方案的全面基準測試,提出了MAESTRO方法,這是對掩碼自編碼器的新改進,具有優化的融合策略和特定的目標正則化方案。 Result: MAESTRO在四個遙測數據集上進行了評估,在強烈依賴多時段動態的任務上設定了新的最先進水平,同時在由單一時段模態主導的任務上保持高度競爭力。 Conclusion: 本文的研究結果表明,MAESTRO是一種有效的遙測數據自監督學習方法,具有廣泛的應用前景。 Abstract: Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalization schemes for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder, featuring optimized fusion strategies and a tailored target normalization scheme that introduces a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.[177] ESSENTIAL: Episodic and Semantic Memory Integration for Video Class-Incremental Learning
Jongseo Lee,Kyungho Bae,Kyle Min,Gyeong-Moon Park,Jinwoo Choi
Main category: cs.CV
TL;DR: The paper proposes ESSENTIAL to improve video class-incremental learning by efficiently integrating episodic and semantic memory.
Details
Motivation: Existing VCIL methods either use memory-inefficient rehearsal training or sacrifice temporal information, leading to suboptimal performance. Method: ESSENTIAL integrates episodic memory and semantic prompts through a memory retrieval module using cross-attention. Result: ESSENTIAL achieves favorable performance on diverse datasets with significantly reduced memory. Conclusion: ESSENTIAL provides a balanced solution for the trade-off between memory-efficiency and performance in video class-incremental learning. Abstract: In this work, we tackle the problem of video classincremental learning (VCIL). Many existing VCIL methods mitigate catastrophic forgetting by rehearsal training with a few temporally dense samples stored in episodic memory, which is memory-inefficient. Alternatively, some methods store temporally sparse samples, sacrificing essential temporal information and thereby resulting in inferior performance. To address this trade-off between memory-efficiency and performance, we propose EpiSodic and SEmaNTIc memory integrAtion for video class-incremental Learning (ESSENTIAL). ESSENTIAL consists of episodic memory for storing temporally sparse features and semantic memory for storing general knowledge represented by learnable prompts. We introduce a novel memory retrieval (MR) module that integrates episodic memory and semantic prompts through cross-attention, enabling the retrieval of temporally dense features from temporally sparse features. We rigorously validate ESSENTIAL on diverse datasets: UCF-101, HMDB51, and Something-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, and Kinetics-400 from the vCLIMB benchmark. Remarkably, with significantly reduced memory, ESSENTIAL achieves favorable performance on the benchmarks.[178] Human-in-Context: Unified Cross-Domain 3D Human Motion Modeling via In-Context Learning
Mengyuan Liu,Xinshun Wang,Zhongbin Fang,Deheng Ye,Xia Li,Tao Tang,Songtao Wu,Xiangtai Li,Ming-Hsuan Yang
Main category: cs.CV
TL;DR: 本文提出了一种名为Human-in-Context (HiC)的统一跨域3D人体动作模型,该模型通过结合姿态和网格表示、扩展任务覆盖范围以及引入大规模数据集,克服了现有跨域模型在实用性与可扩展性方面的限制。
Details
Motivation: 现有跨域模型通常依赖于特定领域的组件和多阶段训练,限制了其在处理多模态、多任务和多数据集时的实用性与可扩展性。因此,需要一种更灵活、更通用的统一模型。 Method: 本文提出了HiC,它是在Pose-in-Context (PiC)基础上的扩展。HiC通过结合姿态和网格表示、引入最大最小相似性提示采样策略以及采用双分支上下文注入网络架构,实现了对跨域数据的更好处理。 Result: 实验结果显示,HiC在跨多个领域的泛化能力、数据规模处理和性能方面均优于PiC,显示出其在构建统一跨域3D人体动作模型方面的潜力。 Conclusion: HiC提供了一种更具灵活性和可扩展性的方法,用于构建能够处理多模态、多任务和多数据集的统一跨域3D人体动作模型。 Abstract: This paper aims to model 3D human motion across domains, where a single model is expected to handle multiple modalities, tasks, and datasets. Existing cross-domain models often rely on domain-specific components and multi-stage training, which limits their practicality and scalability. To overcome these challenges, we propose a new setting to train a unified cross-domain model through a single process, eliminating the need for domain-specific components and multi-stage training. We first introduce Pose-in-Context (PiC), which leverages in-context learning to create a pose-centric cross-domain model. While PiC generalizes across multiple pose-based tasks and datasets, it encounters difficulties with modality diversity, prompting strategy, and contextual dependency handling. We thus propose Human-in-Context (HiC), an extension of PiC that broadens generalization across modalities, tasks, and datasets. HiC combines pose and mesh representations within a unified framework, expands task coverage, and incorporates larger-scale datasets. Additionally, HiC introduces a max-min similarity prompt sampling strategy to enhance generalization across diverse domains and a network architecture with dual-branch context injection for improved handling of contextual dependencies. Extensive experimental results show that HiC performs better than PiC in terms of generalization, data scale, and performance across a wide range of domains. These results demonstrate the potential of HiC for building a unified cross-domain 3D human motion model with improved flexibility and scalability. The source codes and models are available at https://github.com/BradleyWang0416/Human-in-Context.[179] Puppeteer: Rig and Animate Your 3D Models
Chaoyue Song,Xiu Li,Fan Yang,Zhongcong Xu,Jiacheng Wei,Fayao Liu,Jiashi Feng,Guosheng Lin,Jianfeng Zhang
Main category: cs.CV
TL;DR: Puppeteer是一种自动化3D模型绑定与动画生成的新框架,利用transformer模型、注意力机制和优化策略,实现更高效、更准确的动画生成,适用于各种3D内容。
Details
Motivation: 现代交互式应用对动态3D内容的需求日益增加,但将静态3D模型转化为动画资产的过程仍是内容创作流程中的瓶颈。尽管生成式AI在静态3D模型创建方面取得了进展,但绑定和动画生成仍然依赖专家干预。因此,Puppeteer旨在解决这一问题,实现自动化绑定和动画生成。 Method: Puppeteer框架通过三个主要步骤进行3D模型的自动绑定和动画生成:1) 使用基于transformer的自回归模型预测骨骼结构,采用基于关节的标记策略和具有随机扰动的层次排序方法;2) 使用注意力机制模型推断蒙皮权重,结合骨骼图距离的关节注意力机制;3) 通过可微分优化生成高保真度的动画。 Result: Puppeteer在多个基准测试中表现出色,显著优于现有技术,不仅在骨骼预测准确性和蒙皮质量方面更优,还能够高效生成稳定、高质量的动画,同时解决了现有方法常见的抖动问题。 Conclusion: Puppeteer是一个能够自动进行3D模型绑定和动画生成的完整框架,其通过创新的transformer模型、注意力机制和优化方案,实现了比现有方法更准确的骨骼预测和更高质量的蒙皮效果,并且能够处理各种类型的3D内容。 Abstract: Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.[180] Quantum Visual Fields with Neural Amplitude Encoding
Shuteng Wang,Christian Theobalt,Vladislav Golyanik
Main category: cs.CV
TL;DR: This paper introduces Quantum Visual Field (QVF), a novel approach for learning 2D and 3D visual representations on quantum computers, outperforming classical and existing quantum methods in accuracy and efficiency.